DP-100 Data Science Questions Topic 7

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

Cross-validation - You must create three equal partitions for cross-validation. You must also configure the cross-validation process so that the rows in the test and training datasets are divided evenly by properties that are near each city's main river. You must complete this task before the data goes through the sampling process. Linear regression module - When you train a Linear Regression module, you must determine the best features to use in a model. You can choose standard metrics provided to measure performance before and after the feature importance process completes. The distribution of features across multiple training models must be consistent. Data visualization - You need to provide the test results to the Fabrikam Residences team. You create data visualizations to aid in presenting the results. You must produce a Receiver Operating Characteristic (ROC) curve to conduct a diagnostic test evaluation of the model. You need to select appropriate methods for producing the ROC curve in Azure Machine Learning Studio to compare the Two-Class Decision Forest and the Two-Class Decision Jungle modules with one another.

Continue to next card

Data issues - Missing values - The AccessibilityToHighway column in both datasets contains missing values. The missing data must be replaced with new data so that it is modeled conditionally using the other variables in the data before filling in the missing values. Columns in each dataset contain missing and null values. The datasets also contain many outliers. The Age column has a high proportion of outliers. You need to remove the rows that have outliers in the Age column. The MedianValue and AvgRoomsInHouse columns both hold data in numeric format. You need to select a feature selection algorithm to analyze the relationship between the two columns in more detail. Model fit - The model shows signs of overfitting. You need to produce a more refined regression model that reduces the overfitting.

Continue to next card

Datasets - There are two datasets in CSV format that contain property details for two cities, London and Paris. You add both files to Azure Machine Learning Studio as separate datasets to the starting point for an experiment. Both datasets contain the following columns: ColumnHeading- Description - CapitaCrimeRate- per capita crime rate by town - Zoned- proportion of residential land zoned for lots over 25K square feet - NonRetailAccess- proportion of retail business acres per town - NextToRiver- proximity of a property to the river - NitrogenOxideConcentration- nitric oxides concentration (parts per 10M) - AvgRoomsPerHouse- average number of rooms per dwelling - Age- proportion of owner-occupied units build prior to 1940 - DistanceToEmploymentCenter- weighted distances to employment centers - AccessibilityToHighway- index of accessibility to radial highways to a value of two decimal places - Tax- full value property tax rate per $10K - PupilTeacherRatio- pupil to teacher ratio by town - ProfessionalClass- professional class percentage - LowerStatus- percentage lower status of the population - MedianValue- median value of owner-occupied home in $1K An initial investigation shows that the datasets are identical in structure apart from the MedianValue column. The smaller Paris dataset contains the MedianValue in text format, whereas the larger London dataset contains the MedianValue in numerical format.

Continue to next card

Experiment requirements - You must set up the experiment to cross-validate the Linear Regression and Bayesian Linear Regression modules to evaluate performance. In each case, the predictor of the dataset is the column named MedianValue. You must ensure that the datatype of the MedianValue column of the Paris dataset matches the structure of the London dataset. You must prioritize the columns of data for predicting the outcome. You must use non-parametric statistics to measure relationships. You must use a feature selection algorithm to analyze the relationship between the MedianValue and AvgRoomsInHouse columns. Model training - Permutation Feature Importance - Given a trained model and a test dataset, you must compute the Permutation Feature Importance scores of feature variables. You must be determined the absolute fit for the model.

Continue to next card

Hyperparameters - You must configure hyperparameters in the model learning process to speed the learning phase. In addition, this configuration should cancel the lowest performing runs at each evaluation interval, thereby directing effort and resources towards models that are more likely to be successful. You are concerned that the model might not efficiently use compute resources in hyperparameter tuning. You also are concerned that the model might prevent an increase in the overall tuning time. Therefore, must implement an early stopping criterion on models that provides savings without terminating promising jobs. Testing - You must produce multiple partitions of a dataset based on sampling using the Partition and Sample module in Azure Machine Learning Studio.

Continue to next card

Introductory InfoCase study - This is a case study. Case studies are not timed separately. You can use as much exam time as you would like to complete each case. However, there may be additional case studies and sections on this exam. You must manage your time to ensure that you are able to complete all questions included on this exam in the time provided. To answer the questions included in a case study, you will need to reference information that is provided in the case study. Case studies might contain exhibits and other resources that provide more information about the scenario that is described in the case study. Each question is independent of the other questions in this case study. At the end of this case study, a review screen will appear. This screen allows you to review your answers and to make changes before you move to the next section of the exam. After you begin a new section, you cannot return to this section.

Continue to next card

To start the case study - To display the first question in this case study, click the Next button. Use the buttons in the left pane to explore the content of the case study before you answer the questions. Clicking these buttons displays information such as business requirements, existing environment, and problem statements. If the case study has an All Information tab, note that the information displayed is identical to the information displayed on the subsequent tabs. When you are ready to answer a question, click the Question button to return to the question. Overview - You are a data scientist for Fabrikam Residences, a company specializing in quality private and commercial property in the United States. Fabrikam Residences is considering expanding into Europe and has asked you to investigate prices for private residences in major European cities. You use Azure Machine Learning Studio to measure the median value of properties. You produce a regression model to predict property prices by using the Linear Regression and Bayesian Linear Regression modules.

Continue to next card

DRAG DROP - You need to implement early stopping criteria as stated in the model training requirements. Which three code segments should you use to develop the solution? To answer, move the appropriate code segments from the list of code segments to the answer area and arrange them in the correct order. NOTE: More than one order of answer choices is correct. You will receive the credit for any of the correct orders you select. Select and Place: - early_termination_policy = TruncationSelectionPolicy(evaluation_interval=1, truncation_percentage=20, delay_evaluation=5) - import BanditPolicy - import TruncationSelectionPolicy - early_termination_policy = BanditPolicy(slack_factor = 0.1, evaluation_interval=1, delay_evaluation=5) - from azureml.train.hyperdrive - early_termination_policy = MedianStoppingPolicy(evaluation_interval = 1, delay_evaluation=5) - import MedianStoppingPolicy

Correct Answer: - from azureml.train.hyperdrive - import TruncationSelectionPolicy - early_termination_policy = TruncationSelectionPolicy(evaluation_interval=1, truncation_percentage=20, delay_evaluation=5) Step 1: from azureml.train.hyperdrive Step 2: Import TruncationSelectionPolicy Truncation selection cancels a given percentage of lowest performing runs at each evaluation interval. Runs are compared based on their performance on the primary metric and the lowest X% are terminated. Scenario: You must configure hyperparameters in the model learning process to speed the learning phase. In addition, this configuration should cancel the lowest performing runs at each evaluation interval, thereby directing effort and resources towards models that are more likely to be successful. Step 3: early_terminiation_policy = TruncationSelectionPolicy.. Example: from azureml.train.hyperdrive import TruncationSelectionPolicy early_termination_policy = TruncationSelectionPolicy(evaluation_interval=1, truncation_percentage=20, delay_evaluation=5) In this example, the early termination policy is applied at every interval starting at evaluation interval 5. A run will be terminated at interval 5 if its performance at interval 5 is in the lowest 20% of performance of all runs at interval 5. Incorrect Answers: Median: Median stopping is an early termination policy based on running averages of primary metrics reported by the runs. This policy computes running averages across all training runs and terminates runs whose performance is worse than the median of the running averages. Slack: Bandit is a termination policy based on slack factor/slack amount and evaluation interval. The policy early terminates any runs where the primary metric is not within the specified slack factor / slack amount with respect to the best performing training run. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-tune-hyperparameters

DRAG DROP - You need to implement an early stopping criteria policy for model training. Which three code segments should you use to develop the solution? To answer, move the appropriate code segments from the list of code segments to the answer area and arrange them in the correct order. NOTE: More than one order of answer choices is correct. You will receive credit for any of the correct orders you select. Select and Place: Code Segments - early_termination_policy = TruncationSelectionPolicy(evaluation_interval=1, truncation_percentage=20, delay_evaluation=5) - import TruncationSelectionPolicy - from azureml.train.hyperdrive - import BanditPolicy - early_termination_policy = BanditPolicy(slack_factor = 0.1, evaluation_interval=1, delay_evaluation=5)

Correct Answer: - from azureml.train.hyperdrive - import TruncationSelectionPolicy - early_termination_policy = TruncationSelectionPolicy(evaluation_interval=1, truncation_percentage=20, delay_evaluation=5) You need to implement an early stopping criterion on models that provides savings without terminating promising jobs. Truncation selection cancels a given percentage of lowest performing runs at each evaluation interval. Runs are compared based on their performance on the primary metric and the lowest X% are terminated. Example: from azureml.train.hyperdrive import TruncationSelectionPolicy early_termination_policy = TruncationSelectionPolicy(evaluation_interval=1, truncation_percentage=20, delay_evaluation=5) Incorrect Answers: Bandit is a termination policy based on slack factor/slack amount and evaluation interval. The policy early terminates any runs where the primary metric is not within the specified slack factor / slack amount with respect to the best performing training run. Example: from azureml.train.hyperdrive import BanditPolicy early_termination_policy = BanditPolicy(slack_factor = 0.1, evaluation_interval=1, delay_evaluation=5 Reference: https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-tune-hyperparameters

HOTSPOT - You need to set up the Permutation Feature Importance module according to the model training requirements. Which properties should you select? To answer, select the appropriate options in the answer area. NOTE: Each correct selection is worth one point. Hot Area: Tune Model Hyperparameters Specify parameter sweeping mode: Random sweep Maximum number of runs on random sweep: 5 Random seed: 0 Label Column: Median Value Launch column selector Metric for measuring performance for classification: - F-score; Precision; Recall; Accuracy Metric for measuring performance for regression: - Root of mean squared error - R-squared - Mean zero one error - Mean absolute error

Correct Answer: - Accuracy - R-squared Alternate Answer: YES (90% responses) - F-score - Mean absolute error (http://breaking-bi.blogspot.com/2017/01/azure-machine-learning-model-evaluation.html) Box 1: Accuracy - Scenario: You want to configure hyperparameters in the model learning process to speed the learning phase by using hyperparameters. In addition, this configuration should cancel the lowest performing runs at each evaluation interval, thereby directing effort and resources towards models that are more likely to be successful. Box 2: R-Squared

HOTSPOT - You need to identify the methods for dividing the data according to the testing requirements. Which properties should you select? To answer, select the appropriate options in the answer area. NOTE: Each correct selection is worth one point. Hot Area: Partition and Sample: - Assign to Folds - Sampling - Head Partition or sample mode: Randomized split Random seed: 0 ?: True; False; Partition evenly; Partition with custom partitions Specify the partitioner method: Partition evenly Specify number of folds to split evenly into: 3

Correct Answer: - Assign to Folds - Partition evenly Scenario: Testing - You must produce multiple partitions of a dataset based on sampling using the Partition and Sample module in Azure Machine Learning Studio. Box 1: Assign to folds -Use Assign to folds option when you want to divide the dataset into subsets of the data. This option is also useful when you want to create a custom number of folds for cross-validation, or to split rows into several groups. Not Head: Use Head mode to get only the first n rows. This option is useful if you want to test a pipeline on a small number of rows, and don't need the data to be balanced or sampled in any way. Not Sampling: The Sampling option supports simple random sampling or stratified random sampling. This is useful if you want to create a smaller representative sample dataset for testing. Box 2: Partition evenly - Specify the partitioner method: Indicate how you want data to be apportioned to each partition, using these options:✑ Partition evenly: Use this option to place an equal number of rows in each partition. To specify the number of output partitions, type a whole number in the Specify number of folds to split evenly into text box. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/algorithm-module-reference/partition-and-sample

DRAG DROP - You need to correct the model fit issue. Which three actions should you perform in sequence? To answer, move the appropriate actions from the list of actions to the answer area and arrange them in the correct order. Select and Place: Actions - Add the Ordinal Regression module - Add the Two-Class Averaged Perception module - Augment the data - Add the Bayesian Linear Regression module - Decrease the memory size for L-BFGS - Add the Multiclass Decision Jungle module - Configure the regularization weight

Correct Answer: - Augment the data - Add the Bayesian Linear Regression module - Configure the regularization weight Step 1: Augment the data - Scenario: Columns in each dataset contain missing and null values. The datasets also contain many outliers. Step 2: Add the Bayesian Linear Regression module. Scenario: You produce a regression model to predict property prices by using the Linear Regression and Bayesian Linear Regression modules. Step 3: Configure the regularization weight. Regularization typically is used to avoid overfitting. For example, in L2 regularization weight, type the value to use as the weight for L2 regularization. We recommend that you use a non-zero value to avoid overfitting. Scenario: Model fit: The model shows signs of overfitting. You need to produce a more refined regression model that reduces the overfitting. Incorrect Answers: Multiclass Decision Jungle module: Decision jungles are a recent extension to decision forests. A decision jungle consists of an ensemble of decision directed acyclic graphs (DAGs). L-BFGS: L-BFGS stands for "limited memory Broyden-Fletcher-Goldfarb-Shanno". It can be found in the wwo-Class Logistic Regression module, which is used to create a logistic regression model that can be used to predict two (and only two) outcomes. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/linear-regression

HOTSPOT - You need to configure the Edit Metadata module so that the structure of the datasets match. Which configuration options should you select? To answer, select the appropriate options in the answer area. NOTE: Each correct selection is worth one point. Hot Area: Edit Metadata Column: Selected columns: MedianValue Launch column selector ?: Floating point; DateTime; TimeSpan; Integer ?: Unchanged; Make Categorical; Make Uncategorical Fields: 5

Correct Answer: - Floating point - Unchanged Alternate Answer: YES - Floating point - Make Categorical Box 1: Floating point - Need floating point for Median values. Scenario: An initial investigation shows that the datasets are identical in structure apart from the MedianValue column. The smaller Paris dataset contains theMedianValue in text format, whereas the larger London dataset contains the MedianValue in numerical format. Box 2: Unchanged - Note: Select the Categorical option to specify that the values in the selected columns should be treated as categories. For example, you might have a column that contains the numbers 0,1 and 2, but know that the numbers actually mean "Smoker", "Non smoker" and "Unknown". In that case, by flagging the column as categorical you can ensure that the values are not used in numeric calculations, only to group data.

HOTSPOT - You need to configure the Feature Based Feature Selection module based on the experiment requirements and datasets. How should you configure the module properties? To answer, select the appropriate options in the dialog box in the answer area. NOTE: Each correct selection is worth one point. Hot Area: Filter Based Feature Selection Feature scoring method: - Fisher Score - Chi-squared - Mutual information - Counts Target column: MedianValue; AvgRoomsInHouse Number of desired features: 1

Correct Answer: - Mutual information - MedianValue Box 1: Mutual Information. The mutual information score is particularly useful in feature selection because it maximizes the mutual information between the joint distribution and target variables in datasets with many dimensions. Box 2: MedianValue - MedianValue is the feature column, , it is the predictor of the dataset. Scenario: The MedianValue and AvgRoomsinHouse columns both hold data in numeric format. You need to select a feature selection algorithm to analyze the relationship between the two columns in more detail. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/filter-based-feature-selection

HOTSPOT - You need to replace the missing data in the AccessibilityToHighway columns. How should you configure the Clean Missing Data module? To answer, select the appropriate options in the answer area. NOTE: Each correct selection is worth one point. Hot Area: Clean Missing Data Columns to be cleaned: AccessibilityToHighway Launch column selector > Minimum missing value ratio: 0 Maximum missing value ratio: 1 Cleaning Mode: Replace with: MICE; Mean; Median; Mode Cols with all missing values: ? Propagate; (can't see all answers)

Correct Answer: - Replace using MICE - Propagate Box 1: Replace using MICE - Replace using MICE: For each missing value, this option assigns a new value, which is calculated by using a method described in the statistical literature as "Multivariate Imputation using Chained Equations" or "Multiple Imputation by Chained Equations". With a multiple imputation method, each variable with missing data is modeled conditionally using the other variables in the data before filling in the missing values. Scenario: The AccessibilityToHighway column in both datasets contains missing values. The missing data must be replaced with new data so that it is modeled conditionally using the other variables in the data before filling in the missing values. Box 2: Propagate - Cols with all missing values indicate if columns of all missing values should be preserved in the output. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/clean-missing-data

DRAG DROP - You need to produce a visualization for the diagnostic test evaluation according to the data visualization requirements. Which three modules should you recommend be used in sequence? To answer, move the appropriate modules from the list of modules to the answer area and arrange them in the correct order. Select and Place: Modules - Score Matchbox Recommender - Apply Transformation - Evaluate Recommender - Evaluate Model - Train Model - Sweep Clustering - Score Model - Load Trained Model

Correct Answer: - Sweep Clustering - Train Model - Evaluate Model Alternate Answer: ? - Train Model - Score Model - Evaluate Model (https://docs.microsoft.com/en-us/azure/machine-learning/classic/evaluate-model-performance) Step 1: Sweep Clustering - Start by using the "Tune Model Hyperparameters" module to select the best sets of parameters for each of the models we're considering. One of the interesting things about the "Tune Model Hyperparameters" module is that it not only outputs the results from the Tuning, it also outputs the Trained Model. Step 2: Train Model - Step 3: Evaluate Model - Scenario: You need to provide the test results to the Fabrikam Residences team. You create data visualizations to aid in presenting the results. You must produce a Receiver Operating Characteristic (ROC) curve to conduct a diagnostic test evaluation of the model. You need to select appropriate methods for producing the ROC curve in Azure Machine Learning Studio to compare the Two-Class Decision Forest and the Two-Class Decision Jungle modules with one another. Reference: http://breaking-bi.blogspot.com/2017/01/azure-machine-learning-model-evaluation.html

HOTSPOT - You need to configure the Permutation Feature Importance module for the model training requirements. What should you do? To answer, select the appropriate options in the dialog box in the answer area. NOTE: Each correct selection is worth one point. Hot Area: Random seed: 0; 500 ?: Root Mean Square Error; R-squared; Mean Zero One Error; Mean Absolute Error

Correct Answer: - 500 - Mean Absolute Error Alternate Answer: 500, Should be RMSE. YES (https://www.theanalysisfactor.com/assessing-the-fit-of-regression-models/ RMSE indicates the absolute fit of the model to the data-how close the observed data points are to the model's predicted values. Whereas R-squared is a relative measure of fit, RMSE is an absolute measure of fit. See: "You must be determined the absolute fit of the model".) Box 1: 500 - For Random seed, type a value to use as seed for randomization. If you specify 0 (the default), a number is generated based on the system clock. A seed value is optional, but you should provide a value if you want reproducibility across runs of the same experiment. Here we must replicate the findings. Box 2: Mean Absolute Error - Scenario: Given a trained model and a test dataset, you must compute the Permutation Feature Importance scores of feature variables. You need to set up the Permutation Feature Importance module to select the correct metric to investigate the model's accuracy and replicate the findings. Regression. Choose one of the following: Precision, Recall, Mean Absolute Error, Root Mean Squared Error, Relative Absolute Error, Relative Squared Error, Coefficient of Determination - Reference: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/permutation-feature-importance

You need to visually identify whether outliers exist in the Age column and quantify the outliers before the outliers are removed. Which three Azure Machine Learning Studio modules should you use? Each correct answer presents part of the solution. NOTE: Each correct selection is worth one point. A. Create Scatterplot B. Summarize Data C. Clip Values D. Replace Discrete Values E. Build Counting Transform

Correct Answer: ABC A: One way to quickly identify Outliers visually is to create scatter plots. B: To have a global view, the summarize data module can be used. Add the module and connect it to the data set that needs to be visualized. C: The easiest way to treat the outliers in Azure ML is to use the Clip Values module. It can identify and optionally replace data values that are above or below a specified threshold. You can use the Clip Values module in Azure Machine Learning Studio, to identify and optionally replace data values that are above or below a specified threshold. This is useful when you want to remove outliers or replace them with a mean, a constant, or other substitute value. Reference: https://blogs.msdn.microsoft.com/azuredev/2017/05/27/data-cleansing-tools-in-azure-machine-learning/ https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/clip-values

You need to select a feature extraction method. Which method should you use? A. Mutual information B. Mood's median test C. Kendall correlation D. Permutation Feature Importance

Correct Answer: C In statistics, the Kendall rank correlation coefficient, commonly referred to as Kendall's tau coefficient (after the Greek letter ֿ„), is a statistic used to measure the ordinal association between two measured quantities. It is a supported method of the Azure Machine Learning Feature selection. Note: Both Spearman's and Kendall's can be formulated as special cases of a more general correlation coefficient, and they are both appropriate in this scenario. Scenario: The MedianValue and AvgRoomsInHouse columns both hold data in numeric format. You need to select a feature selection algorithm to analyze the relationship between the two columns in more detail. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/feature-selection-modules

You need to select a feature extraction method. Which method should you use? A. Mutual information B. Pearson's correlation C. Spearman correlation D. Fisher Linear Discriminant Analysis

Correct Answer: C Spearman's rank correlation coefficient assesses how well the relationship between two variables can be described using a monotonic function. Note: Both Spearman's and Kendall's can be formulated as special cases of a more general correlation coefficient, and they are both appropriate in this scenario. Scenario: The MedianValue and AvgRoomsInHouse columns both hold data in numeric format. You need to select a feature selection algorithm to analyze the relationship between the two columns in more detail. Incorrect Answers: B: The Spearman correlation between two variables is equal to the Pearson correlation between the rank values of those two variables; while Pearson's correlation assesses linear relationships, Spearman's correlation assesses monotonic relationships (whether linear or not). Reference: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/feature-selection-modules


Set pelajaran terkait

LUOA World History II - Module 4: Absolutism, Reason, & Revolution

View Set

Completing the Application, Underwriting, and Delivery Policy

View Set

Porth Patho Chapter 35 Chapter 35: Somatosensory Function, Pain, and Headache

View Set

FL LIFE INSURANCE TEST QUESTIONS

View Set

*HURST REVIEW Qbank/Customize Quiz - Fundamentals

View Set

Cyber-security Essentials Study Guide

View Set

FL rules for all lines/ Primerica

View Set