DP-100 Data Science Questions Topic 2

अब Quizwiz के साथ अपने होमवर्क और परीक्षाओं को एस करें!

You are building a recurrent neural network to perform a binary classification. You review the training loss, validation loss, training accuracy, and validation accuracy for each training epoch. You need to analyze model performance. You need to identify whether the classification model is overfitted. Which of the following is correct? A. The training loss stays constant and the validation loss stays on a constant value and close to the training loss value when training the model. B. The training loss decreases while the validation loss increases when training the model. C. The training loss stays constant and the validation loss decreases when training the model. D. The training loss increases while the validation loss decreases when training the model.

Correct Answer: B An overfit model is one where performance on the train set is good and continues to improve, whereas performance on the validation set improves to a point and then begins to degrade. Reference:https://machinelearningmastery.com/diagnose-overfitting-underfitting-lstm-models/

You plan to run a Python script as an Azure Machine Learning experiment. The script contains the following code: import os, argparse, glob from azureml.core import Run parser = argparse.ArgumentParser() parser.add_argument('--imput-data', type=str, dest='data_folder') args = parser.parse_args() data_path = args.data_folder file_paths = glob.glob(data_path + "/*.jpg") You must specify a file dataset as an input to the script. The dataset consists of multiple large image files and must be streamed directly from its source. You need to write code to define a ScriptRunConfig object for the experiment and pass the ds dataset as an argument. Which code segment should you use? A. arguments = ['--input-data', ds.to_pandas_dataframe()] B. arguments = ['--input-data', ds.as_mount()] C. arguments = ['--data-data', ds] D. arguments = ['--input-data', ds.as_download()]

Correct Answer: A Alternate Answer: B YES (Yes. Because 1) the data are images, not tabular, 2) the data are large and so should be mounted not downloaded. as_mount() is for streaming data from data sources) If you have structured data not yet registered as a dataset, create a TabularDataset and use it directly in your training script for your local or remote experiment. To load the TabularDataset to pandas DataFramedf = dataset.to_pandas_dataframe() Note: TabularDataset represents data in a tabular format created by parsing the provided file or list of files. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-train-with-datasets

You are solving a classification task. You must evaluate your model on a limited data sample by using k-fold cross-validation. You start by configuring a k parameter as the number of splits. You need to configure the k parameter for the cross-validation. Which value should you use? A. k=1 B. k=10 C. k=0.5 D. k=0.9

Correct Answer: B Leave One Out (LOO) cross-validation Setting K = n (the number of observations) yields n-fold and is called leave-one out cross-validation (LOO), a special case of the K-fold approach. LOO CV is sometimes useful but typically doesn't shake up the data enough. The estimates from each fold are highly correlated and hence their average can have high variance. This is why the usual choice is K=5 or 10. It provides a good compromise for the bias-variance tradeoff.

You plan to run a Python script as an Azure Machine Learning experiment. The script must read files from a hierarchy of folders. The files will be passed to the script as a dataset argument. You must specify an appropriate mode for the dataset argument. Which two modes can you use? Each correct answer presents a complete solution. NOTE: Each correct selection is worth one point. A. to_pandas_dataframe() B. as_download() C. as_upload() D. as_mount()

Correct Answer: B, D Reference: https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.data.filedataset?view=azure-ml-py

You are creating a machine learning model. You need to identify outliers in the data. Which two visualizations can you use? Each correct answer presents a complete solution. NOTE: Each correct selection is worth one point. A. Venn diagram B. Box plot C. ROC curve D. Random forest diagram E. Scatter plot

Correct Answer: BE The box-plot algorithm can be used to display outliers. One other way to quickly identify Outliers visually is to create scatter plots. Reference: https://blogs.msdn.microsoft.com/azuredev/2017/05/27/data-cleansing-tools-in-azure-machine-learning/

You create and register a model in an Azure Machine Learning workspace. You must use the Azure Machine Learning SDK to implement a batch inference pipeline that uses a ParallelRunStep to score input data using the model. You must specify a value for the ParallelRunConfig compute_target setting of the pipeline step. You need to create the compute target. Which class should you use? A. BatchCompute B. AdlaCompute C. AmlCompute D. AksCompute

Correct Answer: C Compute target to use for ParallelRunStep. This parameter may be specified as a compute target object or the string name of a compute target in the workspace. The compute_target target is of AmlCompute or string. Note: An Azure Machine Learning Compute (AmlCompute) is a managed-compute infrastructure that allows you to easily create a single or multi-node compute. The compute is created within your workspace region as a resource that can be shared with other users Reference: https://docs.microsoft.com/en-us/python/api/azureml-contrib-pipeline-steps/azureml.contrib.pipeline.steps.parallelrunconfig https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.compute.amlcompute(class)

You use the following code to define the steps for a pipeline: from azureml.core import Workspace, Experiment, Run from azureml.pipeline.core import Pipeline from azureml.pipeline.steps import PythonScriptStep ws = Workspace.from_config(). . . step1 = PythonScriptStep(name="step1", ...) step2 = PythonScriptsStep(name="step2", ...) pipeline_steps = [step1, step2] You need to add code to run the steps. Which two code segments can you use to achieve this goal? Each correct answer presents a complete solution. NOTE: Each correct selection is worth one point. A. experiment = Experiment(workspace=ws, name='pipeline-experiment') run = experiment.submit(config=pipeline_steps) B. run = Run(pipeline_steps) C. pipeline = Pipeline(workspace=ws, steps=pipeline_steps) experiment = Experiment(workspace=ws, name='pipeline-experiment') run = experiment.submit(pipeline) D. pipeline = Pipeline(workspace=ws, steps=pipeline_steps) run = pipeline.submit(experiment_name='pipeline-experiment')

Correct Answer: CD After you define your steps, you build the pipeline by using some or all of those steps. # Build the pipeline. Example: pipeline1 = Pipeline(workspace=ws, steps=[compare_models]) # Submit the pipeline to be runpipeline_run1 = Experiment(ws, 'Compare_Models_Exp').submit(pipeline1) Reference: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-create-machine-learning-pipelines

HOTSPOT - You create an Azure Databricks workspace and a linked Azure Machine Learning workspace. You have the following Python code segment in the Azure Machine Learning workspace: import mlflow import mlflow.azureml import azureml.mlflow import azureml.core from azureml.core import Workspace subscription_id = 'subscription_id' resourse_group = 'resource_group_name' workspace_name = 'workspace_name' ws = Workspace.get(name=workspace_name, subscription_id=subscription_id, resource_group=resource_group) experimentName = "/Users/{user_name}/{experiment_folder}/{experiment_name}" mlflow.set_experiment(experimentName) uri = ws.get_mlflow_tracking_uri() mlflow.set_tracking_uri(uri) Instructions: For each of the following statements, select Yes if the statement is true. Otherwise, select No. NOTE: Each correct selection is worth one point. Hot Area: YES or NO - A resource group and Azure machine Learning workspace will be created - An Azure Databricks experiment will be tracked only in the Azure machine Learning workspace - The epoch loss metric is set to be tracked

Correct Answer: No, Yes, Yes Box 1: No - The Workspace.get method loads an existing workspace without using configuration files. ws = Workspace.get(name="myworkspace", subscription_id='<azure-subscription-id>', resource_group='myresourcegroup') Box 2: Yes - MLflow Tracking with Azure Machine Learning lets you store the logged metrics and artifacts from your local runs into your Azure Machine Learning workspace. The get_mlflow_tracking_uri() method assigns a unique tracking URI address to the workspace, ws, and set_tracking_uri() points the MLflow tracking URI to that address. Box 3: Yes - Note: In Deep Learning, epoch means the total dataset is passed forward and backward in a neural network once. Reference: https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.workspace.workspace https://docs.microsoft.com/en-us/azure/machine-learning/how-to-use-mlflow

DRAG DROP - You previously deployed a model that was trained using a tabular dataset named training-dataset, which is based on a folder of CSV files. Over time, you have collected the features and predicted labels generated by the model in a folder containing a CSV file for each month. You have created two tabular datasets based on the folder containing the inference data: one named predictions-dataset with a schema that matches the training data exactly, including the predicted label; and another named features-dataset with a schema containing all of the feature columns and a timestamp column based on the filename, which includes the day, month, and year. You need to create a data drift monitor to identify any changing trends in the feature data since the model was trained. To accomplish this, you must define the required datasets for the data drift monitor. Which datasets should you use to configure the data drift monitor? To answer, drag the appropriate datasets to the correct data drift monitor options. Each source may be used once, more than once, or not at all. You may need to drag the split bar between panes or scroll to view content. NOTE: Each correct selection is worth one point. Target datasets: - training-dataset; predictions-dataset; features-dataset Answer Area: Baseline dataset: Target dataset:

Correct Answer: training-dataset; predictions-dataset Alternate answer: training-dataset; features-dataset YES ("target_dataset: Required. Dataset to run either adhoc or scheduled DataDrift jobs for. Must be a time series." (https://docs.microsoft.com/en-us/python/api/azureml-datadrift/azureml.datadrift.datadriftdetector(class)?view=azure-ml-py)) Box 1: training-dataset - Baseline dataset - usually the training dataset for a model. Box 2: predictions-dataset - Target dataset - usually model input data - is compared over time to your baseline dataset. This comparison means that your target dataset must have a timestamp column specified. The monitor will compare the baseline and target datasets. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-monitor-datasets

DRAG DROP - You create a multi-class image classification deep learning experiment by using the PyTorch framework. You plan to run the experiment on an Azure Compute cluster that has nodes with GPU's. You need to define an Azure Machine Learning service pipeline to perform the monthly retraining of the image classification model. The pipeline must run with minimal cost and minimize the time required to train the model. Which three pipeline steps should you run in sequence? To answer, move the appropriate actions from the list of actions to the answer area and arrange them in the correct order. Select and Place: - Configure a DataTransferStep() to fetch new image data from public web portal, running on the cpu-compute compute target - Configure an EstimatorStep() to run an estimator that runs the bird_classifier_train.py model training script on the gpu_compute compute target - Configure a PythonScriptStep() to run both image_fetcher.py and image_resize.py on the cpu-compute compute target - Configure an EstimatorStep() to run an estimator that runs the bird_classifier_train.py model training script on the cpu_compute compute target - Configure a PythonScriptStep() to run image_fetcher.py on the cpu-compute compute target - Configure a PythonScriptStep() to run image_resize.py on the cpu-compute compute target - Configure a PythonScriptStep() to run bird_classifier_train.py on the cpu-compute compute target - Configure a PythonScriptStep() to run bird_classifier_train.py on the gpu-compute compute target

Correct Answer: - Configure a DataTransferStep() to fetch new image data from public web portal, running on the cpu-compute compute target - Configure a PythonScriptStep() to run image_resize.py on the cpu-compute compute target - Configure an EstimatorStep() to run an estimator that runs the bird_classifier_train.py model training script on the gpu_compute compute target Step 1: Configure a DataTransferStep() to fetch new image dataג€¦ Step 2: Configure a PythonScriptStep() to run image_resize.py on the cpu-compute compute target. Step 3: Configure the EstimatorStep() to run training script on the gpu_compute computer target. The PyTorch estimator provides a simple way of launching a PyTorch training job on a compute target. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-train-pytorch

DRAG DROP - You have a dataset that contains over 150 features. You use the dataset to train a Support Vector Machine (SVM) binary classifier. You need to use the Permutation Feature Importance module in Azure Machine Learning Studio to compute a set of feature importance scores for the dataset. In which order should you perform the actions? To answer, move all actions from the list of actions to the answer area and arrange them in the correct order. Select and Place: - Add a Two-Class Support Vector Machine module to initialize the SVM classifier - Set the Metric for measuring performance property to Classification - Accuracy and then run the experiment - Add a Permutation Feature Importance module and connect the trained model and test dataset - Add a dataset to the experiment - Add a Split Data module to create training and test datasets

Correct Answer: - Add a Two-Class Support Vector Machine module to initialize the SVM classifier - Add a dataset to the experiment - Add a Split Data module to create training and test datasets - Add a Permutation Feature Importance module and connect the trained model and test dataset - Set the Metric for measuring performance property to Classification - Accuracy and then run the experiment Alternate Answer: Add dataset, Add split, Add Two Class, Add Permutation, Set Accuracy Step 1: Add a Two-Class Support Vector Machine module to initialize the SVM classifier. Step 2: Add a dataset to the experiment Step 3: Add a Split Data module to create training and test dataset. To generate a set of feature scores requires that you have an already trained model, as well as a test dataset. Step 4: Add a Permutation Feature Importance module and connect to the trained model and test dataset. Step 5: Set the Metric for measuring performance property to Classification - Accuracy and then run the experiment. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/two-class-support-vector-machine https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/permutation-feature-importance

DRAG DROP - You are creating a machine learning model that can predict the species of a penguin from its measurements. You have a file that contains measurements for three species of penguin in comma-delimited format. The model must be optimized for area under the received operating characteristic curve performance metric, averaged for each class. You need to use the Automated Machine Learning user interface in Azure Machine Learning studio to run an experiment and find the best performing model. Which five actions should you perform in sequence? To answer, move the appropriate actions from the list of actions to the answer area and arrange them in the correct order. Select and Place: - Create and select a new dataset by uploading the comma-delimited file of penguin data - Configure the automated machine learning run by selecting the experiment name, target column, and compute target - Set the Primary metric configuration setting to Accuracy - Set the Classification task type - Select the Regression task type - Run the automated machine learning experiment and review the results - Set the Primary metric configuration setting to AUC Weighted

Correct Answer: - Create and select a new dataset by uploading the comma-delimited file of penguin data - Set the Classification task type - Set the Primary metric configuration setting to Accuracy - Configure the automated machine learning run by selecting the experiment name, target column, and compute target - Run the automated machine learning experiment and review the results Alternative Answer: all responses YES (create dataset > configure experiment > select classification > select metric (AUC)> run exp) AUC Weighted is correct as shown here: https://docs.microsoft.com/en-us/azure/machine-learning/tutorial-first-experiment-automated-ml Step 1: Create and select a new dataset by uploading he command-delimited file of penguin data. Step 2: Select the Classification task type Step 3: Set the Primary metric configuration setting to Accuracy. The available metrics you can select is determined by the task type you choose. Primary metrics for classification scenarios: Post thresholded metrics, like accuracy, average_precision_score_weighted, norm_macro_recall, and precision_score_weighted may not optimize as well for datasets which are very small, have very large class skew (class imbalance), or when the expected metric value is very close to 0.0 or 1.0. In those cases, AUC_weighted can be a better choice for the primary metric. Step 4: Configure the automated machine learning run by selecting the experiment name, target column, and compute target Step 5: Run the automated machine learning experiment and review the results. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-configure-auto-train

DRAG DROP - You have an Azure Machine Learning workspace that contains a CPU-based compute cluster and an Azure Kubernetes Service (AKS) inference cluster. You create a tabular dataset containing data that you plan to use to create a classification model. You need to use the Azure Machine Learning designer to create a web service through which client applications can consume the classification model by submitting new data and getting an immediate prediction as a response. Which three actions should you perform in sequence? To answer, move the appropriate actions from the list of actions to the answer area and arrange them in the correct order. Select and Place: - Create and run a batch inference pipeline on the compute cluster - Deploy a real-time endpoint on the inference cluster - Create and run a real-time inference pipeline on the compute cluster - Create and run a training pipeline that prepares the data and trains a classification model on the compute cluster - Use the automated ML user interface to train a classification model on the compute cluster - Create and start a Compute Instance

Correct Answer: - Create and start a Compute Instance - Create and run a training pipeline that prepares the data and trains a classification model on the compute cluster - Create and run a real-time inference pipeline on the compute cluster Alternate Answer: YES - Create and run a training pipeline - Create and run a real-time inference pipeline. - Deploy a real-time endpoint. Step 1: Create and start a Compute Instance To train and deploy models using Azure Machine Learning designer, you need compute on which to run the training process, test the model, and host the model in a deployed service. There are four kinds of compute resource you can create: Compute Instances: Development workstations that data scientists can use to work with data and models. Compute Clusters: Scalable clusters of virtual machines for on-demand processing of experiment code. Inference Clusters: Deployment targets for predictive services that use your trained models. Attached Compute: Links to existing Azure compute resources, such as Virtual Machines or Azure Databricks clusters. Step 2: Create and run a training pipeline. After you've used data transformations to prepare the data, you can use it to train a machine learning model. Create and run a training pipeline Step 3: Create and run a real-time inference pipeline After creating and running a pipeline to train the model, you need a second pipeline that performs the same data transformations for new data, and then uses the trained model to inference (in other words, predict) label values based on its features. This pipeline will form the basis for a predictive service that you can publish for applications to use. Reference: https://docs.microsoft.com/en-us/learn/modules/create-classification-model-azure-machine-learning-designer/

DRAG DROP - You create a training pipeline using the Azure Machine Learning designer. You upload a CSV file that contains the data from which you want to train your model. You need to use the designer to create a pipeline that includes steps to perform the following tasks: ✑ Select the training features using the pandas filter method. ✑ Train a model based on the naive_bayes.GaussianNB algorithm. ✑ Return only the Scored Labels column by using the query ✑ SELECT [Scored Labels] FROM t1; Which modules should you use? To answer, drag the appropriate modules to the appropriate locations. Each module name may be used once, more than once, or not at all. You may need to drag the split bar between panes or scroll to view content. NOTE: Each correct selection is worth one point. Select and Place: Create Python Model Train Model Two Class Neural Network Execute Python Script Apply SQL Transformation Select Column in Dataset Answer Area - image of ML Designer nodes ---- training-data ---____ _____ -- Split Data Train Model ---- Score Model ----____

Correct Answer: ---- training-data ---"Two Class Neural Network" "Execute Python Script" -- Split Data Train Model ---- Score Model ----"Select Columns in Dataset" Alternate Answer: YES-USE THIS... Top to Bottom Select Columns in Dataset, Create Python Model, Execute Python Script checked this out at this link: https://docs.microsoft.com/en-us/azure/machine-learning/algorithm-module-reference/create-python-model {But, there's a Microsoft link (second reference) to a diagram with the following options (top to bottom):} Box 1: Two-Class Neural Network - The Two-Class Neural Network creates a binary classifier using a neural network algorithm. Train a model based on the naive_bayes.GaussianNB algorithm. Box 2: Execute python script - Select the training features using the pandas filter method Box 3: Select Columns in DataSetReturn only the Scored Labels column by using the query SELECT [Scored Labels] FROM t1; Reference: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/two-class-neural-network https://docs.microsoft.com/en-us/azure/machine-learning/algorithm-module-reference/create-python-model

HOTSPOT - You have a feature set containing the following numerical features: X, Y, and Z. The Poisson (Pearson - sic) correlation coefficient (r-value) of X, Y, and Z features is shown in the following image: -----X-----Y-----Z X : 1 : 0.149676 : -0.106276 Y : 0.149676 : 1 : 0.859122 Z : -0.106276 : 0.859122 : 1 Use the drop-down menus to select the answer choice that answers each question based on the information presented in the graphic. NOTE: Each correct selection is worth one point. Hot Area: What is the r-value for the correlation of Y to Z? : -0.106276 : 0.149676 : 0.859122 : 1 Which type of relationship exists between Z and Y in the feature set? : a positive linear relationship : a negative linear relationship : no linear relationship

Correct Answer: 0.859122, a positive linear relationship Box 1: 0.859122 - Box 2: a positively linear relationship +1 indicates a strong positive linear relationship -1 indicates a strong negative linear correlation 0 denotes no linear relationship between the two variables. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/compute-linear-correlation

HOTSPOT - You create an experiment in Azure Machine Learning Studio. You add a training dataset that contains 10,000 rows. The first 9,000 rows represent class 0 (90 percent). The remaining 1,000 rows represent class 1 (10 percent). The training set is imbalances between two classes. You must increase the number of training examples for class 1 to 4,000 by using 5 data rows. You add the Synthetic Minority Oversampling Technique (SMOTE) module to the experiment. You need to configure the module. Which values should you use? To answer, select the appropriate options in the dialog box in the answer area. NOTE: Each correct selection is worth one point. Hot Area: SMOTE percentage: 0, 300, 3000, 4000 Number of nearest neighbors: 0, 1, 5, 4000

Correct Answer: 300, 5 Box 1: 300 - You type 300 (%), the module triples the percentage of minority cases (3000) compared to the original dataset (1000). Box 2: 5 - We should use 5 data rows. Use the Number of nearest neighbors option to determine the size of the feature space that the SMOTE algorithm uses when in building new cases. A nearest neighbor is a row of data (a case) that is very similar to some target case. The distance between any two cases is measured by combining the weighted vectors of all features. By increasing the number of nearest neighbors, you get features from more cases. By keeping the number of nearest neighbors low, you use features that are more like those in the original sample. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/smote

HOTSPOT - You are tuning a hyperparameter for an algorithm. The following table shows a data set with different hyperparameter, training error, and validation errors. Hyperparameter(H), Training error (TE) Validation error (VE) 1, 105, 95 2, 200, 85 3, 250, 100 4, 105, 100 5, 400, 50 Use the drop-down menus to select the answer choice that answers each question based on the information presented in the graphic. Hot Area: Which H value should you select based on the data? 1, 2, 3, 4, 5 What H value displays the poorest training result? 1, 2, 3, 4, 5

Correct Answer: 4, 5 Box 1: 4 - Choose the one which has lower training and validation error and also the closest match. Box 2: 5 - Minimize variance (difference between validation error and train error). Reference: https://medium.com/comet-ml/organizing-machine-learning-projects-project-management-guidelines-2d2b85651bbd

Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution. After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen. You are using Azure Machine Learning to run an experiment that trains a classification model. You want to use Hyperdrive to find parameters that optimize the AUC metric for the model. You configure a HyperDriveConfig for the experiment by running the following code: hyperdrive = HyperDriveConfig(estimator=your_estimator, hyperparameter_sampling=your_parms, policy=policy, primary_metric_name='AUC' primary_metric_goal=PrimaryMetricGoal.MAXIMIZE, max_total_runs=6, max_concurrent_runs=4) You plan to use this configuration to run a script that trains a random forest model and then tests it with validation data. The label values for the validation data are stored in a variable named y_test variable, and the predicted probabilities from the model are stored in a variable named y_predicted. You need to add logging to the script to allow Hyperdrive to optimize hyperparameters for the AUC metric. Solution: Run the following code: from sklearn.metrics import roc_auc_score import logging # code to train model omitted auc = roc_auc_score(y_test, y_predicted) logging.info("AUC: " + str(auc)) Does the solution meet the goal? A. Yes B. No

Correct Answer: A Python printing/logging example: logging.info(message)Destination: Driver logs, Azure Machine Learning designer Reference: YES https://docs.microsoft.com/en-us/azure/machine-learning/how-to-debug-pipelines - under "Logging options and Behavior"

Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution. After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen. You are creating a new experiment in Azure Machine Learning Studio. One class has a much smaller number of observations than the other classes in the training set. You need to select an appropriate data sampling strategy to compensate for the class imbalance. Solution: You use the Synthetic Minority Oversampling Technique (SMOTE) sampling mode. Does the solution meet the goal? A. Yes B. No

Correct Answer: A SMOTE is used to increase the number of underepresented cases in a dataset used for machine learning. SMOTE is a better way of increasing the number of rare cases than simply duplicating existing cases. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/smote

You write five Python scripts that must be processed in the order specified in Exhibit A (https://www.examtopics.com/exams/microsoft/dp-100/view/12/) which allows the same modules to run in parallel, but will wait for modules with dependencies. You must create an Azure Machine Learning pipeline using the Python SDK, because you want to script to create the pipeline to be tracked in your version control system. You have created five PythonScript Steps and have named the variables to match the module names. You need to create the pipeline shown. Assume all relevant imports have been done. Which Python code segment should you use? A. p = Pipeline(ws, steps=[[[[step_1_a, step_1_b], step_2_a], step_2_b], step_3]) B. pipeline_steps = { "Pipeline": { "run": step_3, "run_after": step_2_a, {"run": step_2_a, "run_after": [{"run": step_1_a},{"run": step_1_b}] }}} p = Pipeline(ws, steps=pipeline_steps) C. step_2_a.run_after(step_1_b) step_2_a.run_after(step_1_a) step_3.run_after(step_2_b) step_3. run_after(step_2_a) p_Pipeline(ws, steps=[step_2]) D. p = Pipeline(ws, steps=[step_1_a, step_1_b, step_2,_a, step_2_b, step_3])

Correct Answer: A The steps parameter is an array of steps. To build pipelines that have multiple steps, place the steps in order in this array. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-use-parallel-run-step https://docs.microsoft.com/en-us/python/api/azureml-pipeline-core/azureml.pipeline.core.stepsequence?view=azure-ml-py

Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution .After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen. An IT department creates the following Azure resource groups and resources: - ml_resources: an Azure Machine Learning workspace named amlworkspace, an Azure Storage account named amlworkspace12345, an Application Insights instance named amlworkspace54321, an Azure Key Vault named amlworkspace67890, an Azure Container Registry named amlworkspace09876 - general_compute: A virtual machine named mlvm with the following configuration: Operating system: Ubuntu Linux, Software installed: Python 3.6 and Jupyter Notebooks, Size: NC6 (6 vCPUs, 1 vGPU, 56 Gb RAM) The IT department creates an Azure Kubernetes Service (AKS)-based inference compute target named aks-cluster in the Azure Machine Learning workspace. You have a Microsoft Surface Book computer with a GPU. Python 3.6 and Visual Studio Code are installed. You need to run a script that trains a deep neural network (DNN) model and logs the loss and accuracy metrics. Solution: Attach the mlvm virtual machine as a compute target in the Azure Machine Learning workspace. Install the Azure ML SDK on the Surface Book and run Python code to connect to the workspace. Run the training script as an experiment on the mlvm remote compute resource. Does the solution meet the goal? A. Yes B. No

Correct Answer: A Use the VM as a compute target. Note: A compute target is a designated compute resource/environment where you run your training script or host your service deployment. This location may be your local machine or a cloud-based compute resource. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/concept-compute-target

You are performing feature engineering on a dataset. You must add a feature named CityName and populate the column value with the text London. You need to add the new feature to the dataset. Which Azure Machine Learning Studio module should you use? A. Edit Metadata B. Filter Based Feature Selection C. Execute Python Script D. Latent Dirichlet Allocation

Correct Answer: A Alternate Answer: C? Typical metadata changes might include marking columns as features. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/edit-metadata

You plan to use automated machine learning to train a regression model. You have data that has features which have missing values, and categorical features with few distinct values. You need to configure automated machine learning to automatically impute missing values and encode categorical features as part of the training task. Which parameter and value pair should you use in the AutoMLConfig class? A. featurization = 'auto' B. enable_voting_ensemble = True C. task = 'classification' D. exclude_nan_labels = True E. enable_tf = True

Correct Answer: A Featurization str or FeaturizationConfig Values: 'auto' / 'off' / FeaturizationConfig Indicator for whether featurization step should be done automatically or not, or whether customized featurization should be used. Column type is automatically detected. Based on the detected column type preprocessing/featurization is done as follows: Categorical: Target encoding, one hot encoding, drop high cardinality categories, impute missing values. Numeric: Impute missing values, cluster distance, weight of evidence. DateTime: Several features such as day, seconds, minutes, hours etc. Text: Bag of words, pre-trained Word embedding, text target encoding. Reference: https://docs.microsoft.com/en-us/python/api/azureml-train-automl-client/azureml.train.automl.automlconfig.automlconfig

Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution. After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen. from azureml.core import Run import pandas as pd run = Run.get_context() data = pd.read_csv('data.csv') label_vals = data['label'].unique() # Add code to record metrics here run.complete() You plan to use a Python script to run an Azure Machine Learning experiment. The script creates a reference to the experiment run context, loads data from a file, identifies the set of unique values for the label column, and completes the experiment run: The experiment must record the unique labels in the data as metrics for the run that can be reviewed later. You must add code to the script to record the unique label values as run metrics at the point indicated by the comment. Solution: Replace the comment with the following code: run.log('Label Values', label_val) Does the solution meet the goal? A. Yes B. No

Correct Answer: A The run_log function is used to log the contents in label_vals: for label_val in label_vals: run.log('Label Values', label_val) Reference: https://www.element61.be/en/resource/azure-machine-learning-services-complete-toolbox-ai

Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution. After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen. You create a model to forecast weather conditions based on historical data. You need to create a pipeline that runs a processing script to load data from a datastore and pass the processed data to a machine learning model training script. Solution: Run the following code: from azureml.train.sklearn import SKLearn sk_est = SKLearn(source_directory='./scripts', compute_target=aml-compute, entry_script='train.py') Does the solution meet the goal? A. Yes B. No

Correct Answer: A The scikit-learn estimator provides a simple way of launching a scikit-learn training job on a compute target. It is implemented through the SKLearn class, which can be used to support single-node CPU training. Example: from azureml.train.sklearn import SKLearn estimator = SKLearn(source_directory=project_folder, compute_target=compute_target, entry_script='train_iris.py') Reference: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-train-scikit-learn

Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution. After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen. You create a model to forecast weather conditions based on historical data. You need to create a pipeline that runs a processing script to load data from a datastore and pass the processed data to a machine learning model training script. Solution: Run the following code: data_store = Datastore.get(ws, "ml-data") data_input = DataReference( datastore = data_store, data_reference_name = "training_data", path_on_datastore = "train/data.txt") data_output = PipelineData("processed_data", datastore=datastore) process_step = PythonScriptStep(script_name="process.py", arguments=["--data", data_input], outputs=[data_output], compute_target=aml_compute, source+directory=train_directory) train_step = PythonScriptStep(script_name="train.py', arguments=["--data", data_output], inputs=[data_output], compute_target=aml_compute, source_directory=train_directory) pipeline = Pipeline(workspace=ws, steps=[process_step, train_step]) Does the solution meet the goal? A. Yes B. No

Correct Answer: A The two steps are present: process_step and train_step Data_input correctly references the data in the data store. Note: Data used in pipeline can be produced by one step and consumed in another step by providing a PipelineData object as an output of one step and an input of one or more subsequent steps. PipelineData objects are also used when constructing Pipelines to describe step dependencies. To specify that a step requires the output of another step as input, use a PipelineData object in the constructor of both steps. For example, the pipeline train step depends on the process_step_output output of the pipeline process step: from azureml.pipeline.core import Pipeline, PipelineData from azureml.pipeline.steps import PythonScriptStep datastore = ws.get_default_datastore() process_step_output = PipelineData("processed_data", datastore=datastore) process_step = PythonScriptStep(script_name="process.py", arguments=["--data_for_train", process_step_output], outputs=[process_step_output], compute_target=aml_compute, source_directory=process_directory) train_step = PythonScriptStep(script_name="train.py", arguments=["--data_for_train", process_step_output], inputs=[process_step_output], compute_target=aml_compute, source_directory=train_directory) pipeline = Pipeline(workspace=ws, steps=[process_step, train_step]) Reference: https://docs.microsoft.com/en-us/python/api/azureml-pipeline-core/azureml.pipeline.core.pipelinedata?view=azure-ml-py

Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution. After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen. You create an Azure Machine Learning service datastore in a workspace. The datastore contains the following files: ✑ /data/2018/Q1.csv ✑ /data/2018/Q2.csv ✑ /data/2018/Q3.csv ✑ /data/2018/Q4.csv ✑ /data/2019/Q1.csv All files store data in the following format: id,f1,f2,I 1,1,2,0 2,1,1,1 3,2,1,0 4,2,2,1 You run the following code: data_store = Datastore.register_azure_blob_container(workspace=ws , datastore_name= 'data_store' , container_name= 'quarterly_data' , account_name='companydata' , account_key='NRPxk8duxbM3...' , create_if_not_exists=False) You need to create a dataset named training_data and load the data from all files into a single data frame by using the following code: data_frame = training_data.to_pandas_dataframe() Solution: Run the following code: from azureml.core import Dataset paths = [(data_store, 'data/2018/%.csv'), (data_store, 'data/2019/%.csv')] training_data = Dataset.Tabular.from_delimited_files(paths) Does the solution meet the goal? A. Yes B. No

Correct Answer: A Use two file paths. Use Dataset.Tabular_from_delimited as the data isn't cleansed. Note: A TabularDataset represents data in a tabular format by parsing the provided file or list of files. This provides you with the ability to materialize the data into a pandas or Spark DataFrame so you can work with familiar data preparation and training libraries without having to leave your notebook. You can create a TabularDataset object from .csv, .tsv, .parquet, .jsonl files, and from SQL query results. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-create-register-datasets

You create a Python script that runs a training experiment in Azure Machine Learning. The script uses the Azure Machine Learning SDK for Python. You must add a statement that retrieves the names of the logs and outputs generated by the script. You need to reference a Python class object from the SDK for the statement. Which class object should you use? A. Run B. ScriptRunConfig C. Workspace D. Experiment

Correct Answer: A What the heck? It's also listed as an incorrect answer. A run represents a single trial of an experiment. Runs are used to monitor the asynchronous execution of a trial, log metrics and store output of the trial, and to analyze results and access artifacts generated by the trial. The run Class get_all_logs method downloads all logs for the run to a directory. Incorrect Answers: A: A run represents a single trial of an experiment. Runs are used to monitor the asynchronous execution of a trial, log metrics and store output of the trial, and to analyze results and access artifacts generated by the trial. B: A ScriptRunConfig packages together the configuration information needed to submit a run in Azure ML, including the script, compute target, environment, and any distributed job-specific configs. Reference: https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.run(class)

You are training machine learning models in Azure Machine Learning. You use Hyperdrive to tune the hyperparameters. In previous model training and tuning runs, many models showed similar performance. You need to select an early termination policy that meets the following requirements: ✑ accounts for the performance of all previous runs when evaluating the current run avoids comparing the current run with only the best performing run to date Which two early termination policies should you use? Each correct answer presents part of the solution. NOTE: Each correct selection is worth one point. A. Median stopping B. Bandit C. Default D. Truncation selection

Correct Answer: AC Alternate Answer: AD YES (https://docs.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.hyperdrive.truncationselectionpolicy?view=azure-ml-py) The Median Stopping policy computes running averages across all runs and cancels runs whose best performance is worse than the median of the running averages. If no policy is specified, the hyperparameter tuning service will let all training runs execute to completion. Incorrect Answers: B: BanditPolicy defines an early termination policy based on slack criteria, and a frequency and delay interval for evaluation. The Bandit policy takes the following configuration parameters: slack_factor: The amount of slack allowed with respect to the best performing training run. This factor specifies the slack as a ratio. D: The Truncation selection policy periodically cancels the given percentage of runs that rank the lowest for their performance on the primary metric. The policy strives for fairness in ranking the runs by accounting for improving model performance with training time. When ranking a relatively young run, the policy uses the corresponding (and earlier) performance of older runs for comparison. Therefore, runs aren't terminated for having a lower performance because they have run for less time than other runs. Reference: https://docs.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.hyperdrive.medianstoppingpolicy https://docs.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.hyperdrive.truncationselectionpolicy https://docs.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.hyperdrive.banditpolicy

You are performing clustering by using the K-means algorithm. You need to define the possible termination conditions. Which three conditions can you use? Each correct answer presents a complete solution. NOTE: Each correct selection is worth one point. A. Centroids do not change between iterations. B. The residual sum of squares (RSS) rises above a threshold. C. The residual sum of squares (RSS) falls below a threshold. D. A fixed number of iterations is executed. E. The sum of distances between centroids reaches a maximum.

Correct Answer: ACD AD: The algorithm terminates when the centroids stabilize or when a specified number of iterations are completed. C: A measure of how well the centroids represent the members of their clusters is the residual sum of squares or RSS, the squared distance of each vector from its centroid summed over all vectors. RSS is the objective function and our goal is to minimize it. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/k-means-clustering https://nlp.stanford.edu/IR-book/html/htmledition/k-means-1.html

No pictures: look here, #73: https://www.examtopics.com/exams/microsoft/dp-100/view/26/ You create a pipeline in designer to train a model that predicts automobile prices. Because of non-linear relationships in the data, the pipeline calculates the natural log (Ln) of the prices in the training data, trains a model to predict this natural log of price value, and then calculates the exponential of the scored label to get the predicted price. The training pipeline is shown in the exhibit. (Click the Training pipeline tab.) >> picture << You create a real-time inference pipeline from the training pipeline, as shown in the exhibit. (Click the Real-time pipeline tab.) Real-time pipeline - >> picture << You need to modify the inference pipeline to ensure that the web service returns the exponential of the scored label as the predicted automobile price and that client applications are not required to include a price value in the input values. Which three modifications must you make to the inference pipeline? Each correct answer presents part of the solution. NOTE: Each correct selection is worth one point. A. Connect the output of the Apply SQL Transformation to the Web Service Output module. B. Replace the Web Service Input module with a data input that does not include the price column. C. Add a Select Columns module before the Score Model module to select all columns other than price. D. Replace the training dataset module with a data input that does not include the price column. E. Remove the Apply Math Operation module that replaces price with its natural log from the data flow. F. Remove the Apply SQL Transformation module from the data flow.

Correct Answer: ACE https://docs.microsoft.com/en-us/learn/modules/create-regression-model-azure-machine-learning-designer/inference-pipeline?ns-enrollment-type=LearningPath&ns-enrollment-id=learn.wwl.create-no-code-predictive-models-with-azure-machine-learning

You are analyzing a dataset containing historical data from a local taxi company. You are developing a regression model. You must predict the fare of a taxi trip. You need to select performance metrics to correctly evaluate the regression model. Which two metrics can you use? Each correct answer presents a complete solution? NOTE: Each correct selection is worth one point. A. a Root Mean Square Error value that is low B. an R-Squared value close to 0 C. an F1 score that is low D. an R-Squared value close to 1 E. an F1 score that is high F. a Root Mean Square Error value that is high

Correct Answer: AD RMSE and R2 are both metrics for regression models. A: Root mean squared error (RMSE) creates a single value that summarizes the error in the model. By squaring the difference, the metric disregards the difference between over-prediction and under-prediction. D: Coefficient of determination, often referred to as R2, represents the predictive power of the model as a value between 0 and 1. Zero means the model is random (explains nothing); 1 means there is a perfect fit. However, caution should be used in interpreting R2 values, as low values can be entirely normal and high values can be suspect. Incorrect Answers: C, E, F - score is used for classification models, not for regression models. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/evaluate-model

You create a multi-class image classification deep learning model that uses the PyTorch deep learning framework. You must configure Azure Machine Learning Hyperdrive to optimize the hyperparameters for the classification model. You need to define a primary metric to determine the hyperparameter values that result in the model with the best accuracy score. Which three actions must you perform? Each correct answer presents part of the solution. NOTE: Each correct selection is worth one point. A. Set the primary_metric_goal of the estimator used to run the bird_classifier_train.py script to maximize. B. Add code to the bird_classifier_train.py script to calculate the validation loss of the model and log it as a float value with the key loss. C. Set the primary_metric_goal of the estimator used to run the bird_classifier_train.py script to minimize. D. Set the primary_metric_name of the estimator used to run the bird_classifier_train.py script to accuracy. E. Set the primary_metric_name of the estimator used to run the bird_classifier_train.py script to loss. F. Add code to the bird_classifier_train.py script to calculate the validation accuracy of the model and log it as a float value with the key accuracy.

Correct Answer: ADF AD: primary_metric_name="accuracy", primary_metric_goal=PrimaryMetricGoal.MAXIMIZE Optimize the runs to maximize "accuracy". Make sure to log this value in your training script. Note: primary_metric_name: The name of the primary metric to optimize. The name of the primary metric needs to exactly match the name of the metric logged by the training script. primary_metric_goal: It can be either PrimaryMetricGoal.MAXIMIZE or PrimaryMetricGoal.MINIMIZE and determines whether the primary metric will be maximized or minimized when evaluating the runs. F: The training script calculates the val_accuracy and logs it as "accuracy", which is used as the primary metric.

You run a script as an experiment in Azure Machine Learning. You have a Run object named run that references the experiment run. You must review the log files that were generated during the experiment run. You need to download the log files to a local folder for review. Which two code segments can you run to achieve this goal? Each correct answer presents a complete solution. NOTE: Each correct selection is worth one point. A. run.get_details() B. run.get_file_names() C. run.get_metrics() D. run.download_files(output_directory='./runfiles') E. run.get_all_logs(destination='./runlogs')

Correct Answer: AE Alternative Answer: DE YES (all responses) (Reference: https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.run(class)?view=azure-ml-py#azureml-core-run-get-all-logs and https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.run(class)?view=azure-ml-py#azureml-core-run-download-files) The run Class get_all_logs method downloads all logs for the run to a directory. The run Class get_details gets the definition, status information, current log files, and other details of the run. Incorrect Answers: B: The run get_file_names list the files that are stored in association with the run. Reference: https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.run(class)

HOTSPOT - You are using C-Support Vector classification to do a multi-class classification with an unbalanced training dataset. The C-Support Vector classification using Python code shown below: from sklearn.svm import svc import numpy as np svc = SVC(kernel='linear', class_weight='balanced', C-1.0, random_state-0) model1 = svc.fit(X_train, y) You need to evaluate the C-Support Vector classification code. Which evaluation statement should you use? To answer, select the appropriate options in the answer area. NOTE: Each correct selection is worth one point. Hot Area: class_weight=balanced - Automatically select the performance metrics for the classification - Automatically adjusts weights directly proportional to class frequencies in the input data - Automatically adjust weights inversely proportional to class frequencies in the input data C parameter - Penalty parameter - Degree of polynomial kernel function - Size of the kernel cache

Correct Answer: Automatically adjusts weights inversely proportional to class frequencies in the input data; Penalty parameter Box 1: Automatically adjust weights inversely proportional to class frequencies in the input data The ג€balancedג€ mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)). Box 2: Penalty parameter - Parameter: C : float, optional (default=1.0)Penalty parameter C of the error term. Reference: https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

You create a datastore named training_data that references a blob container in an Azure Storage account. The blob container contains a folder named csv_files in which multiple comma-separated values (CSV) files are stored. You have a script named train.py in a local folder named ./script that you plan to run as an experiment using an estimator. The script includes the following code to read data from the csv_files folder: import os, argparse, pandas as pd from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from azureml.core import Run run = Run.get_context() parser = argparse.ArgumentParser() parser.add_argument('--data-folder', type=str, dest='data_folder, help='data reference') args = parser.parse_args() data_folder = args.data_folder csv_files = os.listdir(data_folder) training_data = pd.concat((pd.read_csv(os.path.join(data_folder,csv_file)) for csv_file in csv_files)) You have the following script. from azureml.core import Workspace, Datastore, Experiment from azureml.train.sklearn import SKLearn ws = Workspace.from_config() exp = Experiment(workspace=ws, name='csv_training') ds = Datastore.get(ws, datastore_name='training_data') data_ref = ds.path('csv_files') # code to define estimator goes here run = exp.submit(config=estimator) run.wait_for_completion(show_output=True) You need to configure the estimator for the experiment so that the script can read the data from a data reference named data_ref that references the csv_files folder in the training_data datastore. Which code should you use to configure the estimator? A. estimator=SKLearn..., inputs= [data_ref.as_named_input('data-folder').to_pandas_dataframe()], compute_target=..., entry_script=...) B. script_parms = {'--data-folder': data_ref.as_mount()} estimator=SKLearn..., script_params=..., compute_target=..., entry_script=... C. estimator=SKLearn..., inputs=[data_ref.as_named_input('data-folder').as_mount()], compute_target=..., entry_script=...) D. script_parms = {'--data-folder': data_ref.as_download(path_on_compute='csv_files')} estimator=SKLearn..., script_params=..., compute_target=..., entry_script=... E. estimator=SKLearn..., inputs=[data_ref.as_named_input('data-folder').as_download(path_on_compute='csv_files')], compute_target=..., entry_script=...)

Correct Answer: B Besides passing the dataset through the input parameters in the estimator, you can also pass the dataset through script_params and get the data path (mounting point) in your training script via arguments. This way, you can keep your training script independent of azureml-sdk. In other words, you will be able use the same training script for local debugging and remote training on any cloud platform. Example: from azureml.train.sklearn import SKLearn script_params = { # mount the dataset on the remote compute and pass the mounted path as an argument to the training script '--data-folder': mnist_ds.as_named_input('mnist').as_mount(), '--regularization': 0.5} est = SKLearn(source_directory=script_folder, script_params=script_params, compute_target=compute_target, environment_definition=env, entry_script='train_mnist.py') # Run the experiment run = experiment.submit(est) run.wait_for_completion(show_output=True) Incorrect Answers: A: Pandas DataFrame not used. Reference: https://docs.microsoft.com/es-es/azure/machine-learning/how-to-train-with-datasets

Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution. After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen. You are creating a new experiment in Azure Machine Learning Studio. One class has a much smaller number of observations than the other classes in the training set. You need to select an appropriate data sampling strategy to compensate for the class imbalance. Solution: You use the Stratified split for the sampling mode. Does the solution meet the goal? A. Yes B. No

Correct Answer: B Instead use the Synthetic Minority Oversampling Technique (SMOTE) sampling mode. Note: SMOTE is used to increase the number of underepresented cases in a dataset used for machine learning. SMOTE is a better way of increasing the number of rare cases than simply duplicating existing cases. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/smote

Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution .After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen. An IT department creates the following Azure resource groups and resources: - ml_resources: an Azure Machine Learning workspace named amlworkspace, an Azure Storage account named amlworkspace12345, an Application Insights instance named amlworkspace54321, an Azure Key Vault named amlworkspace67890, an Azure Container Registry named amlworkspace09876 - general_compute: A virtual machine named mlvm wiht the following configuration: Operating system: Ubuntu Linux, Software installed: Python 3.6 and Jupyter Notebooks, Size: NC6 (6 vCPUs, 1 vGPU, 56 Gb RAM) The IT department creates an Azure Kubernetes Service (AKS)-based inference compute target named aks-cluster in the Azure Machine Learning workspace. You have a Microsoft Surface Book computer with a GPU. Python 3.6 and Visual Studio Code are installed. You need to run a script that trains a deep neural network (DNN) model and logs the loss and accuracy metrics. Solution: Install the Azure ML SDK on the Surface Book. Run Python code to connect to the workspace. Run the training script as an experiment on the aks- cluster compute target. Does the solution meet the goal? A. Yes B. No

Correct Answer: B Need to attach the mlvm virtual machine as a compute target in the Azure Machine Learning workspace. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/concept-compute-target

Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution. After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen. An IT department creates the following Azure resource groups and resources: - ml_resources: an Azure Machine Learning workspace named amlworkspace, an Azure Storage account named amlworkspace12345, an Application Insights instance named amlworkspace54321, an Azure Key Vault named amlworkspace67890, an Azure Container Registry named amlworkspace09876 - general_compute: A virtual machine named mlvm wiht the following configuration: Operating system: Ubuntu Linux, Software installed: Python 3.6 and Jupyter Notebooks, Size: NC6 (6 vCPUs, 1 vGPU, 56 Gb RAM) The IT department creates an Azure Kubernetes Service (AKS)-based inference compute target named aks-cluster in the Azure Machine Learning workspace. You have a Microsoft Surface Book computer with a GPU. Python 3.6 and Visual Studio Code are installed. You need to run a script that trains a deep neural network (DNN) model and logs the loss and accuracy metrics. Solution: Install the Azure ML SDK on the Surface Book. Run Python code to connect to the workspace and then run the training script as an experiment on local compute. Does the solution meet the goal? A. Yes B. No

Correct Answer: B Need to attach the mlvm virtual machine as a compute target in the Azure Machine Learning workspace. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/concept-compute-target

You are creating a classification model for a banking company to identify possible instances of credit card fraud. You plan to create the model in Azure Machine Learning by using automated machine learning. The training dataset that you are using is highly unbalanced. You need to evaluate the classification model. Which primary metric should you use? A. normalized_mean_absolute_error B. AUC_weighted C. accuracy D. normalized_root_mean_squared_error E. spearman_correlation

Correct Answer: B AUC_weighted is a Classification metric. Note: AUC is the Area under the Receiver Operating Characteristic Curve. Weighted is the arithmetic mean of the score for each class, weighted by the number of true instances in each class. Incorrect Answers: A: normalized_mean_absolute_error is a regression metric, not a classification metric. C: When comparing approaches to imbalanced classification problems, consider using metrics beyond accuracy such as recall, precision, and AUROC. It may be that switching the metric you optimize for during parameter selection or model selection is enough to provide desirable performance detecting the minority class. D: normalized_root_mean_squared_error is a regression metric, not a classification metric. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-understand-automated-ml

Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution. After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen. You create an Azure Machine Learning service datastore in a workspace. The datastore contains the following files: ✑ /data/2018/Q1.csv ✑ /data/2018/Q2.csv ✑ /data/2018/Q3.csv ✑ /data/2018/Q4.csv ✑ /data/2019/Q1.csv All files store data in the following format: id,f1,f2,I 1,1,2,0 2,1,1,1 3,2,1,0 4,2,2,1 You run the following code: data_store = Datastore.register_azure_blob_container(workspace=ws , datastore_name= 'data_store' , container_name= 'quarterly_data' , account_name='companydata' , account_key='NRPxk8duxbM3...' , create_if_not_exists=False) You need to create a dataset named training_data and load the data from all files into a single data frame by using the following code: data_frame = training_data.to_pandas_dataframe() Solution: Run the following code: from azureml.core import Dataset paths = (data_store, 'data/%/%.csv') training_data = Dataset.Tabular.from_delimited_files(paths) Does the solution meet the goal? A. Yes B. No

Correct Answer: B Alternate Answer: A ? paths line above it is different from the tabular example that is correct that uses two filepaths Define paths with two file paths instead. Use Dataset.Tabular_from_delimited as the data isn't cleansed. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-create-register-datasets

You use the Azure Machine Learning service to create a tabular dataset named training_data. You plan to use this dataset in a training script. You create a variable that references the dataset using the following code: training_ds = workspace.datasets.get("training_data") You define an estimator to run the script. You need to set the correct property of the estimator to ensure that your script can access the training_data dataset. Which property should you set? A. environment_definition = {"training_data":training_ds} B. inputs = [training_ds.as_named_input('training_ds')] C. script_params = {"--training_ds":training_ds} D. source_directory = training_ds

Correct Answer: B Example: # Get the training datasetdiabetes_ds = ws.datasets.get("Diabetes Dataset") # Create an estimator that uses the remote computehyper_estimator = SKLearn(source_directory=experiment_folder, inputs=[diabetes_ds.as_named_input('diabetes')], # Pass the dataset as an input compute_target = cpu_cluster, conda_packages=['pandas','ipykernel','matplotlib'], pip_packages=['azureml-sdk','argparse','pyarrow'], entry_script='diabetes_training.py') Reference: https://notebooks.azure.com/GraemeMalcolm/projects/azureml-primers/html/04%20-%20Optimizing%20Model%20Training.ipynb

You register a file dataset named csv_folder that references a folder. The folder includes multiple comma-separated values (CSV) files in an Azure storage blob container. You plan to use the following code to run a script that loads data from the file dataset. You create and instantiate the following variables: remote_cluster: References the Azure Machine Learning compute cluster ws: References the Azure Machine Learning workspace You have the following code: from azureml.train.estimator import Estimator file_dataset = ws.datasets.get('csv_folder') estimator = Estimator(source_directory=script_folder, >>------------<< compute_target = remote_cluster, entry_script = 'script.py') run = experiment.submit(config=estimator) run.wait_for_completion(show_output=True) You need to pass the dataset to ensure that the script can read the files it references. Which code segment should you insert to replace the code comment? A. inputs=[file_dataset.as_named_input('training_files')], B. inputs=[file_dataset.as_named_input('training_files').as_mount()], C. inputs=[file_dataset.as_named_input('training_files').to_pandas_dataframe()], D. script_params={'--training_files': file_dataset},

Correct Answer: B Example: from azureml.train.estimator import Estimatorscript_params = { # to mount files referenced by mnist dataset '--data-folder': mnist_file_dataset.as_named_input('mnist_opendataset').as_mount(), '--regularization': 0.5} est = Estimator(source_directory=script_folder, script_params=script_params, compute_target=compute_target, environment_definition=env, entry_script='train.py') Reference: https://docs.microsoft.com/en-us/azure/machine-learning/tutorial-train-models-with-aml

Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution. After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen. You are using Azure Machine Learning to run an experiment that trains a classification model. You want to use Hyperdrive to find parameters that optimize the AUC metric for the model. You configure a HyperDriveConfig for the experiment by running the following code: hyperdrive = HyperDriveConfig(estimator=your_estimator, hyperparameter_sampling=your_parms, policy=policy, primary_metric_name='AUC' primary_metric_goal=PrimaryMetricGoal.MAXIMIZE, max_total_runs=6, max_concurrent_runs=4) You plan to use this configuration to run a script that trains a random forest model and then tests it with validation data. The label values for the validation data are stored in a variable named y_test variable, and the predicted probabilities from the model are stored in a variable named y_predicted. You need to add logging to the script to allow Hyperdrive to optimize hyperparameters for the AUC metric. Solution: Run the following code: import json, os from sklearn.metrics import roc_auc_score # code to train model omitted auc = roc_auc_score(y_test, y_predicted) os.makedirs("outputs", exist_ok = True) with open("outputs/AUC.txt", "w") as file_cur: file_cur.write(auc) Does the solution meet the goal? A. Yes B. No

Correct Answer: B Explanation - Use a solution with logging.info(message) instead. Note: Python printing/logging example: logging.info(message)Destination: Driver logs, Azure Machine Learning designer Reference: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-debug-pipelines

Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution. After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen. You are using Azure Machine Learning to run an experiment that trains a classification model. You want to use Hyperdrive to find parameters that optimize the AUC metric for the model. You configure a HyperDriveConfig for the experiment by running the following code: hyperdrive = HyperDriveConfig(estimator=your_estimator, hyperparameter_sampling=your_parms, policy=policy, primary_metric_name='AUC' primary_metric_goal=PrimaryMetricGoal.MAXIMIZE, max_total_runs=6, max_concurrent_runs=4) You plan to use this configuration to run a script that trains a random forest model and then tests it with validation data. The label values for the validation data are stored in a variable named y_test variable, and the predicted probabilities from the model are stored in a variable named y_predicted. You need to add logging to the script to allow Hyperdrive to optimize hyperparameters for the AUC metric. Solution: Run the following code: import numpy as np from sklearn.metrics import roc_auc_score # code to train model omitted auc = roc_auc_score(y_test, y_predicted) print(np.float(auc)) Does the solution meet the goal? A. Yes B. No

Correct Answer: B Explanation - Use a solution with logging.info(message) instead. Note: Python printing/logging example: logging.info(message)Destination: Driver logs, Azure Machine Learning designer Reference: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-debug-pipelines

You create a multi-class image classification deep learning model that uses a set of labeled images. You create a script file named train.py that uses the PyTorch1.3 framework to train the model. You must run the script by using an estimator. The code must not require any additional Python libraries to be installed in the environment for the estimator. The time required for model training must be minimized. You need to define the estimator that will be used to run the script. Which estimator type should you use? A. TensorFlow B. PyTorch C. SKLearn D. Estimator

Correct Answer: B For PyTorch, TensorFlow and Chainer tasks, Azure Machine Learning provides respective PyTorch, TensorFlow, and Chainer estimators to simplify using these frameworks. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-train-ml-models

Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution. After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen. You are creating a new experiment in Azure Machine Learning Studio. One class has a much smaller number of observations than the other classes in the training set. You need to select an appropriate data sampling strategy to compensate for the class imbalance. Solution: You use the Principal Components Analysis (PCA) sampling mode. Does the solution meet the goal? A. Yes B. No

Correct Answer: B Instead use the Synthetic Minority Oversampling Technique (SMOTE) sampling mode. Note: SMOTE is used to increase the number of underepresented cases in a dataset used for machine learning. SMOTE is a better way of increasing the number of rare cases than simply duplicating existing cases. Incorrect Answers: The Principal Component Analysis module in Azure Machine Learning Studio (classic) is used to reduce the dimensionality of your training data. The module analyzes your data and creates a reduced feature set that captures all the information contained in the dataset, but in a smaller number of features. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/smote https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/principal-component-analysis

Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution. After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen. from azureml.core import Run import pandas as pd run = Run.get_context() data = pd.read_csv('data.csv') label_vals = data['label'].unique() # Add code to record metrics here run.complete() You plan to use a Python script to run an Azure Machine Learning experiment. The script creates a reference to the experiment run context, loads data from a file, identifies the set of unique values for the label column, and completes the experiment run: The experiment must record the unique labels in the data as metrics for the run that can be reviewed later. You must add code to the script to record the unique label values as run metrics at the point indicated by the comment. Solution: Replace the comment with the following code: run.log_table('Label Values', label_vals) Does the solution meet the goal? A. Yes B. No

Correct Answer: B Instead use the run_log function to log the contents in label_vals: for label_val in label_vals: run.log('Label Values', label_val) Reference: https://www.element61.be/en/resource/azure-machine-learning-services-complete-toolbox-ai

You write a Python script that processes data in a comma-separated values (CSV) file. You plan to run this script as an Azure Machine Learning experiment. The script loads the data and determines the number of rows it contains using the following code: from azureml.core import Run import pandas as pd run = Run.get_colntext() data = pd.read_csv('./data.csv') rows = (len(data)) @ record row_count metric here ... You need to record the row count as a metric named row_count that can be returned using the get_metrics method of the Run object after the experiment run completes. Which code should you use? A. run.upload_file(T3 row_count', './data.csv') B. run.log('row_count', rows) C. run.tag('row_count', rows) D. run.log_table('row_count', rows) E. run.log_row('row_count', rows)

Correct Answer: B Log a numerical or string value to the run with the given name using log(name, value, description=''). Logging a metric to a run causes that metric to be stored in the run record in the experiment. You can log the same metric multiple times within a run, the result being considered a vector of that metric. Example: run.log("accuracy", 0.95) Incorrect Answers: E: Using log_row(name, description=None, **kwargs) creates a metric with multiple columns as described in kwargs. Each named parameter generates a column with the value specified. log_row can be called once to log an arbitrary tuple, or multiple times in a loop to generate a complete table. Example: run.log_row("Y over X", x=1, y=0.4) Reference: https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.run

Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution. After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen. You create a model to forecast weather conditions based on historical data. You need to create a pipeline that runs a processing script to load data from a datastore and pass the processed data to a machine learning model training script. Solution: Run the following code: datastore = ws.get_default_datastore() data_input = PipelineData("raw_data", datastore=rawdatastore) data_output = PipelineData("processed_data", datastore=datastore) process_step = PythonScriptStep(script_name="process.py", arguments=[--data_for_train", data_input], outputs=[data_output],compute_target=aml_compute, source_directory=process_directory) train_step = PythonScriptStep(script_name="train.py", arguments=["--data_for_train", data_input], inputs=[data_output],compute_target=aml_compute, source_directory=train_directory) pipeline = Pipeline(workspace=ws, steps=[process_step, train_step]) Does the solution meet the goal? A. Yes B. No

Correct Answer: B Note: Data used in pipeline can be produced by one step and consumed in another step by providing a PipelineData object as an output of one step and an input of one or more subsequent steps. Compare with this example, the pipeline train step depends on the process_step_output output of the pipeline process step: from azureml.pipeline.core import Pipeline, PipelineData from azureml.pipeline.steps import PythonScriptStep datastore = ws.get_default_datastore() process_step_output = PipelineData("processed_data", datastore=datastore) process_step = PythonScriptStep(script_name="process.py", arguments=["--data_for_train", process_step_output], outputs=[process_step_output], compute_target=aml_compute, source_directory=process_directory) train_step = PythonScriptStep(script_name="train.py", arguments=["--data_for_train", process_step_output], inputs=[process_step_output], compute_target=aml_compute, source_directory=train_directory) pipeline = Pipeline(workspace=ws, steps=[process_step, train_step]) Reference: https://docs.microsoft.com/en-us/python/api/azureml-pipeline-core/azureml.pipeline.core.pipelinedata?view=azure-ml-py

Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution. After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen. You create a model to forecast weather conditions based on historical data. You need to create a pipeline that runs a processing script to load data from a datastore and pass the processed data to a machine learning model training script. Solution: Run the following code: from azureml.train.dnn import TensorFlow sk_est = TensorFlow(source_directory='./scripts', compute_target=aml-compute, entry_script='train.py') Does the solution meet the goal? A. Yes B. No

Correct Answer: B The scikit-learn estimator provides a simple way of launching a scikit-learn training job on a compute target. It is implemented through the SKLearn class, which can be used to support single-node CPU training. Example: from azureml.train.sklearn import SKLearn estimator = SKLearn(source_directory=project_folder, compute_target=compute_target, entry_script='train_iris.py') Reference: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-train-scikit-learn

Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution. After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen. You create a model to forecast weather conditions based on historical data. You need to create a pipeline that runs a processing script to load data from a datastore and pass the processed data to a machine learning model training script. Solution: Run the following code: from azureml.train.estimator import Estimator sk_est = Estimator(source_directory='./scripts', compute_target=aml-compute, entry_script='train.py, conda_packages=['scikit-learn']) Does the solution meet the goal? A. Yes B. No

Correct Answer: B The scikit-learn estimator provides a simple way of launching a scikit-learn training job on a compute target. It is implemented through the SKLearn class, which can be used to support single-node CPU training. Example: from azureml.train.sklearn import SKLearn estimator = SKLearn(source_directory=project_folder, compute_target=compute_target, entry_script='train_iris.py') Reference: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-train-scikit-learn

Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution. After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen. You create a model to forecast weather conditions based on historical data. You need to create a pipeline that runs a processing script to load data from a datastore and pass the processed data to a machine learning model training script. Solution: Run the following code: datastore = ws.get_default_datastore() data_output = pd.read_csv("traindata.csv") process_step = PythonScriptStep(script_name="process.py", arguments=[--data_for_train", data_output], outputs=[data_output],compute_target=aml_compute, source_directory=process_directory) train_step = PythonScriptStep(script_name="train.py", arguments=["--data_for_train", data_output], inputs=[data_output],compute_target=aml_compute, source_directory=train_directory) pipeline = Pipeline(workspace=ws, steps=[process_step, train_step]) Does the solution meet the goal? A. Yes B. No

Correct Answer: B The two steps are present: process_step and train_step The training data input is not setup correctly. Note: Data used in pipeline can be produced by one step and consumed in another step by providing a PipelineData object as an output of one step and an input of one or more subsequent steps. PipelineData objects are also used when constructing Pipelines to describe step dependencies. To specify that a step requires the output of another step as input, use a PipelineData object in the constructor of both steps. For example, the pipeline train step depends on the process_step_output output of the pipeline process step: from azureml.pipeline.core import Pipeline, PipelineData from azureml.pipeline.steps import PythonScriptStep datastore = ws.get_default_datastore() process_step_output = PipelineData("processed_data", datastore=datastore) process_step = PythonScriptStep(script_name="process.py", arguments=["--data_for_train", process_step_output], outputs=[process_step_output], compute_target=aml_compute, source_directory=process_directory) train_step = PythonScriptStep(script_name="train.py", arguments=["--data_for_train", process_step_output], inputs=[process_step_output], compute_target=aml_compute, source_directory=train_directory) pipeline = Pipeline(workspace=ws, steps=[process_step, train_step]) Reference: https://docs.microsoft.com/en-us/python/api/azureml-pipeline-core/azureml.pipeline.core.pipelinedata?view=azure-ml-py

Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution. After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen. You have a Python script named train.py in a local folder named scripts. The script trains a regression model by using scikit-learn. The script includes code to load a training data file which is also located in the scripts folder. You must run the script as an Azure ML experiment on a compute cluster named aml-compute. You need to configure the run to ensure that the environment includes the required packages for model training. You have instantiated a variable named aml- compute that references the target compute cluster. Solution: Run the following code: from azureml.train.estimator import Estimator sk_est = Estimator(source_directory='./scripts', compute_target=aml-compute, entry_script='train.py') Does the solution meet the goal? A. Yes B. No

Correct Answer: B There is a missing line: conda_packages=['scikit-learn'], which is needed. Correct example: sk_est = Estimator(source_directory='./my-sklearn-proj', script_params=script_params, compute_target=compute_target, entry_script='train.py', conda_packages=['scikit-learn']) Note: The Estimator class represents a generic estimator to train data using any supplied framework. This class is designed for use with machine learning frameworks that do not already have an Azure Machine Learning pre-configured estimator. Pre-configured estimators exist for Chainer, PyTorch, TensorFlow, and SKLearn. Example: from azureml.train.estimator import Estimator script_params = { # to mount files referenced by mnist dataset '--data-folder': ds.as_named_input('mnist').as_mount(), '--regularization': 0.8} Reference: https://docs.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.estimator.estimator

Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution. After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen. You create an Azure Machine Learning service datastore in a workspace. The datastore contains the following files: ✑ /data/2018/Q1.csv ✑ /data/2018/Q2.csv ✑ /data/2018/Q3.csv ✑ /data/2018/Q4.csv ✑ /data/2019/Q1.csv All files store data in the following format: id,f1,f2,I 1,1,2,0 2,1,1,1 3,2,1,0 4,2,2,1 You run the following code: data_store = Datastore.register_azure_blob_container(workspace=ws , datastore_name= 'data_store' , container_name= 'quarterly_data' , account_name='companydata' , account_key='NRPxk8duxbM3...' , create_if_not_exists=False) You need to create a dataset named training_data and load the data from all files into a single data frame by using the following code: data_frame = training_data.to_pandas_dataframe() Solution: Run the following code: from azureml.core import Dataset paths = [(data_store, 'data/2018/%.csv'), (data_store, 'data/2019/%.csv')] training_data = Dataset.File.from_files(paths) Does the solution meet the goal? A. Yes B. No

Correct Answer: B Use two file paths. Use Dataset.Tabular_from_delimited, instead of Dataset.File.from_files as the data isn't cleansed. Note: A FileDataset references single or multiple files in your datastores or public URLs. If your data is already cleansed, and ready to use in training experiments, you can download or mount the files to your compute as a FileDataset object. A TabularDataset represents data in a tabular format by parsing the provided file or list of files. This provides you with the ability to materialize the data into a pandas or Spark DataFrame so you can work with familiar data preparation and training libraries without having to leave your notebook. You can create a TabularDataset object from .csv, .tsv, .parquet, .jsonl files, and from SQL query results. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-create-register-datasets

You use the following code to run a script as an experiment in Azure Machine Learning: from azureml.core import Workspace, Experiment, Run from azureml.core.import RunConfig, ScriptRunConfig ws = Workspace.from_config() run_config = RunConfiguration() run_config.target='local' script_config = ScriptRunConfig(source_directory='./script', script='experiment.py', run_config=run_config) experiment = Experiment(workspace=ws, name='script experiment') run = experiment.submit(config=script_config) run.wait_for_completion() You must identify the output files that are generated by the experiment run. You need to add code to retrieve the output file names. Which code segment should you add to the script? A. files = run.get_properties() B. files = run.get_file_names() C. files = run.get_details_with_logs() D. files = run.get_metrics() E. files = run.get_details()

Correct Answer: B You can list all of the files that are associated with this run record by called run.get_file_names() Reference: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-track-experiments

Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution. After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen. from azureml.core import Run import pandas as pd run = Run.get_context() data = pd.read_csv('data.csv') label_vals = data['label'].unique() # Add code to record metrics here run.complete() You plan to use a Python script to run an Azure Machine Learning experiment. The script creates a reference to the experiment run context, loads data from a file, identifies the set of unique values for the label column, and completes the experiment run: The experiment must record the unique labels in the data as metrics for the run that can be reviewed later. You must add code to the script to record the unique label values as run metrics at the point indicated by the comment. Solution: Replace the comment with the following code: run.upload_file('outputs/labels.csv', './data.csv') Does the solution meet the goal? A. Yes B. No

Correct Answer: B label_vals has the unique labels (from the statement label_vals = data['label'].unique()), and it has to be logged. Note: Instead use the run_log function to log the contents in label_vals: for label_val in label_vals: run.log('Label Values', label_val) Reference: https://www.element61.be/en/resource/azure-machine-learning-services-complete-toolbox-ai

Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution. After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen. You create a model to forecast weather conditions based on historical data. You need to create a pipeline that runs a processing script to load data from a datastore and pass the processed data to a machine learning model training script. Solution: Run the following code: datastore = ws.get_default_datastore() data_output = PipelineData("processed_data", datastore=datastore) process_step = PythonScriptStep(script_name="process.py", arguments=[--data_for_train", data_output], outputs=[data_output],compute_target=aml_compute, source_directory=process_directory) pipeline = Pipeline(workspace=ws, steps=[process_step) Does the solution meet the goal? A. Yes B. No

Correct Answer: B train_step is missing. Reference: https://docs.microsoft.com/en-us/python/api/azureml-pipeline-core/azureml.pipeline.core.pipelinedata?view=azure-ml-py

You create a binary classification model. You need to evaluate the model performance. Which two metrics can you use? Each correct answer presents a complete solution. NOTE: Each correct selection is worth one point. A. relative absolute error B. precision C. accuracy D. mean absolute error E. coefficient of determination

Correct Answer: BC The evaluation metrics available for binary classification models are: Accuracy, Precision, Recall, F1 Score, and AUC. Note: A very natural question is: 'Out of the individuals whom the model, how many were classified correctly (TP)?' This question can be answered by looking at the Precision of the model, which is the proportion of positives that are classified correctly. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/studio/evaluate-model-performance

You plan to use the Hyperdrive feature of Azure Machine Learning to determine the optimal hyperparameter values when training a model. You must use Hyperdrive to try combinations of the following hyperparameter values: ✑ learning_rate: any value between 0.001 and 0.1 ✑ batch_size: 16, 32, or 64 You need to configure the search space for the Hyperdrive experiment. Which two parameter expressions should you use? Each correct answer presents part of the solution. NOTE: Each correct selection is worth one point. A. a choice expression for learning_rate B. a uniform expression for learning_rate C. a normal expression for batch_size D. a choice expression for batch_size E. a uniform expression for batch_size

Correct Answer: BD B: Continuous hyperparameters are specified as a distribution over a continuous range of values. Supported distributions include: ✑ uniform(low, high) - Returns a value uniformly distributed between low and high D: Discrete hyperparameters are specified as a choice among discrete values. choice can be: one or more comma-separated values ✑ a range object ✑ any arbitrary list object Reference: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-tune-hyperparameters

You are building a regression model for estimating the number of calls during an event. You need to determine whether the feature values achieve the conditions to build a Poisson regression model. Which two conditions must the feature set contain? Each correct answer presents part of the solution. NOTE: Each correct selection is worth one point. A. The label data must be a negative value. B. The label data must be whole numbers. C. The label data must be non-discrete. D. The label data must be a positive value. E. The label data can be positive or negative.

Correct Answer: BD Poisson regression is intended for use in regression models that are used to predict numeric values, typically counts. Therefore, you should use this module to create your regression model only if the values you are trying to predict fit the following conditions: ✑ The response variable has a Poisson distribution. ✑ Counts cannot be negative. The method will fail outright if you attempt to use it with negative labels. ✑ A Poisson distribution is a discrete distribution; therefore, it is not meaningful to use this method with non-whole numbers. Reference:https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/poisson-regression

You use the Azure Machine Learning SDK in a notebook to run an experiment using a script file in an experiment folder. The experiment fails. You need to troubleshoot the failed experiment. What are two possible ways to achieve this goal? Each correct answer presents a complete solution. A. Use the get_metrics() method of the run object to retrieve the experiment run logs. B. Use the get_details_with_logs() method of the run object to display the experiment run logs. C. View the log files for the experiment run in the experiment folder. D. View the logs for the experiment run in Azure Machine Learning studio. E. Use the get_output() method of the run object to retrieve the experiment run logs.

Correct Answer: BD Use get_details_with_logs() to fetch the run details and logs created by the run. You can monitor Azure Machine Learning runs and view their logs with the Azure Machine Learning studio. Incorrect Answers: A: You can view the metrics of a trained model using run.get_metrics(). E: get_output() gets the output of the step as PipelineData. Reference: https://docs.microsoft.com/en-us/python/api/azureml-pipeline-core/azureml.pipeline.core.steprun https://docs.microsoft.com/en-us/azure/machine-learning/how-to-monitor-view-training-logs

You have a comma-separated values (CSV) file containing data from which you want to train a classification model. You are using the Automated Machine Learning interface in Azure Machine Learning studio to train the classification model. You set the task type to Classification. You need to ensure that the Automated Machine Learning process evaluates only linear models. What should you do? A. Add all algorithms other than linear ones to the blocked algorithms list. B. Set the Exit criterion option to a metric score threshold. C. Clear the option to perform automatic featurization. D. Clear the option to enable deep learning. E. Set the task type to Regression.

Correct Answer: C Alternate Answer: A (all responses) YES (Blocked algorithms - Algorithms you want to exclude from the training job Answer is A: Add all algorithms other than linear ones to the blocked algorithms list. https://docs.microsoft.com/en-us/azure/machine-learning/tutorial-first-experiment-automated-ml) Automatic featurization can fit non-linear models. Reference: https://econml.azurewebsites.net/spec/estimation/dml.html https://docs.microsoft.com/en-us/azure/machine-learning/how-to-use-automated-ml-for-ml-models

You plan to run a script as an experiment using a Script Run Configuration. The script uses modules from the scipy library as well as several Python packages that are not typically installed in a default conda environment. You plan to run the experiment on your local workstation for small datasets and scale out the experiment by running it on more powerful remote compute clusters for larger datasets. You need to ensure that the experiment runs successfully on local and remote compute with the least administrative effort. What should you do? A. Do not specify an environment in the run configuration for the experiment. Run the experiment by using the default environment. B. Create a virtual machine (VM) with the required Python configuration and attach the VM as a compute target. Use this compute target for all experiment runs. C. Create and register an Environment that includes the required packages. Use this Environment for all experiment runs. D. Create a config.yaml file defining the conda packages that are required and save the file in the experiment folder. E. Always run the experiment with an Estimator by using the default packages.

Correct Answer: C If you have an existing Conda environment on your local computer, then you can use the service to create an environment object. By using this strategy, you can reuse your local interactive environment on remote runs. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-use-environments

You are solving a classification task. You must evaluate your model on a limited data sample by using k-fold cross-validation. You start by configuring a k parameter as the number of splits. You need to configure the k parameter for the cross-validation. Which value should you use? A. k=0.5 B. k=0.01 C. k=5 D. k=1

Correct Answer: C Leave One Out (LOO) cross-validation Setting K = n (the number of observations) yields n-fold and is called leave-one out cross-validation (LOO), a special case of the K-fold approach. LOO CV is sometimes useful but typically doesn't shake up the data enough. The estimates from each fold are highly correlated and hence their average can have high variance. This is why the usual choice is K=5 or 10. It provides a good compromise for the bias-variance tradeoff.

You are building a machine learning model for translating English language textual content into French language textual content. You need to build and train the machine learning model to learn the sequence of the textual content. Which type of neural network should you use? A. Multilayer Perceptions (MLPs) B. Convolutional Neural Networks (CNNs) C. Recurrent Neural Networks (RNNs) D. Generative Adversarial Networks (GANs)

Correct Answer: C To translate a corpus of English text to French, we need to build a recurrent neural network (RNN). Note: RNNs are designed to take sequences of text as inputs or return sequences of text as outputs, or both. They're called recurrent because the network's hidden layers have a loop in which the output and cell state from each time step become inputs at the next time step. This recurrence serves as a form of memory. It allows contextual information to flow through the network so that relevant outputs from previous time steps can be applied to network operations at the current time step. Reference: https://towardsdatascience.com/language-translation-with-rnns-d84d43b40571

You plan to use the Hyperdrive feature of Azure Machine Learning to determine the optimal hyperparameter values when training a model. You must use Hyperdrive to try combinations of the following hyperparameter values. You must not apply an early termination policy. ✑ learning_rate: any value between 0.001 and 0.1 ✑ batch_size: 16, 32, or 64 You need to configure the sampling method for the Hyperdrive experiment. Which two sampling methods can you use? Each correct answer is a complete solution. NOTE: Each correct selection is worth one point. A. No sampling B. Grid sampling C. Bayesian sampling D. Random sampling

Correct Answer: CD C: Bayesian sampling is based on the Bayesian optimization algorithm and makes intelligent choices on the hyperparameter values to sample next. It picks the sample based on how the previous samples performed, such that the new sample improves the reported primary metric. Bayesian sampling does not support any early termination policy Example: from azureml.train.hyperdrive import BayesianParameterSampling from azureml.train.hyperdrive import uniform, choice param_sampling = BayesianParameterSampling( { "learning_rate": uniform(0.05, 0.1), "batch_size": choice(16, 32, 64, 128)}) D: In random sampling, hyperparameter values are randomly selected from the defined search space. Random sampling allows the search space to include both discrete and continuous hyperparameters. Incorrect Answers: B: Grid sampling can be used if your hyperparameter space can be defined as a choice among discrete values and if you have sufficient budget to exhaustively search over all values in the defined search space. Additionally, one can use automated early termination of poorly performing runs, which reduces wastage of resources. Example, the following space has a total of six samples: from azureml.train.hyperdrive import GridParameterSampling from azureml.train.hyperdrive import choice param_sampling = GridParameterSampling( {"num_hidden_layers": choice(1, 2, 3), "batch_size": choice(16, 32)}) Reference: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-tune-hyperparameters

You create a machine learning model by using the Azure Machine Learning designer. You publish the model as a real-time service on an Azure Kubernetes Service (AKS) inference compute cluster. You make no changes to the deployed endpoint configuration. You need to provide application developers with the information they need to consume the endpoint. Which two values should you provide to application developers? Each correct answer presents part of the solution. NOTE: Each correct selection is worth one point. A. The name of the AKS cluster where the endpoint is hosted. B. The name of the inference pipeline for the endpoint. C. The URL of the endpoint. D. The run ID of the inference pipeline experiment for the endpoint. E. The key for the endpoint.

Correct Answer: CE Deploying an Azure Machine Learning model as a web service creates a REST API endpoint. You can send data to this endpoint and receive the prediction returned by the model. You create a web service when you deploy a model to your local environment, Azure Container Instances, Azure Kubernetes Service, or field-programmable gate arrays (FPGA). You retrieve the URI used to access the web service by using the Azure Machine Learning SDK. If authentication is enabled, you can also use the SDK to get the authentication keys or tokens. Example: # URL for the web service scoring_uri = '<your web service URI>' # If the service is authenticated, set the key or token key = '<your key or token>' Reference: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-consume-web-service

DRAG DROP - You create machine learning models by using Azure Machine Learning. You plan to train and score models by using a variety of compute contexts. You also plan to create a new compute resource in Azure Machine Learning studio. You need to select the appropriate compute types. Which compute types should you select? To answer, drag the appropriate compute types to the correct requirements. Each compute type may be used once, more than once, or not at all. You may need to drag the split bar between panes or scroll to view content. NOTE: Each correct selection is worth one point. Select and Place: (Put Compute type by each Requirement) Compute types: Attached compute, Inference cluster, Compute cluster Requirements: - Train models by using the Azure Machine Learning Designer - Score new data through a trained model published as a real-time web service - Train models by using an Azure Databricks cluster - Deploy models by using the Azure Machine Learning designer

Correct Answer: Compute Cluster, Inference cluster, Attached compute, Compute cluster Alternative Answer: (last one is Inference Cluster YES! https://docs.microsoft.com/en-us/azure/machine-learning/tutorial-designer-automobile-price-deploy https://docs.microsoft.com/en-us/azure/machine-learning/concept-compute-target) Box 1: Compute cluster - Create a single or multi node compute cluster for your training, batch inferencing or reinforcement learning workloads. Box 2: Inference cluster - Box 3: Attached compute - The compute types that can currently be attached for training include: - A remote VM - -- Azure Databricks (for use in machine learning pipelines) -- Azure Data Lake Analytics (for use in machine learning pipelines) - Azure HDInsight - Box 4: Compute cluster - Note: There are four compute types: Compute instance - Compute clusters - Inference clusters - Attached compute - Note 2: Compute clusters - Create a single or multi node compute cluster for your training, batch inferencing or reinforcement learning workloads. Attached compute - To use compute targets created outside the Azure Machine Learning workspace, you must attach them. Attaching a compute target makes it available to your workspace. Use Attached compute to attach a compute target for training. Use Inference clusters to attach an AKS cluster for inferencing. Inference clusters - Create or attach an Azure Kubernetes Service (AKS) cluster for large scale inferencing. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-create-attach-compute-studio

You run an automated machine learning experiment in an Azure Machine Learning workspace. Information about the run is listed in the table below: Experiment, RunId, Staus, CreatedOn, Duration auto_ml_classification, AutoML_123-123, Completed, 11/11/2019, 00:27:11 You need to write a script that uses the Azure Machine Learning SDK to retrieve the best iteration of the experiment run. Which Python code segment should you use? >> all start with this << from azureml.core import Workspace from azureml.train.automl.run import AutoMLRun A. ws = Workspace.from_config() automl_ex = ws.experiments.get('auto_ml_classification') best_iter = automl_ex.archived_time.find('11/11/2019') B. automl_ex = ws.experiments.get('auto_ml_classification') automl_run = AutoMLRun(automl_ex, 'AutoML_123-123') best_iter = automl_run.current_run C. ws = Workspace.from_config() automl_ex = ws.experiments.get('auto_ml_classification') best_iter = list(automl_ex.get_runs())[0] D. ws = Workspace.from_config() automl_ex = ws.experiments.get('auto_ml_classification') automl_run = AutoMLRun(automl_ex, 'AutoML_123-123') best_iter = automl_run.get_output()[0] E. ws = Workspace.from_config() automl_ex = ws.experiments.get('auto_ml_classification') best_iter = automl_ex.get_runs('AutoML_123-123')

Correct Answer: D The get_output method on automl_classifier returns the best run and the fitted model for the last invocation. Overloads on get_output allow you to retrieve the best run and fitted model for any logged metric or for a particular iteration. In [ ]: best_run, fitted_model = local_run.get_output() Reference: https://notebooks.azure.com/azureml/projects/azureml-getting-started/html/how-to-use-azureml/automated-machine-learning/classification-with-deployment/auto- ml-classification-with-deployment.ipynb

You are performing a filter-based feature selection for a dataset to build a multi-class classifier by using Azure Machine Learning Studio. The dataset contains categorical features that are highly correlated to the output label column. You need to select the appropriate feature scoring statistical method to identify the key predictors. Which method should you use? A. Kendall correlation B. Spearman correlation C. Chi-squared D. Pearson correlation

Correct Answer: D Alternate Answer: C YES (all responses) Pearson's correlation statistic, or Pearson's correlation coefficient, is also known in statistical models as the r value. For any two variables, it returns a value that indicates the strength of the correlation Pearson's correlation coefficient is the test statistics that measures the statistical relationship, or association, between two continuous variables. It is known as the best method of measuring the association between variables of interest because it is based on the method of covariance. It gives information about the magnitude of the association, or correlation, as well as the direction of the relationship. Incorrect Answers: C: The two-way chi-squared test is a statistical method that measures how close expected values are to actual results. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/filter-based-feature-selection https://www.statisticssolutions.com/pearsons-correlation-coefficient/

You are creating a new Azure Machine Learning pipeline using the designer. The pipeline must train a model using data in a comma-separated values (CSV) file that is published on a website. You have not created a dataset for this file. You need to ingest the data from the CSV file into the designer pipeline using the minimal administrative effort. Which module should you add to the pipeline in Designer? A. Convert to CSV B. Enter Data Manually C. Import Data D. Dataset

Correct Answer: D Alternative Answer: C YES all comments (Answer is C: Import Data module - Load data from web URLs or from various cloud-based storage in Azure, such as Azure SQL database, Azure blob storage. https://docs.microsoft.com/en-us/azure/machine-learning/algorithm-module-reference/import-data) The preferred way to provide data to a pipeline is a Dataset object. The Dataset object points to data that lives in or is accessible from a datastore or at a WebURL. The Dataset class is abstract, so you will create an instance of either a FileDataset (referring to one or more files) or a TabularDataset that's created by from one or more files with delimited columns of data. Example: from azureml.core import Datasetiris_tabular_dataset = Dataset.Tabular.from_delimited_files([(def_blob_store, 'train-dataset/iris.csv')]) Reference: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-create-your-first-pipeline

You are evaluating a completed binary classification machine learning model. You need to use the precision as the evaluation metric. Which visualization should you use? A. Violin plot B. Gradient descent C. Box plot D. Binary classification confusion matrix

Correct Answer: D Incorrect Answers: A: A violin plot is a visual that traditionally combines a box plot and a kernel density plot. B: Gradient descent is a first-order iterative optimization algorithm for finding the minimum of a function. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or approximate gradient) of the function at the current point. C: A box plot lets you see basic distribution information about your data, such as median, mean, range and quartiles but doesn't show you how your data looks throughout its range. Reference: https://machinelearningknowledge.ai/confusion-matrix-and-performance-metrics-machine-learning/

You create a binary classification model by using Azure Machine Learning Studio. You must tune hyperparameters by performing a parameter sweep of the model. The parameter sweep must meet the following requirements: ✑ iterate all possible combinations of hyperparameters ✑ minimize computing resources required to perform the sweep You need to perform a parameter sweep of the model. Which parameter sweep mode should you use? A. Random sweep B. Sweep clustering C. Entire grid D. Random grid

Correct Answer: D Maximum number of runs on random grid: This option also controls the number of iterations over a random sampling of parameter values, but the values are not generated randomly from the specified range; instead, a matrix is created of all possible combinations of parameter values and a random sampling is taken over the matrix. This method is more efficient and less prone to regional oversampling or undersampling. If you are training a model that supports an integrated parameter sweep, you can also set a range of seed values to use and iterate over the random seeds as well. This is optional, but can be useful for avoiding bias introduced by seed selection. Incorrect Answers: B: If you are building a clustering model, use Sweep Clustering to automatically determine the optimum number of clusters and other parameters. C: Entire grid: When you select this option, the module loops over a grid predefined by the system, to try different combinations and identify the best learner. This option is useful for cases where you don't know what the best parameter settings might be and want to try all possible combination of values. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/tune-model-hyperparameters https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/tune-model-hyperparameters

You are evaluating a completed binary classification machine learning model. You need to use the precision as the evaluation metric. Which visualization should you use? A. Violin plot B. Gradient descent C. Scatter plot D. Receiver Operating Characteristic (ROC) curve

Correct Answer: D Receiver operating characteristic (or ROC) is a plot of the correctly classified labels vs. the incorrectly classified labels for a particular model. Incorrect Answers: A: A violin plot is a visual that traditionally combines a box plot and a kernel density plot. B: Gradient descent is a first-order iterative optimization algorithm for finding the minimum of a function. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or approximate gradient) of the function at the current point. C: A scatter plot graphs the actual values in your data against the values predicted by the model. The scatter plot displays the actual values along the X-axis, and displays the predicted values along the Y-axis. It also displays a line that illustrates the perfect prediction, where the predicted value exactly matches the actual value. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-understand-automated-ml#confusion-matrix

You use the Azure Machine Learning Python SDK to define a pipeline to train a model. The data used to train the model is read from a folder in a datastore. You need to ensure the pipeline runs automatically whenever the data in the folder changes. What should you do? A. Set the regenerate_outputs property of the pipeline to True B. Create a ScheduleRecurrance object with a Frequency of auto. Use the object to create a Schedule for the pipeline C. Create a PipelineParameter with a default value that references the location where the training data is stored D. Create a Schedule for the pipeline. Specify the datastore in the datastore property, and the folder containing the training data in the path_on_datastore property

Correct Answer: D Reference: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-trigger-published-pipeline

You have the following code. The code prepares an experiment to run a script: from azureml.core import Workspace, Experiment, Run, ScriptRunConfig ws = Workspace.from_config() script_config = ScriptRunConfig(source_directory='experiment_files', script='experiment.py') script_experiment = Experiment(workspace=ws, name='script-experiment') The experiment must be run on local computer using the default environment. You need to add code to start the experiment and run the script. Which code segment should you use? A. run = script_experiment.start_logging() B. run = Run(experiment=script_experiment) C. ws.get_run(run_id=experiment.id) D. run = script_experiment.submit(config=script_config)

Correct Answer: D The experiment class submit method submits an experiment and return the active created run. Syntax: submit(config, tags=None, **kwargs) Reference: https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.experiment.experiment

You run an experiment that uses an AutoMLConfig class to define an automated machine learning task with a maximum of ten model training iterations. The task will attempt to find the best performing model based on a metric named accuracy. You submit the experiment with the following code: from azureml.core.experiment import Experiment automl_experiment = Experiment(ws, 'automl_experiment') automl_run = automl_experiment.submit(automl_config, show_output=True) You need to create Python code that returns the best model that is generated by the automated machine learning task. Which code segment should you use? A. best_model = automl_run.get_details() B. best_model = automl_run.get_metrics() C. best_model = automl_run.get_file_names()[1] D. best_model = automl_run.get_output()[1]

Correct Answer: D The get_output method returns the best run and the fitted model. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-configure-auto-train https://notebooks.azure.com/azureml/projects/azureml-getting-started/html/how-to-use-azureml/automated-machine-learning/classification/auto-ml- classification.ipynb

You use the Two-Class Neural Network module in Azure Machine Learning Studio to build a binary classification model. You use the Tune Model Hyperparameters module to tune accuracy for the model. You need to configure the Tune Model Hyperparameters module. Which two values should you use? Each correct answer presents part of the solution. NOTE: Each correct selection is worth one point. A. Number of hidden nodes B. Learning Rate C. The type of the normalizer D. Number of learning iterations E. Hidden layer specification

Correct Answer: DE Alternate Answer: BD YES (B and D: The 2-class Neural Network has only 2 parameters that can be set to "range" (instead of "single parameter"), which in turn can be learned by the Tune Model Hyperparameters module: 1) number of iterations 2) Learning rate.) D: For Number of learning iterations, specify the maximum number of times the algorithm should process the training cases. E: For Hidden layer specification, select the type of network architecture to create. Between the input and output layers you can insert multiple hidden layers. Most predictive tasks can be accomplished easily with only one or a few hidden layers. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/two-class-neural-network

You define a datastore named ml-data for an Azure Storage blob container. In the container, you have a folder named train that contains a file named data.csv. You plan to use the file to train a model by using the Azure Machine Learning SDK. You plan to train the model by using the Azure Machine Learning SDK to run an experiment on local compute. You define a DataReference object by running the following code: from azureml.core import Workspace, Datastore, Environment from azureml.train.estimator import Estimator ws = Workspace.from_config() ml_data = = Datastore.get(ws, datastore_name='ml-data') data_ref = ml_data.path('train').as_download(path_on_compute='train_data') estimator = Estimator(source_directory='experiment_folder', script_param={'--data-folder': data_ref}, compute_target = 'local', entry_script='training.py') run = experiment.submit(config=estimator) run.wait_for_completion(show_output=True) You need to load the training data. Which code segment should you use? (beginning of all but C.) >>import os, argparse, pandas as pd parser = argparse.ArgumentParser() parser.add_argument('--data-folder', type=str, dest='data_folder') data_folder = args.data_folder<< A. data = pd.read_csv(os.path.join(data_folder, 'ml-data', 'train_data', 'data.csv')) B. data = pd.read_csv(os.path.join(data_folder, 'train', 'data.csv')) C. import pandas as pd data = pd.read_csv('./data.csv') D. data = pd.read_csv(os.path.join('ml_data', data_folder, 'data.csv')) E. data = pd.read_csv(os.path.join(data_folder, 'data.csv'))

Correct Answer: E Example: data_folder = args.data_folder # Load Train and Test data train_data = pd.read_csv(os.path.join(data_folder, 'data.csv')) Reference: https://www.element61.be/en/resource/azure-machine-learning-services-complete-toolbox-ai

You create a script that trains a convolutional neural network model over multiple epochs and logs the validation loss after each epoch. The script includes arguments for batch size and learning rate. You identify a set of batch size and learning rate values that you want to try. You need to use Azure Machine Learning to find the combination of batch size and learning rate that results in the model with the lowest validation loss. What should you do? A. Run the script in an experiment based on an AutoMLConfig object B. Create a PythonScriptStep object for the script and run it in a pipeline C. Use the Automated Machine Learning interface in Azure Machine Learning studio D. Run the script in an experiment based on a ScriptRunConfig object E. Run the script in an experiment based on a HyperDriveConfig object

Correct Answer: E Reference: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-tune-hyperparameters

You create a batch inference pipeline by using the Azure ML SDK. You configure the pipeline parameters by executing the following code: from azureml.contrib.pipeline.steps imoprt ParallelRunConfig parallel_run_config = ParallelRunConfig( source_directory=scripts_folder, entry_script= "batch_pipeline.py", mini_batch_size= "5", error_threshold=10, output_action= "append_row", environment=batch_env, compute_target=compute_target, logging_level= "DEBUG", node_count=4) You need to obtain the output from the pipeline execution. Where will you find the output? A. the digit_identification.py script B. the debug log C. the Activity Log in the Azure portal for the Machine Learning workspace D. the Inference Clusters tab in Machine Learning studio E. a file named parallel_run_step.txt located in the output folder

Correct Answer: E output_action (str): How the output is to be organized. Currently supported values are 'append_row' and 'summary_only'. 'append_row' ג€" All values output by run() method invocations will be aggregated into one unique file named parallel_run_step.txt that is created in the output location.'summary_only' Reference: https://docs.microsoft.com/en-us/python/api/azureml-contrib-pipeline-steps/azureml.contrib.pipeline.steps.parallelrunconfig

HOTSPOT - You are working on a classification task. You have a dataset indicating whether a student would like to play soccer and associated attributes. The dataset includes the following columns: - IsPlaySoccer - Values can be 1 and 0 - Gender - Values can be M or F - PrevExamMarks - Stores values from 0 to 100 - Height - Stores values in centimeters - Weight - Stores values in kilograms You need to classify variables by type. Which variable should you add to each category? To answer, select the appropriate options in the answer area. NOTE: Each correct selection is worth one point. Hot Area: Categorical Variables - Gender, IsPlaySoccer - Gender, PrevExamMarks, Height, Weight - PrevExamMarks, Height, Weight - IsPlaySoccer Continuous Variables - Gender, IsPlaySoccer - Gender, PrevExamMarks, Height, Weight - PrevExamMarks, Height, Weight - IsPlaySoccer

Correct Answer: Gender, IsPlaySoccer; PrevExamMarks, Height, Weight Categorical variable - Categorical variables contain a finite number of categories or distinct groups. Categorical data might not have a logical order. so here we have clear 2 distinct value for IsPlaySoccer and Gender. Continuous variable - Continuous variables are numeric variables that have an infinite number of values between any two values. A continuous variable can be numeric or date/time. so here we have range of values for PrevExamMarks, Height, and Weight Discrete variable - Discrete variables are numeric variables that have a countable number of values between any two values. A discrete variable is always numeric. For example, the number of customer complaints or the number of flaws or defects. Reference: https://www.edureka.co/blog/classification-algorithms/

HOTSPOT - You publish a batch inferencing pipeline that will be used by a business application. The application developers need to know which information should be submitted to and returned by the REST interface for the published pipeline. You need to identify the information required in the REST request and returned as a response from the published pipeline. Which values should you use in the REST request and to expect in the response? To answer, select the appropriate options in the answer area. NOTE: Each correct selection is worth one point. Hot Area: REST Request Request Header >1< Request Body >1< Response >2< >1<: JSON containing the run ID JSON containing the pipeline ID JSON containing the experiment name JSON containing an OAuth bearer token >2<: JSON containing the run ID JSON containing a list of predictions JSON containing the experiment name JSON containing a path to the parallel_run_step.txt output file

Correct Answer: JSON containing the OAuth bearer token, JSON containing the experiment name, JSON containing the run ID Box 1: JSON containing an OAuth bearer token Specify your authentication header in the request.To run the pipeline from the REST endpoint, you need an OAuth2 Bearer-type authentication header. Box 2: JSON containing the experiment name Add a JSON payload object that has the experiment name. Example: rest_endpoint = published_pipeline.endpoint response = requests.post(rest_endpoint, headers=auth_header, json={"ExperimentName": "batch_scoring", "ParameterAssignments": {"process_count_per_node": 6}}) run_id = response.json()["Id"] Box 3: JSON containing the run ID Make the request to trigger the run. Include code to access the Id key from the response dictionary to get the value of the run ID. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/tutorial-pipeline-batch-scoring-classification

HOTSPOT - You plan to use Hyperdrive to optimize the hyperparameters selected when training a model. You create the following code to define options for the hyperparameter experiment: import azureml.train.hyperdrive.parameter_expressions as pe from azureml.train.hyperdrive import GridParameterSampling, HyperDriveConfig param_sampling = GridParametersSampling([ "max_depth" : pe.choice(6, 7, 8, 9), "learning_rate" : pe.choice(0.05, 0.1, 0.15)}) hyperdrive_run_config = HyperDriveConfig( estimator = estimator, hyperparameter_sampling = param_sampling, policy = None, primary_metric_name = "auc", primary_metric_goal = PrimaryMetricGoal.MAXIMIZE, max_total_runs = 50, max_concurrent_runs = 4) For each of the following statements, select Yes if the statement is true. Otherwise, select No. NOTE: Each correct selection is worth one point. Hot Area: YES or NO - There will be 50 runs for this hyperparameter tuning experiment - You can use the policy parameter in the HyperDriveConfig class to specify a security policy - The experiment will create a run for every possible value for the learning rate parameter between 0.05 and 0.15

Correct Answer: No, Yes, No Alternative Answer: No, No, No YES (https://docs.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.hyperdrive.hyperdriveconfig?view=azure-ml-py Policy: The early termination policy to use. If None - the default, no early termination policy will be used.) Box 1: No - max_total_runs (50 here) The maximum total number of runs to create. This is the upper bound; there may be fewer runs when the sample space is smaller than this value. Box 2: Yes - Policy EarlyTerminationPolicy - The early termination policy to use. If None - the default, no early termination policy will be used. Box 3: No - Discrete hyperparameters are specified as a choice among discrete values. choice can be: ✑ one or more comma-separated values ✑ a range object ✑ any arbitrary list object Reference: https://docs.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.hyperdrive.hyperdriveconfig https://docs.microsoft.com/en-us/azure/machine-learning/how-to-tune-hyperparameters

HOTSPOT - You register the following versions of a model. Model name, Model version, Tags, Properties hcm, 3, 'Training context':'CPU Compute', value:87.43 hcm, 2, 'Training context':'CPU Compute', value:54.98 hcm, 1, 'Training context':'CPU Compute', value:23.56 You use the Azure ML Python SDK to run a training experiment. You use a variable named run to reference the experiment run. After the run has been submitted and completed, you run the following code: run.register_model(model_path='outputs/model.pkl', model_name='healthcare_model', tags={'Training context':'CPU Compute'}) For each of the following statements, select Yes if the statement is true. Otherwise, select No. NOTE: Each correct selection is worth one point. Hot Area: YES or NO - The code will cause a previous version of the saved model to be overwritten - The version number will now be 4 - The latest version of the stored model will have a property of value: 87.43

Correct Answer: No, Yes, No Reference: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-deploy-and-where

HOTSPOT - You have an Azure blob container that contains a set of TSV files. The Azure blob container is registered as a datastore for an Azure Machine Learning service workspace. Each TSV file uses the same data schema. You plan to aggregate data for all of the TSV files together and then register the aggregated data as a dataset in an Azure Machine Learning workspace by using the Azure Machine Learning SDK for Python. You run the following code. from azureml.core.workspace import Workspace from azureml.core.datastore import Datastore from azureml.core.dataset import Dataset import pandas as pd datastore_paths = (datastore, './data/*.tsv') myDataset_1 = Dataset.File.from_files(path=datastore_paths) myDataset_2 = Dataset.Tabular.from_delimited_files(path=datastore_paths, separator='\t') For each of the following statements, select Yes if the statement is true. Otherwise, select No. NOTE: Each correct selection is worth one point. Hot Area: YES or No - The myDataset_1 dataset can be converted into a pandas dataframe by using the following method: using myDataset_1.to_pandas_dataframe() - The myDataset_1.to_path() method returns an array of file paths for all of the TSV files in the dataset - The myDataset_2 dataset can be converted into a pandas dataframe by using the following method: myDataset_2.to_pandas_dataframe()

Correct Answer: No, Yes, Yes Box 1: No - FileDataset references single or multiple files in datastores or from public URLs. The TSV files need to be parsed. Box 2: Yes - to_path() gets a list of file paths for each file stream defined by the dataset. Box 3: Yes - TabularDataset.to_pandas_dataframe loads all records from the dataset into a pandas DataFrame. TabularDataset represents data in a tabular format created by parsing the provided file or list of files. Note: TSV is a file extension for a tab-delimited file used with spreadsheet software. TSV stands for Tab Separated Values. TSV files are used for raw data and can be imported into and exported from spreadsheet software. TSV files are essentially text files, and the raw data can be viewed by text editors, though they are often used when moving raw data between spreadsheets. Reference: https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.data.tabulardataset

HOTSPOT - You have a dataset created for multiclass classification tasks that contains a normalized numerical feature set with 10,000 data points and 150 features. You use 75 percent of the data points for training and 25 percent for testing. You are using the scikit-learn machine learning library in Python. You use X to denote the feature set and Y to denote class labels. You create the following Python data frames: X_train - training feature set Y_train - training class labels x_train - testing feature set y_train - testing class labels You need to apply the Principal Component Analysis (PCA) method to reduce the dimensionality of the feature set to 10 features in both training and testing sets. How should you complete the code segment? To answer, select the appropriate options in the answer area. NOTE: Each correct selection is worth one point. Hot Area: from sklearn.decomposition import PCA - pca = PCA(); PCA(n_components = 150); PCA(n_components = 10); PCA(n_components = 10000) - X_train= >>pca; model; sklearn.decomposition<<.fit_transform(X_train) - x_test = pca.x_test; X_train; fit(x_test); transform(x_test)

Correct Answer: PCA(n_components = 10), pca, transform(x_test) Box 1: PCA(n_components = 10) Need to reduce the dimensionality of the feature set to 10 features in both training and testing sets. Example:from sklearn.decomposition import PCA pca = PCA(n_components=2) ;2 dimensions principalComponents = pca.fit_transform(x) Box 2: pca - fit_transform(X[, y]) fits the model with X and apply the dimensionality reduction on X. Box 3: transform(x_test) transform(X) applies dimensionality reduction to X. Reference: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

DRAG DROP - You create a multi-class image classification deep learning model. The model must be retrained monthly with the new image data fetched from a public web portal. You create an Azure Machine Learning pipeline to fetch new data, standardize the size of images, and retrain the model. You need to use the Azure Machine Learning SDK to configure the schedule for the pipeline. Which four actions should you perform in sequence? To answer, move the appropriate actions from the list of actions to the answer area and arrange them in the correct order. Select and Place: - Publish the pipeline - Retrieve the pipeline ID - Create a ScheduleRecurrence(frequency='Month', interval=1, start_time='2019-01-01T00:00:00') object - Define a pipeline parameter named RunDate - Define a new Azure Machine Learning pipeline StepRun object with the stop ID of the first step in the pipeline - Define a Azure Machine Learning pipeline schedule using the schedule.create method with the defined recurrence specification

Correct Answer: Publish the pipeline; Retrieve the Pipeline ID, Create a ScheduleRecurrence(frequence='Month', interval=1, start_time='2019-01-01T00:00:00') object; Define an Azure Machine Learning pipeline schedule using the schedule.create method with the defined recurrence specification Step 1: Publish the pipeline. To schedule a pipeline, you'll need a reference to your workspace, the identifier of your published pipeline, and the name of the experiment in which you wish to create the schedule. Step 2: Retrieve the pipeline ID. Needed for the schedule. Step 3: Create a ScheduleRecurrence. To run a pipeline on a recurring basis, you'll create a schedule. A Schedule associates a pipeline, an experiment, and a trigger. First create a schedule. Example: Create a Schedule that begins a run every 15 minutes: recurrence = ScheduleRecurrence(frequency="Minute", interval=15) Step 4: Define an Azure Machine Learning pipeline schedule. Example, continued: recurring_schedule = Schedule.create(ws, name="MyRecurringSchedule", description="Based on time", pipeline_id=pipeline_id, experiment_name=experiment_name, recurrence=recurrence) Reference: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-schedule-pipelines

HOTSPOT - You are hired as a data scientist at a winery. The previous data scientist used Azure Machine Learning. You need to review the models and explain how each model makes decisions. Which explainer modules should you use? To answer, select the appropriate options in the answer area. NOTE: Each correct selection is worth one point. Hot Area: A random forest model for predicting the alcohol content in wine given a set of covariates >> Tabular, HAN, Text, Image A natural language processing model for analyzing field reports >> Tree, HAN, Text, Image An image classifier that determines the quality of the grape based upon it's physical characteristics >> Kernel, HAN, Text, Image

Correct Answer: Tabular, Text, Image Meta explainers automatically select a suitable direct explainer and generate the best explanation info based on the given model and data sets. The meta explainers leverage all the libraries (SHAP, LIME, Mimic, etc.) that we have integrated or developed. The following are the meta explainers available in the SDK: Tabular Explainer: Used with tabular datasets. Text Explainer: Used with text datasets. Image Explainer: Used with image datasets. Box 1: Tabular - Box 2: Text - Box 3: Image - Incorrect Answers: Hierarchical Attention Network (HAN) HAN was proposed by Yang et al. in 2016. Key features of HAN that differentiates itself from existing approaches to document classification are (1) it exploits the hierarchical nature of text data and (2) attention mechanism is adapted for document classification. Reference: https://medium.com/microsoftazure/automated-and-interpretable-machine-learning-d07975741298

DRAG DROP - You are building an experiment using the Azure Machine Learning designer. You split a dataset into training and testing sets. You select the Two-Class Boosted Decision Tree as the algorithm. You need to determine the Area Under the Curve (AUC) of the model. Which three modules should you use in sequence? To answer, move the appropriate modules from the list of modules to the answer area and arrange them in the correct order. Select and Place: Modules: - Export Data - Tune Model Hyperparameters - Cross Validate Model - Evaluate Model - Score Model - Train Model

Correct Answer: Train Model, Score Model, Evaluate Model Step 1: Train Model - Two-Class Boosted Decision Tree - First, set up the boosted decision tree model. 1. Find the Two-Class Boosted Decision Tree module in the module palette and drag it onto the canvas. 2. Find the Train Model module, drag it onto the canvas, and then connect the output of the Two-Class Boosted Decision Tree module to the left input port of theTrain Model module. The Two-Class Boosted Decision Tree module initializes the generic model, and Train Model uses training data to train the model. 3. Connect the left output of the left Execute R Script module to the right input port of the Train Model module (in this tutorial you used the data coming from the left side of the Split Data module for training). Step 2: Score Model - Score and evaluate the models - You use the testing data that was separated out by the Split Data module to score our trained models. You can then compare the results of the two models to see which generated better results. Add the Score Model modules - 1. Find the Score Model module and drag it onto the canvas. 2. Connect the Train Model module that's connected to the Two-Class Boosted Decision Tree module to the left input port of the Score Model module. 3. Connect the right Execute R Script module (our testing data) to the right input port of the Score Model module. Step 3: Evaluate Model - To evaluate the two scoring results and compare them, you use an Evaluate Model module. 1. Find the Evaluate Model module and drag it onto the canvas. 2. Connect the output port of the Score Model module associated with the boosted decision tree model to the left input port of the Evaluate Model module. 3. Connect the other Score Model module to the right input port.

You create a binary classification model to predict whether a person has a disease. You need to detect possible classification errors. Which error type should you choose for each description? To answer, select the appropriate options in the answer area. NOTE: Each correct selection is worth one point. Hot Area: - A person has a disease. The model classified the case as having a disease - A person does not have a disease. The model classifies the case as having no disease - A person does not have a disease. The model classifies the case as a having a disease - A person has a disease. The model classifies the case as having no disease For all, use these choices: True Positives, True Negatives, False Positives, False Negatives

Correct Answer: True Positives, True Negatives, False Positives, False Negatives Box 1: True Positive - A true positive is an outcome where the model correctly predicts the positive class Box 2: True Negative - A true negative is an outcome where the model correctly predicts the negative class. Box 3: False Positive - A false positive is an outcome where the model incorrectly predicts the positive class. Box 4: False Negative - A false negative is an outcome where the model incorrectly predicts the negative class. Note: Let's make the following definitions: "Wolf" is a positive class. "No wolf" is a negative class. We can summarize our "wolf-prediction" model using a 2x2 confusion matrix that depicts all four possible outcomes: Reference: https://developers.google.com/machine-learning/crash-course/classification/true-false-positive-negative

HOTSPOT - You have a multi-class image classification deep learning model that uses a set of labeled photographs. You create the following code to select hyperparameter values when training the model. from azureml.train.hyperdrive import BeyesianParameterSampling param_sampling = BayesianParameterSampling ({ "learning_rate": uniform(0.01, 0.1), "batch_size": choice(16, 32, 64, 128)}) For each of the following statements, select Yes if the statement is true. Otherwise, select No. NOTE: Each correct selection is worth one point. Hot Area: YES or NO - Hyperparameter combinations for the runs are selected based on how previous samples performed in the previous experiment run - The learning rate value 0.09 might be used during model training - You can define an early termination policy for this hyperparameter tuning run

Correct Answer: Yes, Yes, No Additional Info: Bayesian sampling is based on the Bayesian optimization algorithm. It picks samples based on how previous samples did, so that new samples improve the primary metric. Bayesian sampling only supports choice, uniform, and quniform distributions over the search space. Bayesian sampling does not support early termination. When using Bayesian sampling, set early_termination_policy = None. Box 1: Yes - Hyperparameters are adjustable parameters you choose to train a model that govern the training process itself. Azure Machine Learning allows you to automate hyperparameter exploration in an efficient manner, saving you significant time and resources. You specify the range of hyperparameter values and a maximum number of training runs. The system then automatically launches multiple simultaneous runs with different parameter configurations and finds the configuration that results in the best performance, measured by the metric you choose. Poorly performing training runs are automatically early terminated, reducing wastage of compute resources. These resources are instead used to explore other hyperparameter configurations. Box 2: Yes - uniform(low, high) - Returns a value uniformly distributed between low and high Box 3: No - Bayesian sampling does not currently support any early termination policy. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-tune-hyperparameters

HOTSPOT - You are using the Hyperdrive feature in Azure Machine Learning to train a model. You configure the Hyperdrive experiment by running the following code: from azureml.train.hyperdrive import RandomParameterSampling param_sampling = RandomParameterSampling( { "learning_rate": normal(10,3), "keep_probability": uniform(0.05, 0.1), "batch_size": choice(16, 32, 64, 128) "number_of_hidden_layers": choice(range(3,5)) } ) For each of the following statements, select Yes if the statement is true. Otherwise, select No. NOTE: Each correct selection is worth one point. Hot Area: YES or NO - By defining sampling in this manner, every possible combination of the parameters will be tested - Random values of the learning rate parameter will be selected from a normal distribution with a mean of 10 and a standard deviation of 3 - The keep_probability parameter value will always be either 0.05 or 0.1 - Random values for the number_of_hidden_layers parameter will be selected from a normal distribution with a mean of 4 and a standard deviation of 5

Correct Answer: Yes, Yes, No, No Box 1: Yes - In random sampling, hyperparameter values are randomly selected from the defined search space. Random sampling allows the search space to include both discrete and continuous hyperparameters. Box 2: Yes - learning_rate has a normal distribution with mean value 10 and a standard deviation of 3. Box 3: No - keep_probability has a uniform distribution with a minimum value of 0.05 and a maximum value of 0.1. Box 4: No - number_of_hidden_layers takes on one of the values [3, 4, 5]. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-tune-hyperparameters

HOTSPOT - You create a script for training a machine learning model in Azure Machine Learning service. You create an estimator by running the following code: from azureml.core import Workspace, Datastore from azureml.core.compute import ComputeTarget from azureml.train.estimator import Estimator work_space = Workspace.from_config() data_source = work_space.get_default_datastore() train_cluster = ComputeTarget(workspace=work_space, name= 'train-cluster') estimator = Estimator(source_directory = 'training-experiment', script_params = { '--data-folder' : data_source.as_mount(), '--regularization':0.8}, compute_target = train_cluster, entry_script = 'train.py', conda_packages = ['scikit-learn']) For each of the following statements, select Yes if the statement is true. Otherwise, select No. NOTE: Each correct selection is worth one point. Hot Area: YES or NO - The estimator will look for the files it needs to run an experiment in the training-experiment directory of the local compute environment - The estimator will mount the local data-folder folder and make it available to the script through a parameter - The train.py script file will be created if it does not exist - The estimator can run Scikit-learn experiments

Correct Answer: Yes, Yes, No, Yes Box 1: Yes - Parameter source_directory is a local directory containing experiment configuration and code files needed for a training job. Box 2: Yes - script_params is a dictionary of command-line arguments to pass to the training script specified in entry_script. Box 3: No - Box 4: Yes - The conda_packages parameter is a list of strings representing conda packages to be added to the Python environment for the experiment.

HOTSPOT - You are using Azure Machine Learning to train machine learning models. You need a compute target on which to remotely run the training script. You run the following Python code: from azureml.core.compute import ComputeTarget, AmlCompute from azureml.core.compute_target import ComputeTargetException the_cluster_name = 'Newcompute" config = AmlCompute.provisioning_configuration(vm_size= 'STANDARD_D2', max_nodes=3) the_cluster = ComputeTarget.create(ws, the_cluster_name, config) For each of the following statements, select Yes if the statement is true. Otherwise, select No. NOTE: Each correct selection is worth one point. Hot Area: YES or NO - The compute is created in the same region as the Machine Learning service workspace - The compute resource created by the code is displayed as a compute cluster in Azure Machine Learning studio - The minimum number of nodes will be zero

Correct Answer: Yes, Yes, Yes Box 1: Yes - The compute is created within your workspace region as a resource that can be shared with other users. Box 2: Yes - It is displayed as a compute cluster. View compute targets - 1. To see all compute targets for your workspace, use the following steps: 2. Navigate to Azure Machine Learning studio. 3. Under Manage, select Compute. 4. Select tabs at the top to show each type of compute target. Box 3: Yes - min_nodes is not specified, so it defaults to 0. Reference: https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.compute.amlcompute.amlcomputeprovisioningconfiguration https://docs.microsoft.com/en-us/azure/machine-learning/how-to-create-attach-compute-studio

HOTSPOT - You are running a training experiment on remote compute in Azure Machine Learning. The experiment is configured to use a conda environment that includes the mlflow and azureml-contrib-run packages. You must use MLflow as the logging package for tracking metrics generated in the experiment. You need to complete the script for the experiment. How should you complete the code? To answer, select the appropriate options in the answer area. NOTE: Each correct selection is worth one point. Hot Area: import numpy as np # Import library to log metrics - from azureml.core import Run; import mlflow; import logging # Start logging for this run - run=Run.get_context(); mlflow.start_run(); logger = logging.getLogger('Run') reg_rate = 0.01 # Log the reg_rate metric - run.log('reg_rate', np.float(reg_rate)); mlflow.log_metric('reg_rate', np.flowt(reg_rate)); logger.info(np.float(reg_rate)) # Stop logging for this run - run.complete(); mlflow.end_run(); logger.setLevel(logging.INFO)

Correct Answer: all mlflow statements Box 1: import mlflow - Import the mlflow and Workspace classes to access MLflow's tracking URI and configure your workspace. Box 2: mlflow.start_run() Set the MLflow experiment name with set_experiment() and start your training run with start_run(). Box 3: mlflow.log_metric(' ..') Use log_metric() to activate the MLflow logging API and begin logging your training run metrics. Box 4: mlflow.end_run() Close the run:run.endRun() Reference: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-use-mlflow https://www.mlflow.org/docs/latest/python_api/mlflow.html

HOTSPOT - You have a Python data frame named salesData in the following format: ----shop----2017--2018 0 - Shop X - 34 - 25 1 - Shop Y - 65 - 76 2 - Shop Z - 48 - 55 The data frame must be unpivoted to a long data format as follows: ----shop----year--value 0 - Shop X - 2017 - 34 1 - Shop Y - 2017 - 65 2 - Shop Z - 2017 - 48 3 - Shop X - 2018 - 25 4 - Shop Y - 2018 - 76 5 - Shop Z - 2018 - 55 You need to use the pandas.melt() function in Python to perform the transformation. How should you complete the code segment? To answer, select the appropriate options in the answer area. NOTE: Each correct selection is worth one point. Hot Area: import pandas as pd salesData = pd.melt( >1<, id_vars=' >2< ', value_vars= >3< >1<: dataFrame; pandas; salesData; year >2<: shop; year; value; Shop X, Shop Y, Shop Z >3<: 'shop'; 'year'; ['year']; ['2017', '2018']

Correct Answer: dataFrame, shop, ['2017','2018'] Alternate Answer: salesData, shop, ['2017','2018'] YES https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.melt.html Box 1: dataFrame - Syntax: pandas.melt(frame, id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None)[source] Where frame is a DataFrame - Box 2: shop - Paramter id_vars id_vars : tuple, list, or ndarray, optional Column(s) to use as identifier variables. Box 3: ['2017','2018'] value_vars : tuple, list, or ndarray, optionalColumn(s) to unpivot. If not specified, uses all columns that are not set as id_vars. Example: df = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'},... 'B': {0: 1, 1: 3, 2: 5},... 'C': {0: 2, 1: 4, 2: 6}}) pd.melt(df, id_vars=['A'], value_vars=['B', 'C']) A variable value - 0 a B 1 1 b B 3 2 c B 5 3 a C 2 4 b C 4 5 c C 6 Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.melt.html

HOTSPOT - You collect data from a nearby weather station. You have a pandas dataframe named weather_df that includes the following data: Temp, Obs_time, Humid, Pressure, Vis, DaysSinceLastObs 74, 2019/10/2 00:00, 0.62, 29.87, 3, 0.5 89, 2019/10/2 12:00, 0.70, 28.88, 10, 0.5 72, 2019/10/3 00:00, 0.64, 30.00, 8, 0.5 80, 2019/10/3 12:00, 0.66, 29.75, 7, 0.5 The data is collected every 12 hours: noon and midnight. You plan to use automated machine learning to create a time-series model that predicts temperature over the next seven days. For the initial round of training, you want to train a maximum of 50 different models. You must use the Azure Machine Learning SDK to run an automated machine learning experiment to train these models. You need to configure the automated machine learning run. How should you complete the AutoMLConfig definition? To answer, select the appropriate options in the answer area. NOTE: Each correct selection is worth one point. Hot Area: automl_config = AutoMLConfig(task=" >>1<< ", training_data=weather_df, label_column_name=" >>2<< ", time_column_name=" >>2<<", max_horizon= >>3<<, iterations= >>3<<, iteration_timeout_minutes=5, primary_metric="r2_score") >>1<<: regression, forecasting, classification, deep learning >>2<<: humidity, pressure, visibility, temp, daysSinceLastObs >>3<<: 2, 6, 7, 12, 14, 50

Correct Answer: forecasting, temp, DaysSinceLastObs, 7, 50 Alternate Answer: ..., 14, 50 (It says max_horizon is deprecated. https://docs.microsoft.com/en-us/python/api/azureml-train-automl-client/azureml.train.automl.automlconfig.automlconfig?view=azure-ml-py) Box 1: forcasting - Task: The type of task to run. Values can be 'classification', 'regression', or 'forecasting' depending on the type of automated ML problem to solve. Box 2: temperature - The training data to be used within the experiment. It should contain both training features and a label column (optionally a sample weights column). Box 3: observation_time - time_column_name: The name of the time column. This parameter is required when forecasting to specify the datetime column in the input data used for building the time series and inferring its frequency. This setting is being deprecated. Please use forecasting_parameters instead. Box 4: 7 - "predicts temperature over the next seven days" max_horizon: The desired maximum forecast horizon in units of time-series frequency. The default value is 1. Units are based on the time interval of your training data, e.g., monthly, weekly that the forecaster should predict out. When task type is forecasting, this parameter is required. Box 5: 50 - "For the initial round of training, you want to train a maximum of 50 different models. "Iterations: The total number of different algorithm and parameter combinations to test during an automated ML experiment. Reference: https://docs.microsoft.com/en-us/python/api/azureml-train-automl-client/azureml.train.automl.automlconfig.automlconfig

DRAG DROP - You plan to explore demographic data for home ownership in various cities. The data is in a CSV file with the following format: age,city,income,home_owner 21,Chicago,50000,0 35,Seattle,120000,1 23,Seattle,65000,04 5,Seattle,130000,1 18,Chicago,48000,0 You need to run an experiment in your Azure Machine Learning workspace to explore the data and log the results. The experiment must log the following information: ✑ the number of observations in the dataset ✑ a box plot of income by home_owner ✑ a dictionary containing the city names and the average income for each city You need to use the appropriate logging methods of the experiment's run object to log the required information. How should you complete the code? To answer, drag the appropriate code segments to the correct locations. Each code segment may be used once, more than once, or not at all. You may need to drag the split bar between panes or scroll to view content. NOTE: Each correct selection is worth one point. Select and Place: Code Segments log log_list log_row log_table log_image >> a LOT of code << from azureml.core import Experiment, Run import pandas as pd import matplotlib.pyplot as plt # Create an Azure ML experiment in workspace experiment = Experiment(workspace = ws, name = "demo-experiment") # Start logging data from the experiment run = experiment.start_logging() # load the dataset data = pd.read_csv('research/demographics.csv') # Log the number of observations row_count = (len(data)) run.>>________<<("observations", row_count) # Log box plot for income by owner fig = plt.figure(figsize=(9, 6)) ax = fig.gca() data.boxplot(column = 'income', by = "home_owner", ax = ax) ax.set_title('income by home_owner') ax.set_ylabel('income') run.>>________<<(name = 'income_by_home_owner', plot = fig) # Create a dataframe of mean income per city mean_inc_df = data.groupby('city')['income'].agg(np.mean).to)_frame().reset_index() # Convert to a dictionary mean_inc_dict = mean_inc_df.to_dict('dict') # Log city names and average income dictionary run.>>________<< (name= "mean_income_by_city", value=mean_inc_dict # Complete tracking and get link to details run.complete()

Correct Answer: log, log_image, log_table Box 1: log - The number of observations in the dataset.run.log(name, value, description='') Scalar values: Log a numerical or string value to the run with the given name. Logging a metric to a run causes that metric to be stored in the run record in the experiment. You can log the same metric multiple times within a run, the result being considered a vector of that metric. Example: run.log("accuracy", 0.95) Box 2: log_image - A box plot of income by home_owner.log_image Log an image to the run record. Use log_image to log a .PNG image file or a matplotlib plot to the run. These images will be visible and comparable in the run record. Example: run.log_image("ROC", plot=plt) Box 3: log_table - A dictionary containing the city names and the average income for each city. log_table: Log a dictionary object to the run with the given name.

HOTSPOT - You are running Python code interactively in a Conda environment. The environment includes all required Azure Machine Learning SDK and MLflow packages. You must use MLflow to log metrics in an Azure Machine Learning experiment named mlflow-experiment. How should you complete the code? To answer, select the appropriate options in the answer area. NOTE: Each correct selection is worth one point. Hot Area: import mlflow from azureml.core import Workspace ws = Workspace.from_config() # Set the MLflow logging target >>1<< # Configure experiment >>2<< # Begin the experiment run with >>3<< # Log my_metric with value 1.00 >>4<< ('my_metric', 1.00) print("Finished!") >>1<< - mlflow.tracking.client = ws - mlflow.set_tracking_uri(ws.get_mlflow_tracking_uri()) - mlflow.log_param('workspace', ws) >>2<< - mlflow-experiment = Run.get_context() - mlflow.get_run('mlflow-experiment') - mlflow.set_experiment('mlflow-experiment') >>3<< - mlflow.active_run - mlflow.start_run() - Run.get_context() >>4<< - run.log() - mlflow.log_metric - print

Correct Answer: mlflow.set_tracking_uri(ws.get_mlflow_tracking_uri(), mlflow.set_experiment('mlflow-experiment'), mlflow.start_run(), mlflow.log_metric Box 1: mlflow.set_tracking_uri(ws.get_mlflow_tracking_uri()) In the following code, the get_mlflow_tracking_uri() method assigns a unique tracking URI address to the workspace, ws, and set_tracking_uri() points the MLflow tracking URI to that address. mlflow.set_tracking_uri(ws.get_mlflow_tracking_uri()) Box 2: mlflow.set_experiment(experiment_name) Set the MLflow experiment name with set_experiment() and start your training run with start_run(). Box 3: mlflow.start_run() Box 4: mlflow.log_metric - Then use log_metric() to activate the MLflow logging API and begin logging your training run metrics. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-use-mlflow

HOTSPOT - You are using the Azure Machine Learning Service to automate hyperparameter exploration of your neural network classification model. You must define the hyperparameter space to automatically tune hyperparameters using random sampling according to following requirements: ✑ The learning rate must be selected from a normal distribution with a mean value of 10 and a standard deviation of 3. ✑ Batch size must be 16, 32 and 64. ✑ Keep probability must be a value selected from a uniform distribution between the range of 0.05 and 0.1. You need to use the param_sampling method of the Python API for the Azure Machine Learning Service. How should you complete the code segment? To answer, select the appropriate options in the answer area. NOTE: Each correct selection is worth one point. Hot Area: from azureml.train.hyperdrive import RandomParameterSampling param_sampling = RandomParameterSampling( { "learning_rate": >1< , "batch_size": >2< , "keep_probability" : >3< } ) >1<: uniform(10, 3), normal(10,3), choice(10,3), Loguniform(10,3) >2<: choice(16,32,64), choice(range(16,64),normal(16,32,64), normal(range(16,64)) >3< choice(range(0.05, 0.1)), uniform(0.05, 0.1), normal(0.05, 0.1), lognormal(0.05, 0.1)

Correct Answer: normal(10,3), choice(16,32,64), uniform(0.05,0.1) Box 1: normal(10,3) Box 2: choice(16, 32, 64) Box 3: uniform(0.05, 0.1) In random sampling, hyperparameter values are randomly selected from the defined search space. Random sampling allows the search space to include both discrete and continuous hyperparameters. Example: from azureml.train.hyperdrive import RandomParameterSampling param_sampling = RandomParameterSampling( { "learning_rate": normal(10, 3), "keep_probability": uniform(0.05, 0.1), "batch_size": choice(16, 32, 64)} Reference: https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-tune-hyperparameters

HOTSPOT - Your Azure Machine Learning workspace has a dataset named real_estate_data. A sample of the data in the dataset follows. postal_code, num_bedrooms, sq_feet, garage, price 12345, 3, 1300, 0, 239000 54321, 1, 950, 0, 110000 12346, 2, 1200, 1, 150000 You want to use automated machine learning to find the best regression model for predicting the price column. You need to configure an automated machine learning experiment using the Azure Machine Learning SDK. How should you complete the code? To answer, select the appropriate options in the answer area. NOTE: Each correct selection is worth one point. from azureml.core import Workspace from azureml.core.compute import ComputeTarget from azureml.core.runconfig import RunConfiguration from azureml.train.automl import AutoMLConfig ws = Workspace.from_config() training_cluster = ComputeTarget(workspace=ws, name= 'aml-cluster1') real_estate_ds = ws.datasets.get('real_estate_data') real_estate_ds = ws.datasets.get('real_estate_ds.random_split(percentage=0.7, seed=123) automl_run_config = RunConfiguration(framework="python") automl_config = AutoMLConfig( task= 'regression', compute_target= training_cluster, run_configuration=automl_run_config, primary_metric='r2_score', >1< = split1_ds, >2< = split2_ds, >3< = 'price') >1<: X, Y, X_valid, Y_valid, training_data >2<: X, Y, X_valid, Y_valid, validation_data >3<: y, y_valid, y_max, label_column_name, exclude_nan_labels

Correct Answer: training_data, validation_data, label_column_name Box 1: training_data - The training data to be used within the experiment. It should contain both training features and a label column (optionally a sample weights column). If training_data is specified, then the label_column_name parameter must also be specified. Box 2: validation_data - Provide validation data: In this case, you can either start with a single data file and split it into training and validation sets or you can provide a separate data file for the validation set. Either way, the validation_data parameter in your AutoMLConfig object assigns which data to use as your validation set. Example, the following code example explicitly defines which portion of the provided data in dataset to use for training and validation. dataset = Dataset.Tabular.from_delimited_files(data) training_data, validation_data = dataset.random_split(percentage=0.8, seed=1) automl_config = AutoMLConfig(compute_target = aml_remote_compute, task = 'classification', primary_metric = 'AUC_weighted', training_data = training_data, validation_data = validation_data, label_column_name = 'Class') Box 3: label_column_name -label_column_name:The name of the label column. If the input data is from a pandas.DataFrame which doesn't have column names, column indices can be used instead, expressed as integers. This parameter is applicable to training_data and validation_data parameters. Incorrect Answers: X: The training features to use when fitting pipelines during an experiment. This setting is being deprecated. Please use training_data and label_column_name instead. Y: The training labels to use when fitting pipelines during an experiment. This is the value your model will predict. This setting is being deprecated. Please use training_data and label_column_name instead. X_valid: Validation features to use when fitting pipelines during an experiment. If specified, then y_valid or sample_weight_valid must also be specified. Y_valid: Validation labels to use when fitting pipelines during an experiment. Both X_valid and y_valid must be specified together.exclude_nan_labels: Whether to exclude rows with NaN values in the label. The default is True. y_max: y_max (float) Maximum value of y for a regression experiment. The combination of y_min and y_max are used to normalize test set metrics based on the input data range. If not specified, the maximum value is inferred from the data. Reference: https://docs.microsoft.com/en-us/python/api/azureml-train-automl-client/azureml.train.automl.automlconfig.automlconfig?view=azure-ml-py

HOTSPOT - You have a dataset that includes home sales data for a city. The dataset includes the following columns. Name, Description Price, The sales price for the house Bedrooms, The number of bedrooms in the house Size, The size of the house in square feet HasGarage, A binary value indicating a garage or not HomeType, The category of the home (apartment, townhose, single-family home) Each row in the dataset corresponds to an individual home sales transaction. You need to use automated machine learning to generate the best model for predicting the sales price based on the features of the house. Which values should you use? To answer, select the appropriate options in the answer area. NOTE: Each correct selection is worth one point. Hot Area: Prediction task: Classification, Forecasting, Regression, Outlier Target column: Price, Bedrooms, Size, HasGarage, HomeType

Correct Answers: Regression, Price Box 1: Regression - Regression is a supervised machine learning technique used to predict numeric values. Box 2: Price - Reference: https://docs.microsoft.com/en-us/learn/modules/create-regression-model-azure-machine-learning-designer

HOTSPOT - You plan to preprocess text from CSV files. You load the Azure Machine Learning Studio default stop words list. You need to configure the Preprocess Text module to meet the following requirements: ✑ Ensure that multiple related words from a single canonical form. ✑ Remove pipe characters from text. ✑ Remove words to optimize information retrieval. Which three options should you select? To answer, select the appropriate options in the answer area. NOTE: Each correct selection is worth one point. Hot Area: Language: English Remove by part of speech: False Text column to clean: - Remove stop words - Lemmatization - Detect Sentences - Normalize case to lowercase - Remove numbers - Remove special characters - Remove duplicate characters - Remove email addresses - Remove URLs - Expand verb contractions - Normalize backslashes to slashes - Split tokens on special characters

Correct Answers: Remove stop words, Lemmatization, Remove special characters Box 1: Remove stop words - Remove words to optimize information retrieval. Remove stop words: Select this option if you want to apply a predefined stopword list to the text column. Stop word removal is performed before any other processes. Box 2: Lemmatization - Ensure that multiple related words from a single canonical form. Lemmatization converts multiple related words to a single canonical form Box 3: Remove special characters Remove special characters: Use this option to replace any non-alphanumeric special characters with the pipe | character. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/preprocess-text


संबंधित स्टडी सेट्स

Lesson 1: Understanding Resistance in DC Combination Circuits

View Set

complex disorders, genetic heterogeneity

View Set

College Government Final Exam Study Guide Chapter 17

View Set

Chapter. 11 Nervous System True or False

View Set

Chapter 14- Instructional Methods

View Set