DP-100 Data Science Questions Topic 1

Ace your homework & exams now with Quizwiz!

You plan to provision an Azure Machine Learning Basic edition workspace for a data science project. You need to identify the tasks you will be able to perform in the workspace. Which three tasks will you be able to perform? Each correct answer presents a complete solution. NOTE: Each correct selection is worth one point. A. Create a Compute Instance and use it to run code in Jupyter notebooks. B. Create an Azure Kubernetes Service (AKS) inference cluster. C. Use the designer to train a model by dragging and dropping pre-defined modules. D. Create a tabular dataset that supports versioning. E. Use the Automated Machine Learning user interface to train a model.

Correct Answer: ABD Incorrect Answers: C, E: The UI is included the Enterprise edition only. Reference: https://azure.microsoft.com/en-us/pricing/details/machine-learning/

You are creating a new experiment in Azure Machine Learning Studio. You have a small dataset that has missing values in many columns. The data does not require the application of predictors for each column. You plan to use the Clean Missing Data. You need to select a data cleaning method. Which method should you use? A. Replace using Probabilistic PCA B. Normalization C. Synthetic Minority Oversampling Technique (SMOTE) D. Replace using MICE

Correct Answer: A Replace using Probabilistic PCA: Compared to other options, such as Multiple Imputation using Chained Equations (MICE), this option has the advantage of not requiring the application of predictors for each column. Instead, it approximates the covariance for the full dataset. Therefore, it might offer better performance for datasets that have missing values in many columns. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/clean-missing-data

DRAG DROP - You are building an intelligent solution using machine learning models. The environment must support the following requirements: ✑ Data scientists must build notebooks in a cloud environment ✑ Data scientists must use automatic feature engineering and model building in machine learning pipelines. ✑ Notebooks must be deployed to retrain using Spark instances with dynamic worker allocation. ✑ Notebooks must be exportable to be version controlled locally. You need to create the environment. Which four actions should you perform in sequence? To answer, move the appropriate actions from the list of actions to the answer area and arrange them in the correct order. Select and Place: - Install the Azure Machine Learning SDK for Python on the cluster - When the cluster is ready, export Zeppelin notebooks to a local environment - Create and execute a Jupyter notebook by using automated machine learning (AutoML) on the cluster - Install Microsoft Machine Learning for Apache Spark - When the cluster is ready and has processed the notebook, export your Jupyter notebook to a local environment - Create an Azure HDInsight cluster to include the Apache Spark MLib library - Create and execute the Zeppelin notebooks on the cluster - Create an Azure Databricks cluster

Correct Answer: - Create an Azure HDInsight cluster to include the Apache Spark MLib library - Install Microsoft Machine Learning for Apache Spark - Create and execute the Zeppelin notebooks on the cluster - When the cluster is ready, export Zeppelin notebooks to a local environment Alternate Answer: YES (- Create Azure Databricks cluster - Install Azure ML SDK for Python - Create and exec Jupyter notebook using AutoML - Export Jupyter to local env https://docs.microsoft.com/en-us/azure/machine-learning/concept-automated-ml) Step 1: Create an Azure HDInsight cluster to include the Apache Spark Mlib library Step 2: Install Microsot Machine Learning for Apache Spark You install AzureML on your Azure HDInsight cluster. Microsoft Machine Learning for Apache Spark (MMLSpark) provides a number of deep learning and data science tools for Apache Spark, including seamless integration of Spark Machine Learning pipelines with Microsoft Cognitive Toolkit (CNTK) and OpenCV, enabling you to quickly create powerful, highly-scalable predictive and analytical models for large image and text datasets. Step 3: Create and execute the Zeppelin notebooks on the cluster Step 4: When the cluster is ready, export Zeppelin notebooks to a local environment. Notebooks must be exportable to be version controlled locally. Reference: https://docs.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-zeppelin-notebook https://azuremlbuild.blob.core.windows.net/pysparkapi/intro.html

HOTSPOT - You are developing a deep learning model by using TensorFlow. You plan to run the model training workload on an Azure Machine Learning Compute Instance. You must use CUDA-based model training. You need to provision the Compute Instance. Which two virtual machines sizes can you use? To answer, select the appropriate virtual machine sizes in the answer area. NOTE: Each correct selection is worth one point. Hot Area: Virtual machine size BASIC_AD, 1, null, 0.75 GB, 20 GB STANDARD_D3_V2, 4, null, 14 GB, 200 GB STANDARD_E64_V3, 64, null, 432 GB, 1600 GB STANDARD_M64LS, 64, null, 512 GB, 2000 GB STANDARD_NC12, 12, 2, 112 GB, 680 GB STANDARD_NC24, 24, 4, 224 GB, 1440 GB

Correct Answer: STANDARD_NC12, 12, 2, 112 GB, 680 GB; STANDARD_NC24, 24, 4, 224 GB, 1440 GB CUDA is a parallel computing platform and programming model developed by Nvidia for general computing on its own GPUs (graphics processing units). CUDA enables developers to speed up compute-intensive applications by harnessing the power of GPUs for the parallelizable part of the computation. Reference: https://www.infoworld.com/article/3299703/what-is-cuda-parallel-programming-for-gpus.html

You plan to build a team data science environment. Data for training models in machine learning pipelines will be over 20 GB in size. You have the following requirements: ✑ Models must be built using Caffe2 or Chainer frameworks. ✑ Data scientists must be able to use a data science environment to build the machine learning pipelines and train models on their personal devices in both connected and disconnected network environments. Personal devices must support updating machine learning pipelines when connected to a network. You need to select a data science environment. Which environment should you use? A. Azure Machine Learning Service B. Azure Machine Learning Studio C. Azure Databricks D. Azure Kubernetes Service (AKS)

Correct Answer: A The Data Science Virtual Machine (DSVM) is a customized VM image on Microsoft's Azure cloud built specifically for doing data science. Caffe2 and Chainer are supported by DSVM. DSVM integrates with Azure Machine Learning. Incorrect Answers: B: Use Machine Learning Studio when you want to experiment with machine learning models quickly and easily, and the built-in machine learning algorithms are sufficient for your solutions. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/data-science-virtual-machine/overview

Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution. After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen. You are analyzing a numerical dataset which contains missing values in several columns. You must clean the missing values using an appropriate operation without affecting the dimensionality of the feature set. You need to analyze a full dataset to include all values. Solution: Replace each missing value using the Multiple Imputation by Chained Equations (MICE) method. Does the solution meet the goal? A. Yes B. No

Correct Answer: A Replace using MICE: For each missing value, this option assigns a new value, which is calculated by using a method described in the statistical literature as "Multivariate Imputation using Chained Equations" or "Multiple Imputation by Chained Equations". With a multiple imputation method, each variable with missing data is modeled conditionally using the other variables in the data before filling in the missing values. Note: Multivariate imputation by chained equations (MICE), sometimes called ג€fully conditional specificationג€ or ג€sequential regression multiple imputationג€ has emerged in the statistical literature as one principled method of addressing missing data. Creating multiple imputations, as opposed to single imputations, accounts for the statistical uncertainty in the imputations. In addition, the chained equations approach is very flexible and can handle variables of varying types(e.g., continuous or binary) as well as complexities such as bounds or survey skip patterns. Reference:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3074241/ https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/clean-missing-data

You create an Azure Machine Learning workspace. You must create a custom role named DataScientist that meets the following requirements: ✑ Role members must not be able to delete the workspace. ✑ Role members must not be able to create, update, or delete compute resources in the workspace. ✑ Role members must not be able to add new users to the workspace. You need to create a JSON file for the DataScientist role in the Azure Machine Learning workspace. The custom role must enforce the restrictions specified by the IT Operations team. Which JSON code segment should you use? A. ... "Actions":["*"], "NotActions": [ >4 MS statements< ] B. ... "Actions":["*"], "NotActions": [] C. ... "Actions": [ > 4 MS statements< ], "NotActions":["*"] D. ... "Actions": [], "NotActions": ["*"]

Correct Answer: A The following custom role can do everything in the workspace except for the following actions: ✑ It can't create or update a compute resource. ✑ It can't delete a compute resource. ✑ It can't add, delete, or alter role assignments. ✑ It can't delete the workspace. To create a custom role, first construct a role definition JSON file that specifies the permission and scope for the role. The following example defines a custom role named "Data Scientist Custom" scoped at a specific workspace level: data_scientist_custom_role.json :{"Name": "Data Scientist Custom","IsCustom": true,"Description": "Can run experiment but can't create or delete compute.","Actions": ["*"],"NotActions": ["Microsoft.MachineLearningServices/workspaces/*/delete","Microsoft.MachineLearningServices/workspaces/write","Microsoft.MachineLearningServices/workspaces/computes/*/write","Microsoft.MachineLearningServices/workspaces/computes/*/delete","Microsoft.Authorization/*/write"],"AssignableScopes": ["/subscriptions/<subscription_id>/resourceGroups/<resource_group_name>/providers/Microsoft.MachineLearningServices/workspaces/<workspace_name>"]} Reference: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-assign-roles

You create an Azure Machine Learning compute resource to train models. The compute resource is configured as follows: ✑ Minimum nodes: 2 ✑ Maximum nodes: 4 You must decrease the minimum number of nodes and increase the maximum number of nodes to the following values: ✑ Minimum nodes: 0 ✑ Maximum nodes: 8 You need to reconfigure the compute resource. What are three possible ways to achieve this goal? Each correct answer presents a complete solution. NOTE: Each correct selection is worth one point. A. Use the Azure Machine Learning studio. B. Run the update method of the AmlCompute class in the Python SDK. C. Use the Azure portal. D. Use the Azure Machine Learning designer. E. Run the refresh_state() method of the BatchCompute class in the Python SDK.

Correct Answer: ABC A: You can manage assets and resources in the Azure Machine Learning studio. B: The update(min_nodes=None, max_nodes=None, idle_seconds_before_scaledown=None) of the AmlCompute class updates the ScaleSettings for thisAmlCompute target. C: To change the nodes in the cluster, use the UI for your cluster in the Azure portal. Reference: https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.compute.amlcompute(class)

You are developing deep learning models to analyze semi-structured, unstructured, and structured data types. You have the following data available for model building: ✑ Video recordings of sporting events ✑ Transcripts of radio commentary about events ✑ Logs from related social media feeds captured during sporting events You need to select an environment for creating the model. Which environment should you use? A. Azure Cognitive Services B. Azure Data Lake Analytics C. Azure HDInsight with Spark MLib D. Azure Machine Learning Studio

Correct Answer: A Alternative Answer: C ? Azure Cognitive Services expand on Microsoft's evolving portfolio of machine learning APIs and enable developers to easily add cognitive features ג€" such as emotion and video detection; facial, speech, and vision recognition; and speech and language understanding ג€" into their applications. The goal of Azure Cognitive Services is to help developers create applications that can see, hear, speak, understand, and even begin to reason. The catalog of services within Azure Cognitive Services can be categorized into five main pillars - Vision, Speech, Language, Search, and Knowledge. Reference: https://docs.microsoft.com/en-us/azure/cognitive-services/welcome https://docs.microsoft.com/en-us/azure/cognitive-services/big-data/cognitive-services-for-big-data

Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution. After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen. You are using Azure Machine Learning Studio to perform feature engineering on a dataset. You need to normalize values to produce a feature column grouped into bins. Solution: Apply an Entropy Minimum Description Length (MDL) binning mode. Does the solution meet the goal? A. Yes B. No

Correct Answer: A Alternative Answer: B - Quantile Normalization YES Entropy MDL binning mode: This method requires that you select the column you want to predict and the column or columns that you want to group into bins. It then makes a pass over the data and attempts to determine the number of bins that minimizes the entropy. In other words, it chooses a number of bins that allows the data column to best predict the target column. It then returns the bin number associated with each row of your data in a column named <colname>quantized. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/group-data-into-bins

You are implementing a machine learning model to predict stock prices. The model uses a PostgreSQL database and requires GPU processing. You need to create a virtual machine that is pre-configured with the required tools. What should you do? A. Create a Data Science Virtual Machine (DSVM) Windows edition. B. Create a Geo Al Data Science Virtual Machine (Geo-DSVM) Windows edition. C. Create a Deep Learning Virtual Machine (DLVM) Linux edition. D. Create a Deep Learning Virtual Machine (DLVM) Windows edition.

Correct Answer: A In the DSVM, your training models can use deep learning algorithms on hardware that's based on graphics processing units (GPUs). PostgreSQL is available for the following operating systems: Linux (all recent distributions), 64-bit installers available for macOS (OS X) version 10.6 and newer ג€"Windows (with installers available for 64-bit version; tested on latest versions and back to Windows 2012 R2. Incorrect Answers: B: The Azure Geo AI Data Science VM (Geo-DSVM) delivers geospatial analytics capabilities from Microsoft's Data Science VM. Specifically, this VM extends the AI and data science toolkits in the Data Science VM by adding ESRI's market-leading ArcGIS Pro Geographic Information System. C, D: DLVM is a template on top of DSVM image. In terms of the packages, GPU drivers etc. are all there in the DSVM image. Mostly it is for convenience during creation where we only allow DLVM to be created on GPU VM instances on Azure. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/data-science-virtual-machine/overview https://docs.microsoft.com/en-us/azure/machine-learning/data-science-virtual-machine/overview

HOTSPOT - You have a dataset that contains 2,000 rows. You are building a machine learning classification model by using Azure Learning Studio. You add a Partition and Sample module to the experiment. You need to configure the module. You must meet the following requirements: ✑ Divide the data into subsets ✑ Assign the rows into folds using a round-robin method ✑ Allow rows in the dataset to be reused How should you configure the module? To answer, select the appropriate options in the dialog box in the answer area. NOTE: Each correct selection is worth one point. Hot Area: Partition or sample mode - Assign to Folds, Pick Fold, Sampling, Head Use replacement in the partitioning - Yes or No Randomized split - Yes or No

Correct Answer: Assign to Folds, YES to Use replacement in the partitioning, NO to Randomized split Use the Split data into partitions option when you want to divide the dataset into subsets of the data. This option is also useful when you want to create a custom number of folds for cross-validation, or to split rows into several groups. 1. Add the Partition and Sample module to your experiment in Studio (classic), and connect the dataset. 2. For Partition or sample mode, select Assign to Folds. 3. Use replacement in the partitioning: Select this option if you want the sampled row to be put back into the pool of rows for potential reuse. As a result, the same row might be assigned to several folds. 4. If you do not use replacement (the default option), the sampled row is not put back into the pool of rows for potential reuse. As a result, each row can be assigned to only one fold. 5. Randomized split: Select this option if you want rows to be randomly assigned to folds. If you do not select this option, rows are assigned to folds using the round-robin method. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/partition-and-sample

Your team is building a data engineering and data science development environment. The environment must support the following requirements: ✑ support Python and Scala ✑ compose data storage, movement, and processing services into automated data pipelines ✑ the same tool should be used for the orchestration of both data engineering and data science ✑ support workload isolation and interactive workloads ✑ enable scaling across a cluster of machines You need to create the environment. What should you do? A. Build the environment in Apache Hive for HDInsight and use Azure Data Factory for orchestration. B. Build the environment in Azure Databricks and use Azure Data Factory for orchestration. C. Build the environment in Apache Spark for HDInsight and use Azure Container Instances for orchestration. D. Build the environment in Azure Databricks and use Azure Container Instances for orchestration.

Correct Answer: B In Azure Databricks, we can create two different types of clusters. ✑ Standard, these are the default clusters and can be used with Python, R, Scala and SQL ✑ High-concurrency Azure Databricks is fully integrated with Azure Data Factory. Incorrect Answers: D: Azure Container Instances is good for development or testing. Not suitable for production workloads. Reference: https://docs.microsoft.com/en-us/azure/architecture/data-guide/technology-choices/data-science-and-machine-learning

Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution. After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen. You are creating a new experiment in Azure Machine Learning Studio. One class has a much smaller number of observations than the other classes in the training set. You need to select an appropriate data sampling strategy to compensate for the class imbalance. Solution: You use the Scale and Reduce sampling mode. Does the solution meet the goal? A. Yes B. No

Correct Answer: B Instead use the Synthetic Minority Oversampling Technique (SMOTE) sampling mode. Note: SMOTE is used to increase the number of underepresented cases in a dataset used for machine learning. SMOTE is a better way of increasing the number of rare cases than simply duplicating existing cases. Incorrect Answers: Common data tasks for the Scale and Reduce sampling mode include clipping, binning, and normalizing numerical values. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/smote https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/data-transformation-scale-and-reduce

Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution. After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen. You are a data scientist using Azure Machine Learning Studio. You need to normalize values to produce an output column into bins to predict a target column. Solution: Apply a Quantiles normalization with a QuantileIndex normalization. Does the solution meet the goal? A. Yes B. No

Correct Answer: B Alternative Answer: A YES - QUANTILE NORMALIZATION Use the Entropy MDL binning mode which has a target column. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/group-data-into-bins

A set of CSV files contains sales records. All the CSV files have the same data schema.Each CSV file contains the sales record for a particular month and has the filename sales.csv. Each file is stored in a folder that indicates the month and year when the data was recorded. The folders are in an Azure blob container for which a datastore has been defined in an Azure Machine Learning workspace. The folders are organized in a parent folder named sales to create the following hierarchical structure: /sales -/01-2019 --/sales.csv -/02-2019 --/sales.csv -/03-2019 --/sales.csv At the end of each month, a new folder with that month's sales file is added to the sales folder. You plan to use the sales data to train a machine learning model based on the following requirements: ✑ You must define a dataset that loads all of the sales data to date into a structure that can be easily converted to a dataframe. ✑ You must be able to create experiments that use only data that was created before a specific previous month, ignoring any data that was added after that month. ✑ You must register the minimum number of datasets possible. You need to register the sales data as a dataset in Azure Machine Learning service workspace. What should you do? A. Create a tabular dataset that references the datastore and explicitly specifies each 'sales/mm-yyyy/sales.csv' file every month. Register the dataset with the name sales_dataset each month, replacing the existing dataset and specifying a tag named month indicating the month and year it was registered. Use this dataset for all experiments. B. Create a tabular dataset that references the datastore and specifies the path 'sales/*/sales.csv', register the dataset with the name sales_dataset and a tag named month indicating the month and year it was registered, and use this dataset for all experiments. C. Create a new tabular dataset that references the datastore and explicitly specifies each 'sales/mm-yyyy/sales.csv' file every month. Register the dataset with the name sales_dataset_MM-YYYY each month with appropriate MM and YYYY values for the month and year. Use the appropriate month-specific dataset for experiments. D. Create a tabular dataset that references the datastore and explicitly specifies each 'sales/mm-yyyy/sales.csv' file. Register the dataset with the name sales_dataset each month as a new version and with a tag named month indicating the month and year it was registered. Use this dataset for all experiments, identifying the version to be used based on the month tag as necessary.

Correct Answer: B Specify the path. Example: The following code gets the workspace existing workspace and the desired datastore by name. And then passes the datastore and file locations to the path parameter to create a new TabularDataset, weather_ds. from azureml.core import Workspace, Datastore, Dataset datastore_name = 'your datastore name' # get existing workspaceworkspace = Workspace.from_config() # retrieve an existing datastore in the workspace by namedatastore = Datastore.get(workspace, datastore_name) # create a TabularDataset from 3 file paths in datastoredatastore_paths = [(datastore, 'weather/2018/11.csv'),(datastore, 'weather/2018/12.csv'),(datastore, 'weather/2019/*.csv')] weather_ds = Dataset.Tabular.from_delimited_files(path=datastore_paths)

Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution. After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen. You are analyzing a numerical dataset which contains missing values in several columns. You must clean the missing values using an appropriate operation without affecting the dimensionality of the feature set. You need to analyze a full dataset to include all values. Solution: Use the Last Observation Carried Forward (LOCF) method to impute the missing data points. Does the solution meet the goal? A. Yes B. No

Correct Answer: B Alternative Answer: A Instead use the Multiple Imputation by Chained Equations (MICE) method. Replace using MICE: For each missing value, this option assigns a new value, which is calculated by using a method described in the statistical literature as "Multivariate Imputation using Chained Equations" or "Multiple Imputation by Chained Equations". With a multiple imputation method, each variable with missing data is modeled conditionally using the other variables in the data before filling in the missing values. Note: Last observation carried forward (LOCF) is a method of imputing missing data in longitudinal studies. If a person drops out of a study before it ends, then his or her last observed score on the dependent variable is used for all subsequent (i.e., missing) observation points. LOCF is used to maintain the sample size and to reduce the bias caused by the attrition of participants in a study. Reference:https://methods.sagepub.com/reference/encyc-of-research-design/n211.xml https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3074241/

You plan to create a speech recognition deep learning model. The model must support the latest version of Python. You need to recommend a deep learning framework for speech recognition to include in the Data Science Virtual Machine (DSVM). What should you recommend? A. Rattle B. TensorFlow C. Weka D. Scikit-learn

Correct Answer: B TensorFlow is an open-source library for numerical computation and large-scale machine learning. It uses Python to provide a convenient front-end API for building applications with the framework TensorFlow can train and run deep neural networks for handwritten digit classification, image recognition, word embeddings, recurrent neural networks, sequence- to-sequence models for machine translation, natural language processing, and PDE (partial differential equation) based simulations. Incorrect Answers: A: Rattle is the R analytical tool that gets you started with data analytics and machine learning. C: Weka is used for visual data mining and machine learning software in Java. D: Scikit-learn is one of the most useful libraries for machine learning in Python. It is on NumPy, SciPy and matplotlib, this library contains a lot of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction. Reference: https://www.infoworld.com/article/3278008/what-is-tensorflow-the-machine-learning-library-explained.html

Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution. After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen. You are a data scientist using Azure Machine Learning Studio. You need to normalize values to produce an output column into bins to predict a target column. Solution: Apply a Quantiles binning mode with a PQuantile normalization. Does the solution meet the goal? A. Yes B. No

Correct Answer: B Use the Entropy MDL binning mode which has a target column. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/group-data-into-bins

Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution. After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen. You are a data scientist using Azure Machine Learning Studio. You need to normalize values to produce an output column into bins to predict a target column. Solution: Apply an Equal Width with Custom Start and Stop binning mode. Does the solution meet the goal? A. Yes B. No

Correct Answer: B Use the Entropy MDL binning mode which has a target column. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/group-data-into-bins

Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution. After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen. You are analyzing a numerical dataset which contains missing values in several columns. You must clean the missing values using an appropriate operation without affecting the dimensionality of the feature set. You need to analyze a full dataset to include all values. Solution: Calculate the column median value and use the median value as the replacement for any missing value in the column. Does the solution meet the goal? A. Yes B. No

Correct Answer: B Use the Multiple Imputation by Chained Equations (MICE) method. Reference: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3074241/ https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/clean-missing-data

Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution. After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen. You are analyzing a numerical dataset which contains missing values in several columns. You must clean the missing values using an appropriate operation without affecting the dimensionality of the feature set. You need to analyze a full dataset to include all values. Solution: Remove the entire column that contains the missing data point. Does the solution meet the goal? A. Yes B. No

Correct Answer: B Use the Multiple Imputation by Chained Equations (MICE) method. Reference: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3074241/ https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/clean-missing-data

You create an Azure Machine Learning workspace. You are preparing a local Python environment on a laptop computer. You want to use the laptop to connect to the workspace and run experiments. You create the following config.json file.{"workspace_name" : "ml-workspace"} You must use the Azure Machine Learning SDK to interact with data and experiments in the workspace. You need to configure the config.json file to connect to the workspace from the Python environment. Which two additional parameters must you add to the config.json file in order to connect to the workspace? Each correct answer presents part of the solution. NOTE: Each correct selection is worth one point. A. login B. resource_group C. subscription_id D. key E. region

Correct Answer: BC To use the same workspace in multiple environments, create a JSON configuration file. The configuration file saves your subscription (subscription_id), resource (resource_group), and workspace name so that it can be easily loaded. The following sample shows how to create a workspace.from azureml.core import Workspacews = Workspace.create(name='myworkspace',subscription_id='<azure-subscription-id>',resource_group='myresourcegroup',create_resource_group=True,location='eastus2') Reference: https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.workspace.workspace

You create a new Azure subscription. No resources are provisioned in the subscription. You need to create an Azure Machine Learning workspace. What are three possible ways to achieve this goal? Each correct answer presents a complete solution. NOTE: Each correct selection is worth one point. A . Run Python code that uses the Azure ML SDK library and calls the Workspace.create method with name, subscription_id, resource_group, and location parameters. B . Use an Azure Resource Management template that includes a Microsoft.MachineLearningServices/ workspaces resource and its dependencies. C . Use the Azure Command Line Interface (CLI) with the Azure Machine Learning extension to call the az group create function with -name and -location parameters, and then the az ml workspace create function, specifying Cw and Cg parameters for the workspace name and resource group. D . Navigate to Azure Machine Learning studio and create a workspace. E . Run Python code that uses the Azure ML SDK library and calls the Workspace.get method with name, subscription_id, and resource_group parameters.

Correct Answer: BCD B: You can create a workspace in the Azure Machine Learning studio C: You can create a workspace for Azure Machine Learning with Azure CLIInstall the machine learning extension.Create a resource group: az group create --name <resource-group-name> --location <location>To create a new workspace where the services are automatically created, use the following command: az ml workspace create -w <workspace-name> -g<resource-group-name> D: You can create and manage Azure Machine Learning workspaces in the Azure portal. 1. Sign in to the Azure portal by using the credentials for your Azure subscription. 2. In the upper-left corner of Azure portal, select + Create a resource. 3. Use the search bar to find Machine Learning. 4. Select Machine Learning. 5. In the Machine Learning pane, select Create to begin. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-create-workspace-template https://docs.microsoft.com/en-us/azure/machine-learning/how-to-manage-workspace-cli https://docs.microsoft.com/en-us/azure/machine-learning/how-to-manage-workspace

You must store data in Azure Blob Storage to support Azure Machine Learning. You need to transfer the data into Azure Blob Storage. What are three possible ways to achieve the goal? Each correct answer presents a complete solution. NOTE: Each correct selection is worth one point. A. Bulk Insert SQL Query B. AzCopy C. Python script D. Azure Storage Explorer E. Bulk Copy Program (BCP)

Correct Answer: BCD You can move data to and from Azure Blob storage using different technologies: ✑ Azure Storage-Explorer ✑ AzCopy ✑ Python ✑ SSIS Reference: https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/move-azure-blob

You are analyzing a dataset by using Azure Machine Learning Studio. You need to generate a statistical summary that contains the p-value and the unique count for each feature column. Which two modules can you use? Each correct answer presents a complete solution. NOTE: Each correct selection is worth one point. A. Computer Linear Correlation B. Export Count Table C. Execute Python Script D. Convert to Indicator Values E. Summarize Data

Correct Answer: BE The Export Count Table module is provided for backward compatibility with experiments that use the Build Count Table (deprecated) and Count Featurizer (deprecated) modules. E: Summarize Data statistics are useful when you want to understand the characteristics of the complete dataset. For example, you might need to know: ✑ How many missing values are there in each column? ✑ How many unique values are there in a feature column? ✑ What is the mean and standard deviation for each column? ✑ The module calculates the important scores for each column, and returns a row of summary statistics for each variable (data column) provided as input. Incorrect Answers: A: The Compute Linear Correlation module in Azure Machine Learning Studio is used to compute a set of Pearson correlation coefficients for each possible pair of variables in the input dataset. C: With Python, you can perform tasks that aren't currently supported by existing Studio modules such as: Visualizing data using matplotlib Using Python libraries to enumerate datasets and models in your workspace Reading, loading, and manipulating data from sources not supported by the Import Data module D: The purpose of the Convert to Indicator Values module is to convert columns that contain categorical values into a series of binary indicator columns that can more easily be used as features in a machine learning model. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/export-count-table https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/summarize-data

You plan to use a Deep Learning Virtual Machine (DLVM) to train deep learning models using Compute Unified Device Architecture (CUDA) computations. You need to configure the DLVM to support CUDA. What should you implement? A. Solid State Drives (SSD) B. Computer Processing Unit (CPU) speed increase by using overclocking C. Graphic Processing Unit (GPU) D. High Random Access Memory (RAM) configuration E. Intel Software Guard Extensions (Intel SGX) technology

Correct Answer: C A Deep Learning Virtual Machine is a pre-configured environment for deep learning using GPU instances. Reference: https://azuremarketplace.microsoft.com/en-au/marketplace/apps/microsoft-ads.dsvm-deep-learning

You are developing a data science workspace that uses an Azure Machine Learning service. You need to select a compute target to deploy the workspace. What should you use? A. Azure Data Lake Analytics B. Azure Databricks C. Azure Container Service D. Apache Spark for HDInsight

Correct Answer: C Alternative Answer: Azure Container Service is no longer available... Perhaps B now? Azure Container Instances can be used as compute target for testing or development. Use for low-scale CPU-based workloads that require less than 48 GB of RAM. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-deploy-and-where

You use Azure Machine Learning Studio to build a machine learning experiment. You need to divide data into two distinct datasets. Which module should you use? A. Assign Data to Clusters B. Load Trained Model C. Partition and Sample D. Tune Model-Hyperparameters

Correct Answer: C Partition and Sample with the Stratified split option outputs multiple datasets, partitioned using the rules you specified. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/partition-and-sample

You plan to deliver a hands-on workshop to several students. The workshop will focus on creating data visualizations using Python. Each student will use a device that has internet access. Student devices are not configured for Python development. Students do not have administrator access to install software on their devices. Azure subscriptions are not available for students. You need to ensure that students can run Python-based data visualization code. Which Azure tool should you use? A. Anaconda Data Science Platform B. Azure BatchAI C. Azure Notebooks D. Azure Machine Learning Service

Correct Answer: C Reference: https://notebooks.azure.com/

You are creating a machine learning model. You have a dataset that contains null rows. You need to use the Clean Missing Data module in Azure Machine Learning Studio to identify and resolve the null and missing data in the dataset. Which parameter should you use? A. Replace with mean B. Remove entire column C. Remove entire row D. Hot Deck E. Custom substitution value F. Replace with mode

Correct Answer: C Remove entire row: Completely removes any row in the dataset that has one or more missing values. This is useful if the missing value can be considered randomly missing. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/clean-missing-data

You are moving a large dataset from Azure Machine Learning Studio to a Weka environment. You need to format the data for the Weka environment. Which module should you use? A. Convert to CSV B. Convert to Dataset C. Convert to ARFF D. Convert to SVMLight

Correct Answer: C Use the Convert to ARFF module in Azure Machine Learning Studio, to convert datasets and results in Azure Machine Learning to the attribute-relation file format used by the Weka toolset. This format is known as ARFF. The ARFF data specification for Weka supports multiple machine learning tasks, including data preprocessing, classification, and feature selection. In this format, data is organized by entities and their attributes, and is contained in a single text file. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/convert-to-arff

You are with a time series dataset in Azure Machine Learning Studio. You need to split your dataset into training and testing subsets by using the Split Data module. Which splitting mode should you use? A. Recommender Split B. Regular Expression Split C. Relative Expression Split D. Split Rows with the Randomized split parameter set to true

Correct Answer: D Alternative Answer: C YES (https://docs.microsoft.com/en-us/azure/machine-learning/algorithm-module-reference/split-data (check the Relative Expression Split section) Split Rows: Use this option if you just want to divide the data into two parts. You can specify the percentage of data to put in each split, but by default, the data is divided 50-50. Incorrect Answers: B: Regular Expression Split: Choose this option when you want to divide your dataset by testing a single column for a value. C: Relative Expression Split: Use this option whenever you want to apply a condition to a number column. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/split-data

You are developing a hands-on workshop to introduce Docker for Windows to attendees. You need to ensure that workshop attendees can install Docker on their devices. Which two prerequisite components should attendees install on the devices? Each correct answer presents part of the solution. NOTE: Each correct selection is worth one point. A. Microsoft Hardware-Assisted Virtualization Detection Tool B. Kitematic C. BIOS-enabled virtualization D. VirtualBox E. Windows 10 64-bit Professional

Correct Answer: CE C: Make sure your Windows system supports Hardware Virtualization Technology and that virtualization is enabled. Ensure that hardware virtualization support is turned on in the BIOS settings. E: To run Docker, your machine must have a 64-bit operating system running Windows 7 or higher. Reference: https://docs.docker.com/toolbox/toolbox_install_windows/ https://blogs.technet.microsoft.com/canitpro/2015/09/08/step-by-step-enabling-hyper-v-for-use-on-windows-10/

DRAG DROP - You are analyzing a raw dataset that requires cleaning. You must perform transformations and manipulations by using Azure Machine Learning Studio. You need to identify the correct modules to perform the transformations. Which modules should you choose? To answer, drag the appropriate modules to the correct scenarios. Each module may be used once, more than once, or not at all. You may need to drag the split bar between panes or scroll to view content. NOTE: Each correct selection is worth one point. Select and Place: (Choose answer for each scenario) Methods: Clean Missing Data, SMOTE, Convert to Indicator Values, Remove Duplicate Rows, Threshold Filter Scenario: - Replace missing values by removing rows and columns - Increase the number of low-incidence examples in the dataset - Convert a categorical feature into a binary indicator - Remove potential duplicates from a dataset

Correct Answer: Clean Missing Data, SMOTE, Convert to Indicator Values, Remove Duplicate Rows Box 1: Clean Missing Data - Box 2: SMOTE - Use the SMOTE module in Azure Machine Learning Studio to increase the number of under represented cases in a dataset used for machine learning. SMOTE is a better way of increasing the number of rare cases than simply duplicating existing cases. Box 3: Convert to Indicator Values - Use the Convert to Indicator Values module in Azure Machine Learning Studio. The purpose of this module is to convert columns that contain categorical values into a series of binary indicator columns that can more easily be used as features in a machine learning model. Box 4: Remove Duplicate Rows - Reference: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/smote https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/convert-to-indicator-values

HOTSPOT - You are performing sentiment analysis using a CSV file that includes 12,000 customer reviews written in a short sentence format. You add the CSV file to Azure Machine Learning Studio and configure it as the starting point dataset of an experiment. You add the Extract N-Gram Features from Text module to the experiment to extract key phrases from the customer review column in the dataset. You must create a new n-gram dictionary from the customer review text and set the maximum n-gram size to trigrams. What should you select? To answer, select the appropriate options in the answer area. NOTE: Each correct selection is worth one point. Hot Area: Extract N-Gram Features from Text Text Column: Column type: String Feature Vocabulary Mode: Create; Read-Only; Update; Merge N-Grams size: 3, 4, 4K, 12K 0 Weighting function: >> statements? << Minimum word length: 3 Maximum word length: 25 Minimum n-gram document: 5 Maximum n-gram document ration: 1

Correct Answer: Create; 3 Vocabulary mode: Create - For Vocabulary mode, select Create to indicate that you are creating a new list of n-gram features. N-Grams size: 3 - For N-Grams size, type a number that indicates the maximum size of the n-grams to extract and store. For example, if you type 3, unigrams, bigrams, and trigrams will be created. Weighting function: Leave blank - The option, Weighting function, is required only if you merge or update vocabularies. It specifies how terms in the two vocabularies and their scores should be weighted against each other. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/extract-n-gram-features-from-text

You use Azure Machine Learning Studio to build a machine learning experiment. You need to divide data into two distinct datasets. Which module should you use? A. Split Data B. Load Trained Model C. Assign Data to Clusters D. Group Data into Bins

Correct Answer: D Alternative Answer: A YES (https://docs.microsoft.com/en-us/azure/machine-learning/component-reference/split-data) The Group Data into Bins module supports multiple options for binning data. You can customize how the bin edges are set and how values are apportioned into the bins. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/group-data-into-bins

You are solving a classification task. The dataset is imbalanced. You need to select an Azure Machine Learning Studio module to improve the classification accuracy. Which module should you use? A. Permutation Feature Importance B. Filter Based Feature Selection C. Fisher Linear Discriminant Analysis D. Synthetic Minority Oversampling Technique (SMOTE)

Correct Answer: D Use the SMOTE module in Azure Machine Learning Studio (classic) to increase the number of underrepresented cases in a dataset used for machine learning. SMOTE is a better way of increasing the number of rare cases than simply duplicating existing cases. You connect the SMOTE module to a dataset that is imbalanced. There are many reasons why a dataset might be imbalanced: the category you are targeting might be very rare in the population, or the data might simply be difficult to collect. Typically, you use SMOTE when the class you want to analyze is under- represented. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/smote

You are a lead data scientist for a project that tracks the health and migration of birds. You create a multi-class image classification deep learning model that uses a set of labeled bird photographs collected by experts. You have 100,000 photographs of birds. All photographs use the JPG format and are stored in an Azure blob container in an Azure subscription. You need to access the bird photograph files in the Azure blob container from the Azure Machine Learning service workspace that will be used for deep learning model training. You must minimize data movement. What should you do? A. Create an Azure Data Lake store and move the bird photographs to the store. B. Create an Azure Cosmos DB database and attach the Azure Blob containing bird photographs storage to the database. C. Create and register a dataset by using TabularDataset class that references the Azure blob storage containing bird photographs. D. Register the Azure blob storage containing the bird photographs as a datastore in Azure Machine Learning service. E. Copy the bird photographs to the blob datastore that was created with your Azure Machine Learning service workspace.

Correct Answer: D We recommend creating a datastore for an Azure Blob container. When you create a workspace, an Azure blob container and an Azure file share are automatically registered to the workspace. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-access-data

HOTSPOT - A coworker registers a datastore in a Machine Learning services workspace by using the following code: Datastore.register_azure_blob_container(workspace=ws, datastore_name='demo_datastore', container_name='demo_datacontainer', account_name='demo_account' account_key='0A0A0A-0A0A00A-0A00A0A0A0A', create_if_not_exists=True) You need to write code to access the datastore from a notebook. How should you complete the code segment? To answer, select the appropriate options in the answer area. NOTE: Each correct selection is worth one point. Hot Area: import azureml.core from azureml.core import Workspace, Datastore ws = Workspace.from_config() datastore = >1< .get( >2<, '>3<') >1<: Workspace, Datastore, Experiment, Run >2<: ws, run, experiment, log >3<: demo_datastore, demo_datacontainer, demo_account, Datastore

Correct Answer: Datastore, ws, demo_datastore Box 1: DataStore - To get a specific datastore registered in the current workspace, use the get() static method on the Datastore class:# Get a named datastore from the current workspacedatastore = Datastore.get(ws, datastore_name='your datastore name') Box 2: ws - Box 3: demo_datastore - Reference: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-access-data

You plan to use a Data Science Virtual Machine (DSVM) with the open source deep learning frameworks Caffe2 and PyTorch. You need to select a pre-configured DSVM to support the frameworks. What should you create? A. Data Science Virtual Machine for Windows 2012 B. Data Science Virtual Machine for Linux (CentOS) C. Geo AI Data Science Virtual Machine with ArcGIS D. Data Science Virtual Machine for Windows 2016 E. Data Science Virtual Machine for Linux (Ubuntu)

Correct Answer: E Caffe2 and PyTorch is supported by Data Science Virtual Machine for Linux. Microsoft offers Linux editions of the DSVM on Ubuntu 16.04 LTS and CentOS 7.4. Only the DSVM on Ubuntu is preconfigured for Caffe2 and PyTorch. Incorrect Answers: D: Caffe2 and PytOCH are only supported in the Data Science Virtual Machine for Linux. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/data-science-virtual-machine/overview https://docs.microsoft.com/en-us/azure/machine-learning/data-science-virtual-machine/tools-included

DRAG DROP - You are creating an experiment by using Azure Machine Learning Studio. You must divide the data into four subsets for evaluation. There is a high degree of missing values in the data. You must prepare the data for analysis. You need to select appropriate methods for producing the experiment. Which three modules should you run in sequence? To answer, move the appropriate actions from the list of actions to the answer area and arrange them in the correct order. NOTE: More than one order of answer choices is correct. You will receive credit for any of the correct orders you select. Select and Place: Build Counting Transform Missing Values Scrubber Feature Hashing Clean Missing Data Replace Discrete Values Import Data Latent Dirichlet Transformation Partition and Sample

Correct Answer: Import Data; Clean Missing Data; Partition and Sample The Clean Missing Data module in Azure Machine Learning Studio, to remove, replace, or infer missing values. Incorrect Answers: ✑ Latent Direchlet Transformation: Latent Dirichlet Allocation module in Azure Machine Learning Studio, to group otherwise unclassified text into a number of categories. Latent Dirichlet Allocation (LDA) is often used in natural language processing (NLP) to find texts that are similar. Another common term is topic modeling. ✑ Build Counting Transform: Build Counting Transform module in Azure Machine Learning Studio, to analyze training data. From this data, the module builds a count table as well as a set of count-based features that can be used in a predictive model. Missing Value Scrubber: The Missing Values Scrubber module is deprecated. ✑ Feature hashing: Feature hashing is used for linguistics, and works by converting unique tokens into integers. ✑ Replace discrete values: the Replace Discrete Values module in Azure Machine Learning Studio is used to generate a probability score that can be used to represent a discrete value. This score can be useful for understanding the information value of the discrete values. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/clean-missing-data

HOTSPOT - You are preparing to use the Azure ML SDK to run an experiment and need to create compute. You run the following code: from azureml.core.compute import ComputeTarget, AmlCompute from azureml.core.compute_target import ComputeTargetException ws = Workspace.from_config() cluster_name = 'aml-cluster' try: training_compute = ComputeTarget(workspace=ws, name=cluster_name) except ComputeTargetException: compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2', vm_priority='lowpriority', max_nodes=4) training_compute = ComputeTarget.create(ws, cluster_name, compute_config) training_compute.wait_for_completion(show_output=True) For each of the following statements, select Yes if the statement is true. Otherwise, select No. NOTE: Each correct selection is worth one point. Hot Area: YES or NO - If a compute cluster named aml-cluster already exists in the workspace, it will be deleted and replaced - The wait_for_completion() method will not return until the aml-cluster compute has four active nodes - If the code creates a new aml-cluster compute target, it may be preempted due to capacity constraints - The aml-cluster compute target is deleted from the workspace after the training experiment completes

Correct Answer: No, Yes, Yes, No Box 1: No - If a compute cluster already exists it will be used. Box 2: Yes - The wait_for_completion method waits for the current provisioning operation to finish on the cluster. Box 3: Yes - Low Priority VMs use Azure's excess capacity and are thus cheaper but risk your run being pre-empted. Box 4: No - Need to use training_compute.delete() to deprovision and delete the AmlCompute target. Reference: https://notebooks.azure.com/azureml/projects/azureml-getting-started/html/how-to-use-azureml/training/train-on-amlcompute/train-on-amlcompute.ipynb https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.compute.computetarget

HOTSPOT - You have an Azure Machine Learning workspace named workspace1 that is accessible from a public endpoint. The workspace contains an Azure Blob storage datastore named store1 that represents a blob container in an Azure storage account named account1. You configure workspace1 and account1 to be accessible by using private endpoints in the same virtual network. You must be able to access the contents of store1 by using the Azure Machine Learning SDK for Python. You must be able to preview the contents of store1 by using Azure Machine Learning studio. You need to configure store1.What should you do? To answer, select the appropriate options in the answer area. NOTE: Each correct selection is worth one point. Hot Area: Both questions use the same list: - Access the contents for store1 by using the Azure Machine Learning SDK for Python - Preview the contents of store1 by using Azure Machine Learning Studio >>1<< - Set store1 as the default datastore - Disable data validation for store1 - Update authentication for store1 - Regenerate the keys of account1

Correct Answer: Regenerate the keys of account1, Update authentication for store1 Box 1: Regenerate the keys of account1. Azure Blob Storage support authentication through Account key or SAS token. To authenticate your access to the underlying storage service, you can provide either your account key, shared access signatures (SAS) tokens, or service principal Box 2: Update the authentication for store1.For Azure Machine Learning studio users, several features rely on the ability to read data from a dataset; such as dataset previews, profiles and automated machine learning. For these features to work with storage behind virtual networks, use a workspace managed identity in the studio to allow Azure Machine Learning to access the storage account from outside the virtual network. Note: Some of the studio's features are disabled by default in a virtual network. To re-enable these features, you must enable managed identity for storage accounts you intend to use in the studio. The following operations are disabled by default in a virtual network:✑ Preview data in the studio. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-access-data

HOTSPOT - You are retrieving data from a large datastore by using Azure Machine Learning Studio. You must create a subset of the data for testing purposes using a random sampling seed based on the system clock. You add the Partition and Sample module to your experiment. You need to select the properties for the module. Which values should you select? To answer, select the appropriate options in the answer area. NOTE: Each correct selection is worth one point. Hot Area: Partition and Sample - Partion or sample mode: > Assign to Folds > Pick a Fold > Sampling > Head Rate of Sampling = .2 - Random seed for sampling: > 0 > 1 > time.clock() > utcNow() Stratified split for sampling: True/False

Correct Answer: Sampling, 0, False Create a sample of data - This option supports simple random sampling or stratified random sampling. This is useful if you want to create a smaller representative sample dataset for testing. 1. Add the Partition and Sample module to your experiment in Studio, and connect the dataset. 2. Partition or sample mode: Set this to Sampling. 3. Rate of sampling. See box 2 below. Box 2: 0 - 3. Rate of sampling. Random seed for sampling: Optionally, type an integer to use as a seed value. This option is important if you want the rows to be divided the same way every time. The default value is 0, meaning that a starting seed is generated based on the system clock. This can lead to slightly different results each time you run the experiment. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/partition-and-sample

HOTSPOT - You are performing a classification task in Azure Machine Learning Studio. You must prepare balanced testing and training samples based on a provided data set. You need to split the data with a 0.75:0.25 ratio. Which value should you use for each parameter? To answer, select the appropriate options in the answer area. NOTE: Each correct selection is worth one point. Hot Area: Splitting mode: Split rows, Recommender Split, Regular Expression Split, Relative Expression Split Fraction of rows in the first output dataset: 0.75, 0.25, 0.5, 1 Randomized split: True, False Stratified split: True, False

Correct Answer: Split rows, 0.75, True, False Box 1: Split rows - Use the Split Rows option if you just want to divide the data into two parts. You can specify the percentage of data to put in each split, but by default, the data is divided 50-50. You can also randomize the selection of rows in each group, and use stratified sampling. In stratified sampling, you must select a single column of data for which you want values to be apportioned equally among the two result datasets. Box 2: 0.75 - If you specify a number as a percentage, or if you use a string that contains the "%" character, the value is interpreted as a percentage. All percentage values must be within the range (0, 100), not including the values 0 and 100. Box 3: Yes - To ensure splits are balanced. Box 4: No - If you use the option for a stratified split, the output datasets can be further divided by subgroups, by selecting a strata column. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/split-data

HOTSPOT - You create an Azure Machine Learning workspace and set up a development environment. You plan to train a deep neural network (DNN) by using the Tensorflow framework and by using estimators to submit training scripts. You must optimize computation speed for training runs. You need to choose the appropriate estimator to use as well as the appropriate training compute target configuration. Which values should you use? To answer, select the appropriate options in the answer area. NOTE: Each correct selection is worth one point. Hot Area: Estimator: Estimator, SKLearn, PyTorch, Tensorflow, Chainer Training compute: - 12 vCPU, 48 GB memory, 96 GB SSD - 12 vCPU, 112 GB memory, 680 HB SSD, 2 GPU, 24 GB GPU memory - 16 vCPU, 128 GB memory, 160 GB HDD, 80 GB NVME disk (4000 MBps) - 44 vCPU, 352 GB memory, 3.4 GHz CPU frequency all cores

Correct Answer: Tensorflow; 12 vCPU, 112 GB memory, 680 GB SSD, 2 GPU, 24 GB GPU memory Box 1: Tensorflow - TensorFlow represents an estimator for training in TensorFlow experiments. Box 2: 12 vCPU, 112 GB memory..,2 GPU,.. Use GPUs for the deep neural network. Reference: https://docs.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.dnn

DRAG DROP - You configure a Deep Learning Virtual Machine for Windows. You need to recommend tools and frameworks to perform the following: ✑ Build deep neural network (DNN) models ✑ Perform interactive data exploration and visualization Which tools and frameworks should you recommend? To answer, drag the appropriate tools to the correct tasks. Each tool may be used once, more than once, or not at all. You may need to drag the split bar between panes or scroll to view content. NOTE: Each correct selection is worth one point. Select and Place: Tools: Vowpal Wabbit, PowerBI Desktop, ADF, Microsoft Cognitive Toolkit Answer Area: Task: Build DNN models: __________ Enable interactive data exploration and visualization: _________

Correct Answer: Vowpal Wabbit, PowerBI Desktop Alternative Answer: Microsoft Cognitive Kit ?, PowerBI Desktop Box 1: Vowpal Wabbit - Use the Train Vowpal Wabbit Version 8 module in Azure Machine Learning Studio (classic), to create a machine learning model by using Vowpal Wabbit. Box 2: PowerBI Desktop - Power BI Desktop is a powerful visual data exploration and interactive reporting tool BI is a name given to a modern approach to business decision making in which users are empowered to find, explore, and share insights from data across the enterprise. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/train-vowpal-wabbit-version-8-model https://docs.microsoft.com/en-us/azure/architecture/data-guide/scenarios/interactive-data-exploration

HOTSPOT - You create an Azure Machine Learning compute target named ComputeOne by using the STANDARD_D1 virtual machine image. ComputeOne is currently idle and has zero active nodes. You define a Python variable named ws that references the Azure Machine Learning workspace. You run the following Python code: from azureml.core.compute import ComputeTarget, AmlCompute from azureml.core.compute_target import ComputeTargetException the_cluster_name = "ComputeOne" try: the_cluster = ComputeTarget(workspace=ws, name=the_cluster_name) print('Step1') except ComputeTargetException: config=AmlCompute.provisioning_configuration(vm_size='STANDARD_DS12_v2', max_notes=4) the_cluster = ComputeTarget.create(ws, the_cluster_name, config) print('Step2') For each of the following statements, select Yes if the statement is true. Otherwise, select No. NOTE: Each correct selection is worth one point. Hot Area: YES OR NO - A new machine learning compute resource is created with a virtual machine size of STANDARD_DS12_v2 and a maximum of four nodes - Any experiments configured to use the_cluster will run on ComputeOne - The text STEP1 will be printed to the screen

Correct Answer: Yes, Yes, No Alternative Answer: No, Yes, Yes YES Box 1: Yes - ComputeTargetException class: An exception related to failures when creating, interacting with, or configuring a compute target. This exception is commonly raised for failures attaching a compute target, missing headers, and unsupported configuration values. Create(workspace, name, provisioning_configuration) Provision a Compute object by specifying a compute type and related configuration. This method creates a new compute target rather than attaching an existing one. Box 2: Yes - Box 3: No - The line before print('Step1') will fail. Reference: https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.compute.computetarget

HOTSPOT - You are evaluating a Python NumPy array that contains six data points defined as follows: data = [10, 20, 30, 40, 50, 60] You must generate the following output by using the k-fold algorithm implantation in the Python Scikit-learn machine learning library: train: [10 40 50 60], test: [20 30] train: [20 30 40 60], test: [10 50] train: [10 20 30 50], test: [40 60] You need to implement a cross-validation to generate the output. How should you complete the code segment? To answer, select the appropriate code segment in the dialog box in the answer area. NOTE: Each correct selection is worth one point. Hot Area: from numpy import array from sklearn.model_selection import >>1<< data - array([10, 20, 30, 40, 50, 60]) kfold = Kfold(n_splits- >>2<< , shuffle - True, random_state-1 for train, test in kFold, split( >>3<< ): print('train: %s, test: %5' % (data[train], data[test]) >>1<<: K-Means, k-fold, CrossValidation, ModelSelection >>2<<: 1, 2, 3, 6 >>3<<: data; k-fold; array; train, test

Correct Answer: k-fold, 3, data Box 1: k-fold - Box 2: 3 - K-Folds cross-validator provides train/test indices to split data in train/test sets. Split dataset into k consecutive folds (without shuffling by default).The parameter n_splits ( int, default=3) is the number of folds. Must be at least 2. Box 3: data - Example: >>> >>> from sklearn.model_selection import KFold >>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]]) >>> y = np.array([1, 2, 3, 4]) >>> kf = KFold(n_splits=2) >>> kf.get_n_splits(X) >>> print(kf) KFold(n_splits=2, random_state=None, shuffle=False) >>> for train_index, test_index in kf.split(X): ... print("TRAIN:", train_index, "TEST:", test_index) ... X_train, X_test = X[train_index], X[test_index] ... y_train, y_test = y[train_index], y[test_index] TRAIN: [2 3] TEST: [0 1] TRAIN: [0 1] TEST: [2 3] Reference: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html

DRAG DROP - An organization uses Azure Machine Learning service and wants to expand their use of machine learning. You have the following compute environments. The organization does not want to create another compute environment. EnvironmentName, Compute type nb_server, Compute Instance aks_cluster, Azure Kubernetes Service mlc_cluster, Machine Learning Compute You need to determine which compute environment to use for the following scenarios. Which compute types should you use? To answer, drag the appropriate compute environments to the correct scenarios. Each compute environment may be used once, more than once, or not at all. You may need to drag the split bar between panes or scroll to view content. NOTE: Each correct selection is worth one point. Select and Place: Environments: nb_server, aks_cluster, mlc_cluster Answer Area Scenario (choose environments): - Run an Azure Machine Learning Designer training pipeline - Deploying a web service from the Azure Machine Learning designer

Correct Answer: nb_server, mlc_cluster Box 1: nb_server - Box 2: mlc_cluster - With Azure Machine Learning, you can train your model on a variety of resources or environments, collectively referred to as compute targets. A compute target can be a local machine or a cloud resource, such as an Azure Machine Learning Compute, Azure HDInsight or a remote virtual machine. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/concept-compute-target https://docs.microsoft.com/en-us/azure/machine-learning/how-to-set-up-training-targets

HOTSPOT - You are creating a machine learning model in Python. The provided dataset contains several numerical columns and one text column. The text column represents a product's category. The product category will always be one of the following: ✑ Bikes ✑ Cars ✑ Vans ✑ Boats You are building a regression model using the scikit-learn Python package. You need to transform the text data to be compatible with the scikit-learn Python package. How should you complete the code segment? To answer, select the appropriate options in the answer area. NOTE: Each correct selection is worth one point. Hot Area: from sklearn import linear_model import >1< dataset = df.read_csv("data\\ProductSales.csv") ProductCategoryMapping = ["Bikes":1, "Cars":2, "Boats": 3, "Vans": 4} dataset['ProductCategoryMapping'] = dataset['ProductCategory']. >2< regr = linear_model.LinearRegression() X_train = dataset[['ProductCategoryMapping', 'ProductSize', 'ProductCost']] y_Train = dataset[['Sales']] regr.fit(X_train, y_train) >1< pandas as df, numpy as df, scipy as df >2< map[ProductCategoryMapping], reduce[ProductCategoryMapping], transpose[ProductCategoryMapping]

Correct Answer: pandas_as_df, transpose[ProductCategoryMapping] Alternative Answer: pandas_as_df, .map(ProductCategoryMapping (https://pandas.pydata.org/docs/reference/api/pandas.Series.map.html https://www.w3resource.com/pandas/dataframe/dataframe-transpose.php https://www.w3resource.com/pandas/series/series-map.php) Box 1: pandas as df - Pandas takes data (like a CSV or TSV file, or a SQL database) and creates a Python object with rows and columns called data frame that looks very similar to table in a statistical software (think Excel or SPSS for example. Box 2: transpose[ProductCategoryMapping] Reshape the data from the pandas Series to columns. Reference: https://datascienceplus.com/linear-regression-in-python/

HOTSPOT - The finance team asks you to train a model using data in an Azure Storage blob container named finance-data. You need to register the container as a datastore in an Azure Machine Learning workspace and ensure that an error will be raised if the container does not exist. How should you complete the code? To answer, select the appropriate options in the answer area. NOTE: Each correct selection is worth one point. Hot Area: datastore = Datastore. >1< (workspace = ws , datastore_name = 'finance_datastore' , container_name = 'finance-data' , account_name = 'fintrainingdatastorage' , account_key = 'FWUY...' , >2< >1< - register_azure_blob_container - register_azure_file_share - register_azure_data_lake - register_azure_sql_database >2< - create_if_not_exists = True - create_if_not_exists = False - overwrite = True - overwrite = False

Correct Answer: register_azure_blob_container, create_if_not_exists = False Box 1: register_azure_blob_container Register an Azure Blob Container to the datastore. Box 2: create_if_not_exists = False Create the file share if it does not exist, defaults to False. Reference: https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.datastore.datastore

HOTSPOT - You are preparing to build a deep learning convolutional neural network model for image classification. You create a script to train the model using CUDA devices. You must submit an experiment that runs this script in the Azure Machine Learning workspace. The following compute resources are available: ✑ a Microsoft Surface device on which Microsoft Office has been installed. Corporate IT policies prevent the installation of additional software ✑ a Compute Instance named ds-workstation in the workspace with 2 CPUs and 8 GB of memory ✑ an Azure Machine Learning compute target named cpu-cluster with eight CPU-based nodes ✑ an Azure Machine Learning compute target named gpu-cluster with four CPU and GPU-based nodes You need to specify the compute resources to be used for running the code to submit the experiment, and for running the script in order to minimize model training time. Which resources should the data scientist use? To answer, select the appropriate options in the answer area. NOTE: Each correct selection is worth one point. Hot Area: Run Code to submit the experiment: - the Microsoft Surface device - the ds-workstation computation instance - the cpu-cluster compute target - the gpu-cluster compute target Run the training script: - the ds-workstation compute instance - the cpu-cluster compute target - the gpu-cluster compute target - the Microsoft Surface device

Correct Answer: the ds-workstation compute instance, the gpu-cluster compute target Box 1: the ds-workstation compute instance A workstation notebook instance is good enough to run experiments. Box 2: the gpu-cluster compute target Just as GPUs revolutionized deep learning through unprecedented training and inferencing performance, RAPIDS enables traditional machine learning practitioners to unlock game-changing performance with GPUs. With RAPIDS on Azure Machine Learning service, users can accelerate the entire machine learning pipeline, including data processing, training and inferencing, with GPUs from the NC_v3, NC_v2, ND or ND_v2 families. Users can unlock performance gains of more than 20X (with 4 GPUs), slashing training times from hours to minutes and dramatically reducing time-to-insight. Reference: https://azure.microsoft.com/sv-se/blog/azure-machine-learning-service-now-supports-nvidia-s-rapids/


Related study sets

Business Information Systems Test 2

View Set

Marina Fahmy 2017 day 888 -= 2- Mention the place, the speakers and the language functions, Marina Fahmy 2017 day 888 -= 1) Find the mistake in each of the following sentences and write it correctly:, Marina Fahmy 2017 day 888 -= choose, Fahmy 2017 d...

View Set

Chapter 29 Bio 221 Plant Diversity

View Set

PLSC 001 Midterm Exam Ch. 4 (Post-Test)

View Set

Ethical Hacking Final Exam Study Guide

View Set

Chapter 46 - Listening Guide Quiz 38: Tchaikovsky: The Nutcracker, Trepak

View Set

Flagella Movement, Fimbrae, Pili, and Endospores

View Set