DP-100 Microsoft Exam
You are developing a hands-on workshop to introduce Docker for Windows to attendees.You need to ensure that workshop attendees can install Docker on their devices.Which two prerequisite components should attendees install on the devices?
- BIOS-enabled virtualization - Windows 10 64-bit Professional
How can you create a workspace?
-In the Microsoft Azure portal -Use the Azure Machine Learning Python SDK -Use the Azure Command Line Interface (CLI) with the Azure Machine Learning CLI extension -Create an Azure Resource Manager templat
You are building an intelligent solution using machine learning models.The environment must support the following requirements: ✑ Data scientists must build notebooks in a cloud environment ✑ Data scientists must use automatic feature engineering and model building in machine learning pipelines. ✑ Notebooks must be deployed to retrain using Spark instances with dynamic worker allocation. ✑ Notebooks must be exportable to be version controlled locally. You need to create the environment.Which four actions should you perform in sequence?
1. Create and Azure HDInsider cluster to include the Apache Spark MLib Library 2. Install Microsoft Machine Learning for Apache Spark 3. Create and execute Zeppelin notebooks on the cluster 4. When the cluster is ready, export Zeppelin notebooks to a local environment.
You plan to provision an Azure Machine Learning Basic edition workspace for a data science project.You need to identify the tasks you will be able to perform in the workspace.Which three tasks will you be able to perform?
1. Create a Compute Instance and use it to run code in Jupyter notebooks. 2. Create an Azure Kubernetes Service (AKS) inference cluster. 3. Create a tabular dataset that supports versioning.
Which of the following descriptions accurately describes Azure Machine Learning?
A cloud-based platform for operating machine learning solutions at scale.
F1-Score
An average metric that takes both precision and recall into account.
You must store data in Azure Blob Storage to support Azure Machine Learning.You need to transfer the data into Azure Blob Storage.What are possible ways to achieve the goal?
AzCopy Python script Azure Storage Explorer SSIS
You are developing a data science workspace that uses an Azure Machine Learning service.You need to select a compute target to deploy the workspace.
Azure Container Service
Compute Clusters
Scalable clusters of virtual machines for on-demand processing of experiment code.
You are solving a classification task.The dataset is imbalanced.You need to select an Azure Machine Learning Studio module to improve the classification accuracy.Which module should you use?
Synthetic Minority Oversampling Technique (SMOTE)
inferencing (machine learning definition)
the use of a trained model to predict labels for new data on which the model has not been trained
batch inferencing
used to apply a predictive model to multiple cases asynchronously - usually writing the results to a file or database.
You plan to use scikit-learn to train a model that predicts credit default risk. The model must predict a value of 0 for loan applications that should be automatically approved, and 1 for applications where there is a risk of default that requires human consideration. What kind of model is required?
A binary classification model
A datastore contains a CSV file of structured data that you want to use as a Pandas dataframe. Which kind of object should you create to make it easy to do this?
A tabular dataset.
Azure Machine Learning workspace
A workspace defines the boundary for a set of related machine learning assets. You can use workspaces to group machine learning assets based on projects, deployment environments (for example, test and production), teams, or some other organizing principle.
You are creating a convolutional neural network. You want to reduce the size of the feature maps that are generated by a convolutional layer. What should you do?
Add a pooling layer after the convolutional layer
You want to use a script-based experiment to train a PyTorch model, setting the batch size and learning rate hyperparameters to specified values each time the experiment runs. What should you do?
Add arguments for batch size and learning rate to the script, and set them in the arguments property of the ScriptRunConfig
You've trained a model using the Python SDK for Azure Machine Learning. You want to deploy the model as a containerized real-time service with high scalability and security. What kind of compute should you create to host the service?
An Azure Kubernetes Services (AKS) inferencing cluster
You use Azure Machine Learning designer to create a training pipeline for a clustering model. Now you want to use the model in an inference pipeline. Which module should you use to infer cluster predictions from the model?
Assign Data to Clusters
Types of compute
Attached compute Local compute Compute clusters
You are developing deep learning models to analyze semi-structured, unstructured, and structured data types.You have the following data available for model building: ✑ Video recordings of sporting events ✑ Transcripts of radio commentary about events ✑ Logs from related social media feeds captured during sporting events You need to select an environment for creating the model.Which environment should you use?
Azure Cognitive Services
You plan to build a team data science environment. Data for training models in machine learning pipelines will be over 20 GB in size. You have the following requirements: ✑ Models must be built using Caffe2 or Chainer frameworks. ✑ Data scientists must be able to use a data science environment to build the machine learning pipelines and train models on their personal devices in both connected and disconnected network environments. Personal devices must support updating machine learning pipelines when connected to a network. You need to select a data science environment.Which environment should you use?
Azure Machine Learning Service
Azure data source types
Azure Storage (blob and file containers) Azure Data Lake stores Azure SQL Database Azure Databricks file system (DBFS)
Your team is building a data engineering and data science development environment. The environment must support the following requirements: ✑ support Python and Scala ✑ compose data storage, movement, and processing services into automated data pipelines ✑ the same tool should be used for the orchestration of both data engineering and data science ✑ support workload isolation and interactive workloads ✑ enable scaling across a cluster of machines You need to create the environment.What should you do?
Build the environment in Azure Databricks and use Azure Data Factory for orchestration.
Splitting mode: Regular Expression Split
Choose this option when you want to divide your dataset by testing a single column for a value. For example, if you're analyzing sentiment, you can check for the presence of a particular product name in a text field. You can then divide the dataset into rows with the target product name and rows without the target product name.
You must create a compute target for training experiments that require a graphical processing unit (GPU). You want to be able to scale the compute so that multiple nodes are started automatically as required. Which kind of compute target should you create?
Compute Cluster
The assets in a workspace include:
Compute targets for development, training, and deployment. Data for experimentation and model training. Notebooks containing shared code and documentation. Experiments, including run history with logged metrics and outputs. Pipelines that define orchestrated multi-step processes. Models that you have trained.
You are moving a large dataset from Azure Machine Learning Studio to a Weka environment.You need to format the data for the Weka environment.Which module should you use?
Convert to ARFF
You use Azure Machine Learning designer to create a training pipeline for a classification model. What must you do before deploying the model as a service?
Create an inference pipeline from the training pipeline
Curated environments
Curated environments contain collections of Python packages and are available in your workspace by default. These environments are backed by cached Docker images which reduces the run preparation cost.
You plan to use a Data Science Virtual Machine (DSVM) with the open source deep learning frameworks Caffe2 and PyTorch.You need to select a pre-configured DSVM to support the frameworks.What should you create?
Data Science Virtual Machine for Linux (Ubuntu)
Inference Clusters
Deployment targets for predictive services that use your trained models
Compute Instances
Development workstations that data scientists can use to work with data and models.
You plan to use a Deep Learning Virtual Machine (DLVM) to train deep learning models using Compute Unified Device Architecture (CUDA) computations.You need to configure the DLVM to support CUDA.What should you implement?
Graphic Processing Unit (GPU)
You are creating an experiment by using Azure Machine Learning Studio.You must divide the data into four subsets for evaluation. There is a high degree of missing values in the data. You must prepare the data for analysis.You need to select appropriate methods for producing the experiment.Which three modules should you run in sequence?
Import data clean missing data partition and sample
You are creating a deep neural network. You increase the Learning Rate parameter. What effect does this setting have?
Larger adjustments are made to weight values during backpropagation
Attached Compute
Links to existing Azure compute resources, such as Virtual Machines or Azure Databricks clusters.
What makes machine learning algorithms different from traditional algorithms?
Machine learning algorithms are shaped by data directly as part of development. Traditional algorithms are based almost entirely on theory or on opinions of the person writing the code
Can you use drag and drop and nice user iterface in Azure Machine Learning Basic?
No, the UI is included the Enterprise edition only.
Can you start your own environment name with the AzureML prefix?
No, this prefix is reserved for curated environments.
You are creating a training pipeline for a regression model, using a dataset that has multiple numeric columns in which the values are on different scales. You want to transform the numeric columns so that the values are all on a similar scale based relative to the minimum and maximum values in each column. Which module should you add to the pipeline?
Normalize Data
Precision
Of the predictions the model made for this class, what proportion were correct?
Recall
Out of all of the instances of this class in the test dataset, how many did the model identify?
You're creating a pipeline that includes two steps. Step 1 preprocesses some data, and step 2 uses the preprocessed data to train a model. What type of object should you use to pass data from step 1 to step 2 and create a dependency between these steps?
OutputFileDatasetConfig
You are creating a batch inferencing pipeline that you want to use to predict new values for a large volume of data files. You want the pipeline to run the scoring script on multiple nodes and collate the results. What kind of step should you include in the pipeline?
ParallelRunStep
You use Azure Machine Learning Studio to build a machine learning experiment.You need to divide data into two distinct datasets.Which module should you use?
Partition and Sample
You are retrieving data from a large datastore by using Azure Machine Learning Studio.You must create a subset of the data for testing purposes using a random sampling seed based on the system clock.You add the Partition and Sample module to your experiment.You need to select the properties for the module.Which values should you select?
Partition and sample mode -> Sample Rate of sampling 0
Enable interactive data exploration and visualisation
PowerBI
Common kinds of step in an Azure Machine Learning pipeline
PythonScriptStep: Runs a specified Python script. DataTransferStep: Uses Azure Data Factory to copy data between data stores. DatabricksStep: Runs a notebook, script, or compiled JAR on a databricks cluster. AdlaStep: Runs a U-SQL job in Azure Data Lake Analytics. ParallelRunStep - Runs a Python script as a distributed task on multiple compute nodes.
Rattle - what is this?
R analytical tool that gets you started with data analytics and machine learning.
You are a lead data scientist for a project that tracks the health and migration of birds. You create a multi-class image classification deep learning model that uses a set of labeled bird photographs collected by experts.You have 100,000 photographs of birds. All photographs use the JPG format and are stored in an Azure blob container in an Azure subscription.You need to access the bird photograph files in the Azure blob container from the Azure Machine Learning service workspace that will be used for deep learning model training. You must minimize data movement. What should you do?
Register the Azure blob storage containing the bird photographs as a datastore in Azure Machine Learning service.
You have run an experiment to train a model. You want the model to be stored in the workspace, and available to other experiments and published services. What should you do?
Register the model in the workspace.
You are with a time series dataset in Azure Machine Learning Studio.You need to split your dataset into training and testing subsets by using the Split Data module.
Relative Expression Split
You are creating a machine learning model. You have a dataset that contains null rows.You need to use the Clean Missing Data module in Azure Machine Learning Studio to identify and resolve the null and missing data in the dataset.Which parameter should you use?
Remove entire row
You've published a pipeline that you want to run every week. You plan to use the Schedule.create method to create the schedule. What kind of object must you create first to configure how frequently the pipeline runs?
ScheduleRecurrence
You are using an Azure Machine Learning designer pipeline to train and test a K-Means clustering model. You want your model to assign items to one of three clusters. Which configuration property of the K-Means Clustering module should you set to accomplish this?
Set Number of Centroids to 3
You are using Azure Machine Learning designer to create a training pipeline for a binary classification model. You have added a dataset containing features and labels, a Two-Class Decision Forest module, and a Train Model module. You plan to use Score Model and Evaluate Model modules to test the trained model with a subset of the dataset that was not used for training. Which additional kind of module should you add?
Split Data
You use Azure Machine Learning Studio to build a machine learning experiment.You need to divide data into two distinct datasets. Which module should you use?
Split Data
Why do we split our data into training and validation sets?
Splitting data into two sets enables you to compare the labels that the model predicts with the actual known labels in the original dataset.
You plan to create a speech recognition deep learning model.The model must support the latest version of Python.You need to recommend a deep learning framework for speech recognition to include in the Data Science Virtual Machine (DSVM).What should you recommend?
TensorFlow
After registering a dataset, you can retrieve it by using any of the following techniques:
The datasets dictionary attribute of a Workspace object. The get_by_name or get_by_id method of the Dataset class.
You are training a deep neural network. You configure the training process to use 50 epochs. What effect does this configuration have?
The entire training dataset is passed through the network 50 times
You train a binary classification model using scikit-learn. When you evaluate it with test data, you determine that the model achieves an overall Recall metric of 0.81. What does this metric indicate?
The model correctly identified 81% of positive cases as positive
You use an Azure Machine Learning designer pipeline to train and test a binary classification model. You review the model's performance metrics in an Evaluate Model module, and note that it has an AUC score of 0.3. What can you conclude about the model?
The model performs worse than random guessing.
You are creating a deep neural network to train a classification model that predicts to which of three classes an observation belongs based on 10 numeric features. Which of the following statements is true of the network architecture?
The output layer should contain three nodes
Entropy MDL binning mode
This method requires that you select the column you want to predict and the column or columns that you want to group into bins. It then makes a pass over the data and attempts to determine the number of bins that minimizes the entropy. In other words, it chooses a number of bins that allows the data column to best predict the target column. It then returns the bin number associated with each row of your data in a column named <colname>quantized.
You are developing a deep learning model by using TensorFlow. You plan to run the model training workload on an Azure Machine Learning Compute Instance.You must use CUDA-based model training.You need to provision the Compute Instance. Which two virtual machines sizes can you use?
Those which support GPU.
K-Means clustering is an example of which kind of machine learning?
Unsupervised machine learning
Convert to Indicator Values module
Use the Convert to Indicator Values module in Azure Machine Learning Studio. The purpose of this module is to convert columns that contain categorical values into a series of binary indicator columns that can more easily be used as features in a machine learning model.
You are a data scientist using Azure Machine Learning Studio.You need to normalize values to produce an output column into bins to predict a target column. What normalization mode to use?
Use the Entropy MDL binning mode which has a target column.
SMOTE module
Use the SMOTE module in Azure Machine Learning Studio to increase the number of underepresented cases in a dataset used for machine learning. SMOTE is a better way of increasing the number of rare cases than simply duplicating existing cases.
You are using the Azure Machine Learning Python SDK to write code for an experiment. You must log metrics from each run of the experiment, and be able to retrieve them easily from each run. What should you do?
Use the log* methods of the Run class to record named metrics.
Splitting mode: Split Rows
Use this option if you just want to divide the data into two parts. You can specify the percentage of data to put in each split. By default, the data is divided 50/50. You can also randomize the selection of rows in each group, and use stratified sampling. In stratified sampling, you must select a single column of data for which you want values to be apportioned equally among the two result datasets.
Splitting mode: Relative Expression Split
Use this option whenever you want to apply a condition to a number column. The number can be a date/time field, a column that contains age or dollar amounts, or even a percentage. For example, you might want to divide your dataset based on the cost of the items, group people by age ranges, or separate data by a calendar date.
Tool to build DNN models
Vowpal Wabbit
Considerations for datastores
When using Azure blob storage, premium level storage may provide improved I/O performance for large datasets. However, this option will increase cost and may limit replication options for data redundancy. When working with data files, although CSV format is very common, Parquet format generally results in better performance. You can access any datastore by name, but you may want to consider changing the default datastore (which is initially the built-in workspaceblobstore datastore).
MICE (Multivariate Imputation using Chained Equations)
With a multiple imputation method, each variable with missing data is modeled conditionally using the other variables in the data before filling in the missing values.
Stratified split
With stratified sampling, the data is divided such that each output dataset gets roughly the same percentage of each target value. For example, you might want to ensure that your training and testing sets are roughly balanced with regard to the outcome or to some other column (such as gender).
Datasets can be versioned?
You can create a new version of a dataset by registering it with the same name as a previously registered dataset and specifying the create_new_version property
You are analyzing a dataset containing historical data from a local taxi company. You are developing a regression model.You must predict the fare of a taxi trip.You need to select performance metrics to correctly evaluate the regression model.Which two metrics can you use?
a Root Mean Square Error value that is low an R-Squared value close to 1
Clustering
a form of unsupervised machine learning in which observations are grouped into clusters based on similarities in their data values, or features
pipeline (Azure Machine Learning definition)
a workflow of machine learning tasks in which each task is implemented as a step.
which compute environment to use when deploying a web service from the Azure Learning Designer?
aks_cluster (Azure Kubernetes Service)
as_mount vs as_download when loading large datasets
as_download - download to temp location, for large files use as_mount if not enough space
You want a script to stream data directly from a file dataset. Which mode should you use?
as_mount()
datastores
datastores are abstractions for cloud data sources. They encapsulate the information required to connect to data sources. You can access datastores directly in code by using the Azure Machine Learning SDK, and use it to upload or download data.
You have a reference to a Workspace named ws. Which code retrieves the default datastore for the workspace?
default_ds = ws.get_default_datastore()
You're using the Azure Machine Learning Python SDK to run experiments. You need to create an environment from a Conda configuration (.yml) file. Which method of the Environment class should you use?
from_conda_specification
You're deploying a model as a real-time inferencing service. What functions must the entry script for the service include?
init() and run()
Probabilistic PCA
it approximates the covariance for the full dataset.
which compute environment to use when running an Azure Learning Designer training pipeline?
mlc_cluster (Machine Learning Compute)
You are using scikit-learn to train a K-Means clustering model that groups observations into three clusters. How should you create the KMeans object to accomplish this goal?
model = KMeans(n_clusters=3)
You have trained a classification model using the scikit-learn LogisticRegression class. You want to use the model to return labels for new data in the array x_new. Which code should you use?
model.predict(x_new)
Scikit-learn - what is this?
one of the most useful libraries for machine learning in Python. It is on NumPy, SciPy and matplotlib, this library contains a lot of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction.
CUDA
parallel computing platform and programming model developed by Nvidia for general computing on its own GPUs (graphics processing units)
You have configured the step in your batch inferencing pipeline with an output_action="append_row" property. In which file should you look for the batch inferencing results?
parallel_run_step.txt
You create an Azure Machine Learning workspace. You are preparing a local Python environment on a laptop computer. You want to use the laptop to connect to the workspace and run experiments. You create the following config.json file. {"workspace_name" : "ml-workspace"} You must use the Azure Machine Learning SDK to interact with data and experiments in the workspace.You need to configure the config.json file to connect to the workspace from the Python environment.Which two additional parameters must you add to the config.json file in order to connect to the workspace?
resource_group subscription_id
Weka - what is this?
used for visual data mining and machine learning software in Java.
Datasets
versioned packaged data objects that can be easily consumed in experiments and pipelines. Datasets are the recommended way to work with data, and are the primary mechanism for advanced Azure Machine Learning capabilities like data labeling and data drift monitoring.
When grid sampling can be employed?
when all hyperparameters are discrete