Azure Data Scientist

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Accuracy

(TP+TN)/(TP+TN+FP+FN) - Out of all the predictions, how many were correct?

Data Frame

- ".df" in python - Pandas package structure of data

Normal distribution

- 68.26% of values within 1 SD of mean - 95.45% within 2 SD of mean - 99.73% within 3 SD of mean

Truncation Policy

- A truncation selection policy cancels the lowest performing X% of runs at each evaluation interval based on the truncation_percentage value you specify for X. EX: from azureml.train.hyperdrive import TruncationSelectionPolicy early_termination_policy = TruncationSelectionPolicy(truncation_percentage=10, evaluation_interval=1, delay_evaluation=5)

early termination policy

- Abandons runs that are unlikely to produce a better result than previously completed runs. - The policy is evaluated at an evaluation_interval you specify, based on each time the target performance metric is logged. - You can also set a delay_evaluation parameter to avoid evaluating the policy until a minimum number of iterations have been completed. - Used because: With a sufficiently large hyperparameter search space, it could take many iterations (child runs) to try every possible combination. Typically, you set a maximum number of iterations, but this could still result in a large number of runs that don't result in a better model than a combination that has already been tried. - Early termination is particularly useful for deep learning scenarios where a deep neural network (DNN) is trained iteratively over a number of epochs. The training script can report the target metric after each epoch, and if the run is significantly underperforming previous runs after the same number of intervals, it can be abandoned.

Deep Learning

- Advanced ML that tries to emulate how the human brain learns - Artificial neural networks that process inputs, numeric inputs = x - DNN = Deep Neural Network

Deploy a model

- After all of the configuration is prepared, you can deploy the model. The easiest way to do this is to call the deploy method of the Model class, like this: from azureml.core.model import Model model = ws.models['classification_model'] service = Model.deploy(workspace=ws, name = 'classifier-service', models = [model], inference_config = classifier_inference_config, deployment_config = classifier_deploy_config, deployment_target = production_cluster) service.wait_for_deployment(show_output = True) - For ACI or local services, you can omit the deployment_target parameter (or set it to None).

Retrieving model files

- After an experiment run has completed, you can use the run objects get_file_names method to list the files generated. - Standard practice is for scripts that train models to save them in the run's outputs folder. - You can also use the run object's download_file and download_files methods to download output files to the local file system.

Combining a script and environment in an InferenceConfig

- After creating the entry script and environment configuration file, you can combine them in an InferenceConfig for the service like this: from azureml.core.model import InferenceConfig classifier_inference_config = InferenceConfig(runtime= "python", source_directory = 'service_files', entry_script="score.py", conda_file="env.yml")

Reuse pipeline steps

- By default, the step output from a previous pipeline run is reused without rerunning the step provided the script, source directory, and other parameters for the step have not changed. Step reuse can reduce the time it takes to run a pipeline, but it can lead to stale results when changes to downstream data sources have not been accounted for. - To control reuse for an individual step, you can set the allow_reuse parameter in the step configuration, like this: step1 = PythonScriptStep(name = 'prepare data', source_directory = 'scripts', script_name = 'data_prep.py', compute_target = 'aml-cluster', runconfig = run_config, inputs=[raw_ds.as_named_input('raw_data')], outputs=[prepped_data], arguments = ['--folder', prepped_data]), # Disable step reuse allow_reuse = False) - When you have multiple steps, you can force all of them to run regardless of individual reuse configuration by setting the regenerate_outputs parameter when submitting the pipeline experiment: pipeline_run = experiment.submit(train_pipeline, regenerate_outputs=True)

Training dataset

- Chunk of data split and used to train the model

PyPlot

- Class in Matplotlib that provides ways to visualize data

Consuming a real time inferencing service

- After deploying a real-time service, you can consume it from client applications to predict labels for new data cases. 1. Using Azure SDK - For testing, you can use the Azure Machine Learning SDK to call a web service through the run method of a WebService object that references the deployed service. Typically, you send data to the run method in JSON format with the following structure: { "data":[ [0.1,2.3,4.1,2.0], // 1st case [0.2,1.8,3.9,2.1], // 2nd case, ... ] } - The response from the run method is a JSON collection with a prediction for each case that was submitted in the data. The following code sample calls a service and displays the response: import json # An array of new data cases x_new = [[0.1,2.3,4.1,2.0], [0.2,1.8,3.9,2.1]] # Convert the array to a serializable list in a JSON document json_data = json.dumps({"data": x_new}) # Call the web service, passing the input data response = service.run(input_data = json_data) # Get the predictions predictions = json.loads(response) # Print the predicted class for each case. for i in range(len(x_new)): print (x_new[i]), predictions[i] ) 2. Using a REST endpoint - In production, most client applications will not include the Azure Machine Learning SDK, and will consume the service through its REST interface. You can determine the endpoint of a deployed service in Azure machine Learning studio, or by retrieving the scoring_uri property of the Webservice object in the SDK, like this: endpoint = service.scoring_uri print(endpoint) - With the endpoint known, you can use an HTTP POST request with JSON data to call the service. The following example shows how to do this using Python: import requests import json # An array of new data cases x_new = [[0.1,2.3,4.1,2.0], [0.2,1.8,3.9,2.1]] # Convert the array to a serializable list in a JSON document json_data = json.dumps({"data": x_new}) # Set the content type in the request headers request_headers = { 'Content-Type':'application/json' } # Call the service response = requests.post(url = endpoint, data = json_data, headers = request_headers) # Get the predictions from the JSON response predictions = json.loads(response.json()) # Print the predicted class for each case. for i in range(len(x_new)): print (x_new[i]), predictions[i] ) 3. Authentication - In production, you will likely want to restrict access to your services by applying authentication. There are two kinds of authentication you can use: *Key: Requests are authenticated by specifying the key associated with the service. *Token: Requests are authenticated by providing a JSON Web Token (JWT). - By default, authentication is disabled for ACI services, and set to key-based authentication for AKS services (for which primary and secondary keys are automatically generated). You can optionally configure an AKS service to use token-based authentication (which is not supported for ACI services). - Assuming you have an authenticated session established with the workspace, you can retrieve the keys for a service by using the get_keys method of the WebService object associated with the service: primary_key, secondary_key = service.get_keys() - For token-based authentication, your client application needs to use service-principal authentication to verify its identity through Azure Active Directory (Azure AD) and call the get_token method of the service to retrieve a time-limited token. - To make an authenticated call to the service's REST endpoint, you must include the key or token in the request header like this: import requests import json # An array of new data cases x_new = [[0.1,2.3,4.1,2.0], [0.2,1.8,3.9,2.1]] # Convert the array to a serializable list in a JSON document json_data = json.dumps({"data": x_new}) # Set the content type in the request headers request_headers = { "Content-Type":"application/json", "Authorization":"Bearer " + key_or_token } # Call the service response = requests.post(url = endpoint, data = json_data, headers = request_headers) # Get the predictions from the JSON response predictions = json.loads(response.json()) # Print the predicted class for each case. for i in range(len(x_new)): print (x_new[i]), predictions[i] )

Tree Based Algorithm (Linear algorithm)

- Algorithm that builds a decision tree to reach a prediction

Ensemble Algorithm (Linear algorithm)

- Algorithm that combines outputs of multiple base algorithms to improve generalizability

Tree-based algorithms

- Algorithms that build a decision tree to reach a prediction

Ensemble algorithms

- Algorithms that combine the outputs of several base algorithms to improve generalizability

Support Vector Machine algorithms

- Algorithms that define a hyperplane that separates classes

Pipeline Steps

- An Azure Machine Learning pipeline is made of one or more steps that perform tasks. - There are many kinds of step supported by Azure Machine Learning pipelines, each with its own specialized purpose and configuration options. Common kinds of step in an Azure Machine Learning pipeline include: 1. PythonScriptStep: Runs a specified Python script. 2. DataTransferStep: Uses Azure Data Factory to copy data between data stores. 3. DatabricksStep: Runs a notebook, script, or compiled JAR on a databricks cluster. 4. AdlaStep: Runs a U-SQL job in Azure Data Lake Analytics. 5. ParallelRunStep - Runs a Python script as a distributed task on multiple compute nodes.

Attaching an unmanaged compute target with SDK

- An unmanaged compute target is one that is defined and managed outside of the Azure Machine Learning workspace; for example, an Azure virtual machine or an Azure Databricks cluster. - The code to attach an existing unmanaged compute target is similar to the code used to create a managed compute target, except that you must use the ComputeTarget.attach() method to attach the existing compute based on its target-specific configuration settings. - For example, the following code can be used to attach an existing Azure Databricks cluster: from azureml.core import Workspace from azureml.core.compute import ComputeTarget, DatabricksCompute # Load the workspace from the saved config file ws = Workspace.from_config() # Specify a name for the compute (unique within the workspace) compute_name = 'db_cluster' # Define configuration for existing Azure Databricks cluster db_workspace_name = 'db_workspace' db_resource_group = 'db_resource_group' db_access_token = '1234-abc-5678-defg-90...' db_config = DatabricksCompute.attach_configuration(resource_group=db_resource_group, workspace_name=db_workspace_name, access_token=db_access_token) # Create the compute databricks_compute = ComputeTarget.attach(ws, compute_name, db_config) databricks_compute.wait_for_completion(True)

Model Drift Monitor

- Azure Machine Learning supports data drift monitoring through the use of datasets. You can capture new feature data in a dataset and compare it to the dataset with which the model was trained. - It's common for organizations to continue to collect new data after a model has been trained. For example, a health clinic might use diagnostic measurements from previous patients to train a model that predicts diabetes likelihood, but continue to collect the same diagnostic measurements from all new patients. The clinic's data scientists could then periodically compare the growing collection of new data to the original training data, and identify any changing data trends that might affect model accuracy. - To monitor data drift using registered datasets, you need to register two datasets: 1. A baseline dataset - usually the original training data. 2. A target dataset that will be compared to the baseline based on time intervals. This dataset requires a column for each feature you want to compare, and a timestamp column so the rate of data drift can be measured. - After creating these datasets, you can define a dataset monitor to detect data drift and trigger alerts if the rate of drift exceeds a specified threshold. You can create dataset monitors using the visual interface in Azure Machine Learning studio, or by using the DataDriftDetector class in the Azure Machine Learning SDK as shown in the following example code: from azureml.datadrift import DataDriftDetector monitor = DataDriftDetector.create_from_datasets(workspace=ws, name='dataset-drift-detector', baseline_data_set=train_ds, target_data_set=new_data_ds, compute_target='aml-cluster', frequency='Week', feature_list=['age','height', 'bmi'], latency=24) - After creating the dataset monitor, you can backfill to immediately compare the baseline dataset to existing data in the target dataset, as shown in the following example, which backfills the monitor based on weekly changes in data for the previous six weeks: import datetime as dt backfill = monitor.backfill( dt.datetime.now() - dt.timedelta(weeks=6), dt.datetime.now())

Statistical Machine Learning Algorithms

- Based on probability ~ (P(y)) P = probability y = value the model is trying to predict ~ (P(y)) = true ~ (1-P(y)) = false - Threshold of 0.5 is used to decide whether or not a predicted label is 1 (P(y)) > 0.5 or a 0 (P(y)) < 0..5 _ You can use the "predict_proba" method to see probability for each case

Bayesian Sampling

- Bayesian sampling chooses hyperparameter values based on the Bayesian optimization algorithm, which tries to select parameter combinations that will result in improved performance from the previous selection. The following code example shows how to configure Bayesian sampling: from azureml.train.hyperdrive import BayesianParameterSampling, choice, uniform param_space = { '--batch_size': choice(16, 32, 64), '--learning_rate': uniform(0.5, 0.1) } param_sampling = BayesianParameterSampling(param_space) - You can only use Bayesian sampling with choice, uniform, and quniform parameter expressions, and you can't combine it with an early-termination policy.

Scheduling a pipeline to run on intervals

- Code to schedule a pipeline to run on periodic intervals using a 'Schedule' after defining a 'ScheduleRecurrence': from azureml.pipeline.core import ScheduleRecurrence, Schedule daily = ScheduleRecurrence(frequency='Day', interval=1) pipeline_schedule = Schedule.create(ws, name='Daily Training', description='trains model every day', pipeline_id=published_pipeline.id, experiment_name='Training_Pipeline', recurrence=daily)

thing.insull().sum()

- Confirm there are no nulls

Creating an entry script

- Create the entry script (sometimes referred to as the scoring script) for the service as a Python (.py) file. It must include two functions: 1. init(): Called when the service is initialized. 2. run(raw_data): Called when new data is submitted to the service. - Typically, you use the init function to load the model from the model registry, and use the run function to generate predictions from the input data. The following example script shows this pattern: import json import joblib import numpy as np from azureml.core.model import Model # Called when the service is loaded def init(): global model # Get the path to the registered model file and load it model_path = Model.get_model_path('classification_model') model = joblib.load(model_path) # Called when a request is received def run(raw_data): # Get the input data as a numpy array data = np.array(json.loads(raw_data)['data']) # Get a prediction from the model predictions = model.predict(data) # Return the predictions as any JSON serializable format return predictions.tolist()

mlflow.create_experiment()

- Creates a new experiment and returns its ID. Runs can be launched under the experiment by passing the experiment ID to mlflow.start_run.

Validation/test dataset

- Data fed through model to verify effectiveness and training done by training dataset

Feature engineering

- Deriving new feature columns by transforming or combining existing features

thing=thing.dropna()

- Drop nulls

mlflow.end_run()

- Ends the currently active run, if any, taking an optional run status.

iloc method

- Finds rows based on original positiom in the data frame - Ex: df_students.iloc[0:5] will pull index positions [0-4] - loc vs iloc: loc will pull rows with index values 0-5 while iloc will pull values included in the range. So iloc would pull [0,1,2,3,4,5] but loc would only pull [0,1,2,3,4]

fit() method

- For training model using scikit-learn Linear Regression class

Clustering

- Form of unsupervised ML - Observations are grouped into clusters based o similarities in their data values or features - Unsupervised but does not make use of previously known values to train the model - In a cluster model the label is the cluster to which the observation is assigned to based purely on it's features - Centroids = central point of a cluster - Features = data points of whatever you're clustering - Observation = an instance

Grid Sampling

- Grid sampling can only be employed when all hyperparameters are discrete, and is used to try every possible combination of parameters in the search space. - For example, in the following code example, grid sampling is used to try every possible combination of discrete batch_size and learning_rate value: from azureml.train.hyperdrive import GridParameterSampling, choice param_space = { '--batch_size': choice(16, 32, 64), '--learning_rate': choice(0.01, 0.1, 1.0) } param_sampling = GridParameterSampling(param_space)

Measures of central tendency

- Help understand distribution better - Fancy way to say the stats that represent the "middle" value of the data - Goal is to find the "typical" value in dataset and common ways to find that are using mean, median and mode, min and max values - bimodal or multimodal = a tie for the most common value (mode)

Distribution

- How are all the values spread/distributed across the sample - Often visualized with a histogram - Think bell curve

Hyperparameter tuning experiment

- In Azure Machine Learning, you can tune hyperparameters by running a hyperdrive experiment. - To run a hyperdrive experiment, you need to create a training script just the way you would do for any other training experiment, except that your script must: - Include an argument for each hyperparameter you want to vary. - Log the target performance metric. This enables the hyperdrive run to evaluate the performance of the child runs it initiates, and identify the one that produces the best performing model. - For example, the following example script trains a logistic regression model using a --regularization argument to set the regularization rate hyperparameter, and logs the accuracy metric with the name Accuracy: import argparse import joblib from azureml.core import Run import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression # Get regularization hyperparameter parser = argparse.ArgumentParser() parser.add_argument('--regularization', type=float, dest='reg_rate', default=0.01) args = parser.parse_args() reg = args.reg_rate # Get the experiment run context run = Run.get_context() # load the training dataset data = run.input_datasets['training_data'].to_pandas_dataframe() # Separate features and labels, and split for training/validation X = data[['feature1','feature2','feature3','feature4']].values y = data['label'].values X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30) # Train a logistic regression model with the reg hyperparameter model = LogisticRegression(C=1/reg, solver="liblinear").fit(X_train, y_train) # calculate and log accuracy y_hat = model.predict(X_test) acc = np.average(y_hat == y_test) run.log('Accuracy', np.float(acc)) # Save the trained model os.makedirs('outputs', exist_ok=True) joblib.dump(value=model, filename='outputs/model.pkl') run.complete()

Checking for an existing compute target

- In many cases, you will want to check for the existence of a compute target, and only create a new one if there isn't already one with the specified name. - To accomplish this, you can catch the ComputeTargetException exception: from azureml.core.compute import ComputeTarget, AmlCompute from azureml.core.compute_target import ComputeTargetException compute_name = "aml-cluster" # Check if the compute target exists try: aml_cluster = ComputeTarget(workspace=ws, name=compute_name) print('Found existing cluster.') except ComputeTargetException: # If not, create it compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_DS11_V2', max_nodes=4) aml_cluster = ComputeTarget.create(ws, compute_name, compute_config) aml_cluster.wait_for_completion(show_output=True)

Matplotlib

- Library for plotting data in python

Scoring AKA Interfacing

- Load model to predict for new data

Types of compute

- Local compute - You can specify a local compute target for most processing tasks in Azure Machine Learning. This runs the experiment on the same compute target as the code used to initiate the experiment, which may be your physical workstation or a virtual machine such as an Azure Machine Learning compute instance on which you are running a notebook. Local compute is generally a great choice during development and testing with low to moderate volumes of data. - Compute clusters - For experiment workloads with high scalability requirements, you can use Azure Machine Learning compute clusters; which are multi-node clusters of Virtual Machines that automatically scale up or down to meet demand. This is a cost-effective way to run experiments that need to handle large volumes of data or use parallel processing to distribute the workload and reduce the time it takes to run. - Attached compute - If you already use an Azure-based compute environment for data science, such as a virtual machine or an Azure Databricks cluster, you can attach it to your Azure Machine Learning workspace and use it as a compute target for certain types of workload.

mlflow.set_tracking_uri()

- Logging function - Connects to a tracking URI. You can also set the MLFLOW_TRACKING_URI environment variable to have MLflow find a URI from there. In both cases, the URI can either be a HTTP/HTTPS URI for a remote server, a database connection string, or a local path to log data to a directory. The URI defaults to mlruns.

mlflow.tracking.get_tracking_uri()

- Logging function - Returns the current tracking URI.

mlflow.log_artifact()

- Logs a local file or directory as an artifact, optionally taking an artifact_path to place it in within the run's artifact URI. Run artifacts can be organized into directories, so you can place the artifact in a directory this way.

mlflow.log_metric()

- Logs a single key-value metric. The value must always be a number. MLflow remembers the history of values for each metric. Use mlflow.log_metrics() to log multiple metrics at once.

mlflow.log_param()

- Logs a single key-value param in the currently active run. The key and value are both strings. Use mlflow.log_params() to log multiple params at once.

mlflow.log_artifacts()

- Logs all the files in a given directory as artifacts, again taking an optional artifact_path.

List vs Array

- Main difference is an array behaves like a vector and when you multiply it by 2 the numbers in it are multiplied vs a list would me twice as long

Register a model

- Model registration enables you to track multiple versions of a model, and retrieve models for inferencing (predicting label values from new data). - When you register a model, you can specify a name, description, tags, framework (such as Scikit-Learn or PyTorch), framework version, custom properties, and other useful metadata. - Registering a model with the same name as an existing model automatically creates a new version of the model, starting with 1 and increasing in units of 1. - To register a model from a local file, you can use the register method of the Model object: from azureml.core import Model model = Model.register(workspace=ws, model_name='classification_model', model_path='model.pkl', # local path description='A classification model', tags={'data-format': 'CSV'}, model_framework=Model.Framework.SCIKITLEARN, model_framework_version='0.20.3') - Alternatively, if you have a reference to the Run used to train the model, you can use its register_model method, as shown here: run.register_model( model_name='classification_model', model_path='outputs/model.pkl', # run outputs path description='A classification model', tags={'data-format': 'CSV'}, model_framework=Model.Framework.SCIKITLEARN, model_framework_version='0.20.3')

Regression

- Model that predicts a numeric value based on other variables. - Works by establishing relationships between variables in data that represent characteristics (features) of the thing being observed that they're also trying to predict (label). - Ex: taking weather, seasons, day of week into account when predicting ice cream sales. - y = f(x) where y is the label you want to predict and x is the feature/s - Supervised machine learning

Deploying a config

- Now that you have the entry script and environment, you need to configure the compute to which the service will be deployed. If you are deploying to an AKS cluster, you must create the cluster and a compute target for it before deploying: from azureml.core.compute import ComputeTarget, AksCompute cluster_name = 'aks-cluster' compute_config = AksCompute.provisioning_configuration(location='eastus') production_cluster = ComputeTarget.create(ws, cluster_name, compute_config) production_cluster.wait_for_completion(show_output=True) - With the compute target created, you can now define the deployment configuration, which sets the target-specific compute specification for the containerized deployment: from azureml.core.webservice import AksWebservice classifier_deploy_config = AksWebservice.deploy_configuration(cpu_cores = 1, memory_gb = 1) - The code to configure an ACI deployment is similar, except that you do not need to explicitly create an ACI compute target, and you must use the deploy_configuration class from the azureml.core.webservice.AciWebservice namespace. Similarly, you can use the azureml.core.webservice.LocalWebservice namespace to configure a local Docker-based service.

NumPy

- Package that includes specific data types ans functions for working with numbers in python

Scaling numeric features

- Prevents features with large values from producing coefficients that disproportionately affect predictions

Random Forest algorithm

- Random forest Classifier - Combines outputs of multiple decision trees

Random Sampling

- Random sampling is used to randomly select a value for each hyperparameter, which can be a mix of discrete and continuous values as shown in the following code example: from azureml.train.hyperdrive import RandomParameterSampling, choice, normal param_space = { '--batch_size': choice(16, 32, 64), '--learning_rate': normal(10, 3) } param_sampling = RandomParameterSampling(param_space)

Measures of variance

- Range: difference between the min and max values in the dataset - Variance: average of the squared difference from the mean - Standard deviation: square root of the variance. This is generally the most useful piece of info. Higher standard deviation = more variance when comparing values in the distribution mean (basically the data is more spread out).

Parameters/hyperparameters

- Referred to as hyperparameters in machine learning - Parameters are values in a dataset itself - Hyperparameters are defined externally from the data - A hyperparameter is a parameter that is set before the learning process begins. These parameters are tunable and can directly affect how well a model trains. Some examples of hyperparameters in machine learning: Learning Rate Number of Epochs Momentum Regularization constant Number of branches in a decision tree Number of clusters in a clustering algorithm (like k-means)

Coefficient of Determination (R^2)

- Relative metric in which the higher the value the better the fit of the model. - Represents how much of the variance between predicted and actual label values the model can explain - Represents the proportion of variance for a dependent variable that's explained by an independent variable or variables in a regression model. - If R^2 of a model is 0.50 then about half of the observed variation can be explained by the models inputs .

mlflow.active_run()

- Returns a mlflow.entities.Run object corresponding to the currently active run, if any. - Note: You cannot access currently-active run attributes (parameters, metrics, etc.) through the run returned by mlflow.active_run. In order to access such attributes, use the mlflow.tracking.MlflowClient as follows: client = mlflow.tracking.MlflowClient() data = client.get_run(mlflow.active_run().info.run_id).data

Describe method

- Returns main descriptive slots for all numeric columns - EX: df_students.describe()

mlflow.get_artifact_uri()

- Returns the URI that artifacts from the current run should be logged to.

Pipelines

- Scikit-learn feature - In Azure Machine Learning, a pipeline is a workflow of machine learning tasks in which each task is implemented as a step. - Steps can be arranged sequentially or in parallel, allowing you to build sophisticated flow logic to enact machine learning operations. - Each step can be run on a specific compute target, making it possible to combine different types of processing as required to achieve an overall goal. - A pipeline can be executed as a process by running the pipeline as an experiment. Each step in the pipeline runs on its allocated compute target as part of the overall experiment run. - You can publish a pipeline as a REST endpoint, enabling client applications to initiate a pipeline run. - You can also define a schedule for a pipeline, and have it run automatically at periodic intervals. - Pipelines enable us to apply a set of pre-processing steps that end with an algorithm. - Can fit pipeline into the data so the model uses all of the preprocessing steps and the algorithm - Useful because if we want the model to predict vales from new data we mist apply the same transformations.

train_test_split function

- Scikit-learn function that ensures we get a statistically random split of training and test data

classification_report

- Scikit-learn function to produce a classification report with more insight than just raw accuracy - Includes: Precision: predictions made by the model for this class, how many were correct? Use "precision_score" Recall: Out of all instances of class in the test dataset, how many did the model ID? Use "recall_score" F1-score: Average metric that takes both precision and recall into account Support: How many instances of this class are in the test dataset? - Precision and recall metrics are derived from four possible outcomes (remember the confusion matrix) 1. True positive 2. False positive 3. False negative 4. True negative

mlflow.set_tag()

- Sets a single key-value tag in the currently active run. The key and value are both strings. Use mlflow.set_tags() to set multiple tags at once.

mlflow.set_experiment()

- Sets an experiment as active. If the experiment does not exist, creates a new experiment. If you do not specify an experiment in mlflow.start_run(), new runs are launched under this experiment.

Confusion Matrix

- Shows number of cases where: - Model predicted 0 and the actual value is 0 (true -) - Model predicted 1 and the actual value is 1 (true +) - Model predicted 0 but the actual value is 1 (false -) - Model predicted 1 but the actual value is 0 (false +)

Root Mean Square Error (RMSE)

- Square root of MSE - Yields absolute metric in the same unit as the label. - Smaller value = better model - Used to measure the difference between values (sample or population) predicted by the model and the values observed.

Correlation

- Statistical value between -1 and 1 that indicate correlation - Values above 0 indicate a positive correlation - Values below 0 indicate a negative correlation

Classification

- Supervised learning - Machine learning training model predict which category or class an item belongs to. - Binary = 2 classes - Models can have more than 2 classes - Total probability for a 2 class model totals to 1. - Prediction made by determining probability of each possible class as a value between 0 (impossible) and 1 (certain) - Relies on data that includes known feature and label values - Remaining data is used to evaluate the model by comparing the predictions it generates to known class labels

Classification

- Takes values and classifies them into categories - Ex: Using height, age, weight, etc to classify if someone is or is not diabetic - Y value in this model is a vector of probable values between o and 1 for each class, showing the probability of the observation belonging to each class.

MinMax scaling

- Technique to normalize data so values you are comparing retain their proportional distribution

Running an Automated ML Experiment

- To run an automated machine learning experiment, you can either use the user interface in Azure Machine Learning studio, or submit an experiment using the SDK. - The user interface provides an intuitive way to select options for your automated machine learning experiment. - When using the SDK, you have greater flexibility, and you can set experiment options using the AutoMLConfig class, as shown in the following example: from azureml.train.automl import AutoMLConfig automl_run_config = RunConfiguration(framework='python') automl_config = AutoMLConfig(name='Automated ML Experiment', task='classification', primary_metric = 'AUC_weighted', compute_target=aml_compute, training_data = train_dataset, validation_data = test_dataset, label_column_name='Label', featurization='auto', iterations=12, max_concurrent_iterations=4)

PipelineData

- The PipelineData object is a special kind of DataReference that: References a location in a datastore. Creates a data dependency between pipeline steps. - You can view a PipelineData object as an intermediary store for data that must be passed from a step to a subsequent step. - To use a PipelineData object to pass data between steps, you must: 1. Define a named PipelineData object that references a location in a datastore. 2. Pass the PipelineData object as a script argument in steps that run scripts (and include code in those scripts to read or write data) 3. Specify the PipelineData object as an input or output for the steps as appropriate. - For example, the following code defines a PipelineData object that for the preprocessed data that must be passed between the steps: from azureml.pipeline.core import PipelineData from azureml.pipeline.steps import PythonScriptStep, EstimatorStep # Get a dataset for the initial data raw_ds = Dataset.get_by_name(ws, 'raw_dataset') # Define a PipelineData object to pass data between steps data_store = ws.get_default_datastore() prepped_data = PipelineData('prepped', datastore=data_store) # Step to run a Python script step1 = PythonScriptStep(name = 'prepare data', source_directory = 'scripts', script_name = 'data_prep.py', compute_target = 'aml-cluster', # Script arguments include PipelineData arguments = ['--raw-ds', raw_ds.as_named_input('raw_data'), '--out_folder', prepped_data], # Specify PipelineData as output outputs=[prepped_data]) # Step to run an estimator step2 = PythonScriptStep(name = 'train model', source_directory = 'scripts', script_name = 'data_prep.py', compute_target = 'aml-cluster', # Pass as script argument arguments=['--in_folder', prepped_data], # Specify PipelineData as input inputs=[prepped_data]) - In the scripts themselves, you can retrieve a reference to the PipelineData object from the script argument, and use it like a local folder: # code in data_prep.py from azureml.core import Run import argparse import os # Get the experiment run context run = Run.get_context() # Get arguments parser = argparse.ArgumentParser() parser.add_argument('--raw-ds', type=str,dest='raw_dataset_id') parser.add_argument('--out_folder', type=str, dest='folder') args = parser.parse_args() output_folder = args.folder # Get input dataset as dataframe raw_df = run.input_datasets['raw_data'].to_pandas_dataframe() # code to prep data (in this case, just select specific columns) prepped_df = raw_df[['col1', 'col2', 'col3']] # Save prepped data to the PipelineData location os.makedirs(output_folder, exist_ok=True) output_path = os.path.join(output_folder, 'prepped_data.csv') prepped_df.to_csv(output_path)

Mean Square Error (MSE)

- The difference between predicted value and actual value, squared. - So the sum of the variations of the individual datapoints squared and divided by the number of datapoints minus 2 - When the data is on a graph you find it by summing the variations in the vertical axis - The smaller the value the better the fit for the model - MSE vs Variance: Variance measures how far the data points are spread whereas MSE is the measurement of how the predicted values are different from the actual values.

Coding different explainers

- The following code example shows how to create an instance of each of these explainer types for a hypothetical model named loan_model: #MimicExplainer from interpret.ext.blackbox import MimicExplainer from interpret.ext.glassbox import DecisionTreeExplainableModel mim_explainer = MimicExplainer(model=loan_model, initialization_examples=X_test, explainable_model = DecisionTreeExplainableModel, features=['loan_amount','income','age','marital_status'], classes=['reject', 'approve']) # TabularExplainer from interpret.ext.blackbox import TabularExplainer tab_explainer = TabularExplainer(model=loan_model, initialization_examples=X_test, features=['loan_amount','income','age','marital_status'], classes=['reject', 'approve']) # PFIExplainer from interpret.ext.blackbox import PFIExplainer pfi_explainer = PFIExplainer(model = loan_model, features=['loan_amount','income','age','marital_status'], classes=['reject', 'approve'])

Creating a managed compute target with SDK

- The most common ways to create or attach a compute target are to use either the Compute page in Azure Machine Learning studio, or the Azure Machine Learning SDK to provision compute targets in code. - A managed compute target is one that is managed by Azure Machine Learning, such as an Azure Machine Learning compute cluster. - To create an Azure Machine Learning compute cluster, use the azureml.core.compute.ComputeTarget class and the AmlCompute class, like this: from azureml.core import Workspace from azureml.core.compute import ComputeTarget, AmlCompute # Load the workspace from the saved config file ws = Workspace.from_config() # Specify a name for the compute (unique within the workspace) compute_name = 'aml-cluster' # Define compute configuration compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_DS11_V2', min_nodes=0, max_nodes=4, vm_priority='dedicated') # Create the compute aml_cluster = ComputeTarget.create(ws, compute_name, compute_config) aml_cluster.wait_for_completion(show_output=True) - In this example, a cluster with up to 4 nodes that'a based on the STANDARD_DS12_v2 VM image will be made. The priority for the VMs is set to dedicated, meaning they are reserved for use in this cluster (the alternative is to specify lowpriority, which has a lower cost but means that the VMs can be preempted if a higher-priority workload requires the compute).

Search space

- The set of hyperparameter values tried during hyperparameter tuning is known as the search space. The definition of the range of possible values that can be chosen depends on the type of hyperparameter. - Defining a search space: - To define a search space for hyperparameter tuning, create a dictionary with the appropriate parameter expression for each named hyperparameter. For example, the following search space indicates that the batch_size hyperparameter can have the value 16, 32, or 64, and the learning_rate hyperparameter can have any value from a normal distribution with a mean of 10 and a standard deviation of 3. from azureml.train.hyperdrive import choice, normal param_space = { '--batch_size': choice(16, 32, 64), '--learning_rate': normal(10, 3) } **Discrete Hyperparameters: Some hyperparameters require discrete values - in other words, you must select the value from a particular set of possibilities. You can define a search space for a discrete parameter using a choice from a list of explicit values, which you can define as a Python list (choice([10,20,30])), a range (choice(range(1,10))), or an arbitrary set of comma-separated values (choice(30,50,100)) You can also select discrete values from any of the following discrete distributions: qnormal quniform qlognormal qloguniform **Continuous hyperparameters: Some hyperparameters are continuous - in other words you can use any value along a scale. To define a search space for these kinds of value, you can use any of the following distribution types: normal uniform lognormal loguniform

ROC (received operator characteristic) chart

- True positive and false positive rated plotted against all possible thresholds - The area under the curve is the overall performance of the model (false negatives) - The dashed line is the probability of predicting correctly with a 50/50 random prediction - The curved line is the true positives - Closer to 1 the better - The more straight and up (like the upper left hand corner of a square) the curved line is shaped the better the accuracy

[thinghere].insull().sum()

- Use to find null values

thinghere[thinghere.insull().any(axis=1)]

- Use to find rows containing null values

Joblib

- Use to save trained model for later use

download_file and download_files method

- Used after an experiment run has completed to download it's output files to the local file system. - EX: # Download a named file run.download_file(name='outputs/model.pkl',output_file_path='model.pkl')

get_file_names

- Used after an experiment run has completed to list the files generated. - EX: # "run" is a reference to a completed experiment run # List the files generated by the experiment for file in run.get_file_names(): print(file)

Troubleshooting a service deployment

- There are a lot of elements to a real-time service deployment, including the trained model, the runtime environment configuration, the scoring script, the container image, and the container host. Troubleshooting a failed deployment, or an error when consuming a deployed service can be complex. - As an initial troubleshooting step, you can check the status of a service by examining its state: from azureml.core.webservice import AksWebservice # Get the deployed service service = AksWebservice(name='classifier-service', workspace=ws) # Check its state print(service.state) - For an operational service, the state should be Healthy. - If a service is not healthy, or you are experiencing errors when using it, you can review its logs: The logs include detailed information about the provisioning of the service, and the requests it has processed; and can often provide an insight into the cause of unexpected errors. print(service.get_logs()) - Deployment and runtime errors can be easier to diagnose by deploying the service as a container in a local Docker instance, like this: from azureml.core.webservice import LocalWebservice deployment_config = LocalWebservice.deploy_configuration(port=8890) service = Model.deploy(ws, 'test-svc', [model], inference_config, deployment_config) - You can then test the locally deployed service using the SDK: print(service.run(input_data = json_data)) - You can then troubleshoot runtime issues by making changes to the scoring file that is referenced in the inference configuration, and reloading the service without redeploying it (something you can only do with a local service): service.reload() print(service.run(input_data = json_data))

Query logs in Application Insights

- To analyze captured log data, you can use the Log Analytics query interface for Application Insights in the Azure portal. This interface supports a SQL-like query syntax that you can use to extract fields from logged data, including custom dimensions created by your Azure Machine Learning service. - For example, the following query returns the timestamp and customDimensions.Content fields from log traces that have a message field value of STDOUT (indicating the data is in the standard output log) and a customDimensions.["Service Name"] field value of my-svc: traces |where message == "STDOUT" and customDimensions.["Service Name"] = "my-svc" | project timestamp, customDimensions.Content - This query returns the logged data as a table

Write Log Data

- To capture telemetry data for Application insights, you can write any values to the standard output log in the scoring script for your service by using a print statement, as shown in the following example: def init(): global model model = joblib.load(Model.get_model_path('my_model')) def run(raw_data): data = json.loads(raw_data)['data'] predictions = model.predict(data) log_txt = 'Data:' + str(data) + ' - Predictions:' + str(predictions) print(log_txt) return predictions.tolist() - Azure Machine Learning creates a custom dimension in the Application Insights data model for the output you write.

Configuring a hyperdrive experiment

- To configure a hyperdrive experiment you must use the HyperDriveConfig object to configure the experiment run: from azureml.core import Experiment from azureml.train.hyperdrive import HyperDriveConfig, PrimaryMetricGoal # Assumes ws, script_config and param_sampling are already defined hyperdrive = HyperDriveConfig(run_config=script_config, hyperparameter_sampling=param_sampling, policy=None, primary_metric_name='Accuracy', primary_metric_goal=PrimaryMetricGoal.MAXIMIZE, max_total_runs=6, max_concurrent_runs=4) experiment = Experiment(workspace = ws, name = 'hyperdrive_training') hyperdrive_run = experiment.submit(config=hyperdrive)

Creating a pipeline

- To create a pipeline, you have to first define each step, and then create a pipeline that includes the steps. - The specific configuration of each step depends on the step type. - For example the following code defines two PythonScriptStep steps to prepare data, and then train a model: from azureml.pipeline.steps import PythonScriptStep # Step to run a Python script step1 = PythonScriptStep(name = 'prepare data', source_directory = 'scripts', script_name = 'data_prep.py', compute_target = 'aml-cluster') # Step to train a model step2 = PythonScriptStep(name = 'train model', source_directory = 'scripts', script_name = 'train_model.py', compute_target = 'aml-cluster') - After defining the steps you can assign them to a pipeline and run it as an experiment: from azureml.pipeline.core import Pipeline from azureml.core import Experiment # Construct the pipeline train_pipeline = Pipeline(workspace = ws, steps = [step1,step2]) # Create an experiment and run the pipeline experiment = Experiment(workspace = ws, name = 'training-pipeline') pipeline_run = experiment.submit(train_pipeline)

Explaining global feature importance

- To retrieve global importance values for the features in your mode, you call the explain_global() method of your explainer to get a global explanation - And then use the get_feature_importance_dict() method to get a dictionary of the feature importance values. - The following code example shows how to retrieve global feature importance: # MimicExplainer global_mim_explanation = mim_explainer.explain_global(X_train) global_mim_feature_importance = global_mim_explanation.get_feature_importance_dict() # TabularExplainer global_tab_explanation = tab_explainer.explain_global(X_train) global_tab_feature_importance = global_tab_explanation.get_feature_importance_dict() # PFIExplainer global_pfi_explanation = pfi_explainer.explain_global(X_train, y_train) global_pfi_feature_importance = global_pfi_explanation.get_feature_importance_dict()

Explaining local feature importance

- To retrieve local feature importance from a MimicExplainer or a TabularExplainer, you must call the explain_local() method of your explainer, specifying the subset of cases you want to explain. Then you can use the get_ranked_local_names() and get_ranked_local_values() methods to retrieve dictionaries of the feature names and importance values, ranked by importance. - The following code example shows how to retrieve local feature importance: # MimicExplainer local_mim_explanation = mim_explainer.explain_local(X_test[0:5]) local_mim_features = local_mim_explanation.get_ranked_local_names() local_mim_importance = local_mim_explanation.get_ranked_local_values() # TabularExplainer local_tab_explanation = tab_explainer.explain_local(X_test[0:5]) local_tab_features = local_tab_explanation.get_ranked_local_names() local_tab_importance = local_tab_explanation.get_ranked_local_values()

insull method

- Used to ID null values in a dataset (missing values)

Regularization parameter

- Used to counteract any bias in the sample and help the model generalize well by avoiding overfitting the model to the training data.

loc method

- Used to retrieve data for a specific index value - Ex: df_students.loc[5] would retrieve 5th index value of a list

ScriptRunConfig

- Used to run a script-based experiment that trains a machine learning model. - Write a script then run it as an experiment by writing a ScriptRunConfig that references the folder and script file - You generally also need to define a Python (Conda) environment that includes any packages required by the script. - In this example, the script uses Scikit-Learn, so you must create an environment that includes that. The script also uses Azure Machine Learning to log metrics, so you need to remember to include the azureml-defaults package in the environment: from azureml.core import Experiment, ScriptRunConfig, Environment from azureml.core.conda_dependencies import CondaDependencies # Create a Python environment for the experiment sklearn_env = Environment("sklearn-env") # Ensure the required packages are installed packages = CondaDependencies.create(conda_packages=['scikit-learn','pip'], pip_packages=['azureml-defaults']) sklearn_env.python.conda_dependencies = packages # Create a script config script_config = ScriptRunConfig(source_directory='training_folder', script='training.py', environment=sklearn_env) # Submit the experiment experiment = Experiment(workspace=ws, name='training-experiment') run = experiment.submit(config=script_config) run.wait_for_completion()

Encoding categorical variables

- Using one hot encoding technique, you can create true/false features for each possible category

Threshold value

- Usually 0.5 is used to determine predicted class - If the positive class has a predicted probability greater than the threshold, then it goes to the class with the probability value 1, if it is less than the threshold it goes to the class with the probability of 0.

Encoding categorical features

- Values that represent discrete categories

Normalizing numeric features

- Values you can measure or count

Writing a script to train a model

- When using an experiment to train a model, your script should save the trained model in the outputs folder. - For example, the following script trains a model using Scikit-Learn, and saves it in the outputs folder using the joblib package: from azureml.core import Run import pandas as pd import numpy as np import joblib from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression # Get the experiment run context run = Run.get_context() # Prepare the dataset diabetes = pd.read_csv('data.csv') X, y = data[['Feature1','Feature2','Feature3']].values, data['Label'].values X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30) # Train a logistic regression model reg = 0.1 model = LogisticRegression(C=1/reg, solver="liblinear").fit(X_train, y_train) # calculate accuracy y_hat = model.predict(X_test) acc = np.average(y_hat == y_test) run.log('Accuracy', np.float(acc)) # Save the trained model os.makedirs('outputs', exist_ok=True) joblib.dump(value=model, filename='outputs/model.pkl') run.complete() - To prepare for an experiment that trains a model, a script like this is created and saved in a folder. - For example, you could save this script as training_script.py in a folder named training_folder. Because the script includes code to load training data from data.csv, this file should also be saved in the folder.

Work with file datasets

- When working with a file dataset, you can use the to_path() method to return a list of the file paths encapsulated by the dataset: for file_path in file_ds.to_path(): print(file_path) - Just as with a Tabular dataset, there are two ways you can pass a file dataset to a script. However, there are some key differences in the way that the dataset is passed. - You can pass a file dataset as a script argument. Unlike with a tabular dataset, you must specify a mode for the file dataset argument, which can be as_download or as_mount. This provides an access point that the script can use to read the files in the dataset. In most cases, you should use as_download, which copies the files to a temporary location on the compute where the script is being run. However, if you are working with a large amount of data for which there may not be enough storage space on the experiment compute, use as_mount to stream the files directly from their source. ScriptRunConfig: env = Environment('my_env') packages = CondaDependencies.create(conda_packages=['pip'], pip_packages=['azureml-defaults', 'azureml-dataprep[pandas]']) env.python.conda_dependencies = packages script_config = ScriptRunConfig(source_directory='my_dir', script='script.py', arguments=['--ds', file_ds.as_download()], environment=env) Script: from azureml.core import Run import glob parser.add_argument('--ds', type=str, dest='ds_ref') args = parser.parse_args() run = Run.get_context() imgs = glob.glob(ds_ref + "/*.jpg")

sklearn.metrics.confusion_matrix

- Will make a confusion matrix in Python

Use a named input file for a dataset

- You can also pass a file dataset as a named input. In this approach, you use the as_named_input method of the dataset to specify a name before specifying the access mode. Then in the script, you can retrieve the dataset by name from the run context's input_datasets collection and read the files from there. As with tabular datasets, if you use a named input, you still need to include a script argument for the dataset, even though you don't actually use it to retrieve the dataset. ScriptRunConfig: env = Environment('my_env') packages = CondaDependencies.create(conda_packages=['pip'], pip_packages=['azureml-defaults', 'azureml-dataprep[pandas]']) env.python.conda_dependencies = packages script_config = ScriptRunConfig(source_directory='my_dir', script='script.py', arguments=['--ds', file_ds.as_named_input('my_ds').as_download()], environment=env) Script: from azureml.core import Run import glob parser.add_argument('--ds', type=str, dest='ds_ref') args = parser.parse_args() run = Run.get_context() dataset = run.input_datasets['my_ds'] imgs= glob.glob(dataset + "/*.jpg")

Monitoring hyperdrive experiments

- You can monitor hyperdrive experiments in Azure Machine Learning studio, or by using the Jupyter Notebooks RunDetails widget. - The experiment will initiate a child run for each hyperparameter combination to be tried, and you can retrieve the logged metrics these runs using the following code: for child_run in run.get_children(): print(child_run.id, child_run.get_metrics()) - You can also list all runs in descending order of performance: for child_run in hyperdrive_.get_children_sorted_by_primary_metric(): print(child_run) - To look at best performing run: best_run = hyperdrive_run.get_best_run_by_primary_metric()

Working with Tabular datasets

- You can read data directly from a tabular dataset by converting it into a Pandas or Spark dataframe: df = tab_ds.to_pandas_dataframe() # code to work with dataframe goes here, for example: print(df.head()) - When you need to access a dataset in an experiment script, you must pass the dataset to the script. There are two ways you can do this. 1. You can pass a tabular dataset as a script argument. When you take this approach, the argument received by the script is the unique ID for the dataset in your workspace. In the script, you can then get the workspace from the run context and use it to retrieve the dataset by it's ID: ScriptRunConfig: env = Environment('my_env') packages = CondaDependencies.create(conda_packages=['pip'], pip_packages=['azureml-defaults', 'azureml-dataprep[pandas]']) env.python.conda_dependencies = packages script_config = ScriptRunConfig(source_directory='my_dir', script='script.py', arguments=['--ds', tab_ds], environment=env) Script: from azureml.core import Run, Dataset parser.add_argument('--ds', type=str, dest='dataset_id') args = parser.parse_args() run = Run.get_context() ws = run.experiment.workspace dataset = Dataset.get_by_id(ws, id=args.dataset_id) data = dataset.to_pandas_dataframe() - Alternatively, you can pass a tabular dataset as a named input. In this approach, you use the as_named_input method of the dataset to specify a name for the dataset. Then in the script, you can retrieve the dataset by name from the run context's input_datasets collection without needing to retrieve it from the workspace. - Note that if you use this approach, you still need to include a script argument for the dataset, even though you don't actually use it to retrieve the dataset. ScriptRunConfig: env = Environment('my_env') packages = CondaDependencies.create(conda_packages=['pip'], pip_packages=['azureml-defaults', 'azureml-dataprep[pandas]']) env.python.conda_dependencies = packages script_config = ScriptRunConfig(source_directory='my_dir', script='script.py', arguments=['--ds', tab_ds.as_named_input('my_dataset')], environment=env) Script: from azureml.core import Run parser.add_argument('--ds', type=str, dest='ds_id') args = parser.parse_args() run = Run.get_context() dataset = run.input_datasets['my_dataset'] data = dataset.to_pandas_dataframe()

Triggering a pipeline to run when data changes

- You can set up a pipeline to run whenever data changes, you must create a "Schedule" that monitors a specified path on a datastore, like this: from azureml.core import Datastore from azureml.pipeline.core import Schedule training_datastore = Datastore(workspace=ws, name='blob_data') pipeline_schedule = Schedule.create(ws, name='Reactive Training', description='trains model on data change', pipeline_id=published_pipeline_id, experiment_name='Training_Pipeline', datastore=training_datastore, path_on_datastore='data/training')

Explainer

- You can use the Azure Machine Learning SDK to create explainers for models, even if they were not trained using an Azure Machine Learning experiment. - To interpret a local model, you must install the azureml-interpret package and use it to create an explainer. There are multiple types of explainer, including: - Kinds of explainers: - MimicExplainer - An explainer that creates a global surrogate model that approximates your trained model and can be used to generate explanations. This explainable model must have the same kind of architecture as your trained model (for example, linear or tree-based). - TabularExplainer - An explainer that acts as a wrapper around various SHAP explainer algorithms, automatically choosing the one that is most appropriate for your model architecture. - PFIExplainer - a Permutation Feature Importance explainer that analyzes feature importance by shuffling feature values and measuring the impact on prediction performance.

Managing datastores

- You can view and manage datastores in Azure Machine Learning Studio, or you can use the Azure Machine Learning SDK. - For example, the following code lists the names of each datastore in the workspace: for ds_name in ws.datastores: print(ds_name) - You can get a reference to any datastore by using the Datastore.get() method: blob_store = Datastore.get(ws,datastore_name='blob_data') - The workspace always has a default datastore (initially, this is the built-in workspaceblobstore datastore), which you can retrieve by using the get_default_datastore() method of a Workspace object: default_store = ws.get_default_datastore() - To change the default datastore, use the set_default_datastore() method: ws.set_default_datastore('blob_data')

Viewing registered models

- You can view registered models in Azure Machine Learning studio. - You can also use the Model object to retrieve details of registered models, like this: from azureml.core import Model for model in Model.list(ws): # Get model name and auto-generated version print(model.name, 'version:', model.version)

Creating an environment

- Your service requires a Python environment in which to run the entry script, which you can configure using Conda configuration file. - An easy way to create this file is to use a CondaDependencies class to create a default environment (which includes the azureml-defaults package and commonly-used packages like numpy and pandas), add any other required packages, and then serialize the environment to a string and save it: from azureml.core.conda_dependencies import CondaDependencies # Add the dependencies for your model myenv = CondaDependencies() myenv.add_conda_package("scikit-learn") # Save the environment config as a .yml file env_file = 'service_files/env.yml' with open(env_file,"w") as f: f.write(myenv.serialize_to_string()) print("Saved dependency info in", env_file)

Logistic function

- f(x) = y - Calculates probability for y value based on x - Would calculate value between 0 and 1 for y to other values (x) - This function forms a sigmoidal S-shaped curve

Logistic Regression

- model.predict(x_new) - Well established algorithm for classification

Using datastores

-To add a datastore to your workspace, you can register it using the graphical interface in Azure Machine Learning studio, or you can use the Azure Machine Learning SDK. - For example, the following code registers an Azure Storage blob container as a datastore named blob_data: from azureml.core import Workspace, Datastore ws = Workspace.from_config() # Register a new datastore blob_ds = Datastore.register_azure_blob_container(workspace=ws, datastore_name='blob_data', container_name='data_container', account_name='az_store_acct', account_key='123456abcde789...')

primary_metric

-vThis is the target performance metric for which the optimal model will be determined. - Azure Machine Learning supports a set of named metrics for each type of task. - To retrieve the list of metrics available for a particular task type, you can use the get_primary_metrics function as shown here: from azureml.train.automl.utilities import get_primary_metrics get_primary_metrics('classification')

fillna method

Replaces missing values with the average to fill null/missing datapoints.

Precision

TP/(/(TP+FP) - Out of all the cases the model predicted to be positive, how many actually are positive?

Recall

TP/(TP+FN) - Out of all the cases that are true positives, how many did the model identify?

Steps to deploy a model as a real-time service

1. Register a trained model from azureml.core import Model classification_model = Model.register(workspace=ws, model_name='classification_model', model_path='model.pkl', # local path description='A classification model') OR (if you have a reference to the Run) run.register_model( model_name='classification_model', model_path='outputs/model.pkl', # run outputs path description='A classification model') 2. Define an inference configuration - Entry script import json import joblib import numpy as np from azureml.core.model import Model # Called when the service is loaded def init(): global model # Get the path to the registered model file and load it model_path = Model.get_model_path('classification_model') model = joblib.load(model_path) # Called when a request is received def run(raw_data): # Get the input data as a numpy array data = np.array(json.loads(raw_data)['data']) # Get a prediction from the model predictions = model.predict(data) # Return the predictions as any JSON serializable format return predictions.tolist() - Create env from azureml.core.conda_dependencies import CondaDependencies # Add the dependencies for your model myenv = CondaDependencies() myenv.add_conda_package("scikit-learn") # Save the environment config as a .yml file env_file = 'service_files/env.yml' with open(env_file,"w") as f: f.write(myenv.serialize_to_string()) print("Saved dependency info in", env_file) - Combining the script and environment in an InferenceConfig from azureml.core.model import InferenceConfig classifier_inference_config = InferenceConfig(runtime= "python", source_directory = 'service_files', entry_script="score.py", conda_file="env.yml") 3. Define a deployment configuration from azureml.core.compute import ComputeTarget, AksCompute cluster_name = 'aks-cluster' compute_config = AksCompute.provisioning_configuration(location='eastus') production_cluster = ComputeTarget.create(ws, cluster_name, compute_config) production_cluster.wait_for_completion(show_output=True) - With the compute target created, you can now define the deployment configuration, which sets the target-specific compute specification for the containerized deployment: from azureml.core.webservice import AksWebservice classifier_deploy_config = AksWebservice.deploy_configuration(cpu_cores = 1, memory_gb = 1) 4. Deploy the model from azureml.core.model import Model model = ws.models['classification_model'] service = Model.deploy(workspace=ws, name = 'classifier-service', models = [model], inference_config = classifier_inference_config, deployment_config = classifier_deploy_config, deployment_target = production_cluster) service.wait_for_deployment(show_output = True)

PFIExplainer

A Permutation Feature Importance explainer that analyzes feature importance by shuffling feature values and measuring the impact on prediction performance.

Managed compute target

A managed compute target is one that is managed by Azure Machine Learning, such as an Azure Machine Learning compute cluster.

Median Stopping Policy

A median stopping policy abandons runs where the target performance metric is worse than the median of the running averages for all runs. EX: from azureml.train.hyperdrive import MedianStoppingPolicy early_termination_policy = MedianStoppingPolicy(evaluation_interval=1, delay_evaluation=5)

URI

A redirect URI, or reply URL, is the location where the authorization server sends the user once the app has been successfully authorized and granted an authorization code or access token. The authorization server sends the code or token to the redirect URI, so it's important you register the correct location as part of the app registration process.

TabularExplainer

An explainer that acts as a wrapper around various SHAP explainer algorithms, automatically choosing the one that is most appropriate for your model architecture.

MimicExplainer

An explainer that creates a global surrogate model that approximates your trained model and can be used to generate explanations. This explainable model must have the same kind of architecture as your trained model (for example, linear or tree-based).

Unmanaged compute target

An unmanaged compute target is one that is defined and managed outside of the Azure Machine Learning workspace; for example, an Azure virtual machine or an Azure Databricks cluster.

dropna method

Drops null columns and values

NaN

Not a number

ParallelRunStep

Using the ParallelRunStep class, you can read batches of files from a File dataset and write the processing output to a PipelineData reference. Additionally, you can set the output_action setting for the step to "append_row", which will ensure that all instances of the step being run in parallel will collate their results to a single output file named parallel_run_step.txt.

Bandit policy

You can use a bandit policy to stop a run if the target performance metric underperforms the best run so far by a specified margin. EX: from azureml.train.hyperdrive import BanditPolicy early_termination_policy = BanditPolicy(slack_amount = 0.2, evaluation_interval=1, delay_evaluation=5) - This example applies the policy for every iteration after the first five, and abandons runs where the reported target metric is 0.2 or more worse than the best performing run after the same number of intervals. - You can also apply a bandit policy using a slack factor, which compares the performance metric as a ratio rather than an absolute value.

Retrieving the best run and its model

best_run, fitted_model = automl_run.get_output() best_run_metrics = best_run.get_metrics() for metric_name in best_run_metrics: metric = best_run_metrics[metric_name] print(metric_name, metric)

Exploring preprocessing steps

for step_ in fitted_model.named_steps: print(step)

Creating a workspace

from azureml.core import Workspace ws = Workspace.create(name='aml-workspace', subscription_id='123456-abc-123...', resource_group='aml-resources', create_resource_group=True, location='eastus' )


Ensembles d'études connexes

IB Biology Unit 1-4 Question Bank

View Set

Chapter 8 Legal Issues in Nursing and Health Care

View Set

Test Chapter 7:4 Skeletal System Bone Names

View Set

Econ 160 - Multiple choice Set 5 (#93-115)

View Set

Geometric Dimensioning & Tolerancing

View Set