Certificate: AWS Machine Learning

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Accuracy

-

False Positive Rate

-

The data science team at a financial services company has created a multi-class classification model to segment the company's customers into three tiers - Platinum, Gold and Silver. The confusion matrix for the underlying model was reported as follows:

0.56 Precision for Platinum = (True Positives / (True Positives + False Positives)) = (30/(30+ (20+10) )) = 30/60 = 0.50 Precision for Gold = (True Positives / (True Positives + False Positives)) = (60/(60+ (50+10) )) = 60/120 = 0.50 Precision for Silver = (True Positives / (True Positives + False Positives)) = (80/(80+ (20+20) )) = 80/120 = 0.67 Overall Precision = Average of the precision for Platinum, Gold and Silver = (0.50+0.50+0.67)/3 = 0.56

A ride-hailing company needs to ingest and store certain attributes of real-time automobile health data which is in JSON format. The company does not want to manage the underlying infrastructure and it wants the data to be available for visualization on a near real time basis. As an ML specialist, what is your recommendation so that the solution requires the least development time and infrastructure management?

Amazon Kinesis Data Firehose is the easiest way to load streaming data into data stores and analytics tools. Kinesis Data Firehose manages all underlying infrastructure, storage, networking, and configuration needed to capture and load your data into Amazon S3, Amazon Redshift, Amazon Elasticsearch Service, or Splunk. You do not have to worry about provisioning, deployment, ongoing maintenance of the hardware, software, or write any other application to manage this process. Amazon QuickSight is a scalable, serverless, embeddable, machine learning-powered business intelligence (BI) service built for the cloud. QuickSight lets you easily create and publish interactive BI dashboards that include Machine Learning-powered insights. QuickSight dashboards can be accessed from any device, and seamlessly embedded into your applications, portals, and websites. This is the correct option as it can be used to process the streaming JSON data via Kinesis Firehose that uses a Lambda to write the selected attributes as JSON data into an S3 location. You should note that Firehose offers built-in integration with intermediary lambda functions to handle any transformations. This transformed data is then consumed in QuickSight for visualizations.

Batch Training

Batch Training - Batch size is a term used in machine learning and refers to the number of training examples utilized in one iteration. The batch size can be one of three options: batch mode: where the batch size is equal to the total dataset thus making the iteration and epoch values equivalent mini-batch mode: where the batch size is greater than one but less than the total dataset size. Usually, a number that can be divided into the total dataset size. stochastic mode: where the batch size is equal to one. Therefore the gradient and the neural network parameters are updated after each sample.

Factorization Machines -

Factorization Machines - The Factorization Machines algorithm is a general-purpose supervised learning algorithm that you can use for both classification and regression tasks. It is an extension of a linear model that is designed to capture interactions between features within high dimensional sparse datasets economically. For example, in a click prediction system, the Factorization Machines model can capture click rate patterns observed when ads from a certain ad-category are placed on pages from a certain page-category. Factorization machines are a good choice for tasks dealing with high dimensional sparse datasets, such as click prediction and item recommendation.

Create more samples using algorithms such as SMOTE

In case of a binary classification model with strongly unbalanced classes, we need to over-sample from the minority class, collect more training data for the minority class and create more samples using algorithms such as SMOTE which effectively uses a k-nearest neighbours approach to exclude members of the majority class while in a similar way creating synthetic examples of a minority class. Here are a few good references

Which of the following represent the correct statements regarding the Amazon SageMaker logging and monitoring options on CloudWatch and CloudTrail? (Select four)

Incorrect options: - CloudTrail monitors calls to InvokeEndpoint - SageMaker monitoring metrics are available on CloudWatch at a 2-minute frequency These two options contradict the explanation provided above, so these options are incorrect.

What technique would you use in SageMaker to train a new model using an expanded dataset that contains an underlying pattern that was not accounted for in the previous training and which resulted in poor model performance?

Incremental Training Over time, you might find that a model generates inferences that are not as good as they were in the past. With incremental training, you can use the artifacts from an existing model and use an expanded dataset to train a new model. Incremental training saves both time and resources. You can use incremental training to: Train a new model using an expanded dataset that contains an underlying pattern that was not accounted for in the previous training and which resulted in poor model performance. Use the model artifacts or a portion of the model artifacts from a popular publicly available model in a training job. You don't need to train a new model from scratch. Resume a training job that was stopped. Train several variants of a model, either with different hyperparameter settings or using different datasets. You can read more on this reference link -

A machine learning engineer is trying to develop a linear regression model and the following represents the residual plot (residuals on the y axis and the independent variable on the x axis) for the model: Given the above residual plot, what would you attribute as the MOST LIKELY reason behind the model's failure?

Linear regression is not the right choice for the underlying model and the residuals do not have constant variance

The marketing team at an Enterprise SaaS company has determined that the cost of customer churn is much greater than the cost of customer retention for its existing customer base. To address this issue, the team worked on a classification model to predict if a customer is likely to churn and boiled it down to two model variants. Model A had 92% accuracy with 40 False Negatives (FN) and 100 False Positives (FP) whereas model B also had 92% accuracy with 100 FN and 40 FP. Which of the two models is more cost effective for the company?

Model A

The data science team at a leading Questions and Answers website wants to improve the user experience and therefore would like to identify duplicate questions based on similarity of the text found in a given question. As an ML Specialist, which SageMaker algorithm would you recommend to help solve this problem?

Object2Vec

Object2Vec

Object2Vec : The Amazon SageMaker Object2Vec algorithm is a general-purpose neural embedding algorithm that is highly customizable. It can learn low-dimensional dense embeddings of high-dimensional objects. The embeddings are learned in a way that preserves the semantics of the relationship between pairs of objects in the original space in the embedding space. You can use the learned embeddings to efficiently compute nearest neighbors of objects and to visualize natural clusters of related objects in low-dimensional space, for example. You can also use the embeddings as features of the corresponding objects in downstream supervised tasks, such as classification or regression.

A medical diagnostics company specializes in cancer detection tests. The data scientists at the company are working on a classification model to detect cancer and they do not want any cases of cancer going undetected. The classification model's predicted value of 1 implies that the patient is predicted to have cancer. Which of the following metrics should the data scientists focus on so that they can achieve the desired objective?

Recall

As part of a clinical study, you have processed millions of medical records with hundreds of features and reduced the feature dimensions to just two using a model based on Principal Component Analysis (PCA). The following graph illustrates the distribution from the PCA model output in the form of two distinct classes shown in red and blue dots. Which algorithm can you use to accurately classify the above output from the PCA model?

Support Vector Machine Principal Component Analysis (PCA) Algorithm—reduces the dimensionality (number of features) within a dataset by projecting data points onto the first few principal components. The objective is to retain as much information or variation as possible. For the given use-case, the features have been reduced to just two dimensions using a model based on Principal Component Analysis (PCA). Support Vector Machine (SVM) is a supervised machine learning algorithm that can be employed for both classification and regression purposes. SVM can solve linear and non-linear problems and work well for many practical problems. SVM creates a line or a hyperplane which separates the data into classes. SVM is used for text classification tasks such as category assignment, detecting spam and sentiment analysis. It is also commonly used for image recognition challenges, performing particularly well in aspect-based recognition and color-based classification. Since SVM can classify for non-linear boundary, so that's the right option.

BlazingText Word2Vec mode -

The Amazon SageMaker BlazingText algorithm provides highly optimized implementations of the Word2vec and text classification algorithms. The Word2vec algorithm is useful for many downstream natural language processing (NLP) tasks, such as sentiment analysis, named entity recognition, machine translation, etc. Text classification is an important task for applications that perform web searches, information retrieval, ranking, and document classification. The Word2vec algorithm maps words to high-quality distributed vectors. The resulting vector representation of a word is called a word embedding. Words that are semantically similar correspond to vectors that are close together. That way, word embeddings capture the semantic relationships between words.

Transfer Learning

Transfer Learning - This is a technique used in image classification algorithms. The image classification algorithm takes an image as input and classifies it into one of the output categories. Image classification in Amazon SageMaker can be run in two modes: full training and transfer learning. In full training mode, the network is initialized with random weights and trained on user data from scratch. In transfer learning mode, the network is initialized with pre-trained weights and just the top fully connected layer is initialized with random weights. Then, the whole network is fine-tuned with new data. In this mode, training can be achieved even with a smaller dataset. This is because the network is already trained and therefore can be used in cases without sufficient training data. Transfer Learning is a general machine learning technique that is not relevant to the SageMaker specific use-case described in the question

ou want to secure the API calls made to your published Amazon SageMaker model endpoints from your customer VPC. By default, these API calls traverse the public network to the request router. What measures would you take to address this issue so that the API calls do not use the public internet?

Use Amazon Virtual Private Cloud interface endpoints powered by AWS PrivateLink for private connectivity between the customer's VPC and the request router to access hosted model endpoints You can connect directly to the SageMaker API or to the SageMaker Runtime through an interface endpoint in your Virtual Private Cloud (VPC) instead of connecting over the internet. When you use a VPC interface endpoint, communication between your VPC and the SageMaker API or Runtime is conducted entirely and securely within the AWS network. The SageMaker API and Runtime support Amazon Virtual Private Cloud (Amazon VPC) interface endpoints that are powered by AWS PrivateLink. Each VPC endpoint is represented by one or more Elastic Network Interfaces with private IP addresses in your VPC subnets. The VPC interface endpoint connects your VPC directly to the SageMaker API or Runtime without an internet gateway, NAT device, VPN connection, or AWS Direct Connect connection. The instances in your VPC don't need public IP addresses to communicate with the SageMaker API or Runtime. You can create an interface endpoint to connect to SageMaker or to SageMaker Runtime with either the AWS console or AWS Command Line Interface (AWS CLI) commands.

As an ML Specialist, you observe that one of the features used in a SageMaker Linear Learner model had 30% missing data. You also believe that this specific feature was somehow related to a few other features in the data-set. Which technique would you use to address the missing data?

Use multiple imputations approach via a supervised learning technique that uses other features to figure out the imputed value For the given use-case, you need to prepare the dataset for a machine learning problem and you need to fix missing values. You can fix missing values by applying machine learning to that dataset itself! In multiple imputations approach, you need to generate missing values from the dataset many times. The individual datasets are then pooled together into the final imputed dataset, with the values chosen to replace the missing data being drawn from the combined results in some way.The multiple imputations approach breaks imputation process into three steps: imputation (multiple times), analysis (staging how the results should be combined), and pooling (integrating the results into the final imputed matrix). There are a variety of multiple imputation algorithms and implementations available. The most popular algorithm is called MICE. Here is a great reference for a deep-dive on the multiple imputations approach:

You want to secure the API calls made to your published Amazon SageMaker model endpoints from your customer VPC. By default, these API calls traverse the public network to the request router. What measures would you take to address this issue so that the API calls do not use the public internet?

WRONG: Use SSH for private connectivity between the customer's VPC and the request router to access hosted model endpoints - This option has been added as a distractor as SSH is used for encrypted data communications between two computers connecting over an open network, such as the internet. By itself SSH cannot facilitate API calls that bypass the public internet.

You want to secure the API calls made to your published Amazon SageMaker model endpoints from your customer VPC. By default, these API calls traverse the public network to the request router. What measures would you take to address this issue so that the API calls do not use the public internet?

WRONG: Use AWS SSE-KMS for private connectivity between the customer's VPC and the request router to access hosted model endpoints - SSE-KMS is used for the Amazon S3 service. When you create an object in S3, you can specify the use of server-side encryption with AWS Key Management Service (AWS KMS) customer master keys (CMKs) to encrypt your data. This is true when you are either uploading a new object or copying an existing object. This encryption is known as SSE-KMS. This option has been added as a distractor, as SSE-KMS cannot facilitate API calls that bypass the public internet.

XGBoost

XGBoost - The XGBoost (eXtreme Gradient Boosting) is a popular and efficient open-source implementation of the gradient boosted trees algorithm. Gradient boosting is a supervised learning algorithm that attempts to accurately predict a target variable by combining an ensemble of estimates from a set of simpler and weaker models. The XGBoost algorithm performs well in machine learning competitions because of its robust handling of a variety of data types, relationships, distributions, and the variety of hyperparameters that you can fine-tune. You can use XGBoost for regression, classification (binary and multiclass), and ranking problems.


Kaugnay na mga set ng pag-aaral

Working With Sheep and Goats in a Research Setting

View Set

Biblical Counseling - Midterm (READ DESCRIPTION)

View Set

Industry 4.0 - the 4th industrial revolution

View Set

Computer Fundamentals - Course Exam

View Set

FOOD SCIENCE: Ch. 3 Chemistry of Food Composition

View Set

Seismic Waves and Earth's Interior

View Set