AWS Machine Learning Specialist Exam

Ace your homework & exams now with Quizwiz!

What does the training job for SageMaker require?

- The URL of the Amazon Simple Storage Service (Amazon S3) bucket where you've stored the training data. - The compute resources that you want SageMaker to use for model training. Compute resources are ML compute instances that are managed by SageMaker. - The URL of the S3 bucket where you want to store the output of the job. - The Amazon Elastic Container Registry path where the training code is stored.

NVIDIA container toolkit

Allows users to build and run a GPU accelerated docker container Ensure containers are nvidia-docker compatible

A Machine Learning Specialist is assisting a huge organization in incorporating machine learning into its products. The organization want to categorize its clients based on their likelihood to churn over the following six months. The firm has identified the data that the Specialist has access to.Which form of machine learning model should the Specialist employ for this task? A. Linear regression B. Classification C. Clustering D. Reinforcement learning Hide Solution

B

A data scientist is tasked with the task of developing a bespoke recommendation model in Amazon SageMaker for an online retailer. Customers purchase just 4-5 things every 5-10 years due to the nature of the company's offerings. As a result, the business is reliant on a continual influx of new consumers. When a new client registers, the business gathers information about the consumer's preferences. The following is a sample of the data that the data scientist has access to.For this use case, how should the data scientist divide the dataset into a training and test set? A. Shuffle all interaction data. Split off the last 10% of the interaction data for the test set. B. Identify the most recent 10% of interactions for each user. Split off these interactions for the test set. C. Identify the 10% of users with the least interaction data. Split off all interaction data from these users for the test set. D. Randomly select 10% of the users. Split off all interaction data from these users for the test set.

B

SHARDS Amazon Kinesis Data Stream The data stream is expected to collect 8 KB of JSON data at up to 1,000 transactions per second How many shards?

One shard provides a capacity of 1 MB/sec data input and 2 MB/sec data output. One shard can support up to 1000 PUT records per second. 8 * 1000 = 8000 KB or 8MB/1MB data input, 8 shards required KB - MB - GB

Which Amazon Services can create streaming ETL jobs?

You can create streaming extract, transform, and load (ETL) jobs that run continuously, consume data from streaming sources like Amazon Kinesis Data Streams, Apache Kafka, and Amazon Managed Streaming for Apache Kafka (Amazon MSK). The jobs cleanse and transform the data, and then load the results into Amazon S3 data lakes or JDBC data stores.

Amazon SageMaker DeepAR

forecasting algorithm is a supervised learning algorithm for forecasting scalar (one-dimensional) time series using recurrent neural networks (RNN). It can provide better forecast accuracies compared to classical forecasting techniques such as Autoregressive Integrated Moving Average (ARIMA) or Exponential Smoothing (ES), both of which are implemented in many open-source and commercial software packages for forecasting.

What format do Amazon SageMaker algorithms wrk fastest with

protobuf recordIO some SageMaker algos support parquet but not all

Business specific terms/acronyms are not being pronounced correctly in Amazon Polly

use pronunciation lexicons bc SSML tags arent supported in Polly

SageMaker input file types

JPG PNG jsonlines CSV recordIO protobuf libsvm ximage

A big mobile network operator is developing a machine learning algorithm to forecast which consumers are likely to cancel their service subscription. The corporation intends to give an incentive to retain these clients, since the cost of churn is far more than the incentive's cost.After testing on a test dataset of 100 consumers, the model generates the following confusion matrix: n= 100. Predicted Churn. Yes. No. Actual Yes. 10 4 Actual No. 10 76 Why is this a feasible model for manufacturing, based on the model assessment results? A. The model is 86% accurate and the cost incurred by the company as a result of false negatives is less than the false positives. B. The precision of the model is 86%, which is less than the accuracy of the model. C. The model is 86% accurate and the cost incurred by the company as a result of false positives is less than the false negatives. D. The precision of the model is 86%, which is greater than the accuracy of the model.

A or D

A Machine Learning Specialist is employed by a multinational cybersecurity firm that handles real-time security events for businesses worldwide. The cybersecurity firm wants to develop a system that would enable it to employ machine learning to classify dangerous events as anomalies in data as it is consumed. Additionally, the corporation wishes to save the findings in its data lake for subsequent processing and analysis.Which method is the MOST EFFECTIVE for completing these tasks? A. Ingest the data using Amazon Kinesis Data Firehose, and use Amazon Kinesis Data Analytics Random Cut Forest (RCF) for anomaly detection. Then use Kinesis Data Firehose to stream the results to Amazon S3. B. Ingest the data into Apache Spark Streaming using Amazon EMR, and use Spark MLlib with k-means to perform anomaly detection. Then store the results in an Apache Hadoop Distributed File System (HDFS) using Amazon EMR with a replication factor of three as the data lake. C. Ingest the data and store it in Amazon S3. Use AWS Batch along with the AWS Deep Learning AMIs to train a k-means model using TensorFlow on the data in Amazon S3. D. Ingest the data and store it in Amazon S3. Have an AWS Glue job that is triggered on demand transform the new data. Then use the built-in Random Cut Forest (RCF) model within Amazon SageMaker to detect anomalies in the data.

A

A Machine Learning Specialist is needed to develop a supervised image recognition model for the purpose of identifying a cat. The Machine Learning Specialist runs many tests and collects the following findings for an image classifier powered by a neural network:There are a total of 1,000 photos accessible.100 photos from the test set (constant test set)The ML Specialist observes that cats were handled upside down by their owners in over 75% of the misclassified photographs.Which strategies can the machine learning specialist apply to improve this particular test error? A. Increase the training data by adding variation in rotation for training images. B. Increase the number of epochs for model training C. Increase the number of layers for the neural network. D. Increase the dropout rate for the second-to-last layer.

A

A Machine Learning Specialist is provided with a structured dataset including information about the buying behaviors of a company's customers. Each client is represented by thousands of data columns and hundreds of number columns. The Specialist's objective is to determine if these columns naturally group together across all consumers and to display the findings as rapidly as feasible.How should the Specialist tackle these tasks? A. Embed the numerical features using the t-distributed stochastic neighbor embedding (t-SNE) algorithm and create a scatter plot. B. Run k-means using the Euclidean distance measure for different values of k and create an elbow plot. C. Embed the numerical features using the t-distributed stochastic neighbor embedding (t-SNE) algorithm and create a line graph. D. Run k-means using the Euclidean distance measure for different values of k and create box plots for each numerical column within each cluster.

A

A specialist in machine learning is trying to construct a linear regression model.Given just the given residual plot, what is the MOST LIKELY cause of the model's failure? IMAGE WITH FANNING RESIDUALS A. Linear regression is inappropriate. The residuals do not have constant variance. B. Linear regression is inappropriate. The underlying data has outliers. C. Linear regression is appropriate. The residuals have a zero mean. D. Linear regression is appropriate. The residuals have constant variance.

A

On a company's social media page, an employee saw a video clip with audio. The video is in Spanish. The employee's primary language is English, and he or she does not comprehend Spanish. The employee requests that a sentiment analysis be performed.Which service combination is the MOST EFFECTIVE in completing the task? A. Amazon Transcribe, Amazon Translate, and Amazon Comprehend B. Amazon Transcribe, Amazon Comprehend, and Amazon SageMaker seq2seq C. Amazon Transcribe, Amazon Translate, and Amazon SageMaker Neural Topic Model (NTM) D. Amazon Transcribe, Amazon Translate and Amazon SageMaker BlazingText

A

A retailer aims to classify new items using machine learning. The Data Science team was presented with a labeled dataset of current goods. There are 1,200 goods in the dataset. Each product in the labeled dataset includes 15 attributes, including its title, dimensions, weight, and price. Each item is tagged with a category, such as books, games, gadgets, or movies.Which model should be used to classify new items using the training data provided? A. AnXGBoost model where the objective parameter is set to multi:softmax B. A deep convolutional neural network (CNN) with a softmax activation function for the last layer C. A regression forest where the number of trees is set equal to the number of product categories D. A DeepAR forecasting model based on a recurrent neural network (RNN)

A Convolutional Neural Network (ConvNet or CNN) is a special type of Neural Network used effectively for image recognition and classification Recurrent neural networks (RNN) are a class of neural networks that is powerful for modeling sequence data such as time series or natural language

A technology business is recommending items to current customers based on their habits and interactions using complicated deep neural networks and GPU processing. Currently, the solution retrieves each dataset from an Amazon S3 bucket and loads it into a TensorFlow model obtained from the company's Git repository. This task is then scheduled to continue for many hours, continuously writing to the same S3 bucket. The task, which is executed from a central queue, may be interrupted, resumed, and continued at any moment in the case of a failure.Senior management is worried about the solution's resource management complexity and the expenses associated with repeating the procedure on a regular basis. They want that the task be automated such that it runs once a week, beginning Monday and concluding by Friday's closing of business.Which architecture should be employed to efficiently scale the solution? A. Implement the solution using AWS Deep Learning Containers and run the container as a job using AWS Batch on a GPU-compatible Spot Instance B. Implement the solution using a low-cost GPU-compatible Amazon EC2 instance and use the AWS Instance Scheduler to schedule the task C. Implement the solution using AWS Deep Learning Containers, run the workload using AWS Fargate running on Spot Instances, and then schedule the task using the built-in task scheduler D. Implement the solution using Amazon ECS running on Spot Instances and schedule the task using the ECS service scheduler

A D is not correct. ECS is responsible for managing the lifecycle and placement of tasks. However, ECS does not run or execute your container. ECS only provides the control plane to manage tasks. A is correct. "You can set up compute environments that use a particular type of EC2 instance, a particular model such as c5.2xlarge or m5.10xlarge, or simply specify that you want to use the newest instance types. You can also specify the minimum, desired, and maximum number of vCPUs for the environment, along with the amount you are willing to pay for a Spot Instance as a percentage of the On-Demand Instance price and a target set of VPC subnets. AWS Batch will efficiently launch, manage, and terminate compute types as needed. You can also manage your own compute environments. In this case you are responsible for setting up and scaling the instances in an Amazon ECS cluster that AWS Batch creates for you. " https://docs.aws.amazon.com/batch/latest/userguide/what-is-batch.html

A Data Scientist wishes to acquire real-time insight into a GZIP file data stream.Which option would allow for the LEAST amount of lag while using SQL to query the stream? A. Amazon Kinesis Data Analytics with an AWS Lambda function to transform the data. B. AWS Glue with a custom ETL script to transform the data. C. An Amazon Kinesis Client Library to transform the data and save it to an Amazon ES cluster. D. Amazon Kinesis Data Firehose to transform the data and put it into an Amazon S3 bucket. Reveal Solution

A Kinesis Data Analytics can use lamda to convert GZIP and can run SQL on the converted data. https://aws.amazon.com/about-aws/whats-new/2017/10/amazon-kinesis-analytics-can-now-pre-process-data-prior-to-running-sql-queries/

A producer of automobile engines gathers data from vehicles as they are driven. The time stamp, engine temperature, rotations per minute (RPM), and other sensor measurements are all captured. The business hopes to forecast when an engine may fail, so it can alert drivers in advance to schedule repair. For training purposes, the engine data is placed into a data lake.Which predictive model is the MOST SUITABLE for production deployment? A. Add labels over time to indicate which engine faults occur at what time in the future to turn this into a supervised learning problem. Use a recurrent neural network (RNN) to train the model to recognize when an engine might need maintenance for a certain fault. B. This data requires an unsupervised learning algorithm. Use Amazon SageMaker k-means to cluster the data. C. Add labels over time to indicate which engine faults occur at what time in the future to turn this into a supervised learning problem. Use a convolutional neural network (CNN) to train the model to recognize when an engine might need maintenance for a certain fault. D. This data is already formulated as a time series. Use Amazon SageMaker seq2seq to model the time series.

A RNN is for timeseries CNN is for images

A Machine Learning Specialist is responsible for developing a procedure for querying a dataset stored on Amazon S3 using Amazon Athena. Over 800,000 records are included in the dataset, which is kept in unencrypted CSV files. Each record is around 1.5 MB in size and comprises 200 columns. The majority of searches will cover no more than five to ten columns.How should the Machine Learning Specialist change the dataset in order to shorten the time required to perform the query? A. Convert the records to Apache Parquet format. B. Convert the records to JSON format. C. Convert the records to GZIP CSV format. D. Convert the records to XML format.

A Using compressions will reduce the amount of data scanned by Amazon Athena, and also reduce your S3 bucket storage. It's a Win-Win for your AWS bill.Supported formats: GZIP, LZO, SNAPPY (Parquet) and ZLIB.Reference:https://www.cloudforecast.io/blog/using-parquet-on-athena-to-save-money-on-aws/

Amazon Personalize is being used by a retail firm to deliver individualized product suggestions to consumers during a marketing campaign. The organization quickly notices a big rise in sales of suggested goods to current clients after the deployment of a new solution version, but these sales decline shortly thereafter. For training purposes, only historical data from prior to the marketing campaign is accessible.What adjustments should a data scientist make to the solution? A. Use the event tracker in Amazon Personalize to include real-time user interactions. B. Add user metadata and use the HRNN-Metadata recipe in Amazon Personalize. C. Implement a new solution using the built-in factorization machines (FM) algorithm in Amazon SageMaker. D. Add event type and event value fields to the interactions dataset in Amazon Personalize.

A https://docs.aws.amazon.com/personalize/latest/dg/recording-events.html

A data science team is developing a dataset repository to house a significant volume of training data that is often utilized in machine learning models. Given that Data Scientists may develop an infinite amount of new datasets each day, the solution must be scalable and cost-effective. Additionally, SQL exploration of the data must be possible.Which storage method is the MOST SUITABLE for this scenario? A. Store datasets as files in Amazon S3. B. Store datasets as files in an Amazon EBS volume attached to an Amazon EC2 instance. C. Store datasets as tables in a multi-node Amazon Redshift cluster. D. Store datasets as global tables in Amazon DynamoDB. Reveal Solution

A s3 most cost efficient, can use Athena to explore in SQL

A Workstation Learning Specialist previously trained a logistic regression model on a local machine using scikit-learn and now wishes to deploy it to production for the sole purpose of inference.What actions should be done to guarantee that an Amazon SageMaker model trained locally can be hosted? A. Build the Docker image with the inference code. Tag the Docker image with the registry hostname and upload it to Amazon ECR. B. Serialize the trained model so the format is compressed for deployment. Tag the Docker image with the registry hostname and upload it to Amazon S3. C. Serialize the trained model so the format is compressed for deployment. Build the image and upload it to Docker Hub. D. Build the Docker image with the inference code. Configure Docker Hub and upload the image to Amazon ECR. Hide Solution

A https://sagemaker-workshop.com/custom/containers.html

Which common parameters MUST be given when submitting Amazon SageMaker training tasks that use one of the built-in algorithms? (Select three.) A. The training channel identifying the location of training data on an Amazon S3 bucket. B. The validation channel identifying the location of validation data on an Amazon S3 bucket. C. The IAM role that Amazon SageMaker can assume to perform tasks on behalf of the users. D. Hyperparameters in a JSON array as documented for the algorithm used. E. The Amazon EC2 instance class specifying whether training will be run using CPU or GPU. F. The output path specifying where on an Amazon S3 bucket the trained model will persist.

A, E, F

Using a dataset of 100 continuous numerical characteristics, a Data Scientist is developing a model to predict customer attrition. The Marketing department has offered no guidance on which characteristics are significant for churn prediction. The Marketing department want to interpret the model and determine the direct effect of important characteristics on the model's output. While training a logistic regression model, the Data Scientist notices a significant difference in the accuracy of the training and validation sets.Which techniques may the Data Scientist use to enhance the model's performance and meet the Marketing team's requirements? (Select two.) A. Add L1 regularization to the classifier B. Add features to the dataset C. Perform recursive feature elimination D. Perform t-distributed stochastic neighbor embedding (t-SNE) E. Perform linear discriminant analysis

AC Key Words: 1. 100 continuous numerical features - too many features 2. No feature selection has been done 3. Easy interpretation - direct relationship between X and Y are preferred 4. gap between the training and validation set accuracy - overfitting A: Correct. L1 regularization = feature selection/dimensionality reduction, solves overfitting, interpretation is easy, direct relationships between x and y B: Wrong. More features, Overfitting will be worse. C: Correct. Recursive feature elimination=feature selection/dimensionality reduction, solves overfitting, interpretation is easy, direct relationships between x and y D: Perform t-distributed stochastic neighbor embedding (t-SNE)= Amazon's favorite dimensionality reduction technique, frequently show up in the questions. However, same as PCA, less interpretable. You won't be able to see the direct impact of relevant features on the model outcome. E: Wrong. If you have more than two classes then Linear Discriminant Analysis is the preferred linear classification technique.

A Machine Learning Specialist is enabling Amazon SageMaker to provide simultaneous access to notebooks, model training, and endpoint deployment by numerous Data Scientists. To guarantee optimal operational performance, the Specialist must be able to monitor the frequency with which the Scientists deploy models, the GPU and CPU use of deployed SageMaker endpoints, and any issues that occur when an endpoint is called.Which services are linked with Amazon SageMaker for the purpose of tracking this data? (Select two.) A. AWS CloudTrail B. AWS Health C. AWS Trusted Advisor D. Amazon CloudWatch E. AWS Config

AD CloudTrail tracks model deployment/API CloudWatch for monitoring GPU and CPU

A financial services firm wants to make Amazon SageMaker its primary data science environment. On sensitive financial data, the company's data scientists run machine learning (ML) models. The organization is concerned about data egress and desires the services of a machine learning engineer to safeguard the environment.Which methods does the machine learning engineer have at his disposal to manage data egress from SageMaker? (Select three.) A. Connect to SageMaker by using a VPC interface endpoint powered by AWS PrivateLink. B. Use SCPs to restrict access to SageMaker. C. Disable root access on the SageMaker notebook instances. D. Enable network isolation for training jobs and models. E. Restrict notebook presigned URLs to specific IPs used by the company. F. Protect data with encryption at rest and in transit. Use AWS Key Management Service (AWS KMS) to manage encryption keys.

ADE see section on controlling data egress - https://aws.amazon.com/blogs/machine-learning/millennium-management-secure-machine-learning-using-amazon-sagemaker/

What eval metric should be used for binary classifier?

AUC

AWS Batch Spot Instances Managed Spot Training Checkpointing

AWS Batch enables developers, scientists, and engineers to easily and efficiently run hundreds of thousands of batch computing jobs on AWS. With AWS Batch, there is no need to install and manage batch computing software or server clusters that you use to run your jobs, allowing you to focus on analyzing results and solving problems. Jobs submitted to AWS Batch are queued and executed based on the assigned order of preference. AWS Batch dynamically provisions the optimal quantity and type of computing resources based on the requirements of the batch jobs submitted. It also offers an automated retry mechanism where you can continuously run a job even in the event of a failure (e.g., instance termination when Amazon EC2 reclaims Spot instances, internal AWS service error/outage). Launch spot instances on Task Nodes.When you launch one or more task instance groups as Spot Instances, Amazon EMR provisions as many task nodes as it can, using your maximum Spot price. This means that if you request a task instance group with six nodes, and only five Spot Instances are available at or below your maximum Spot price, Amazon EMR launches the instance group with five nodes, adding the sixth later if possible. Managed Spot Training - uses Amazon EC2 Spot instance to run training jobs instead of on-demand instances. You can specify which training jobs use spot instances and a stopping condition that specifies how long SageMaker waits for a job to run using Amazon EC2 Spot instances. Managed spot training can optimize the cost of training models up to 90% over on-demand instances. SageMaker manages the Spot interruptions on your behalf. Checkpointing - To avoid restarting a training job from scratch should it be interrupted, it is strongly recommended that you implement checkpointing which is a technique that saves the model in training at periodic intervals. With this, you can resume a training job from a well-defined point in time, continuing from the most recent partially trained model.

Most scalable way to increase the availability of the application's underlying machine learning component (via endpoint)

Add SageMaker instances of the same size and use the existing endpoint to host them. Amazon SageMaker multi-model endpoints address these pain points and give businesses a scalable yet cost-effective solution to deploy multiple ML models. Cheaper to have multiple models on one endpoint https://tutorialsdojo.com/amazon-sagemaker/

Semantic Segmentation vs Object Detection

Algorithm provides fine-grained, pixel level approach to computer vision applications. Tags every pixel with a class (ie. airplane or not, person or not) Object detection provides a bounding box and not the segmentation needed to identify more specific boundaries

ACRONYMS Amazon EMR Amazon EKS Amazon ECR Amazon ECS Amazon EC2 Amazon S3 Amazon RDS VPC NAT Data egress

Amazon Elastic MapReduce Amazon Elastic Kubernetes Service Amazon Elastic Container Registry Amazon Elastic Container Service Amazon Elastic Compute Cloud Amazon Simple Storage Service Amazon Relational Database Service Virtual Private Cloud Network Address Translation Data egress is when data leaves a network and goes to an external location

SageMaker Batch Transform vs SageMaker Inference Pipeline

Amazon SageMaker Inference Pipeline is a tool used to chain different stages of your inference. To get inferences for an entire dataset, use batch transform. With batch transform, you create a batch transform job using a trained model and the dataset, which must be stored in Amazon S3. Amazon SageMaker saves the inferences in an S3 bucket that you specify when you create the batch transform job. You can use Amazon SageMaker Batch Transform to exclude attributes before running predictions. You can also join the prediction results with partial or entire input data attributes when using data that is in CSV, text, or JSON format. This eliminates the need for any additional pre-processing or post-processing and accelerates the overall ML process. Inference pipeline provides real time predictions

A machine learning specialist created a deep learning model for picture categorization. The Specialist, on the other hand, encountered an overfitting issue, with training and testing accuracies of 99 percent and 75%, respectively.How should the Specialist approach this situation and what is the underlying cause? A. The learning rate should be increased because the optimization process was trapped at a local minimum. B. The dropout rate at the flatten layer should be increased because the model is not generalized enough. C. The dimensionality of dense layer next to the flatten layer should be increased because the model is not complex enough. D. The epoch number should be increased because the optimization process was terminated before it reached the global minimum.

B

A machine learning specialist is assisting a media organization in classifying popular articles from the organization's website. Before a story is published, the business uses random forests to predict its popularity. Below is an example of the data that was utilized.The Specialist want to convert the Day Of Week column in the dataset to binary values.Which approach should be used to convert the values in this column to binary? A. Binarization B. One-hot encoding C. Tokenization D. Normalization transformation

B

A real estate firm wishes to develop a machine learning model capable of forecasting home values using a historical dataset. 32 features are included in the dataset.Which model is most appropriate for the business requirement? A. Logistic regression B. Linear regression C. K-means D. Principal component analysis (PCA)

B

The Machine Learning Specialist at a corporation wants to increase the training pace of a TensorFlow-based time series forecasting model. Currently, the training is conducted using a single GPU and takes roughly 23 hours to complete. Daily training must be conducted.Although the model's accuracy is satisfactory, the business believes that the amount of the training data will continue to grow and that the model will need to be updated hourly rather than daily. Additionally, the organization wishes to reduce coding labor and infrastructure modifications.What modifications should the Machine Learning Specialist make to the training solution in order for it to scale in the future? A. Do not change the TensorFlow code. Change the machine to one with a more powerful GPU to speed up the training. B. Change the TensorFlow code to implement a Horovod distributed framework supported by Amazon SageMaker. Parallelize the training to as many machines as needed to achieve the business goals. C. Switch to using a built-in AWS SageMaker DeepAR model. Parallelize the training to as many machines as needed to achieve the business goals. D. Move the training to Amazon EMR and distribute the workload to as many machines as needed to achieve the business goals.

B

Using Amazon Athena and Amazon S3, a mobile network operator is developing an analytics platform for analyzing and optimizing a business's operations.The source systems transmit data in real time in the.CSV format. Before storing the data on Amazon S3, the Data Engineering team want to convert it to the Apache Parquet format.Which approach requires the MINIMUM amount of work to implement? A. Ingest .CSV data using Apache Kafka Streams on Amazon EC2 instances and use Kafka Connect S3 to serialize data as Parquet B. Ingest .CSV data from Amazon Kinesis Data Streams and use Amazon Glue to convert data into Parquet. C. Ingest .CSV data using Apache Spark Structured Streaming in an Amazon EMR cluster and use Apache Spark to convert data into Parquet. D. Ingest .CSV data from Amazon Kinesis Data Streams and use Amazon Kinesis Data Firehose to convert data into Parquet.

B

Although a Machine Learning Specialist developed a regression model, the first iteration requires optimization. The Specialist must determine if the model over- or underestimates the objective more often.Which option does the Specialist have for determining if the goal number is being over- or underestimated? A. Root Mean Square Error (RMSE) B. Residual plots C. Area under the curve D. Confusion matrix

B AUC and Confusion Matrices are used for classification problems, not regression. And RMSE does not tell us if the target is being over or underestimated, because residuals are squared! So we actually have to look at the residuals themselves. And that's B.

A machine learning professional is running an Amazon SageMaker endpoint on a P3 instance and using the built-in object identification algorithm to make real-time predictions in a production application. When the expert examines the model's resource consumption, he or she sees that the model is only using a portion of the GPU.Which architectural improvements would maximize the use of provided resources? A. Redeploy the model as a batch transform job on an M5 instance. B. Redeploy the model on an M5 instance. Attach Amazon Elastic Inference to the instance. C. Redeploy the model on a P3dn instance. D. Deploy the model onto an Amazon Elastic Container Service (Amazon ECS) cluster using a P3 instance.

B Amazon Elastic Inference (EI) is a resource you can attach to your Amazon EC2 CPU instances to accelerate your deep learning (DL) inference workloads. Amazon EI accelerators come in multiple sizes and are a cost-effective method to build intelligent capabilities into applications running on Amazon EC2 instances.

Every minute, a monitoring service creates 1 TB of scale metrics record data. Amazon Athena is used by a research team to execute queries on this data. Due to the high number of data, the queries execute slowly, and the team demands improved speed.How should the records in Amazon S3 be kept to optimize query performance? A. CSV files B. Parquet files C. Compressed JSON D. RecordIO

B Athena works best with Parquet

A data scientist is developing a sentiment analysis application. The validation accuracy is low, and the Data Scientist believes that this is due to the dataset's large vocabulary and low average frequency of terms.Which tool should be utilized to increase the accuracy of validation? A. Amazon Comprehend syntax analysis and entity detection B. Amazon SageMaker BlazingText cbow mode C. Natural Language Toolkit (NLTK) stemming and stop word removal D. Scikit-leam term frequency-inverse document frequency (TF-IDF) vectorizer

B Blazing Text = word2vec

A Machine Learning Specialist is required to work for an online shop that want to do analytics on each client visit using a machine learning pipeline.The data must be ingested at a rate of up to 100 transactions per second using Amazon Kinesis Data Streams, and the JSON data blob must be 100 KB in size.What is the MINIMUM number of shards that the Specialist should employ in Kinesis Data Streams to effectively ingest this data? A. 1 shards B. 10 shards C. 100 shards D. 1,000 shards

B One shard can Ingest 1 MB/second or 1,000 records/second 100 kb * 100 t/second = 10000 kb = 10 mb 10mb / max_threshold_per_shard (1 mb) = 10 shards

A data scientist must discover fake user accounts on an ecommerce platform for a business. The organization want to establish whether a freshly formed account is connected to a previously identified fraudulent user. AWS Glue is being used by the data scientist to purify the company's application logs during ingestion.Which technique will enable the data scientist to detect bogus accounts? A. Execute the built-in FindDuplicates Amazon Athena query. B. Create a FindMatches machine learning transform in AWS Glue. C. Create an AWS Glue crawler to infer duplicate accounts in the source data. D. Search for duplicate accounts in the AWS Glue Data Catalog.

B Reference:https://docs.aws.amazon.com/glue/latest/dg/machine-learning.html

Using Amazon SageMaker, a Machine Learning Specialist is developing a model for time series forecasting. The Specialist has completed the model's training and is now intending to load test the endpoint in order to establish Auto Scaling for the model variation.Which technique enables the Specialist to analyze the load test's latency, memory use, and CPU utilization? A. Review SageMaker logs that have been written to Amazon S3 by leveraging Amazon Athena and Amazon QuickSight to visualize logs as they are being produced. B. Generate an Amazon CloudWatch dashboard to create a single view for the latency, memory utilization, and CPU utilization metrics that are outputted by Amazon SageMaker. C. Build custom Amazon CloudWatch Logs and then leverage Amazon ES and Kibana to query and visualize the log data as it is generated by Amazon SageMaker. D. Send Amazon CloudWatch Logs that were generated by Amazon SageMaker to Amazon ES and use Kibana to query and visualize the log data.

B Reference:https://docs.aws.amazon.com/sagemaker/latest/dg/monitoring-cloudwatch.html

A business want to categorize user behavior as fraudulent or normal. A machine learning expert will develop a binary classifier based on two features: the account's age, represented by x, and the month of the transaction, denoted by y. The distributions of the classes are shown in the accompanying image. SCATTER PLOT WITH SQUARE OF RED DOTS Positive classes are shown in red, whereas negative classes are depicted in black.Which model would be the most precise? A. Linear support vector machine (SVM) B. Decision tree C. Support vector machine (SVM) with a radial basis function kernel D. Single perceptron with a Tanh activation function

B Support Vector Machine with Radial Kernel is circle boundary Decision Tree is square/rectangle

A Data Scientist is developing a linear regression model and evaluating the statistical significance of each coefficient using the derived p-values. The Data Scientist observes that the majority of the characteristics in the dataset are regularly distributed. The image depicts the plot of a single feature from the dataset.Which transformation should the Data Scientist do to ensure that the linear regression model's statistical assumptions are met? A. Exponential transformation B. Logarithmic transformation C. Polynomial transformation D. Sinusoidal transformation

B The log transformation will reduce this features skewness.

A manufacturing business uses an Amazon S3 bucket to store structured and unstructured data. A Machine Learning Specialist want to query this data using SQL.Which option requires the LEAST amount of work in order to query this data? A. Use AWS Data Pipeline to transform the data and Amazon RDS to run queries. B. Use AWS Glue to catalogue the data and Amazon Athena to run queries. C. Use AWS Batch to run ETL on the data and Amazon Aurora to run the queries. D. Use AWS Lambda to transform the data and Amazon Kinesis Data Analytics to run queries.

B Using Glue Use AWS Glue to catalogue the data and Amazon Athena to run queries against data on S3 are very typical use cases for those services. D is not ideal, Lambda can surely do many things but it requires development/testing effort, and Amazon Kinesis Data Analytics is not ideal for ad-hoc queries.

A Machine Learning Specialist is putting a bespoke ResNet model into a Docker container in order to train the model using Amazon SageMaker. The Specialist is training the model on Amazon EC2 P3 instances and wants to setup the Docker container effectively to take use of the NVIDIA GPUs.What is the Specialist's role? A. Bundle the NVIDIA drivers with the Docker image. B. Build the Docker container to be NVIDIA-Docker compatible. C. Organize the Docker container's file structure to execute on GPU instances. D. Set the GPU flag in the Amazon SageMaker CreateTrainingJob request body. Reveal Solution

B https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-dg.pdf page 55: If you plan to use GPU devices, make sure that your containers are nvidia-docker compatible. Only the CUDA toolkit should be included on containers. Don't bundle NVIDIA drivers with the image. For more information about nvidia-docker, see NVIDIA/nvidia-docker.

A data storage solution for Amazon SageMaker is being developed by a machine learning specialist. There is already a TensorFlow-based model developed as a train.py script that makes use of static training data saved as TFRecords.Which approach of supplying training data to Amazon SageMaker would satisfy business needs with the LEAST amount of development time? A. Use Amazon SageMaker script mode and use train.py unchanged. Point the Amazon SageMaker training invocation to the local path of the data without reformatting the training data. B. Use Amazon SageMaker script mode and use train.py unchanged. Put the TFRecord data into an Amazon S3 bucket. Point the Amazon SageMaker training invocation to the S3 bucket without reformatting the training data. C. Rewrite the train.py script to add a section that converts TFRecords to protobuf and ingests the protobuf data instead of TFRecords. D. Prepare the data in the format accepted by Amazon SageMaker. Use AWS Glue or AWS Lambda to reformat and store the data in an Amazon S3 bucket.

B https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/using_tf.html https://github.com/aws-samples/amazon-sagemaker-script-mode/blob/master/tf-horovod-inference-pipeline/train.py

Cloudwatch alarm vs event

Cloudwatch event can trigger a lambda function Cloudwatch alarm allows you to watch cloudwatch metrics and to receive notifications when they fall below or rise above a certain threshold

A firm operates a vast number of factories and maintains a complicated supply chain connection in which an unexpected breakdown of a machine might result in the suspension of operations at multiple plants. A data scientist wishes to examine factory sensor data in order to detect equipment in need of preventative maintenance and then deploy a repair crew to avoid unscheduled downtime. A single machine's sensor data may include up to 200 data points, including temperatures, voltages, vibrations, RPMs, and pressure measurements.The firm put Wi-Fi and LANs across the plants to capture this sensor data. Despite the fact that many industrial sites lack stable or high-speed internet access, the manufacturer want to retain near-real-time inference capabilities.Which model deployment architecture will satisfy these business requirements? A. Deploy the model in Amazon SageMaker. Run sensor data through this model to predict which machines need maintenance. B. Deploy the model on AWS IoT Greengrass in each factory. Run sensor data through this model to infer which machines need maintenance. C. Deploy the model to an Amazon SageMaker batch transformation job. Generate inferences in a daily batch report to identify machines that need maintenance. D. Deploy the model in Amazon SageMaker and use an IoT rule to write data to an Amazon DynamoDB table. Consume a DynamoDB stream from the table with an AWS Lambda function to invoke the endpoint.

B For latency-sensitive use cases and for use-cases that require analyzing large amounts of streaming data, it may not be possible to run ML inference in the cloud. Besides, cloud-connectivity may not be available all the time. For these use cases, you need to deploy the ML model close to the data source. SageMaker Neo + IoT GreenGrass To design and push something to edge: 1. design something to do the job, say TF model 2. compile it for the edge device using SageMaker Neo, say Nvidia Jetson 3. run it on the edge using IoT GreenGrass

A machine learning expert at a fruit processing firm is tasked with developing a system for categorizing apples into three categories. The expert compiled a collection of 150 photos for each species of apple and used transfer learning to train a neural network on ImageNet using this dataset.The firm expects a model to be at least 85 percent accurate in order to utilize it.Following a thorough grid search, the best hyperparameters were as follows:✑Accuracy of 68 percent on the training set✑Accuracy of 67 percent on the validation setWhat can the machine learning professional do to increase the accuracy of the system? A. Upload the model to an Amazon SageMaker notebook instance and use the Amazon SageMaker HPO feature to optimize the model's hyperparameters. B. Add more data to the training set and retrain the model using transfer learning to reduce the bias. C. Use a neural network model with more layers that are pretrained on ImageNet and apply transfer learning to increase the variance. D. Train a new model using the current neural network architecture.

B or C?

A firm that encourages good sleep habits via the use of cloud-connected devices is now using AWS to host a sleep monitoring application. The program gathers information on the device's use from its users. The company's Data Science team is developing a machine learning model to forecast when and if a user may cease to use the company's gadgets. The model's predictions are utilized by a downstream application to identify the most effective method of engaging consumers.The Data Science team is developing many iterations of the machine learning model and comparing them to the commercial objectives of the organization. To determine the model's long-term performance, the team intends to run numerous versions in parallel for extended periods of time, with the possibility to alter the percentage of inferences supplied by the models.Which method achieves these criteria with the LEAST amount of effort? A. Build and host multiple models in Amazon SageMaker. Create multiple Amazon SageMaker endpoints, one for each model. Programmatically control invoking different models for inference at the application layer. B. Build and host multiple models in Amazon SageMaker. Create an Amazon SageMaker endpoint configuration with multiple production variants. Programmatically control the portion of the inferences served by the multiple models by updating the endpoint configuration. C. Build and host multiple models in Amazon SageMaker Neo to take into account different types of medical devices. Programmatically control which model is invoked for inference based on the medical device type. D. Build and host multiple models in Amazon SageMaker. Create a single endpoint that accesses multiple models. Use Amazon SageMaker batch transform to control invoking the different models through the single endpoint.

B(?) A/B testing with Amazon SageMaker is required in the Exam. In A/B testing, you test different variants of your models and compare how each variant performs. Amazon SageMaker enables you to test multiple models or model versions behind the `same endpoint` using `production variants`. Each production variant identifies a machine learning (ML) model and the resources deployed for hosting the model. To test multiple models by `distributing traffic` between them, specify the `percentage of the traffic` that gets routed to each model by specifying the `weight` for each `production variant` in the endpoint configuration.

A media corporation with a large collection of unlabeled photographs, text, audio, and video footage seeks to index its assets in order to enable the Research team to quickly identify relevant information. The firm wishes to use machine learning in order to expedite the work of its in-house researchers, who have minimal experience with machine learning. Which approach is the FASTEST for indexing the assets? A. Use Amazon Rekognition, Amazon Comprehend, and Amazon Transcribe to tag data into distinct categories/classes. B. Create a set of Amazon Mechanical Turk Human Intelligence Tasks to label all footage. C. Use Amazon Transcribe to convert speech to text. Use the Amazon SageMaker Neural Topic Model (NTM) and Object Detection algorithms to tag data into distinct categories/classes. D. Use the AWS Deep Learning AMI and Amazon EC2 GPU instances to create custom models for audio transcription and topic modeling, and use object detection to tag data into distinct categories/classes.

B?

A Machine Learning Specialist is responsible for preparing data for training by moving and transforming it. Certain data must be handled in near-real time, while others may be transferred on an hourly basis. There are already existing Amazon EMR MapReduce operations for data cleaning and feature engineering.Which of the following services are capable of supplying data to MapReduce jobs? (Select two.) A. AWS DMS B. Amazon Kinesis C. AWS Data Pipeline D. Amazon Athena E. Amazon ES

BC Kinesis data into EMR: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-kinesis.html

A Data Scientist is required to do employment data analysis. The dataset comprises roughly ten million observations of individuals across ten distinct characteristics. The Data Scientist discovers that the income and age distributions are not typical during the preliminary study. While income levels exhibit the anticipated right skew, with fewer persons earning more, the age distribution exhibits the same right skew, with fewer older individuals engaging in the workforce.Which feature transformations may the Data Scientist do to repair the data that has been skewed incorrectly? (Select two.) A. Cross-validation B. Numerical value binning C. High-degree polynomial transformation D. Logarithmic transformation E. One hot encoding Hide Solution

BD B because we have skewed data with few exeptions D log transform can change distribution of data not C - because there is no indicaiton in the text, that data is following any of the HIGH DEGREE polynomial distribution like x^ 10

An insurance firm is creating a new automotive gadget that employs a camera to monitor drivers' behavior and alerts them when they look to be distracted. The organization developed roughly 10,000 training photos in a controlled setting that will be used to train and assess machine learning models by a Machine Learning Specialist.During the model assessment, the Specialist sees that as the number of epochs grows, the training error rate decreases quicker and the model is unable to effectively infer on unseen test pictures.Which of the following approaches should be used to remedy this situation? (Select two.) A. Add vanishing gradient to the model. B. Perform data augmentation on the training data. C. Make the neural network architecture complex. D. Use gradient checking in the model. E. Add L2 regularization to the model.

BE An insurance firm is creating a new automotive gadget that employs a camera to monitor drivers' behavior and alerts them when they look to be distracted. The organization developed roughly 10,000 training photos in a controlled setting that will be used to train and assess machine learning models by a Machine Learning Specialist.During the model assessment, the Specialist sees that as the number of epochs grows, the training error rate decreases quicker and the model is unable to effectively infer on unseen test pictures.Which of the following approaches should be used to remedy this situation? (Select two.) A. Add vanishing gradient to the model. B. Perform data augmentation on the training data. C. Make the neural network architecture complex. D. Use gradient checking in the model. E. Add L2 regularization to the model.

A data scientist is constructing a machine learning model to determine the legitimacy of financial transactions. The labeled data provided for training consists of 100,000 observations that are not fraudulent and 1,000 observations that are fraudulent.When the trained model is applied to a previously unknown validation dataset, the Data Scientist obtains the following confusion matrix. Although the model is 99.1 percent accurate, the Data Scientist has been requested to minimize false negatives.Which combination of procedures should the Data Scientist perform in order to minimize the model's false positive predictions? (Select two.)

BE Here is why: not A (it is classification problem) not C (increasing max_depth parameter will increase overfitting, and thus make generalization worse) not D (AUC is very poor model quality indicator for imbalanced data) B is valid option (default value for scale_pos_weight is 1, it definitely should be increased to something like: sum(negative instances) / sum(positive instances) E is valid option (decreasing max_depth should make generalization better)

Full Bayesian Network Naive Bayesian model Correlation Bayesian Optimization for parameter tuning

Bayesian network is a representation of a joint probability distribution of a set of random variables with a possible mutual causal relationship Bayesian network used when there are correlated features or features ranging in correlation. Naive Bayes relies on how strongly independent predictors are. If there are low correlations, this would be appropriate. The Bayesian optimization is a sequential algorithm that learns from past trainings as the tuning job progresses. This tuning algorithm learns from each incremental step which highly limits parallelism. Running more hyperparameter tuning jobs concurrently gets more work done quickly, but a tuning job improves only through successive rounds of experiments. Typically, running one training job at a time achieves the best results with the least amount of compute time.

A data scientist created a machine learning translation model for English to Japanese by combining 500,000 aligned phrase pairs with Amazon SageMaker's built-in seq2seq method. The data scientist discovers that the translation quality is acceptable for a five-word example while testing with sample sentences. However, the quality degrades to an unsatisfactory level when the statement exceeds 100 words in length.Which course of action will remedy the issue? A. Change preprocessing to use n-grams. B. Add more nodes to the recurrent neural network (RNN) than the largest sentence's word count. C. Adjust hyperparameters related to the attention mechanism. D. Choose a different weight initialization type.

C https://docs.aws.amazon.com/sagemaker/latest/dg/seq-2-seq-howitworks.html

A business want to categorize user behavior as fraudulent or normal. A Machine Learning Specialist want to develop a binary classifier based on two features: account age and transaction month. The graphic shown illustrates the class distribution of these characteristics. (DESCRIPTION: Age of Account (Y-axis) Transaction Month (X-axis) Points clustered in circle in the middle are fraudulent, others are non-fraudulent) Which model would have the HIGHEST degree of accuracy based on this information? A. Long short-term memory (LSTM) model with scaled exponential linear unit (SELU) B. Logistic regression C. Support vector machine (SVM) with non-linear kernel D. Single perceptron with tanh activation function

C

Customer data is collected by a Machine Learning Specialist for an online shopping website. Demographic information, previous visits, and information about the surrounding area are all included in the data. The Specialist is responsible for developing a machine learning strategy for identifying client buying behaviors, preferences, and trends in order to improve the website's service and recommendation capabilities.Which course of action should the Specialist suggest? A. Latent Dirichlet Allocation (LDA) for the given collection of discrete data to identify patterns in the customer database. B. A neural network with a minimum of three layers and random initial weights to identify patterns in the customer database. C. Collaborative filtering based on user interactions and correlations to identify patterns in the customer database. D. Random Cut Forest (RCF) over random subsamples to identify patterns in the customer database.

C Collaborative filtering is for recommendation

On Amazon SageMaker, a Machine Learning Specialist is preparing data for training. The Specialist is training using one of SageMaker's built-in algorithms. The dataset is saved in.CSV format and converted to a numpy.array, which looks to be slowing down the training process.What actions should the Specialist take to optimize the data for SageMaker training? A. Use the SageMaker batch transform feature to transform the training data into a DataFrame. B. Use AWS Glue to compress the data into the Apache Parquet format. C. Transform the dataset into the RecordIO protobuf format. D. Use the SageMaker hyperparameter optimization feature to automatically optimize the data.

C Most Amazon SageMaker algorithms work best when you use the optimized protobuf recordIO format for the training data. https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-training.html

A farming firm is interested in applying machine learning to identify particular weed species in a 100-acre grassland patch. The firm now employs tractor-mounted cameras to gather several photographs of the field in 10 — 10 grids. Additionally, the organization has a sizable training dataset comprised of annotated photos of common weed classifications such as broadleaf and non-broadleaf docks.The organization wishes to develop a weed identification model capable of identifying certain kinds of weeds and their position within a field. The model will be hosted on Amazon SageMaker endpoints once it is complete. The model will do real-time inference using the camera pictures.Which strategy should a Machine Learning Specialist use in order to achieve reliable predictions? A. Prepare the images in RecordIO format and upload them to Amazon S3. Use Amazon SageMaker to train, test, and validate the model using an image classification algorithm to categorize images into various weed classes. B. Prepare the images in Apache Parquet format and upload them to Amazon S3. Use Amazon SageMaker to train, test, and validate the model using an object- detection single-shot multibox detector (SSD) algorithm. C. Prepare the images in RecordIO format and upload them to Amazon S3. Use Amazon SageMaker to train, test, and validate the model using an object- detection single-shot multibox detector (SSD) algorithm. D. Prepare the images in Apache Parquet format and upload them to Amazon S3. Use Amazon SageMaker to train, test, and validate the model using an image classification algorithm to categorize images into various weed classes.

C Pay attention that the question is asking for 2 things: 1. detect specific types of weeds 2. detect the location of each type within the field. Image Classification can only classify images. Object detection algorithm: 1.identifies all instances of objects within the image scene. 2.its location and scale in the image are indicated by a rectangular bounding box. Data format for Computer Vision algorithms in SageMaker: Recommend to use RecordIO.

A commercial security firm successfully piloted the use of 100 cameras deployed in strategic spots across the main office. The cameras' images were uploaded to Amazon S3, Amazon Rekognition was used to tag them, and the findings were saved in Amazon ES. The firm is now exploring the possibility of expanding the pilot into a complete production system with hundreds of video cameras across its worldwide office sites. The objective is to detect non-employee activity in real time.Which of the following options should the agency consider? A. Use a proxy server at each local office and for each camera, and stream the RTSP feed to a unique Amazon Kinesis Video Streams video stream. On each stream, use Amazon Rekognition Video and create a stream processor to detect faces from a collection of known employees, and alert when non-employees are detected. B. Use a proxy server at each local office and for each camera, and stream the RTSP feed to a unique Amazon Kinesis Video Streams video stream. On each stream, use Amazon Rekognition Image to detect faces from a collection of known employees and alert when non-employees are detected. C. Install AWS DeepLens cameras and use the DeepLens_Kinesis_Video module to stream video to Amazon Kinesis Video Streams for each camera. On each stream, use Amazon Rekognition Video and create a stream processor to detect faces from a collection on each stream, and alert when non-employees are detected. D. Install AWS DeepLens cameras and use the DeepLens_Kinesis_Video module to stream video to Amazon Kinesis Video Streams for each camera. On each stream, run an AWS Lambda function to capture image fragments and then call Amazon Rekognition Image to detect faces from a collection of known employees, and alert when non-employees are detected.

C based on this exact same article answer: https://docs.aws.amazon.com/rekognition/latest/dg/streaming-video.html Rekognition stream processor doesn't need Lambda.

A business is establishing an Amazon SageMaker environment. Communication through the internet is prohibited under the business data security policy.How can the Amazon SageMaker service be enabled without also authorizing direct internet access to Amazon SageMaker notebook instances? A. Create a NAT gateway within the corporate VPC. B. Route Amazon SageMaker traffic through an on-premises network. C. Create Amazon SageMaker VPC interface endpoints within the corporate VPC. D. Create VPC peering with Amazon VPC hosting Amazon SageMaker.

C https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-dg.pdf (516) https://docs.aws.amazon.com/zh_tw/vpc/latest/userguide/vpc-endpoints.html

A corporation populates an Amazon S3 data lake with machine learning (ML) data derived from online advertising clicks. The Kinesis Producer Library is used to add click data to an Amazon Kinesis data stream (KPL). The data is fed into the S3 data lake using an Amazon Kinesis Data Firehose delivery stream from the data stream. As the amount of data rises, a machine learning professional sees that the pace at which data is fed into Amazon S3 remains rather consistent. Additionally, there is a rising backlog of data to be ingested by Kinesis Data Streams and Kinesis Data Firehose.Which of the following steps is most likely to increase the pace of data intake into Amazon S3? A. Increase the number of S3 prefixes for the delivery stream to write to. B. Decrease the retention period for the data stream. C. Increase the number of shards for the data stream. D. Add more consumers using the Kinesis Client Library (KCL).

C # of shards is determined by: 1. # of transactions per second times 2. data blob eg. 100 KB in size 3. One shard can Ingest 1 MB/second

A trucking business is gathering real-time visual data from its global fleet of vehicles. The data is expanding at a breakneck pace, with around 100 GB of new data created daily. The organization wishes to investigate possible applications of machine learning while guaranteeing that the data is only available to authorized IAM users.Which storage choice offers the most processing flexibility and supports IAM access control? A. Use a database, such as Amazon DynamoDB, to store the images, and set the IAM policies to restrict access to only the desired IAM users. B. Use an Amazon S3-backed data lake to store the raw images, and set up the permissions using bucket policies. C. Setup up Amazon EMR with Hadoop Distributed File System (HDFS) to store the files, and restrict access to the EMR instances using IAM policies. D. Configure Amazon EFS with IAM policies to make the data available to Amazon EC2 instances owned by the IAM users.

C It says real time data and to be used for ML process so EMR more suitable. also S3 bucket policies not same as IAM users so B is not correct.

Within a nation, an organization gathers census data to ascertain healthcare and social program requirements by province and city. Each person responds to around 500 questions on the census form.Which algorithmic combination would deliver the necessary insights? (Select two.) A. The factorization machines (FM) algorithm B. The Latent Dirichlet Allocation (LDA) algorithm C. The principal component analysis (PCA) algorithm D. The k-means algorithm E. The Random Cut Forest (RCF) algorithm

C: (OK) Use PCA for reducing number of variables. Each citizen's response should have answer for 500 questions, so it should have 500 variables D: (OK) Use K-means clustering A: (Not OK) Factorization Machines Algorithm is usually used for tasks dealing with high dimensional sparse datasets B: (Not OK) The Latent Dirichlet Allocation (LDA) algorithm should be used for task dealing topic modeling in NLP E: (Not OK) Random Cut Forest should be used for detecting anomalies in data

A Data Scientist was given a collection of insurance records, each of which had an ID for the record, the final result from 200 possible outcomes, and the date of the final outcome.Additionally, some limited information on the substance of claims is supplied, although only for a handful of the 200 categories. Hundreds of records have been provided during the last three years for each result category. The Data Scientist want to forecast the number of claims in each category month by month, many months in advance.Which machine learning algorithm should be used? A. Classification month-to-month using supervised learning of the 200 categories based on claim contents. B. Reinforcement learning using claim IDs and timestamps where the agent will identify how many claims in each category to expect from month to month. C. Forecasting using claim IDs and timestamps to identify how many claims in each category to expect from month to month. D. Classification with supervised learning of the categories for which partial information on claim contents is provided, and forecasting using claim IDs and timestamps for all other categories.

C?

A gaming business has introduced an online game in which players may sign up for free but must pay to access certain features. The organization must develop an automated system that can forecast if a new user will convert to a premium subscriber within a year. The business has compiled a labeled collection of data from one million consumers.The training dataset contains 1,000 positive samples (from users who paid within a year) and 999,000 negative samples (from users who never paid for any characteristics). Each data sample contains 200 attributes about the user, such as their age, device, location, and play behaviors.The Data Science team constructed a random forest model on this dataset, which converged to above 99 percent accuracy on the training set. However, the prediction accuracy on a test dataset was insufficient.Which of the following strategies should the Data Science team use to address this issue? (Select two.) A. Add more deep trees to the random forest to enable the model to learn more features. B. Include a copy of the samples in the test dataset in the training dataset. C. Generate more positive samples by duplicating the positive samples and adding a small amount of noise to the duplicated data. D. Change the cost function so that false negatives have a higher impact on the cost value than false positives. E. Change the cost function so that false positives have a higher impact on the cost value than false negatives.

CD Model is predicting all negatives, so we need to minimize the over prediction of negatives Minimize false negatives

A corporation is experiencing poor accuracy while training on Amazon SageMaker's default built-in picture categorization algorithm. The Data Science team want to use an Inception neural network architecture rather than a ResNet one.Which of the following is the most effective way to do this? (Select two.) A. Customize the built-in image classification algorithm to use Inception and use this for model training. B. Create a support case with the SageMaker team to change the default image classification algorithm to Inception. C. Bundle a Docker container with TensorFlow Estimator loaded with an Inception network and use this for model training. D. Use custom code in Amazon SageMaker with TensorFlow Estimator to load the model with an Inception network, and use this for model training. E. Download and apt-get install the inception network code into an Amazon EC2 instance and use this instance as a Jupyter notebook in Amazon SageMaker.

CD The effectual way BOTH RELATIVE TO SageMaker Estimator C.DOCKER OR BRING YOUR CODE BY D.SageMaker with TensorFlow Estimator

On Amazon SageMaker, a Machine Learning team runs its own training algorithm. External assets are required for the training algorithm. The team must submit to Amazon SageMaker both its own algorithm code and algorithm-specific parameters.Which services should the team combine in order to create a bespoke algorithm in Amazon SageMaker? (Select two.) A. AWS Secrets Manager B. AWS CodeStar C. Amazon ECR D. Amazon ECS E. Amazon S3

CE Container Registry and Amazon s3

A major organization has built a business intelligence tool that creates reports and dashboards from data gathered from different operational KPIs. The organization wishes to improve the executive experience by allowing them to get data from reports using natural language. The organization wants executives to be able to communicate with one another using written and spoken interfaces.Which services may be used to provide this conversational interface? (Select three.) A. Alexa for Business B. Amazon Connect C. Amazon Lex D. Amazon Polly E. Amazon Comprehend F. Amazon Transcribe

CEF ?

A Data Scientist observes oscillations in training accuracy while doing mini-batch training on a neural network for a classification task.Which of the following is the MOST LIKELY CAUSE of this problem? A. The class distribution in the dataset is imbalanced. B. Dataset shuffling is disabled. C. The batch size is too big. D. The learning rate is very high.

D

A business analyzes camera photos of the tops of objects placed on shop shelves to identify which things have been taken and which remain. The organization now has a total of 1,000 hand-labeled photos encompassing ten separate things after many hours of data tagging. The training was ineffective.Which machine learning technique best meets the long-term goals of the business? A. Convert the images to grayscale and retrain the model B. Reduce the number of distinct items from 10 to 2, build the model, and iterate C. Attach different colored labels to each item, take the images again, and build the model D. Augment training data for each item using image variants like inversions and translations, build the model, and iterate.

D

A corporation wishes to forecast home selling prices using existing historical sales data. The selling price is the goal variable in the company's dataset. The attributes include the lot size, measures of the living space and non-living area, the number of bedrooms and bathrooms, the year constructed, and the postal code. The organization wishes to forecast home selling prices using multivariable linear regression.Which step should a machine learning professional take to eliminate extraneous information and simplify the model? A. Plot a histogram of the features and compute their standard deviation. Remove features with high variance. B. Plot a histogram of the features and compute their standard deviation. Remove features with low variance. C. Build a heatmap showing the correlation of the dataset against itself. Remove features with low mutual correlation scores. D. Run a correlation check of all features against the target variable. Remove features with low target variable correlation scores.

D

A manufacturer of airplane engines is compiling a time series of 200 performance indicators. Engineers need near-real-time detection of significant production problems during testing. All data must be retained for offline analysis.Which strategy would be the MOST EFFECTIVE in terms of defect detection in near-real time? A. Use AWS IoT Analytics for ingestion, storage, and further analysis. Use Jupyter notebooks from within AWS IoT Analytics to carry out analysis for anomalies. B. Use Amazon S3 for ingestion, storage, and further analysis. Use an Amazon EMR cluster to carry out Apache Spark ML k-means clustering to determine anomalies. C. Use Amazon S3 for ingestion, storage, and further analysis. Use the Amazon SageMaker Random Cut Forest (RCF) algorithm to determine anomalies. D. Use Amazon Kinesis Data Firehose for ingestion and Amazon Kinesis Data Analytics Random Cut Forest (RCF) to perform anomaly detection. Use Kinesis Data Firehose to store data in Amazon S3 for further analysis.

D

On a company's website, a Machine Learning Specialist implemented a model that delivers product suggestions. Initially, the concept performed admirably and resulted in consumers purchasing an average of more things. However, the Specialist has noted that the efficacy of product suggestions has waned in recent months, and consumers are reverting to their previous buying patterns. The Specialist is puzzled what occurred, since the model has been same since it was deployed over a year ago.Which strategy should the Specialist use in order to enhance the model's performance? A. The model needs to be completely re-engineered because it is unable to handle product inventory changes. B. The model's hyperparameters should be periodically updated to prevent drift. C. The model should be periodically retrained from scratch using the original data while adding a regularization term to handle product inventory changes D. The model should be periodically retrained using the original training data plus new data as product inventory changes.

D

A machine learning expert is now working on a proof of concept for government users who are most concerned about security. The expert is training a convolutional neural network (CNN) model for a picture classification application using Amazon SageMaker. The expert wishes to safeguard the data from inadvertent access and transmission to a distant host by malicious programs put on the training container.Which of the following actions will give the MOST SECURE protection? A. Remove Amazon S3 access permissions from the SageMaker execution role. B. Encrypt the weights of the CNN model. C. Encrypt the training and validation dataset. D. Enable network isolation for training jobs. Reveal Solution

D Based on the following link: https://aws.amazon.com/blogs/security/secure-deployment-of-amazon-sagemaker-resources/ "EnableNetworkIsolation - Set this to true when creating training, hyperparameter tuning, and inference jobs to prevent situations like malicious code being accidentally installed and transferring data to a remote host."

A web-based business wishes to increase conversions on its landing page. The business developed a multi-class deep learning network algorithm using Amazon SageMaker regularly using a big historical dataset of client visits. However, there is an overfitting issue: training data indicates a prediction accuracy of 90%, whereas test data indicates only a prediction accuracy of 70%. The organization has to increase the generalizability of its model prior to putting it in production in order to optimize visit-to-purchase conversions. Which activity is advised to ensure that the company's test and validation data is modeled with the HIGHEST degree of accuracy possible? A. Increase the randomization of training data in the mini-batches used in training B. Allocate a higher proportion of the overall data to the training dataset C. Apply L1 or L2 regularization and dropouts to the training D. Reduce the number of layers and units (or neurons) from the deep learning network

D Because we have 'deep neural network'. And there are two ways to reduce overfitting of a neural network: 1) Change network complexity by changing the network structure (number of weights). 2) Change network complexity by changing the network parameters (values of weights).

A business evaluates the risk variables associated with a specific energy sector using a long short-term memory (LSTM) model. The program analyzes multi-page text documents and categorizes each phrase as either posing a danger or posing no risk. The model is underperforming, despite the Data Scientist's extensive experimentation with several network architectures and tuning of the associated hyperparameters. Which technique will result in the MAXIMUM increase in performance? A. Initialize the words by term frequency-inverse document frequency (TF-IDF) vectors pretrained on a large collection of news articles related to the energy sector. B. Use gated recurrent units (GRUs) instead of LSTM and run the training process until the validation loss stops decreasing. C. Reduce the learning rate and run the training process until the training loss stops decreasing. D. Initialize the words by word2vec embeddings pretrained on a large collection of news articles related to the energy sector.

D C is not the best the answer because the question states that tuning parameters doesn't help a lot. Transfer learning would be better solution.

A data scientist conducts data exploration and analysis using an Amazon SageMaker notebook instance. This involves installing some Python packages on the notebook instance that are not natively accessible on Amazon SageMaker.How can a machine learning professional guarantee that the data scientist's essential packages are automatically accessible on the notebook instance? A. Install AWS Systems Manager Agent on the underlying Amazon EC2 instance and use Systems Manager Automation to execute the package installation commands. B. Create a Jupyter notebook file (.ipynb) with cells containing the package installation commands to execute and place the file under the /etc/init directory of each Amazon SageMaker notebook instance. C. Use the conda package manager from within the Jupyter notebook console to apply the necessary conda packages to the default kernel of the notebook. D. Create an Amazon SageMaker lifecycle configuration with package installation commands and assign the lifecycle configuration to the notebook instance.

D I would select D. See AWS documentation: https://docs.aws.amazon.com/sagemaker/latest/dg/nbi-add-external.html

A retail chain has been utilizing Amazon Kinesis Data Firehose to ingest purchase details from its network of 20,000 outlets into Amazon S3. To facilitate the training of a more advanced machine learning model, training data will need additional but straightforward transformations, and certain characteristics will be merged. Daily retraining of the model is required.Which update will take the LEAST amount of development work, given the vast number of stores and historical data ingestion? A. Require that the stores to switch to capturing their data locally on AWS Storage Gateway for loading into Amazon S3, then use AWS Glue to do the transformation. B. Deploy an Amazon EMR cluster running Apache Spark with the transformation logic, and have the cluster run each day on the accumulating records in Amazon S3, outputting new/transformed records to Amazon S3. C. Spin up a fleet of Amazon EC2 instances with the transformation logic, have them transform the data records accumulating on Amazon S3, and output the transformed records to Amazon S3. D. Insert an Amazon Kinesis Data Analytics stream downstream of the Kinesis Data Firehose stream that transforms raw record attributes into simple transformed values using SQL.

D Question has '"simple transformations, and some attributes will be combined" and Least development effort. Kinesis analytics can get data from Firehose, transform and write to S3 https://docs.aws.amazon.com/kinesisanalytics/latest/java/examples-s3.html

A Machine Learning Specialist initiates a hyperparameter tuning project for a tree-based ensemble model using Amazon SageMaker with the target metric Area Under the Receiver Operating Characteristic Curve (AUC). This method will ultimately be integrated into a pipeline that retrains and optimizes hyperparameters each night in order to model click-through on stale data every 24 hours.The Specialist want to adjust the input hyperparameter range in order to reduce the time required to train these models and, eventually, to save expenditures (s).Which visualization will achieve this goal? A. A histogram showing whether the most important input feature is Gaussian. B. A scatter plot with points colored by target variable that uses t-Distributed Stochastic Neighbor Embedding (t-SNE) to visualize the large number of input variables in an easier-to-read dimension. C. A scatter plot showing the performance of the objective metric over each training iteration. D. A scatter plot showing the correlation between maximum tree depth and the objective metric.

D This is a very tricky question. The idea is to reconfigure the ranges of the hyperparameters. A refers to a feature, not a hyperparameter. A is out. C refers to training the model, not optimizing the range of hyperparameters. C is out. Now it gets tricky. D will let you find determine what the approximately best tree depth is. That's good. That's what you're trying to do but it's only one of many hyperparameters. It's the best choice so far. B is tricky. t-SNE does help you visualize multidimensional data but option B refers to input variables, not hyperparameters. For this very tricky question, I would do with D. It's the only one that accomplishes the task of limiting the range of a hyperparameter, even if it is only one of them.

A financial institution is attempting to identify credit card fraud. According to the firm, around 2% of credit card transactions are fraudulent. On the basis of a year's worth of credit card transaction data, a data scientist trained a classifier. The model must distinguish between fraudulent transactions (positives) and legitimate ones (negatives). The company's objective is to catch as many positives as possible correctly.Which metrics should be used to optimize the model by the data scientist? (Select two.) A. Specificity B. False positive rate C. Accuracy D. Area under the precision-recall curve E. True positive rate

DE we need to make the recall rate(not precision) high. Recall is = Sensitivity = False Negative which is a Type II error Precision = specificity = False Positive which is a Type I error

Difference between Kinesis Data Streams and Kinesis Data Firehose

Data Firehose can do both ingestion and data transformation (with Lambda) Kinesis Data Firehose can invoke your Lambda function to transform incoming source data and deliver the transformed data to destinations. You can enable Kinesis Data Firehose data transformation when you create your delivery stream. When you enable Kinesis Data Firehose data transformation, Kinesis Data Firehose buffers incoming data up to 3 MB by default. (To adjust the buffering size, use the ProcessingConfiguration API with the ProcessorParameter called BufferSizeInMBs.) Kinesis Data Firehose then invokes the specified Lambda function asynchronously with each buffered batch using the AWS Lambda synchronous invocation model. The transformed data is sent from Lambda to Kinesis Data Firehose. Kinesis Data Firehose then sends it to the destination when the specified destination buffering size or buffering interval is reached, whichever happens first.

Amazon Kinesis Data Streams vs Amazon Kinesis Firehose Amazon KCL (Kinesis Client Library) Amazon Kinesis Video Streams

Firehose ingests, Data Streams ingests and transforms Kinesis Client Library helps you consume and process data from kinesis data stream by taking care of the complex tasks associated with distributed computing. Such as load balancing, responding to instance failures, checkpointing processed records, and reacting to resharding. These are different from Kinesis Datastreams APIs available in AWS SDKs Amazon Kinesis Video Streams makes it easy to securely stream video from connected devices to AWS for analytics, machine learning (ML), playback, and other processing. Kinesis Video Streams automatically provisions and elastically scales all the infrastructure needed to ingest streaming video data from millions of devices.

AWS ML Confusion Matrix

For multiclass classification predictive models, you can use the output to determine accuracy for each class. Color of box shows the % correct or incorrect predictions. True class frequencies for evaluation data are shown in second to last column. Predicted class columns are shown at bottom.

NN being trained on batches. Loss function is oscillating. What is issue?

Learning rate too high. Too high a learning rate prevent weights from reaching an optimal solution. Very high learning rate causes loss function to dip and then increase. Too low a learning rate causes a slow decline in loss function.

The elbow method of choosing # of clusters

Possible solution based on fact that a good grouping is characterized by a small within-cluster variation, but a large between cluster variation. When the proportion of variance explained has plateaued out, we are said to have reached an "elbow" and the corresponding value of k provides an appropriate number of cluster to segment data. https://aws.amazon.com/blogs/machine-learning/k-means-clustering-with-amazon-sagemaker/

Preferred storage option Faster storage options Optimize performance for Athena and SageMaker by converting data to _________

S3 is amazon preferred storage option FSx Lustre and EFS are other options (faster) Optimize performance of dataset outside of S3 1. Convert into parquet for Athena 2. If using sagemaker change to recordIO protobuf

Amazon Polly is being used by a business to convert plaintext texts to voice for the purpose of automating corporate announcements. However, in modern papers, corporate acronyms are mispronounced.What should a Machine Learning Specialist do in the future with regard to this issue? A. Convert current documents to SSML with pronunciation tags. B. Create an appropriate pronunciation lexicon. C. Output speech marks to guide in pronunciation. D. Use Amazon Lex to preprocess the text files for pronunciation

SSML is specific to that particular doument, like W3C an be pronounced as "World Wide Web Consortium" using <sub alias="World Wide Web Consortium">W3C</sub> in that specific document and when you create a new document, you need to format again. But with LEXICONS, you can upload a lexicon file once and ALL the FUTURE documents can just have W3C and that will be pronounced as "World Wide Web Consortium". https://docs.aws.amazon.com/polly/latest/dg/supportedtags.html

Amazon API Gateway vs SageMaker Model with Create Endpoint API

SageMaker endpoint does not have public access. Amazon API Gateway is a fully managed service that makes it easy for developers to create, publish, maintain, monitor, and secure APIs at any scale. API Gateway can be used to present an external-facing, single point of entry for Amazon SageMaker endpoints. API Gateway can be used to front an Amazon SageMaker inference endpoint as (part of) a REST API, by making use of an API Gateway feature called mapping templates. This feature makes it possible for the REST API to be integrated directly with an Amazon SageMaker runtime endpoint, thereby avoiding the use of any intermediate compute resource (such as AWS Lambda or Amazon ECS containers) to invoke the endpoint. The result is a solution that is simpler, faster, and cheaper to run.

Amazon SageMaker Ground Truth Amazon Mechanical Turk

SageMaker ground truth is a semi-supervised approach, which decreases time to label. Produces confidence score, so low confidence labels can have manual labelers intervene

Amazon Comprehend vs Amazon SageMaker BlazingText

SageMaker will make you maintain an ML model vs using an API with comprehend

How to limit costs using data formats when querying from Amazon s3

Use Athena to query Parquet and ORC file formats can support predicate pushdown/predicate filtering Since parquet and ORC both have blocks of data that represent column values, it can determine whether block / column should be skipped or not - compressing columns


Related study sets

busi law exam 2 quiz questions 8,10

View Set

Functionalism, Introspection, Structuralism

View Set

Chapter 5: Ethical decision making

View Set

C235 - Training and Development: Topics 4 - 5

View Set

Humerus, Non-routine elbow, Shoulder, Clavicle, AC Joints

View Set