AWS Certified Machine Learning 1
A Data Scientist wishes to acquire real-time insight into a GZIP file data stream.Which option would allow for the LEAST amount of lag while using SQL to query the stream? A. Amazon Kinesis Data Analytics with an AWS Lambda function to transform the data. B. AWS Glue with a custom ETL script to transform the data. C. An Amazon Kinesis Client Library to transform the data and save it to an Amazon ES cluster. D. Amazon Kinesis Data Firehose to transform the data and put it into an Amazon S3 bucket.
A. Amazon Kinesis Data Analytics with an AWS Lambda function to transform the data.
Although a Machine Learning Specialist developed a regression model, the first iteration requires optimization. The Specialist must determine if the model over- or underestimates the objective more often. Which option does the Specialist have for determining if the goal number is being over- or underestimated? A. Root Mean Square Error (RMSE) B. Residual plots C. Area under the curve D. Confusion matrix
B. Residual plots
A machine learning specialist created the graph below to illustrate the k-means findings for k = [1..10]:What is a realistic pick for the optimum value of k, given the graph? A. 1 B. 4 C. 7 D. 10
B. 4
A manufacturing business uses an Amazon S3 bucket to store structured and unstructured data. A Machine Learning Specialist want to query this data using SQL. Which option requires the LEAST amount of work in order to query this data? A. Use AWS Data Pipeline to transform the data and Amazon RDS to run queries. B. Use AWS Glue to catalogue the data and Amazon Athena to run queries. C. Use AWS Batch to run ETL on the data and Amazon Aurora to run the queries. D. Use AWS Lambda to transform the data and Amazon Kinesis Data Analytics to run queries.
B. Use AWS Glue to catalogue the data and Amazon Athena to run queries.
A data scientist created a machine learning translation model for English to Japanese by combining 500,000 aligned phrase pairs with Amazon SageMaker's built-in seq2seq method. The data scientist discovers that the translation quality is acceptable for a five-word example while testing with sample sentences. However, the quality degrades to an unsatisfactory level when the statement exceeds 100 words in length. Which course of action will remedy the issue? A. Change preprocessing to use n-grams. B. Add more nodes to the recurrent neural network (RNN) than the largest sentence's word count. C. Adjust hyperparameters related to the attention mechanism. D. Choose a different weight initialization type.
C. Adjust hyperparameters related to the attention mechanism.
On Amazon SageMaker, a Machine Learning team runs its own training algorithm. External assets are required for the training algorithm. The team must submit to Amazon SageMaker both its own algorithm code and algorithm-specific parameters. Which services should the team combine in order to create a bespoke algorithm in Amazon SageMaker? (Select two.) A. AWS Secrets Manager B. AWS CodeStar C. Amazon ECR D. Amazon ECS E. Amazon S3
C. Amazon ECR E. Amazon S3
A business want to categorize user behavior as fraudulent or normal. A Machine Learning Specialist want to develop a binary classifier based on two features: account age and transaction month. The graphic shown illustrates the class distribution of these characteristics.Which model would have the HIGHEST degree of accuracy based on this information? A. Long short-term memory (LSTM) model with scaled exponential linear unit (SELU) B. Logistic regression C. Support vector machine (SVM) with non-linear kernel D. Single perceptron with tanh activation function
C. Support vector machine (SVM) with non-linear kernel
Which common parameters MUST be given when submitting Amazon SageMaker training tasks that use one of the built-in algorithms? (Select three.) A. The training channel identifying the location of training data on an Amazon S3 bucket. B. The validation channel identifying the location of validation data on an Amazon S3 bucket. C. The IAM role that Amazon SageMaker can assume to perform tasks on behalf of the users. D. Hyperparameters in a JSON array as documented for the algorithm used. E. The Amazon EC2 instance class specifying whether training will be run using CPU or GPU. F. The output path specifying where on an Amazon S3 bucket the trained model will persist.
C. The IAM role that Amazon SageMaker can assume to perform tasks on behalf of the users. E. The Amazon EC2 instance class specifying whether training will be run using CPU or GPU. F. The output path specifying where on an Amazon S3 bucket the trained model will persist.
A Machine Learning Specialist is responsible for developing a procedure for querying a dataset stored on Amazon S3 using Amazon Athena. Over 800,000 records are included in the dataset, which is kept in unencrypted CSV files. Each record is around 1.5 MB in size and comprises 200 columns. The majority of searches will cover no more than five to ten columns.How should the Machine Learning Specialist change the dataset in order to shorten the time required to perform the query? A. Convert the records to Apache Parquet format. B. Convert the records to JSON format. C. Convert the records to GZIP CSV format. D. Convert the records to XML format.
A. Convert the records to Apache Parquet format.
A producer of automobile engines gathers data from vehicles as they are driven. The time stamp, engine temperature, rotations per minute (RPM), and other sensor measurements are all captured. The business hopes to forecast when an engine may fail, so it can alert drivers in advance to schedule repair. For training purposes, the engine data is placed into a data lake.Which predictive model is the MOST SUITABLE for production deployment? A. Add labels over time to indicate which engine faults occur at what time in the future to turn this into a supervised learning problem. Use a recurrent neural network (RNN) to train the model to recognize when an engine might need maintenance for a certain fault. B. This data requires an unsupervised learning algorithm. Use Amazon SageMaker k-means to cluster the data. C. Add labels over time to indicate which engine faults occur at what time in the future to turn this into a supervised learning problem. Use a convolutional neural network (CNN) to train the model to recognize when an engine might need maintenance for a certain fault. D. This data is already formulated as a time series. Use Amazon SageMaker seq2seq to model the time series.
A. Add labels over time to indicate which engine faults occur at what time in the future to turn this into a supervised learning problem. Use a recurrent neural network (RNN) to train the model to recognize when an engine might need maintenance for a certain fault.
On a company's social media page, an employee saw a video clip with audio. The video is in Spanish. The employee's primary language is English, and he or she does not comprehend Spanish. The employee requests that a sentiment analysis be performed.Which service combination is the MOST EFFECTIVE in completing the task? A. Amazon Transcribe, Amazon Translate, and Amazon Comprehend B. Amazon Transcribe, Amazon Comprehend, and Amazon SageMaker seq2seq C. Amazon Transcribe, Amazon Translate, and Amazon SageMaker Neural Topic Model (NTM) D. Amazon Transcribe, Amazon Translate and Amazon SageMaker BlazingText
A. Amazon Transcribe, Amazon Translate, and Amazon Comprehend
A retailer aims to classify new items using machine learning. The Data Science team was presented with a labeled dataset of current goods. There are 1,200 goods in the dataset. Each product in the labeled dataset includes 15 attributes, including its title, dimensions, weight, and price. Each item is tagged with a category, such as books, games, gadgets, or movies. Which model should be used to classify new items using the training data provided? A. AnXGBoost model where the objective parameter is set to multi:softmax B. A deep convolutional neural network (CNN) with a softmax activation function for the last layer C. A regression forest where the number of trees is set equal to the number of product categories D. A DeepAR forecasting model based on a recurrent neural network (RNN)
A. AnXGBoost model where the objective parameter is set to multi:softmax
A Workstation Learning Specialist previously trained a logistic regression model on a local machine using scikit-learn and now wishes to deploy it to production for the sole purpose of inference. What actions should be done to guarantee that an Amazon SageMaker model trained locally can be hosted? A. Build the Docker image with the inference code. Tag the Docker image with the registry hostname and upload it to Amazon ECR. B. Serialize the trained model so the format is compressed for deployment. Tag the Docker image with the registry hostname and upload it to Amazon S3. C. Serialize the trained model so the format is compressed for deployment. Build the image and upload it to Docker Hub. D. Build the Docker image with the inference code. Configure Docker Hub and upload the image to Amazon ECR.
A. Build the Docker image with the inference code. Tag the Docker image with the registry hostname and upload it to Amazon ECR.
A Machine Learning Specialist is provided with a structured dataset including information about the buying behaviors of a company's customers. Each client is represented by thousands of data columns and hundreds of number columns. The Specialist's objective is to determine if these columns naturally group together across all consumers and to display the findings as rapidly as feasible. How should the Specialist tackle these tasks? A. Embed the numerical features using the t-distributed stochastic neighbor embedding (t-SNE) algorithm and create a scatter plot. B. Run k-means using the Euclidean distance measure for different values of k and create an elbow plot. C. Embed the numerical features using the t-distributed stochastic neighbor embedding (t-SNE) algorithm and create a line graph. D. Run k-means using the Euclidean distance measure for different values of k and create box plots for each numerical column within each cluster.
A. Embed the numerical features using the t-distributed stochastic neighbor embedding (t-SNE) algorithm and create a scatter plot.
A technology business is recommending items to current customers based on their habits and interactions using complicated deep neural networks and GPU processing. Currently, the solution retrieves each dataset from an Amazon S3 bucket and loads it into a TensorFlow model obtained from the company's Git repository. This task is then scheduled to continue for many hours, continuously writing to the same S3 bucket. The task, which is executed from a central queue, may be interrupted, resumed, and continued at any moment in the case of a failure.Senior management is worried about the solution's resource management complexity and the expenses associated with repeating the procedure on a regular basis. They want that the task be automated such that it runs once a week, beginning Monday and concluding by Friday's closing of business.Which architecture should be employed to efficiently scale the solution? A. Implement the solution using AWS Deep Learning Containers and run the container as a job using AWS Batch on a GPU-compatible Spot Instance B. Implement the solution using a low-cost GPU-compatible Amazon EC2 instance and use the AWS Instance Scheduler to schedule the task C. Implement the solution using AWS Deep Learning Containers, run the workload using AWS Fargate running on Spot Instances, and then schedule the task using the built-in task scheduler D. Implement the solution using Amazon ECS running on Spot Instances and schedule the task using the ECS service schedule
A. Implement the solution using AWS Deep Learning Containers and run the container as a job using AWS Batch on a GPU-compatible Spot Instance
A Machine Learning Specialist is needed to develop a supervised image recognition model for the purpose of identifying a cat. The Machine Learning Specialist runs many tests and collects the following findings for an image classifier powered by a neural network: There are a total of 1,000 photos accessible. 100 photos from the test set (constant test set) The ML Specialist observes that cats were handled upside down by their owners in over 75% of the misclassified photographs. Which strategies can the machine learning specialist apply to improve this particular test error? A. Increase the training data by adding variation in rotation for training images. B. Increase the number of epochs for model training C. Increase the number of layers for the neural network. D. Increase the dropout rate for the second-to-last layer.
A. Increase the training data by adding variation in rotation for training images.
A data science team is developing a dataset repository to house a significant volume of training data that is often utilized in machine learning models. Given that Data Scientists may develop an infinite amount of new datasets each day, the solution must be scalable and cost-effective. Additionally, SQL exploration of the data must be possible. Which storage method is the MOST SUITABLE for this scenario? A. Store datasets as files in Amazon S3. B. Store datasets as files in an Amazon EBS volume attached to an Amazon EC2 instance. C. Store datasets as tables in a multi-node Amazon Redshift cluster. D. Store datasets as global tables in Amazon DynamoDB.
A. Store datasets as files in Amazon S3
A commercial security firm successfully piloted the use of 100 cameras deployed in strategic spots across the main office. The cameras' images were uploaded to Amazon S3, Amazon Rekognition was used to tag them, and the findings were saved in Amazon ES. The firm is now exploring the possibility of expanding the pilot into a complete production system with hundreds of video cameras across its worldwide office sites. The objective is to detect non-employee activity in real time.Which of the following options should the agency consider? A. Use a proxy server at each local office and for each camera, and stream the RTSP feed to a unique Amazon Kinesis Video Streams video stream. On each stream, use Amazon Rekognition Video and create a stream processor to detect faces from a collection of known employees, and alert when non-employees are detected. B. Use a proxy server at each local office and for each camera, and stream the RTSP feed to a unique Amazon Kinesis Video Streams video stream. On each stream, use Amazon Rekognition Image to detect faces from a collection of known employees and alert when non-employees are detected. C. Install AWS DeepLens cameras and use the DeepLens_Kinesis_Video module to stream video to Amazon Kinesis Video Streams for each camera. On each stream, use Amazon Rekognition Video and create a stream processor to detect faces from a collection on each stream, and alert when non-employees are detected. D. Install AWS DeepLens cameras and use the DeepLens_Kinesis_Video module to stream video to Amazon Kinesis Video Streams for each camera. On each stream, run an AWS Lambda function to capture image fragments and then call Amazon Rekognition Image to detect faces from a collection of known employees, and alert when non-employees are detected.
A. Use a proxy server at each local office and for each camera, and stream the RTSP feed to a unique Amazon Kinesis Video Streams video stream. On each stream, use Amazon Rekognition Video and create a stream processor to detect faces from a collection of known employees, and alert when non-employees are detected.
Amazon Personalize is being used by a retail firm to deliver individualized product suggestions to consumers during a marketing campaign. The organization quickly notices a big rise in sales of suggested goods to current clients after the deployment of a new solution version, but these sales decline shortly thereafter. For training purposes, only historical data from prior to the marketing campaign is accessible. What adjustments should a data scientist make to the solution? A. Use the event tracker in Amazon Personalize to include real-time user interactions. B. Add user metadata and use the HRNN-Metadata recipe in Amazon Personalize. C. Implement a new solution using the built-in factorization machines (FM) algorithm in Amazon SageMaker. D. Add event type and event value fields to the interactions dataset in Amazon Personalize.
A. Use the event tracker in Amazon Personalize to include real-time user interactions.
A data scientist must discover fake user accounts on an ecommerce platform for a business. The organization want to establish whether a freshly formed account is connected to a previously identified fraudulent user. AWS Glue is being used by the data scientist to purify the company's application logs during ingestion. Which technique will enable the data scientist to detect bogus accounts? A. Execute the built-in FindDuplicates Amazon Athena query. B. Create a FindMatches machine learning transform in AWS Glue. C. Create an AWS Glue crawler to infer duplicate accounts in the source data. D. Search for duplicate accounts in the AWS Glue Data Catalog.
B. Create a FindMatches machine learning transform in AWS Glue.
A machine learning expert at a fruit processing firm is tasked with developing a system for categorizing apples into three categories. The expert compiled a collection of 150 photos for each species of apple and used transfer learning to train a neural network on ImageNet using this dataset.The firm expects a model to be at least 85 percent accurate in order to utilize it.Following a thorough grid search, the best hyperparameters were as follows:✑Accuracy of 68 percent on the training set✑Accuracy of 67 percent on the validation setWhat can the machine learning professional do to increase the accuracy of the system? A. Upload the model to an Amazon SageMaker notebook instance and use the Amazon SageMaker HPO feature to optimize the model's hyperparameters. B. Add more data to the training set and retrain the model using transfer learning to reduce the bias. C. Use a neural network model with more layers that are pretrained on ImageNet and apply transfer learning to increase the variance. D. Train a new model using the current neural network architecture.
B. Add more data to the training set and retrain the model using transfer learning to reduce the bias.
A Machine Learning Specialist is responsible for preparing data for training by moving and transforming it. Certain data must be handled in near-real time, while others may be transferred on an hourly basis. There are already existing Amazon EMR MapReduce operations for data cleaning and feature engineering. Which of the following services are capable of supplying data to MapReduce jobs? (Select two.) A. AWS DMS B. Amazon Kinesis C. AWS Data Pipeline D. Amazon Athena E. Amazon ES
B. Amazon Kinesis C. AWS Data Pipeline
A data scientist is developing a sentiment analysis application. The validation accuracy is low, and the Data Scientist believes that this is due to the dataset's large vocabulary and low average frequency of terms. Which tool should be utilized to increase the accuracy of validation? A. Amazon Comprehend syntax analysis and entity detection B. Amazon SageMaker BlazingText cbow mode C. Natural Language Toolkit (NLTK) stemming and stop word removal D. Scikit-leam term frequency-inverse document frequency (TF-IDF) vectorizer
B. Amazon SageMaker BlazingText cbow mode
A firm that encourages good sleep habits via the use of cloud-connected devices is now using AWS to host a sleep monitoring application. The program gathers information on the device's use from its users. The company's Data Science team is developing a machine learning model to forecast when and if a user may cease to use the company's gadgets. The model's predictions are utilized by a downstream application to identify the most effective method of engaging consumers. The Data Science team is developing many iterations of the machine learning model and comparing them to the commercial objectives of the organization. To determine the model's long-term performance, the team intends to run numerous versions in parallel for extended periods of time, with the possibility to alter the percentage of inferences supplied by the models. Which method achieves these criteria with the LEAST amount of effort? A. Build and host multiple models in Amazon SageMaker. Create multiple Amazon SageMaker endpoints, one for each model. Programmatically control invoking different models for inference at the application layer. B. Build and host multiple models in Amazon SageMaker. Create an Amazon SageMaker endpoint configuration with multiple production variants. Programmatically control the portion of the inferences served by the multiple models by updating the endpoint configuration. C. Build and host multiple models in Amazon SageMaker Neo to take into account different types of medical devices. Programmatically control which model is invoked for inference based on the medical device type. D. Build and host multiple models in Amazon SageMaker. Create a single endpoint that accesses multiple models. Use Amazon SageMaker batch transform to control invoking the different models through the single endpoint.
B. Build and host multiple models in Amazon SageMaker. Create an Amazon SageMaker endpoint configuration with multiple production variants. Programmatically control the portion of the inferences served by the multiple models by updating the endpoint configuration.
The Machine Learning Specialist at a corporation wants to increase the training pace of a TensorFlow-based time series forecasting model. Currently, the training is conducted using a single GPU and takes roughly 23 hours to complete. Daily training must be conducted.Although the model's accuracy is satisfactory, the business believes that the amount of the training data will continue to grow and that the model will need to be updated hourly rather than daily. Additionally, the organization wishes to reduce coding labor and infrastructure modifications.What modifications should the Machine Learning Specialist make to the training solution in order for it to scale in the future? A. Do not change the TensorFlow code. Change the machine to one with a more powerful GPU to speed up the training. B. Change the TensorFlow code to implement a Horovod distributed framework supported by Amazon SageMaker. Parallelize the training to as many machines as needed to achieve the business goals. C. Switch to using a built-in AWS SageMaker DeepAR model. Parallelize the training to as many machines as needed to achieve the business goals. D. Move the training to Amazon EMR and distribute the workload to as many machines as needed to achieve the business goals.
B. Change the TensorFlow code to implement a Horovod distributed framework supported by Amazon SageMaker. Parallelize the training to as many machines as needed to achieve the business goals.
A Machine Learning Specialist is assisting a huge organization in incorporating machine learning into its products. The organization want to categorize its clients based on their likelihood to churn over the following six months. The firm has identified the data that the Specialist has access to. Which form of machine learning model should the Specialist employ for this task? A. Linear regression B. Classification C. Clustering D. Reinforcement learning
B. Classification
Amazon Polly is being used by a business to convert plaintext texts to voice for the purpose of automating corporate announcements. However, in modern papers, corporate acronyms are mispronounced.What should a Machine Learning Specialist do in the future with regard to this issue? A. Convert current documents to SSML with pronunciation tags. B. Create an appropriate pronunciation lexicon. C. Output speech marks to guide in pronunciation. D. Use Amazon Lex to preprocess the text files for pronunciation
B. Create an appropriate pronunciation lexicon.
Using Amazon SageMaker, a Machine Learning Specialist is developing a model for time series forecasting. The Specialist has completed the model's training and is now intending to load test the endpoint in order to establish Auto Scaling for the model variation. Which technique enables the Specialist to analyze the load test's latency, memory use, and CPU utilization? A. Review SageMaker logs that have been written to Amazon S3 by leveraging Amazon Athena and Amazon QuickSight to visualize logs as they are being produced. B. Generate an Amazon CloudWatch dashboard to create a single view for the latency, memory utilization, and CPU utilization metrics that are outputted by Amazon SageMaker. C. Build custom Amazon CloudWatch Logs and then leverage Amazon ES and Kibana to query and visualize the log data as it is generated by Amazon SageMaker. D. Send Amazon CloudWatch Logs that were generated by Amazon SageMaker to Amazon ES and use Kibana to query and visualize the log data.
B. Generate an Amazon CloudWatch dashboard to create a single view for the latency, memory utilization, and CPU utilization metrics that are outputted by Amazon SageMaker.
A data scientist is tasked with the task of developing a bespoke recommendation model in Amazon SageMaker for an online retailer. Customers purchase just 4-5 things every 5-10 years due to the nature of the company's offerings. As a result, the business is reliant on a continual influx of new consumers. When a new client registers, the business gathers information about the consumer's preferences. The following is a sample of the data that the data scientist has access to. For this use case, how should the data scientist divide the dataset into a training and test set? A. Shuffle all interaction data. Split off the last 10% of the interaction data for the test set. B. Identify the most recent 10% of interactions for each user. Split off these interactions for the test set. C. Identify the 10% of users with the least interaction data. Split off all interaction data from these users for the test set. D. Randomly select 10% of the users. Split off all interaction data from these users for the test set.
B. Identify the most recent 10% of interactions for each user. Split off these interactions for the test set.
A data scientist is constructing a machine learning model to determine the legitimacy of financial transactions. The labeled data provided for training consists of 100,000 observations that are not fraudulent and 1,000 observations that are fraudulent.When the trained model is applied to a previously unknown validation dataset, the Data Scientist obtains the following confusion matrix. Although the model is 99.1 percent accurate, the Data Scientist has been requested to minimize false negatives.Which combination of procedures should the Data Scientist perform in order to minimize the model's false positive predictions? (Select two.) A. Change the XGBoost eval_metric parameter to optimize based on rmse instead of error. B. Increase the XGBoost scale_pos_weight parameter to adjust the balance of positive and negative weights. C. Increase the XGBoost max_depth parameter because the model is currently underfitting the data. D. Change the XGBoost eval_metric parameter to optimize based on AUC instead of error. E. Decrease the XGBoost max_depth parameter because the model is currently overfitting the data.
B. Increase the XGBoost scale_pos_weight parameter to adjust the balance of positive and negative weights. D. Change the XGBoost eval_metric parameter to optimize based on AUC instead of error.
Using Amazon Athena and Amazon S3, a mobile network operator is developing an analytics platform for analyzing and optimizing a business's operations.The source systems transmit data in real time in the.CSV format. Before storing the data on Amazon S3, the Data Engineering team want to convert it to the Apache Parquet format.Which approach requires the MINIMUM amount of work to implement? A. Ingest .CSV data using Apache Kafka Streams on Amazon EC2 instances and use Kafka Connect S3 to serialize data as Parquet B. Ingest .CSV data from Amazon Kinesis Data Streams and use Amazon Glue to convert data into Parquet. C. Ingest .CSV data using Apache Spark Structured Streaming in an Amazon EMR cluster and use Apache Spark to convert data into Parquet. D. Ingest .CSV data from Amazon Kinesis Data Streams and use Amazon Kinesis Data Firehose to convert data into Parquet.
B. Ingest .CSV data from Amazon Kinesis Data Streams and use Amazon Glue to convert data into Parquet.
A Data Scientist is developing a linear regression model and evaluating the statistical significance of each coefficient using the derived p-values. The Data Scientist observes that the majority of the characteristics in the dataset are regularly distributed. The image depicts the plot of a single feature from the dataset. Which transformation should the Data Scientist do to ensure that the linear regression model's statistical assumptions are met? A. Exponential transformation B. Logarithmic transformation C. Polynomial transformation D. Sinusoidal transformation
B. Logarithmic transformation
A machine learning specialist is assisting a media organization in classifying popular articles from the organization's website. Before a story is published, the business uses random forests to predict its popularity. Below is an example of the data that was utilized.The Specialist want to convert the Day Of Week column in the dataset to binary values.Which approach should be used to convert the values in this column to binary? A. Binarization B. One-hot encoding C. Tokenization D. Normalization transformation
B. One-hot encoding
Every minute, a monitoring service creates 1 TB of scale metrics record data. Amazon Athena is used by a research team to execute queries on this data. Due to the high number of data, the queries execute slowly, and the team demands improved speed. How should the records in Amazon S3 be kept to optimize query performance? A. CSV files B. Parquet files C. Compressed JSON D. RecordIO
B. Parquet files
An insurance firm is creating a new automotive gadget that employs a camera to monitor drivers' behavior and alerts them when they look to be distracted. The organization developed roughly 10,000 training photos in a controlled setting that will be used to train and assess machine learning models by a Machine Learning Specialist.During the model assessment, the Specialist sees that as the number of epochs grows, the training error rate decreases quicker and the model is unable to effectively infer on unseen test pictures.Which of the following approaches should be used to remedy this situation? (Select two.) A. Add vanishing gradient to the model. B. Perform data augmentation on the training data. C. Make the neural network architecture complex. D. Use gradient checking in the model. E. Add L2 regularization to the model.
B. Perform data augmentation on the training data. E. Add L2 regularization to the model.
What is the real class frequency for Romance and the anticipated class frequency for Adventure given the following confusion matrix for a movie classification model? A. The true class frequency for Romance is 77.56% and the predicted class frequency for Adventure is 20.85% B. The true class frequency for Romance is 57.92% and the predicted class frequency for Adventure is 13.12% C. The true class frequency for Romance is 0.78 and the predicted class frequency for Adventure is (0.47-0.32) D. The true class frequency for Romance is 77.56% * 0.78 and the predicted class frequency for Adventure is 20.85%*0.32
B. The true class frequency for Romance is 57.92% and the predicted class frequency for Adventure is 13.12%
A data storage solution for Amazon SageMaker is being developed by a machine learning specialist. There is already a TensorFlow-based model developed as a train.py script that makes use of static training data saved as TFRecords. Which approach of supplying training data to Amazon SageMaker would satisfy business needs with the LEAST amount of development time? A. Use Amazon SageMaker script mode and use train.py unchanged. Point the Amazon SageMaker training invocation to the local path of the data without reformatting the training data. B. Use Amazon SageMaker script mode and use train.py unchanged. Put the TFRecord data into an Amazon S3 bucket. Point the Amazon SageMaker training invocation to the S3 bucket without reformatting the training data. C. Rewrite the train.py script to add a section that converts TFRecords to protobuf and ingests the protobuf data instead of TFRecords. D. Prepare the data in the format accepted by Amazon SageMaker. Use AWS Glue or AWS Lambda to reformat and store the data in an Amazon S3 bucket.
B. Use Amazon SageMaker script mode and use train.py unchanged. Put the TFRecord data into an Amazon S3 bucket. Point the Amazon SageMaker training invocation to the S3 bucket without reformatting the training data.
A trucking business is gathering real-time visual data from its global fleet of vehicles. The data is expanding at a breakneck pace, with around 100 GB of new data created daily. The organization wishes to investigate possible applications of machine learning while guaranteeing that the data is only available to authorized IAM users.Which storage choice offers the most processing flexibility and supports IAM access control? A. Use a database, such as Amazon DynamoDB, to store the images, and set the IAM policies to restrict access to only the desired IAM users. B. Use an Amazon S3-backed data lake to store the raw images, and set up the permissions using bucket policies. C. Setup up Amazon EMR with Hadoop Distributed File System (HDFS) to store the files, and restrict access to the EMR instances using IAM policies. D. Configure Amazon EFS with IAM policies to make the data available to Amazon EC2 instances owned by the IAM users.
B. Use an Amazon S3-backed data lake to store the raw images, and set up the permissions using bucket policies.
A major organization has built a business intelligence tool that creates reports and dashboards from data gathered from different operational KPIs. The organization wishes to improve the executive experience by allowing them to get data from reports using natural language. The organization wants executives to be able to communicate with one another using written and spoken interfaces.Which services may be used to provide this conversational interface? (Select three.) A. Alexa for Business B. Amazon Connect C. Amazon Lex D. Amazon Polly E. Amazon Comprehend F. Amazon Transcribe
C. Amazon Lex E. Amazon Comprehend F. Amazon Transcribe
A web-based business wishes to increase conversions on its landing page. The business developed a multi-class deep learning network algorithm using Amazon SageMaker regularly using a big historical dataset of client visits. However, there is an overfitting issue: training data indicates a prediction accuracy of 90%, whereas test data indicates only a prediction accuracy of 70%. The organization has to increase the generalizability of its model prior to putting it in production in order to optimize visit-to-purchase conversions. Which activity is advised to ensure that the company's test and validation data is modelled with the HIGHEST degree of accuracy possible? A. Increase the randomization of training data in the mini-batches used in training B. Allocate a higher proportion of the overall data to the training dataset C. Apply L1 or L2 regularization and dropouts to the training D. Reduce the number of layers and units (or neurons) from the deep learning network
C. Apply L1 or L2 regularization and dropouts to the training
Customer data is collected by a Machine Learning Specialist for an online shopping website. Demographic information, previous visits, and information about the surrounding area are all included in the data. The Specialist is responsible for developing a machine learning strategy for identifying client buying behaviors, preferences, and trends in order to improve the website's service and recommendation capabilities.Which course of action should the Specialist suggest? A. Latent Dirichlet Allocation (LDA) for the given collection of discrete data to identify patterns in the customer database. B. A neural network with a minimum of three layers and random initial weights to identify patterns in the customer database. C. Collaborative filtering based on user interactions and correlations to identify patterns in the customer database. D. Random Cut Forest (RCF) over random subsamples to identify patterns in the customer database.
C. Collaborative filtering based on user interactions and correlations to identify patterns in the customer database.
A business is establishing an Amazon SageMaker environment. Communication through the internet is prohibited under the business data security policy. How can the Amazon SageMaker service be enabled without also authorizing direct internet access to Amazon SageMaker notebook instances? A. Create a NAT gateway within the corporate VPC. B. Route Amazon SageMaker traffic through an on-premises network. C. Create Amazon SageMaker VPC interface endpoints within the corporate VPC. D. Create VPC peering with Amazon VPC hosting Amazon SageMaker.
C. Create Amazon SageMaker VPC interface endpoints within the corporate VPC.
A Data Scientist was given a collection of insurance records, each of which had an ID for the record, the final result from 200 possible outcomes, and the date of the final outcome. Additionally, some limited information on the substance of claims is supplied, although only for a handful of the 200 categories. Hundreds of records have been provided during the last three years for each result category. The Data Scientist want to forecast the number of claims in each category month by month, many months in advance. Which machine learning algorithm should be used? A. Classification month-to-month using supervised learning of the 200 categories based on claim contents. B. Reinforcement learning using claim IDs and timestamps where the agent will identify how many claims in each category to expect from month to month. C. Forecasting using claim IDs and timestamps to identify how many claims in each category to expect from month to month. D. Classification with supervised learning of the categories for which partial information on claim contents is provided, and forecasting using claim IDs and timestamps for all other categories.
C. Forecasting using claim IDs and timestamps to identify how many claims in each category to expect from month to month.
A Data Scientist is required to do employment data analysis. The dataset comprises roughly ten million observations of individuals across ten distinct characteristics. The Data Scientist discovers that the income and age distributions are not typical during the preliminary study. While income levels exhibit the anticipated right skew, with fewer persons earning more, the age distribution exhibits the same right skew, with fewer older individuals engaging in the workforce. Which feature transformations may the Data Scientist do to repair the data that has been skewed incorrectly? (Select two.) A. Cross-validation B. Numerical value binning C. High-degree polynomial transformation D. Logarithmic transformation E. One hot encoding
C. High-degree polynomial transformation D. Logarithmic transformation
A corporation populates an Amazon S3 data lake with machine learning (ML) data derived from online advertising clicks. The Kinesis Producer Library is used to add click data to an Amazon Kinesis data stream (KPL). The data is fed into the S3 data lake using an Amazon Kinesis Data Firehose delivery stream from the data stream. As the amount of data rises, a machine learning professional sees that the pace at which data is fed into Amazon S3 remains rather consistent. Additionally, there is a rising backlog of data to be ingested by Kinesis Data Streams and Kinesis Data Firehose.Which of the following steps is most likely to increase the pace of data intake into Amazon S3? A. Increase the number of S3 prefixes for the delivery stream to write to. B. Decrease the retention period for the data stream. C. Increase the number of shards for the data stream. D. Add more consumers using the Kinesis Client Library (KCL).
C. Increase the number of shards for the data stream.
A farming firm is interested in applying machine learning to identify particular weed species in a 100-acre grassland patch. The firm now employs tractor-mounted cameras to gather several photographs of the field in 10 — 10 grids. Additionally, the organization has a sizable training dataset comprised of annotated photos of common weed classifications such as broadleaf and non-broadleaf docks. The organization wishes to develop a weed identification model capable of identifying certain kinds of weeds and their position within a field. The model will be hosted on Amazon SageMaker endpoints once it is complete. The model will do real-time inference using the camera pictures. Which strategy should a Machine Learning Specialist use in order to achieve reliable predictions? A. Prepare the images in RecordIO format and upload them to Amazon S3. Use Amazon SageMaker to train, test, and validate the model using an image classification algorithm to categorize images into various weed classes. B. Prepare the images in Apache Parquet format and upload them to Amazon S3. Use Amazon SageMaker to train, test, and validate the model using an object- detection single-shot multibox detector (SSD) algorithm. C. Prepare the images in RecordIO format and upload them to Amazon S3. Use Amazon SageMaker to train, test, and validate the model using an object- detection single-shot multibox detector (SSD) algorithm. D. Prepare the images in Apache Parquet format and upload them to Amazon S3. Use Amazon SageMaker to train, test, and validate the model using an image classification algorithm to categorize images into various weed classes.
C. Prepare the images in RecordIO format and upload them to Amazon S3. Use Amazon SageMaker to train, test, and validate the model using an object- detection single-shot multibox detector (SSD) algorithm.
A big mobile network operator is developing a machine learning algorithm to forecast which consumers are likely to cancel their service subscription. The corporation intends to give an incentive to retain these clients, since the cost of churn is far more than the incentive's cost.After testing on a test dataset of 100 consumers, the model generates the following confusion matrix:Why is this a feasible model for manufacturing, based on the model assessment results? A. The model is 86% accurate and the cost incurred by the company as a result of false negatives is less than the false positives. B. The precision of the model is 86%, which is less than the accuracy of the model. C. The model is 86% accurate and the cost incurred by the company as a result of false positives is less than the false negatives. D. The precision of the model is 86%, which is greater than the accuracy of the model.
C. The model is 86% accurate and the cost incurred by the company as a result of false positives is less than the false negatives.
Within a nation, an organization gathers census data to ascertain healthcare and social program requirements by province and city. Each person responds to around 500 questions on the census form.Which algorithmic combination would deliver the necessary insights? (Select two.) A. The factorization machines (FM) algorithm B. The Latent Dirichlet Allocation (LDA) algorithm C. The principal component analysis (PCA) algorithm D. The k-means algorithm E. The Random Cut Forest (RCF) algorithm
C. The principal component analysis (PCA) algorithm D. The k-means algorithm
A Machine Learning Specialist initiates a hyperparameter tuning project for a tree-based ensemble model using Amazon SageMaker with the target metric Area Under the Receiver Operating Characteristic Curve (AUC). This method will ultimately be integrated into a pipeline that retrains and optimizes hyperparameters each night in order to model click-through on stale data every 24 hours. The Specialist want to adjust the input hyperparameter range in order to reduce the time required to train these models and, eventually, to save expenditures (s). Which visualization will achieve this goal? A. A histogram showing whether the most important input feature is Gaussian. B. A scatter plot with points colored by target variable that uses t-Distributed Stochastic Neighbor Embedding (t-SNE) to visualize the large number of input variables in an easier-to-read dimension. C. A scatter plot showing the performance of the objective metric over each training iteration. D. A scatter plot showing the correlation between maximum tree depth and the objective metric.
D. A scatter plot showing the correlation between maximum tree depth and the objective metric.
A financial institution is attempting to identify credit card fraud. According to the firm, around 2% of credit card transactions are fraudulent. On the basis of a year's worth of credit card transaction data, a data scientist trained a classifier. The model must distinguish between fraudulent transactions (positives) and legitimate ones (negatives). The company's objective is to catch as many positives as possible correctly.Which metrics should be used to optimize the model by the data scientist? (Select two.) A. Specificity B. False positive rate C. Accuracy D. Area under the precision-recall curve E. True positive rate
D. Area under the precision-recall curve E. True positive rate
A business analyzes camera photos of the tops of objects placed on shop shelves to identify which things have been taken and which remain. The organization now has a total of 1,000 hand-labeled photos encompassing ten separate things after many hours of data tagging. The training was ineffective.Which machine learning technique best meets the long-term goals of the business? A. Convert the images to grayscale and retrain the model B. Reduce the number of distinct items from 10 to 2, build the model, and iterate C. Attach different colored labels to each item, take the images again, and build the model D. Augment training data for each item using image variants like inversions and translations, build the model, and iterate.
D. Augment training data for each item using image variants like inversions and translations, build the model, and iterate.
A data scientist conducts data exploration and analysis using an Amazon SageMaker notebook instance. This involves installing some Python packages on the notebook instance that are not natively accessible on Amazon SageMaker. How can a machine learning professional guarantee that the data scientist's essential packages are automatically accessible on the notebook instance? A. Install AWS Systems Manager Agent on the underlying Amazon EC2 instance and use Systems Manager Automation to execute the package installation commands. B. Create a Jupyter notebook file (.ipynb) with cells containing the package installation commands to execute and place the file under the /etc/init directory of each Amazon SageMaker notebook instance. C. Use the conda package manager from within the Jupyter notebook console to apply the necessary conda packages to the default kernel of the notebook. D. Create an Amazon SageMaker lifecycle configuration with package installation commands and assign the lifecycle configuration to the notebook instance.
D. Create an Amazon SageMaker lifecycle configuration with package installation commands and assign the lifecycle configuration to the notebook instance.
A machine learning expert is now working on a proof of concept for government users who are most concerned about security. The expert is training a convolutional neural network (CNN) model for a picture classification application using Amazon SageMaker. The expert wishes to safeguard the data from inadvertent access and transmission to a distant host by malicious programs put on the training container. Which of the following actions will give the MOST SECURE protection? A. Remove Amazon S3 access permissions from the SageMaker execution role. B. Encrypt the weights of the CNN model. C. Encrypt the training and validation dataset. D. Enable network isolation for training jobs.
D. Enable network isolation for training jobs.
A business evaluates the risk variables associated with a specific energy sector using a long short-term memory (LSTM) model. The program analyzes multi-page text documents and categorizes each phrase as either posing a danger or posing no risk. The model is underperforming, despite the Data Scientist's extensive experimentation with several network architectures and tuning of the associated hyperparameters.Which technique will result in the MAXIMUM increase in performance? A. Initialize the words by term frequency-inverse document frequency (TF-IDF) vectors pretrained on a large collection of news articles related to the energy sector. B. Use gated recurrent units (GRUs) instead of LSTM and run the training process until the validation loss stops decreasing. C. Reduce the learning rate and run the training process until the training loss stops decreasing. D. Initialize the words by word2vec embeddings pretrained on a large collection of news articles related to the energy sector.
D. Initialize the words by word2vec embeddings pretrained on a large collection of news articles related to the energy sector.
A retail chain has been utilizing Amazon Kinesis Data Firehose to ingest purchase details from its network of 20,000 outlets into Amazon S3. To facilitate the training of a more advanced machine learning model, training data will need additional but straightforward transformations, and certain characteristics will be merged. Daily retraining of the model is required. Which update will take the LEAST amount of development work, given the vast number of stores and historical data ingestion? A. Require that the stores to switch to capturing their data locally on AWS Storage Gateway for loading into Amazon S3, then use AWS Glue to do the transformation. B. Deploy an Amazon EMR cluster running Apache Spark with the transformation logic, and have the cluster run each day on the accumulating records in Amazon S3, outputting new/transformed records to Amazon S3. C. Spin up a fleet of Amazon EC2 instances with the transformation logic, have them transform the data records accumulating on Amazon S3, and output the transformed records to Amazon S3. D. Insert an Amazon Kinesis Data Analytics stream downstream of the Kinesis Data Firehose stream that transforms raw record attributes into simple transformed values using SQL.
D. Insert an Amazon Kinesis Data Analytics stream downstream of the Kinesis Data Firehose stream that transforms raw record attributes into simple transformed values using SQL.
A corporation wishes to forecast home selling prices using existing historical sales data. The selling price is the goal variable in the company's dataset. The attributes include the lot size, measures of the living space and non-living area, the number of bedrooms and bathrooms, the year constructed, and the postal code. The organization wishes to forecast home selling prices using multivariable linear regression.Which step should a machine learning professional take to eliminate extraneous information and simplify the model? A. Plot a histogram of the features and compute their standard deviation. Remove features with high variance. B. Plot a histogram of the features and compute their standard deviation. Remove features with low variance. C. Build a heatmap showing the correlation of the dataset against itself. Remove features with low mutual correlation scores. D. Run a correlation check of all features against the target variable. Remove features with low target variable correlation scores.
D. Run a correlation check of all features against the target variable. Remove features with low target variable correlation scores.
A Data Scientist observes oscillations in training accuracy while doing mini-batch training on a neural network for a classification task.Which of the following is the MOST LIKELY CAUSE of this problem? A. The class distribution in the dataset is imbalanced. B. Dataset shuffling is disabled. C. The batch size is too big. D. The learning rate is very high.
D. The learning rate is very high.
On a company's website, a Machine Learning Specialist implemented a model that delivers product suggestions. Initially, the concept performed admirably and resulted in consumers purchasing an average of more things. However, the Specialist has noted that the efficacy of product suggestions has waned in recent months, and consumers are reverting to their previous buying patterns. The Specialist is puzzled what occurred, since the model has been same since it was deployed over a year ago. Which strategy should the Specialist use in order to enhance the model's performance? A. The model needs to be completely re-engineered because it is unable to handle product inventory changes. B. The model's hyperparameters should be periodically updated to prevent drift. C. The model should be periodically retrained from scratch using the original data while adding a regularization term to handle product inventory changes D. The model should be periodically retrained using the original training data plus new data as product inventory changes.
D. The model should be periodically retrained using the original training data plus new data as product inventory changes.
A manufacturer of airplane engines is compiling a time series of 200 performance indicators. Engineers need near-real-time detection of significant production problems during testing. All data must be retained for offline analysis.Which strategy would be the MOST EFFECTIVE in terms of defect detection in near-real time? A. Use AWS IoT Analytics for ingestion, storage, and further analysis. Use Jupyter notebooks from within AWS IoT Analytics to carry out analysis for anomalies. B. Use Amazon S3 for ingestion, storage, and further analysis. Use an Amazon EMR cluster to carry out Apache Spark ML k-means clustering to determine anomalies. C. Use Amazon S3 for ingestion, storage, and further analysis. Use the Amazon SageMaker Random Cut Forest (RCF) algorithm to determine anomalies. D. Use Amazon Kinesis Data Firehose for ingestion and Amazon Kinesis Data Analytics Random Cut Forest (RCF) to perform anomaly detection. Use Kinesis Data Firehose to store data in Amazon S3 for further analysis.
D. Use Amazon Kinesis Data Firehose for ingestion and Amazon Kinesis Data Analytics Random Cut Forest (RCF) to perform anomaly detection. Use Kinesis Data Firehose to store data in Amazon S3 for further analysis