AiMLShort

¡Supera tus tareas y exámenes ahora con Quizwiz!

101 A company uses deep neural networks, GPU compute to recommend its products to its customers based on customer's habits. Gets data from S3, loads it to TensorFlow model, pulled from local Git Repo. Job runs for few hours, outputs its progress to S3. Job can be paused and restarted from central queue. Weekly automated workload runs from Mon to Fri. Which of solutions below will incur lowest cost? A. Implement solution using AWS Deep Learning Containers and run container as a job using AWS Batch on a GPU-compatible Spot Instance. B. Implement solution using a low-cost GPU-compatible EC2 instance and use AWS Instance Scheduler to schedule task. C. Implement solution using AWS Deep Learning Containers, run workload using AWS Fargate running on Spot Instances, and then schedule task using built-in task scheduler. D. Implement solution using ECS running on Spot Instances and schedule task using ECS service scheduler.

119 A company uses Forecast to build a forecasting model for the inventory demand. The company has provided a dataset of historic inventory demand for its products as a .csv file stored in S3. The table below shows a sample of the dataset. How to transform the data? A. Use ETL jobs in AWS Glue to separate the dataset into a target time series dataset and an item metadata dataset. Upload both datasets as .csv files to S3. B. Use a Jupyter notebook in SageMaker to separate the dataset into a related time series dataset and an item metadata dataset. Upload both datasets as tables in Aurora. C. Use AWS Batch jobs to separate the dataset into a target time series dataset, a related time series dataset, and an item metadata dataset. Upload them directly to Forecast from a local machine. D. Use a Jupyter notebook in SageMaker to transform the data into the optimized protobuf recordIO format. Upload the dataset in this forma

134 Code in ECR container image. SageMaker notebook's Processing job uses KMS encrypted data in fm S3, to S3. Below? A. Create IAM role/permission to create SageMaker Processing jobs, S3 read/write access to S3 and KMS/ECR permissions. Attach role to SageMaker notebook. Create SageMaker Processing job from notebook. B. Create IAM role/permission to create SageMaker Processing jobs. Attach role to SageMaker notebook. Create SageMaker Processing job with IAM role having read/write permissions to S3, and KMS/ECR permissions. C. Create IAM role/permission to create SageMaker Processing jobs, to access ECR. Attach role to SageMaker notebook. S3 endpoint and KMS endpoint in VPC. SageMaker Processing jobs from notebook. D. IAM role/permission to create SageMaker Processing jobs. Attach role to SageMaker notebook. S3 endpoint in VPC. SageMaker Processing jobs with access/secret key of IAM user with KMS and ECR permissions.

145 A company sells thousands of products on a public website and wants to automatically identify products with potential durability problems. The company has 1.000 reviews with date, star rating, review text, review summary, and customer email fields, but many reviews are incomplete and have empty fields. Each review has already been labeled with the correct durability result. A machine learning specialist must train a model to identify reviews expressing concerns over product durability. The first model needs to be trained and ready to review in 2 days. What is the MOST direct approach to solve this problem within 2 days? A. Train a custom classifier by using Amazon Comprehend. B. Build a recurrent neural network (RNN) in Amazon SageMaker by using Gluon and Apache MXNet. C. Train a built-in BlazingText model using Word2Vec mode in Amazon SageMaker. D. Use a built-in seq2seq model in Amazon SageMaker.

15 A Machine Learning Specialist at a company sensitive to security is preparing a dataset for model training. The dataset is stored in Amazon S3 and contains Personally Identifiable Information (PII). The dataset: Must be accessible from a VPC only. Must not traverse the public internet. How can these requirements be satisfied? A. Create a VPC endpoint and apply a bucket access policy that restricts access to the given VPC endpoint and the VPC. B. Create a VPC endpoint and apply a bucket access policy that allows access from the given VPC endpoint and an Amazon EC2 instance. C. Create a VPC endpoint and use Network Access Control Lists (NACLs) to allow traffic between only the given VPC endpoint and an Amazon EC2 instance. D. Create a VPC endpoint and use security groups to restrict access to the given VPC endpoint and an Amazon EC2 instance.

167 Build deep learning model using labeling to alert station manager if passenger crosses line without any train in station. Video data must remain confidential. Creates a bounding box to label sample data, uses an object detection model. Which of the below is correct solution? A. Rekognition Custom Labels to label dataset and create a custom Rekognition object detection model. Create a private workforce. Use A2I to review low-confidence predictions and retrain custom Rekognition model. B. SageMaker Ground Truth object detection labeling task. Mechanical Turk as labeling workforce. C. Rekognition Custom Labels to label dataset and create custom Rekognition object detection model. Create workforce with third-party Marketplace vendor. Use A2I to review low-confidence predictions and retrain custom Rekognition model. D. SageMaker Ground Truth semantic segmentation labeling task. Private workforce as labeling workforce.

175 A company uses ML to automate loan approval process with categorical fields data, like: location by city, housing status, financial data. Uses gradient boosting regression model to infer the credit score. Training accuracy: 99%, testing accuracy: 75%. With MOST testing accuracy? A. One-hot encoder for the categorical fields in the dataset. Perform standardization on financial fields in dataset. Apply L1 regularization to data. B. Tokenization of the categorical fields in the dataset. Perform binning on the financial fields in dataset. Remove the outliers in the data by using z-score. C. Label encoder for the categorical fields in the dataset. Perform L1 regularization on financial fields in dataset. Apply L2 regularization to data. D. Logarithm transformation on the categorical fields in the dataset. Perform binning on financial fields in dataset. Use imputation to populate missing values in dataset.

19 A Machine Learning Specialist is building a logistic regression model that will predict whether or not a person will order a pizza. The Specialist is trying to build the optimal model with an ideal classification threshold. What model evaluation technique should the Specialist use to understand how different classification thresholds will impact the model's performance? A. Receiver operating characteristic (ROC) curve B. Misclassification rate C. Root Mean Square Error (RMSE) D. L1 norm

209 Train various forecasting models on 80% of historical daily prices dataset, validate the efficacy of models on 20%. Which solution below is BEST to split the dataset to training and test datasets to compare model performance? A. Pick date so that 80% of data points precede the date. Assign that group of data points as training dataset. Assign all remaining data points to validation dataset. B. Pick date so that 80% of data points occur after the date. Assign that group of data points as training dataset. Assign all remaining data points to the validation dataset. C. Starting from earliest date in dataset, pick eight data points fortraining dataset and two data points for validation dataset. Repeat this stratified sampling until no data points remain. D. Sample data points randomly without replacement so that 80% of the data points are in training dataset. Assign all remaining data points to validation dataset.

213 With high throughput, durable storage, scalability, upto 5 min of latency for streaming data ingestion. MOST operationally efficient? A. Configure devices to send streaming data to Kinesis data stream. Configure Kinesis Data Firehose delivery stream to auto consume Kinesis data stream, transform data with Lambda, save output in S3. B. Configure devices to send streaming data to S3. Configure S3 event notifications to invoke Lambda to read, transform and load data in Kinesis data stream. Configure Kinesis Data Firehose delivery stream to auto consume Kinesis data stream and load output back in S3. C. Configure devices to send streaming data to S3. Configure S3 event notifications to invoke Glue job to read, transform and load data in new S3. D. Configure devices to send streaming data to Kinesis Data Firehose delivery stream. Configure Glue job that connects to delivery stream to transform and load output in S3.

220 A manufacturing company wants to monitor its devices for anomalous behavior. A data scientist has trained an Amazon SageMaker scikit-learn model that classifies a device as normal or anomalous based on its 4-day telemetry. The 4-day telemetry of each device is collected in a separate file and is placed in an Amazon S3 bucket once every hour. The total time to run the model across the telemetry for all devices is 5 minutes. What is the MOST cost-effective solution for the company to use to run the model across the telemetry for all the devices? A. SageMaker Batch Transform B. SageMaker Asynchronous Inference C. SageMaker Processing D. A SageMaker multi-container endpoint

224 A data scientist at a food production company wants to use an Amazon SageMaker built-in model to classify different vegetables. The current dataset has many features. The company wants to save on memory costs when the data scientist trains and deploys the model. The company also wants to be able to find similar data points for each test data point. Which algorithm will meet these requirements? A. K-nearest neighbors (k-NN) with dimension reduction B. Linear learner with early stopping C. K-means D. Principal component analysis (PCA) with the algorithm mode set to random

241 Creates insights each morning, about previous day's rental reservations. Auto-stream data to S3 in near real time. Detect high-demand rental cars at each location. Create a visualization dashboard that auto refreshes with latest data. LEAST development time? A. Use Kinesis Data Firehose to stream reservation data directly to S3. Detect high-demand outliers by using QuickSight ML Insights. Visualize data in QuickSight. B. Use Kinesis Data Streams to stream reservation data directly to S3. Detect high-demand outliers by using RCF trained model in SageMaker. Visualize data in QuickSight. C. Use Kinesis Data Firehose to stream reservation data directly to S3. Detect high-demand outliers by using RCF trained model in SageMaker. Visualize data in QuickSight. D. Use Kinesis Data Streams to stream reservation data directly to S3. Detect high-demand outliers by using QuickSight ML Insights. Visualize data in QuickSight.

265 Wants to use ML to identify houses with solar panels. Has 8,000 satellite images as training data. Will use SageMaker Ground Truth to label data. Has no ML expertise. Which one has LEAST effort? A. Set up private workforce with internal team. Use private workforce and SageMaker Ground Truth active learning feature to label data. Use Rekognition Custom Labels for model training and hosting. B. Set up private workfoce with internal team. Use private workforce to label data. Use Rekognition Custom Labels for model training and hosting. C. Set up private workforce with internal team. Use private workforce and SageMaker Ground Truth active learning feature to label data. Use SageMaker Object Detection algorithm to train a model. Use SageMaker batch transform for inference. D. Set up public workforce to label data. Use SageMaker Object Detection algorithm to train a model. Use SageMaker batch transform for inference.

277 Promotes new product to existing customers. Has data for past promotions that are similar. Decides to try an experiment to send a more expensive marketing package to few customers. Marketing campaign to customers who are most likely to buy new product. At least 90% of customers who are likely to purchase new product, receive marketing materials. Uses linear learner algo in SageMaker for model training. Has recall of 80%, precision of 75%. BEST way to retrain model? A. Set the target_recall hyperparameter to 90%. Set the binary_classifier_model_selection_criteria hyperparameter to recall_at_target_precision. B. Set the target_precision hyperparameter to 90%. Set the binary_classifier_model_selection_criteria hyperparameter to precision_at_target_recall. C. Use 90% of the historical data for training. Set the number of epochs to 20. D. Set the normalize_label hyperparameter to true. Set the number of classes to 2.

37 Which 1 has Real-time analytics, Interactive analytics of historical data, Clickstream analytics, Product recommendations? A. Glue as data catalog; Kinesis Data Streams and Kinesis Data Analytics for real-time data insights; Kinesis Data Firehose for delivery to ES for clickstream analytics; EMR for personalized product recommendation B. Athena as data catalog: Kinesis Data Streams, Data Analytics for near-real-time data insights; Kinesis Data Firehose for clickstream analytics; Glue for personalized product recommendation C. Glue as data catalog; Kinesis Data Streams and Kinesis Data Analytics for historical data insights; Kinesis Data Firehose for delivery to ES for clickstream analytics; EMR for personalized product recommendation D. Athena as data catalog; Kinesis Data Streams, Data Analytics for historical data insights; DynamoDB streams for clickstream analytics; Glue for personalized product recommendation

49 Camera images upload to S3. Rekognition tags images, stores result in Amazon ES. Which one can identify activities by non-employees in real time? A. Use proxy server for cameras, stream RTSP feed to unique Kinesis Video Streams video stream. Use Rekognition Video, create a stream processor to detect faces, alert for non-employees. B. Use proxy server for cameras, stream RTSP feed to unique Kinesis Video Streams video stream. Use Rekognition Image to detect faces, alert for non-employees. C. Install DeepLens cameras, DeepLens_Kinesis_Video to stream video to Kinesis Video Streams for cameras. On streams, use Rekognition Video and create a stream processor to detect faces, alert for non-employees. D. Install DeepLens cameras, DeepLens_Kinesis_Video to stream video to Kinesis Video Streams for cameras. On streams, run Lambda to capture image fragments, call Rekognition Image to detect faces, alert for non-employees.

52 Building a flexible and robust serverless data lake on Amazon S3, following below requirements: Support querying old and new data on S3 through Athena and Redshift Spectrum. Support event-driven ETL pipelines Provide a quick and easy way to understand metadata Which one below? A. AWS Glue crawler to crawl S3 data, an AWS Lambda function to trigger an AWS Glue ETL job, and an AWS Glue Data catalog to search and discover metadata. B. AWS Glue crawler to crawl S3 data, an AWS Lambda function to trigger an AWS Batch job, and an external Apache Hive metastore to search and discover metadata. C. AWS Glue crawler to crawl S3 data, an Amazon CloudWatch alarm to trigger an AWS Batch job, and an AWS Glue Data Catalog to search and discover metadata. D. AWS Glue crawler to crawl S3 data, an Amazon CloudWatch alarm to trigger an AWS Glue ETL job, and an external Apache Hive metastore to search and discover metadata.

57 A Machine Learning Specialist is required to build a supervised image-recognition model to identify a cat. The ML Specialist performs some tests and records the following results for a neural network-based image classifier: Total number of images available = 1,000 Test set images = 100 (constant test set) The ML Specialist notices that, in over 75% of the misclassified images, the cats were held upside down by their owners. Which techniques can be used by the ML Specialist to improve this specific test error? A. Increase the training data by adding variation in rotation for training images. B. Increase the number of epochs for model training C. Increase the number of layers for the neural network. D. Increase the dropout rate for the second-to-last layer.

61 The company is designing a solution that will allow it to use ML to score malicious security events as anomalies on the data being ingested and to be able to save the results in its data lake for later processing and analysis. What is the MOST efficient way to accomplish these? A. Ingest data using Kinesis Data Firehose, use Kinesis Data Analytics Random Cut Forest (RCF) for anomaly detection. Use Kinesis Data Firehose to stream the results to S3. B. Ingest data into Apache Spark Streaming using EMR, use Spark MLlib with k-means for anomaly detection. Store results in Apache Hadoop Distributed File System (HDFS) using EMR with a replication factor of 3 as data lake. C. Ingest data, store it in S3. Use AWS Batch with AWS Deep Learning AMIs to train k-means model using TensorFlow on data in S3. D. Ingest data, store it in S3. Run on-demand AWS Glue job to transform data. Use SageMaker's RCF model to detect anomaly.

62 A Data Scientist wants to gain real-time insights into a data stream of GZIP files. Which solution would allow the use of SQL to query the stream with the LEAST latency? A. Amazon Kinesis Data Analytics with an AWS Lambda function to transform the data. B. AWS Glue with a custom ETL script to transform the data. C. An Amazon Kinesis Client Library to transform the data and save it to an Amazon ES cluster. D. Amazon Kinesis Data Firehose to transform the data and put it into an Amazon S3 bucket.

78 A Machine Learning Specialist previously trained a logistic regression model using scikit-learn on a local machine, and the Specialist now wants to deploy it to production for inference only. What steps should be taken to ensure Amazon SageMaker can host a model that was trained locally? A. Build the Docker image with the inference code. Tag the Docker image with the registry hostname and upload it to Amazon ECR. B. Serialize the trained model so the format is compressed for deployment. Tag the Docker image with the registry hostname and upload it to Amazon S3. C. Serialize the trained model so the format is compressed for deployment. Build the image and upload it to Docker Hub. D. Build the Docker image with the inference code. Configure Docker Hub and upload the image to Amazon ECR.

89 A ML Specialist is given a structured dataset on shopping habits of a company's customer base. The dataset contains thousands of columns of data and hundreds of numerical columns for each customer. The Specialist wants to identify whether there are natural groupings for these columns across all customers and visualize the results fast. What approach should the Specialist take to accomplish these tasks? A. Embed the numerical features using the t-distributed stochastic neighbor embedding (t-SNE) algorithm and create a scatter plot. B. Run k-means using the Euclidean distance measure for different values of k and create an elbow plot. C. Embed the numerical features using the t-distributed stochastic neighbor embedding (t-SNE) algorithm and create a line graph. D. Run k-means using the Euclidean distance measure for different values of k and create box plots for each numerical column within each cluster.

96 A Machine Learning Specialist is attempting to build a linear regression model. Given the displayed residual plot only, what is the MOST likely problem with the model? A. Linear regression is inappropriate. The residuals do not have constant variance. B. Linear regression is inappropriate. The underlying data has outliers. C. Linear regression is appropriate. The residuals have a zero mean. D. Linear regression is appropriate. The residuals have constant variance.

130 A data scientist must build a custom recommendation model in SageMaker for an online retail company. Due to nature of the company's products, customers buy only 4-5 products every 5-10 years. So, company relies on steady stream of new customers. When new customer signs up, company collects data on customer's preferences. Below is sample of the data available to the data scientist. How should the data scientist split the dataset into a training and test set for this use case? A. Shuffle all interaction data. Split off the last 10% of the interaction data for the test set. B. Identify the most recent 10% of interactions for each user. Split off these interactions for the test set. C. Identify the 10% of users with the least interaction data. Split off all interaction data from these users for the test set. D. Randomly select 10% of the users. Split off all interaction data from these users for the test set.

142 SageMaker notebook instances for ML team, creates VPC interface endpoints for communication between VPC and notebook instances. All connections to SageMaker API are contained entirely and securely using AWS network. ML team realizes that people outside VPC can still connect to notebook instances across internet. Which 1? A. Change notebook instances' SG to only allow CIDR ranges of VPC. Apply this SG to all of notebook instances' VPC interfaces. B. Create an IAM policy that allows sagemaker:CreatePresignedNotebooklnstanceUrl and sagemaker:DescribeNotebooklnstance actions from only VPC endpoints. Apply this policy to all IAM users, groups, and roles used to access notebook instances. C. Add NAT gateway to VPC. Convert SageMaker notebook instance's subnets to private. Reassign only private IPs to all notebook instances D. Change NACL of the subnet the notebook is hosted in, to restrict access to anyone outside VPC.

150 MOST efficient? A. SageMaker Studio to rebuild model. Create notebook using XGBoost training container for model training. Deploy model at endpoint. Enable SageMaker Model Monitor to store inferences. Use inferences to create Shapley values to explain model behavior. Create chart showing features and SHAP values to explain how features affect model outcomes. B. SageMaker Studio to rebuild model. Create notebook that uses XGBoost training container for model training. Activate and configure SageMaker Debugger to calculate and collect Shapley values. Create a chart that shows features and SHAP values to explain to credit team how features affect model outcomes. C. Create and use the SageMaker notebook instance and XGBoost library. Use Python XGBoost' plot_importance(). D. Use SageMaker Studio to rebuild the model. Create a notebook with XGBoost training container. Use SageMaker Processing to post-analyze the model.

155 A ML specialist is administering a production Amazon SageMaker endpoint with model monitoring configured. Amazon SageMaker Model Monitor detects violations on the SageMaker endpoint, so the ML specialist retrains the model with the latest dataset. This dataset is statistically representative of the current production traffic. The ML specialist notices that even after deploying the new SageMaker model and running the first monitoring job, the SageMaker endpoint still has violations. Which solution below can meet the requirement? A. Manually trigger the monitoring job to re-evaluate the SageMaker endpoint traffic sample. B. Run the Model Monitor baseline job again on the new training set. Configure Model Monitor to use the new baseline. C. Delete the endpoint and recreate it with the original configuration. D. Retrain the model again by using a combination of the original training set and the new training set.

18 A Machine Learning Specialist is packaging a custom ResNet model into a Docker container so the company can leverage Amazon SageMaker for training. The Specialist is using Amazon EC2 P3 instances to train the model and needs to properly configure the Docker container to leverage the NVIDIA GPUs. What does the Specialist need to do? A. Bundle the NVIDIA drivers with the Docker image. B. Build the Docker container to be NVIDIA-Docker compatible. C. Organize the Docker container's file structure to execute on GPU instances. D. Set the GPU flag in the Amazon SageMaker CreateTrainingJob request body.

188 A machine learning (ML) specialist wants to bring a custom training algorithm to Amazon SageMaker. The ML specialist implements the algorithm in a Docker container that is supported by SageMaker. How should the ML specialist package the Docker container so that SageMaker can launch the training correctly? A. Specify the server argument in the ENTRYPOINT instruction in the Dockerfile. B. Specify the training program in the ENTRYPOINT instruction in the Dockerfile. C. Include the path to the training data in the docker build command when packaging the container. D. Use a COPY instruction in the Dockerfile to copy the training program to the /opt/ml/train directory.

192 A newspaper publisher has a table of customer data that consists of several numerical and categorical features, such as age and education history, as well as subscription status. The company wants to build a targeted marketing model for predicting the subscription status based on the table data. Which Amazon SageMaker built-in algorithm should be used to model the targeted marketing? A. Random Cut Forest (RCF) B. XGBoost C. Neural Topic Model (NTM) D. DeepAR forecasting

212 A data scientist has 20 TB of data in CSV format in an Amazon S3 bucket. The data scientist needs to convert the data to Apache Parquet format. How can the data scientist convert the file format with the LEAST amount of effort? A. Use an AWS Glue crawler to convert the file format. B. Write a script to convert the file format. Run the script as an AWS Glue job. C. Write a script to convert the file format. Run the script on an Amazon EMR cluster. D. Write a script to convert the file format. Run the script in an Amazon SageMaker notebook.

238 A company wants to collect, monitor, store real-time traffic data at its several amusement park entrances by using cameras. It should be able to immediately access the data for viewing. The data must be indexed and must be accessible to ML team. MOST cost-effectively? A. Use Amazon Kinesis Video Streams to ingest, index, and store the data. Use the built-in integration with Amazon Rekognition for viewing by the security team. B. Use Amazon Kinesis Video Streams to ingest, index, and store the data. Use the built-in HTTP live streaming (HLS) capability for viewing by the security team. C. Use Amazon Rekognition Video and the GStreamer plugin to ingest the data for viewing by the security team. Use Amazon Kinesis Data Streams to index and store the data. D. Use Amazon Kinesis Data Firehose to ingest, index, and store the data. Use the built-in HTTP live streaming (HLS) capability for viewing by the security team.

24 A Machine Learning Specialist trained a regression model, but the first iteration needs optimizing. The Specialist needs to understand whether the model is more frequently overestimating or underestimating the target. What option can the Specialist use to determine whether it is overestimating or underestimating the target value? A. Root Mean Square Error (RMSE) B. Residual plots C. Area under the curve D. Confusion matrix

249 A manufacturing company has a production line with sensors that collect hundreds of quality metrics. The company has stored sensor data and manual inspection results in a data lake for several months. To automate quality control, the machine learning team must build an automated mechanism that determines whether the produced goods are good quality, replacement market quality, or scrap quality based on the manual inspection results. Which modeling approach will deliver the MOST accurate prediction of product quality? A. Amazon SageMaker DeepAR forecasting algorithm B. Amazon SageMaker XGBoost algorithm C. Amazon SageMaker Latent Dirichlet Allocation (LDA) algorithm D. A convolutional neural network (CNN) and ResNet

268 System to audit ML systems, perform metadata analysis on features used by ML models, generate report to analyze metadata, set data sensitivity, authorship of features. LEAST effort? A. Use SageMaker Feature Store to select features. Create data flow to perform feature-level metadata analysis. Create DynamoDB table to store feature-level metadata. QuickSight to analyze metadata B. Use SageMaker Feature Store to set feature groups for current features that ML models use. Assign required metadata for each feature. SageMaker Studio to analyze metadata C. Use SageMaker Features Store to apply custom algorithms to analyze feature-level metadata that company requires. Create DynamoDB table to store feature-level metadata. QuickSight to analyze metadata. D. Use SageMaker Feature Store to set feature groups for current features that ML models use. Assign required metadata for each feature. QuickSight to analyze metadata.

274 A machine learning (ML) specialist is training a linear regression model. The specialist notices that the model is overfitting. The specialist applies an L1 regularization parameter and runs the model again. This change results in all features having zero weights. What should the ML specialist do to improve the model results? A. Increase the L1 regularization parameter. Do not change any other training parameters. B. Decrease the L1 regularization parameter. Do not change any other training parameters. C. Introduce a large L2 regularization parameter. Do not change the current L1 regularization value. D. Introduce a small L2 regularization parameter. Do not change the current L1 regularization value.

276 A wildlife research company has a set of images of lions and cheetahs. The company created a dataset of the images. The company labeled each image with a binary label that indicates whether an image contains a lion or cheetah. The company wants to train a model to identify whether new images contain a lion or cheetah. Which Amazon SageMaker algorithm will meet this requirement? A. XGBoost B. Image Classification - TensorFlow C. Object Detection - TensorFlow D. Semantic segmentation - MXNet

3 A Mobile Network Operator is building an analytics platform to analyze and optimize a company's operations using Amazon Athena and Amazon S3. The source systems send data in .CSV format in real time. The Data Engineering team wants to transform the data to the Apache Parquet format before storing it on Amazon S3. Which solution takes the LEAST effort to implement? A. Ingest .CSV data using Apache Kafka Streams on Amazon EC2 instances and use Kafka Connect S3 to serialize data as Parquet B. Ingest .CSV data from Amazon Kinesis Data Streams and use Amazon Glue to convert data into Parquet. C. Ingest .CSV data using Apache Spark Structured Streaming in an Amazon EMR cluster and use Apache Spark to convert data into Parquet. D. Ingest .CSV data from Amazon Kinesis Data Streams and use Amazon Kinesis Data Firehose to convert data into Parquet.

40 Receive a notification when the model is overfitting. View SageMaker log to ensure there are no unauthorized API calls. What is Least amount of code, fewest steps? A. Implement an AWS Lambda function to log SageMaker API calls to S3. Add code to push a custom metric to CloudWatch. Create an alarm in CloudWatch with SNS to receive a notification when the model is overfitting. B. Use AWS CloudTrail to log SageMaker API calls to S3. Add code to push a custom metric to CloudWatch. Create an alarm in CloudWatch with SNS to receive a notification when the model is overfitting. C. Implement an AWS Lambda function to log SageMaker API calls to AWS CloudTrail. Add code to push a custom metric to CloudWatch. Create an alarm in CloudWatch with SNS to receive a notification when the model is overfitting. D. Use AWS CloudTrail to log SageMaker API calls to S3. Set up SNS to receive a notification when the model is overfitting

59 A data scientist has explored and sanitized a dataset in preparation for the modeling phase of a supervised learning task. The statistical dispersion can vary widely between features, sometimes by several orders of magnitude. Before moving on to the modeling phase, the data scientist wants to ensure that the prediction performance on the production data is as accurate as possible. Which sequence of steps should the data scientist take to meet these requirements? A. Apply random sampling to the dataset. Then split the dataset into training, validation, and test sets. B. Split the dataset into training, validation, and test sets. Then rescale the training set and apply the same scaling to the validation and test sets. C. Rescale the dataset. Then split the dataset into training, validation, and test sets. D. Split the dataset into training, validation, and test sets. Then rescale the training set, the validation set

71 On-premise: runs ETL process at fixed intervals, uses PySpark to combine multiple data sources to a 1. It is moving to cloud: Combine multiple data sources, Reuse existing PySpark logic, Run solution on existing schedule, Minimize number of servers to be managed. BEST? A. Write raw data to S3. Schedule Lambda to submit Spark step to persistent EMR cluster on a schedule. Use existing PySpark logic to run ETL job on EMR cluster. Output results to S3. B. Write raw data to S3, Glue ETL job for input processing, PySpark ETL job using existing logic, new Glue trigger to run ETL job on existing schedule, send output of ETL job to S3. C. Write raw data to S3. Schedule Lambda to run on schedule, process input from S3. Write Lambda logic in Python. Use existing PySpark logic for ETL process. Send Lambda output to S3. D. Use Kinesis DataAnal to stream input, run real-time SQL on stream to transform it. Output results to S3.

79 A trucking company is collecting live image data from its fleet of trucks across the globe. The data is growing rapidly and approximately 100 GB of new data is generated every day. The company wants to explore machine learning uses cases while ensuring the data is only accessible to specific IAM users. Which storage option provides the most processing flexibility and will allow access control with IAM? A. Use a database, such as Amazon DynamoDB, to store the images, and set the IAM policies to restrict access to only the desired IAM users. B. Use an Amazon S3-backed data lake to store the raw images, and set up the permissions using bucket policies. C. Setup up Amazon EMR with Hadoop Distributed File System (HDFS) to store the files, and restrict access to the EMR instances using IAM policies. D. Configure Amazon EFS with IAM policies to make the data available to Amazon EC2 instances owned by the IAM users.

93 A company wants to classify user behavior as either fraudulent or normal. Based on internal research, a machine learning specialist will build a binary classifier based on two features: age of account, denoted by x, and transaction month, denoted by y. The class distributions are illustrated in the provided figure. The positive class is portrayed in red, while the negative class is portrayed in black. Which model would have the HIGHEST accuracy? A. Linear support vector machine (SVM) B. Decision tree C. Support vector machine (SVM) with a radial basis function kernel D. Single perceptron with a Tanh activation function

1 A large mobile network operating company is building a machine learning model to predict customers who are likely to unsubscribe from the service. The company plans to offer an incentive for these customers as the cost of churn is far greater than the cost of the incentive.The model produces the following confusion matrix after evaluating on a test dataset of 100 customers: Based on the model evaluation results, why is this a viable model for production? A. The model is 86% accurate and the cost incurred by the company as a result of false negatives is less than the false positives. B. The precision of the model is 86%, which is less than the accuracy of the model. C. The model is 86% accurate and the cost incurred by the company as a result of false positives is less than the false negatives. D. The precision of the model is 86%, which is greater than the accuracy of the model.

124 A data scientist has developed a machine learning translation model for English to Japanese by using Amazon SageMaker's built-in seq2seq algorithm with 500,000 aligned sentence pairs. While testing with sample sentences, the data scientist finds that the translation quality is reasonable for an example as short as five words. However, the quality becomes unacceptable if the sentence is 100 words long. Which action will resolve the problem? A. Change preprocessing to use n-grams. B. Add more nodes to the recurrent neural network (RNN) than the largest sentence's word count. C. Adjust hyperparameters related to the attention mechanism. D. Choose a different weight initialization type.

127 A company trains computer vision model to detect areas of concern on patients' CT scans. It has a large collection of unlabeled CT scans of each patient and stored in an S3. The scans must be accessible to authorized users only. How to build labeling pipeline with LEAST effort? A. Create workforce with AWS IAM. Build labeling tool on EC2 Queue images for labeling by using AWS SQS. Write labeling instructions. B. Create an Mechanical Turk workforce and manifest file. Create labeling job by using built-in image classification task type in SageMaker Ground Truth. Write labeling instructions. C. Create private workforce and manifest file. Create labeling job by using built-in bounding box task type in SageMaker Ground Truth. Write labeling instructions. D. Create workforce with Cognito. Build labeling web application with AWS Amplify. Build labeling workflow backend using AWS Lambda. Write labeling instructions.

128 A company uses Textract to extract textual data from thousands of scanned text-heavy legal documents daily. Uses this info to process loan applications automatically. Some of documents fail business validation and are returned to human reviewers to investigate, causing delay in loan applications. What should company do to reduce processing time of loan applications? A. Configure Textract to route low-confidence predictions to SageMaker Ground Truth. Perform a manual review on those words before performing a business validation. B. Use an Textract synchronous operation instead of an asynchronous operation. C. Configure Textract to route low-confidence predictions to Augmented AI (A2I). Perform a manual review on those words before performing a business validation. D. Use Rekognition's feature to detect text in an image to extract the data from scanned images. Use this information to process the loan applications.

133 Converts unstructured paper receipts into images. Create NLP model to find relevant entities like date, location, notes, receipt no., OCR to extract text for labeling. Docs are in diff formats, difficult to set up manual workflows for them. Trained NER model for custom entity detection using small sample size with low confidence score. Retrain with large dataset by LEAST effort? A. Extract text using Textract. Use the SageMaker BlazingText algorithm to train on text for entities and custom entities. B. Extract text using a deep learning OCR model from Marketplace. Use NER deep learning model to extract entities. C. Extract text using Textract. Use Comprehend for entity detection, Comprehend custom entity recognition for custom entity detection. D. Extract text using a deep learning OCR model from Marketplace. Use Comprehend for entity detection, Comprehend custom entity recognition for custom entity detection.

14 A company wants to classify user behavior as either fraudulent or normal. Based on internal research, a Machine Learning Specialist would like to build a binary classifier based on two features: age of account and transaction month. The class distribution for these features is illustrated in the figure provided. Based on this information, which model would have the HIGHEST accuracy? A. Long short-term memory (LSTM) model with scaled exponential linear unit (SELU) B. Logistic regression C. Support vector machine (SVM) with non-linear kernel D. Single perceptron with tanh activation function

161 Which? A. Use voice-driven Lex for ASR. Create customer slots within bot that identify required product names. Lex synonym mechanism to provide additional variations of each product name as bad xcriptions are identified. B. Use Transcribe for ASR. Analyze word confidence scores in transcript, auto update a custom vocab file with any word with confidence score < min. Use this custom vocab file in all future xcription tasks. C. Create custom vocab file having each product name with phonetic pronunciations, use it with AWS Transcribe for ASR customization. Analyze transcripts, manually update custom vocab file for updated or additional entries for wrongly identified names. D. Use audio transcripts to create training dataset, build Transcribe custom language model. Analyze transcripts, update training dataset with manually corrected transcripts which had wrongly transcribed product names. Create updated custom model.

162 Which 1 has MOST savings? A. Change notebook instance type to a memory optimized instance with same vCPU number as ml.m5.4xlarge instance has. Stop notebook when it is not in use. Run both data preprocessing and feature engineering development on that instance B. Keep notebook instance type and size same. Stop notebook when it is not in use. Run data preprocessing on a P3 instance type with same memory as ml.m5.4xlarge instance by using SageMaker Processing C. Change notebook instance type to a smaller general purpose instance. Stop notebook when it is not in use. Run data preprocessing on an ml.r5 instance with same memory size as ml.m5.4xlarge instance by using SageMaker Processing D. Change notebook instance type to a smaller general purpose instance. Stop notebook when it is not in use. Run data preprocessing on an R5 instance with same memory size as ml.m5.4xlarge instance by using Reserved Instance option.

173 Uses ML model for daily sales forecasting. Model is inaccurate results. Each day, AWS Glue job consolidates input data (used for forecasting) with actual daily sales data and the predictions of the model, stores output data in S3. ML team uses SageMaker Studio notebook to know why model is inaccurate. What solution from below can be done to visualize model's degradation MOST accurately? A. Create a histogram of the daily sales over the last 3 weeks. In addition, create a histogram of the daily sales from before that period. B. Create a histogram of the model errors over the last 3 weeks. In addition, create a histogram of the model errors from before that period. C. Create a line chart with the weekly mean absolute error (MAE) of the model. D. Create a scatter plot of daily sales versus model error for the last 3 weeks. In addition, create a scatter plot of daily sales versus model error from before that period.

205 All ML experiments and HPO jobs must be invoked from scripts inside SageMaker Studio notebooks. 1? A. Prepare custom HPO script running multiple training jobs in SageMaker Studio in local mode to tune custom container image model. Use auto model tuning feature of SageMaker with early stopping enabled to tune model of built-in image classify algo. Select best model B. SageMaker Autopilot to tune custom container image model. Use auto model tuning feature of SageMaker with early stopping enabled to tune model of built-in image classify algo. Select best model C. SageMaker Experiments to run and manage multiple training jobs to tune custom container image model. Use auto model tuning feature of SageMaker to tune model of built-in image classify algo. Select best model D. Auto model tuning feature of SageMaker to tune custom container image model and built-in image classify algo model at same time. Select best model

218 A company stores its documents in Amazon S3 with no predefined product categories. A data scientist needs to build a machine learning model to categorize the documents for all the company's products. Which solution will meet these requirements with the MOST operational efficiency? A. Build a custom clustering model. Create a Dockerfile and build a Docker image. Register the Docker image in Amazon Elastic Container Registry (Amazon ECR). Use the custom image in Amazon SageMaker to generate a trained model. B. Tokenize the data and transform the data into tabular data. Train an Amazon SageMaker k-means model to generate the product categories. C. Train an Amazon SageMaker Neural Topic Model (NTM) model to generate the product categories. D. Train an Amazon SageMaker Blazing Text model to generate the product categories.

225 A data scientist is training a large PyTorch model by using Amazon SageMaker. It takes 10 hours on average to train the model on GPU instances. The data scientist suspects that training is not converging and that resource utilization is not optimal. What should the data scientist do to identify and address training issues with the LEAST development effort? A. Use CPU utilization metrics that are captured in Amazon CloudWatch. Configure a CloudWatch alarm to stop the training job early if low CPU utilization occurs. B. Use high-resolution custom metrics that are captured in Amazon CloudWatch. Configure an AWS Lambda function to analyze the metrics and to stop the training job early if issues are detected. C. Use the SageMaker Debugger vanishing_gradient and LowGPUUtilization built-in rules to detect issues and to launch the StopTrainingJob action if issues are detected. D. Use the SageMaker Debugger confusion and

228 A retail company wants to use Amazon Forecast to predict daily stock levels of inventory. The ML specialist wants to use item-related features such as "category", "brand", and "safety stock count" and also a binary time series feature that has "promotion applied" as its name. Future promotion info is available only for next 5 days. The ML specialist must choose an algorithm and an evaluation metric to produce prediction results to maximize profit. Which solution meets these requirements? A. Use ARIMA training algorithm. Evaluate model by Weighted Quantile Loss (wQL) metric at 0.75 (P75). B. Use ARIMA training algorithm. Evaluate model by Weighted Absolute Percentage Error (WAPE) metric. C. Use CNN-QR training algorithm. Evaluate model by Weighted Quantile Loss (wQL) metric at 0.75 (P75). D. Use CNN-QR training algorithm. Evaluate model by Weighted Absolute Percentage Error (WAPE) metric.

236 An online store is predicting future book sales by using a linear regression model that is based on past sales data. The data includes duration, a numerical feature that represents the number of days that a book has been listed in the online store. A data scientist performs an exploratory data analysis and discovers that the relationship between book sales and duration is skewed and non-linear. Which data transformation step should the data scientist take to improve the predictions of the model? A. One-hot encoding B. Cartesian product transformation C. Quantile binning D. Normalization

237 1? A. Create S3 bucket and its ACL for each dataset. For each sensitive bucket, set ACL to allow access only fm Fin Dept users. Allow all 3 Dept users to access non-sensitive buckets. B. Create S3 bucket for each dataset. For each sensitive bucket, set policy to allow access only fm Fin Dept users. Allow all 3 Dept users to access S3 buckets with non-sensitive data. C. Create 1 S3 bucket with 1st folder for sensitive, 2nd for non-sensitive data. For Fin Dept user groups, attach an IAM policy that provides access to both folders. For Marketing, HR Dept user groups, attach IAM policy with access to only folder with non-sensitive data. D. Create 1 S3 bucket with 2 folders to separate sensitive datasets from non-sensitive datasets. Set policy for S3 bucket to allow only Fin Dept user group to access folder that contains sensitive datasets. Allow all 3 Dept user groups to access folder that contain non-sensitive data.

240 Computes rolling averages of ingested data from Kinesis data stream. Which 1 stores result in SageMaker Feature Store in near real time? A. Load data in S3 using Kinesis Data Firehose. Use SageMaker Processing job to aggregate, load the data to SageMaker Feature Store as online feature group B. Write data directly from data stream to SageMaker Feature Store as online feature group. Calculate rolling averages in place in SageMaker Feature Store by using SageMaker GetRecord API operation. C. Consume data stream by using Kinesis Data Analytics SQL application that calculates rolling averages. Generate result stream and consume it by using a custom Lambda that publishes results to SageMaker Feature Store as online feature group. D. Load data in S3 using Kinesis Data Firehose. Use SageMaker Processing job to load the data to SageMaker Feature Store as offline feature group. Compute the rolling averages at query time.

25 A company wants to classify user behavior as either fraudulent or normal. Based on internal research, a Machine Learning Specialist would like to build a binary classifier based on two features: age of account and transaction month. The class distribution for these features is illustrated in the figure provided. Based on this information, which model would have the HIGHEST recall with respect to the fraudulent class? A. Decision tree B. Linear support vector machine (SVM) C. Naive Bayesian classifier D. Single Perceptron with sigmoidal activation function

253 Which 1? A. Use CloudWatch metrics to find SageMaker training weights, gradients, biases, and activation outputs. Compute the filter ranks based on the training info. Prune and remove low-ranking filters. Set new weights based on the pruned filters. Run a new training job with pruned model. B. Use SageMaker Ground Truth to build and run data labeling workflows. Collect a larger labeled dataset with the labelling workflows. Run a new training job that uses new labeled data with previous training data. C. Use SageMaker Debugger to find training weights, gradients, biases, and activation outputs. Compute the filter ranks based on the training info. Prune and remove low-ranking filters. Set the new weights based on the pruned set of filters. Run a new training job with the pruned model. D. Use SageMaker Model Monitor to find ModelLatency and OverheadLatency metric. Increase the learning rate. Run a new training job.

258 Understand some characteristics of visitors to the store. Has security video recordings from the past several years. Generate a report of hourly visitors from the recordings. Group visitors by hair style and hair color. LEAST amount of effort? A. Use an object detection algo to identify a visitor's hair in video frames. Pass the identified hair to an ResNet-50 algorithm to determine hair style and hair color. B. Use an object detection algo to identify a visitor's hair in video frames. Pass the identified hair to an XGBoost algorithm to determine hair style and hair color. C. Use a semantic segmentation algo to identify a visitor's hair in video frames. Pass the identified hair to an ResNet-50 algorithm to determine hair style and hair color. D. Use a semantic segmentation algo to identify a visitor's hair in video frames. Pass the identified hair to an XGBoost algorithm to determine hair style and hair.

43 In a company, the Amazon SageMaker notebooks must access data in S3. For data privacy, all Amazon notebooks instances must stay in a secured VPC with no internet access and the data traffic must stay within AWS network. How should the Data Science team configure the notebook instance placement to meet these requirements? A. Associate Amazon SageMaker notebook with a private subnet in a VPC. Place the SageMaker endpoint and S3 buckets within the same VPC. B. Associate Amazon SageMaker notebook with a private subnet in a VPC. Use IAM policies to grant access to S3 and SageMaker. C. Associate Amazon SageMaker notebook with a private subnet in a VPC. Ensure the VPC has S3 VPC endpoints and SageMaker VPC endpoints attached to it. D. Associate Amazon SageMaker notebook with a private subnet in a VPC. Ensure the VPC has a NAT gateway and an associated security group allowing only outbound connections to S3 and SageMaker.

47 A company is setting up an Amazon SageMaker environment. The corporate data security policy does not allow communication over the internet. How can the company enable the Amazon SageMaker service without enabling direct internet access to Amazon SageMaker notebook instances? A. Create a NAT gateway within the corporate VPC. B. Route Amazon SageMaker traffic through an on-premises network. C. Create Amazon SageMaker VPC interface endpoints within the corporate VPC. D. Create VPC peering with Amazon VPC hosting Amazon SageMaker.

55 A company is running a machine learning prediction service that generates 100 TB of predictions every day. A Machine Learning Specialist must generate a visualization of the daily precision-recall curve from the predictions, and forward a read-only version to the Business team. LEAST coding effort below? A. Run a daily Amazon EMR workflow to generate precision-recall data, and save the results in Amazon S3. Give the Business team read-only access to S3. B. Generate daily precision-recall data in Amazon QuickSight, and publish the results in a dashboard shared with the Business team. C. Run a daily Amazon EMR workflow to generate precision-recall data, and save the results in Amazon S3. Visualize the arrays in Amazon QuickSight, and publish them in a dashboard shared with the Business team. D. Generate daily precision-recall data in Amazon ES, and publish the results in a dashboard shared with the Business team.

56 A Machine Learning Specialist is preparing data for training on Amazon SageMaker. The Specialist is using one of the SageMaker built-in algorithms for the training. The dataset is stored in .CSV format and is transformed into a numpy.array, which appears to be negatively affecting the speed of the training. What should the Specialist do to optimize the data for training on SageMaker? A. Use the SageMaker batch transform feature to transform the training data into a DataFrame. B. Use AWS Glue to compress the data into the Apache Parquet format. C. Transform the dataset into the RecordIO protobuf format. D. Use the SageMaker hyperparameter optimization feature to automatically optimize the data.

58 A Machine Learning Specialist needs to be able to ingest streaming data and store it in Apache Parquet files for exploration and analysis. Which of the following services would both ingest and store this data in the correct format? A. AWS DMS B. Amazon Kinesis Data Streams C. Amazon Kinesis Data Firehose D. Amazon Kinesis Data Analytics

75 A Machine Learning Specialist wants to determine the appropriate SageMakerVariantInvocationsPerInstance setting for an endpoint automatic scaling configuration. The Specialist has performed a load test on a single instance and determined that peak requests per second (RPS) without service degradation is about 20 RPS. As this is the first deployment, the Specialist intends to set the invocation safety factor to 0.5. Based on the stated parameters and given that the invocations per instance setting is measured on a per-minute basis, what should the Specialist set as the SageMakerVariantInvocationsPerInstance setting? A. 10 B. 30 C. 600 D. 2,400

105 A Machine Learning Specialist is deciding between building a naive Bayesian model or a full Bayesian network for a classification problem. The Specialist computes the Pearson correlation coefficients between each feature and finds that their absolute values range between 0.1 to 0.95. Which model describes the underlying data in this situation? A. A naive Bayesian model, since the features are all conditionally independent. B. A full Bayesian network, since the features are all conditionally independent. C. A naive Bayesian model, since some of the features are statistically dependent. D. A full Bayesian network, since some of the features are statistically dependent.

108 Reqs:Ingest streaming data to alert for unusual web traffic patterns, Calculate anomaly score, Adapt unusual eventId to changing web patterns. Which 1? A. Historic data to train AnomalyDetection model using SageMakerRCF. Use KinesisDataStream for incoming data. Lambda to enrich data. RCF model to calculate anomaly score. B. Historic data to train AnomalyDetection model using SageMakerXGBoost. Use KinesisDataStream for incoming data. Lambda to enrich data. XGBoost model to calculate anomaly score. C. Send streaming data to KinesisDataFirehose. Map delivery stream to KinesisDataAnalytics. Run real time SQL query against streaming data with kNN SQL extension to calculate anomaly scores using tumbling window. D. Send streaming data to KinesisDataFirehose. Map delivery stream to KinesisDataAnalytics. Run real time SQL query against streaming data with RCF SQL extension to calculate anomaly scores using sliding window.

121 A conpany uses SageMaker notebook instance for data exploration and analysis using Python packages that are not natively available on SageMaker. How can a ML specialist ensure that required packages are automatically available on the notebook instance for the data scientist to use? A. Install AWS Systems Manager Agent on the underlying EC2 instance and use Systems Manager Automation to execute the package installation commands. B. Create a Jupyter notebook file (.ipynb) with cells containing the package installation commands to execute and place the file under the /etc/init directory of each SageMaker notebook instance. C. Use the conda package manager from within the Jupyter notebook console to apply the necessary conda packages to the default kernel of the notebook. D. Create an SageMaker lifecycle configuration with package installation commands and assign the lifecycle configuration to the notebook instance.

137 Video cameras in low_bandwidth_stores to notify cash register's big lines? A. Install cameras compatible with KinesisVideoStreams to send to AWS over store's internet. Lambda to take image, send it to Rekognition to count number of faces in image. Send SNS notification for long lines. B. Deploy DeepLens cameras. Enable Rekognition on DeepLens device, use it to trigger local Lambda when person is recognized. Lambda to send SNS notification for long lines. C. Install cameras compatible with KinesisVideoStreams in stores. Build custom model in SageMaker to recognize number of people in image. Lambda to take image, SageMaker endpoint to call model to count people. Send SNS notification for long lines. D. Deploy DeepLens cameras. Build a custom model in SageMaker to recognize number of people in image. Deploy model to cameras. Deploy Lambda to cameras to use model to count people, send SNS notification for long lines.

149 Monitor comments in social media, evaluate sentiment, visualize trends, configure alarms based on thresholds. Least resources? A. Train a model in SageMaker by using BlazingText to detect sentiments. Trigger a Lambda when posts are added to S3 to invoke endpoint and record sentiment in DynamoDB and in custom CloudWatch metric. CloudWatch alarms to notify analysts. B. Train a model in SageMaker by using semantic segmentation algo to model semantic contents. Lambda when posts are added to S3, record sentiment in DynamoDB. Second Lambda to query recently added records, send SNS to notify analysts. C. Lambda when social media is posted to S3. Comprehend to capture sentiment, record sentiment in DynamoDB. Second Lambda to send SNS to notify analysts D. Lambda when social media is posted to S3. Comprehend to capture sentiment, record sentiment in custom CloudWatch metric and in S3. CloudWatch alarms to notify analysts

151 Model's text preprocessing stage will include part-of-speech tagging, key phase extraction. Preprocessed text will be input to a custom classification algo that ML team has written, trained it using Apache MXNet. Build NLP model quickest? A. Use Amazon Comprehend for the part-of-speech tagging, key phase extraction, and classification tasks. B. Use an NLP library in Amazon SageMaker for the part-of-speech tagging. Use Amazon Comprehend for the key phase extraction. Use AWS Deep Learning Containers with Amazon SageMaker to build the custom classifier. C. Use Amazon Comprehend for the part-of-speech tagging and key phase extraction tasks. Use Amazon SageMaker built-in Latent Dirichlet Allocation (LDA) algorithm to build the custom classifier. D. Use Amazon Comprehend for the part-of-speech tagging and key phase extraction tasks. Use AWS Deep Learning Containers with Amazon SageMaker to build the custom classifier.

171 A machine learning (ML) specialist wants to create a data preparation job that uses a PySpark script with complex window aggregation operations to create data for training and testing. The ML specialist needs to evaluate the impact of the number of features and the sample count on model performance. Which approach should the ML specialist use to determine the ideal data transformations for the model? A. Add an Amazon SageMaker Debugger hook to the script to capture key metrics. Run the script as an AWS Glue job. B. Add an Amazon SageMaker Experiments tracker to the script to capture key metrics. Run the script as an AWS Glue job. C. Add an Amazon SageMaker Debugger hook to the script to capture key parameters. Run the script as a SageMaker processing job. D. Add an Amazon SageMaker Experiments tracker to the script to capture key parameters. Run the script as a SageMaker processing job.

174 An company sends a weekly email newsletter after identifying five customer segments based on age, income, and location. The customers' current segmentation is unknown. The data scientist previously built XGBoost model to predict likelihood of customer responding to email based on age, income, location. What one of below? A. The XGBoost model provides a true/false binary output. Apply principal component analysis (PCA) with five feature dimensions to predict a segment. B. The XGBoost model provides a true/false binary output. Increase the number of classes the XGBoost model predicts to five classes to predict a segment. C. The XGBoost model is a supervised machine learning algorithm. Train a k-Nearest-Neighbors (kNN) model with K = 5 on the same dataset to predict a segment. D. The XGBoost model is a supervised machine learning algorithm. Train a k-means model with K = 5 on the same dataset to predict a segment.

177 No in-house ML expert? A. Export 2 DB columns: claim_label, claim_text to CSV. SageMaker Object2Vec algo, CSV file to train model. SageMaker to deploy model to inference endpoint. Service on inference endpoint to process incoming claims, predict labels B. Export 1 DB column: claim_text to CSV. Use SageMaker LDA algo and CSV file to train model, to auto detect labels. SageMaker to deploy model to inference endpoint. Develop service to use inference endpoint to process incoming claims, predict labels, route claims to queue C. Textract to process DB, auto detect claim_label, claim_text. Comprehend custom classify, Comprehend API to process incoming claims, predict labels, route claims to queue. D. Export 2 DB columns: claim_label, claim_text to CSV. Comprehend custom classify, CSV file to train custom classifier. Develop service to use Comprehend API to process incoming claims, predict labels, route claims to queue.

183 A company is evaluating GluonTS on Amazon SageMaker DeepAR model. The evaluation metrics on test set indicate that the coverage score is 0.489 and 0.889 at the 0.5 and 0.9 quantiles, respectively. What can the data scientist reasonably conclude about the distributional forecast related to the test set? A. The coverage scores indicate that the distributional forecast is poorly calibrated. These scores should be approximately equal to each other at all quantiles. B. The coverage scores indicate that the distributional forecast is poorly calibrated. These scores should peak at the median and be lower at the tails. C. The coverage scores indicate that the distributional forecast is correctly calibrated. These scores should always fall below the quantile itself. D. The coverage scores indicate that the distributional forecast is correctly calibrated. These scores should be approximately equal to the quantile itself.

20 An interactive online dictionary wants to add a widget that displays words used in similar contexts. A Machine Learning Specialist is asked to provide word features for the downstream nearest neighbor model powering the widget. What should the Specialist do to meet these requirements? A. Create one-hot word encoding vectors. B. Produce a set of synonyms for every word using Amazon Mechanical Turk. C. Create word embedding vectors that store edit distance with every other word. D. Download word embeddings pre-trained on a large corpus.

22 Ingest POs from 20,000 stores to S3 using Kinesis Data Firehose. Apply simple transforms to training records, combine some attributes and retrain new model daily. LEAST development effort? A. Require that the stores to switch to capturing their data locally on AWS Storage Gateway for loading into Amazon S3, then use AWS Glue to do the transformation. B. Deploy an Amazon EMR cluster running Apache Spark with the transformation logic, and have the cluster run each day on the accumulating records in Amazon S3, outputting new/transformed records to Amazon S3. C. Spin up a fleet of Amazon EC2 instances with the transformation logic, have them transform the data records accumulating on Amazon S3, and output the transformed records to Amazon S3. D. Insert an Amazon Kinesis Data Analytics stream downstream of the Kinesis Data Firehose stream that transforms raw record attributes into simple transformed values using SQL.

221 MOST accurate SageMaker built-in k-means clustering algo for optimal number of subgroups (k)? A. Calculate PCA components. Run k-means clustering algo for a range of k by using only first two PCA components. Create scatter plot for each cluster. The optimal value of k is value where clusters start to look reasonably separated. B. Calculate PCA components. Create line plot of number of components against explained variance. The optimal value of k is number of PCA components after which curve starts decreasing in a linear fashion. C. Create a t-SNE plot for range of perplexity values. The optimal value of k is value of perplexity, where clusters start to look reasonably separated. D. Run k-means clustering algo for a range of k. For each value of k, calculate sum of squared errors (SSE). Plot line chart of SSE for each value of k. The optimal value of k is point after which curve starts decreasing in linear fashion

26 A ML Specialist kicks off a hyperparameter tuning job for a tree-based ensemble model using SageMaker with AUC as the objective metric. This workflow retrains and tunes hyperparameters each night to model click-through on data that goes stale every 24 hrs. To decrease the amount of time it takes to train these models, and to decrease costs, the Specialist wants to reconfigure the input hyperparameter range(s). Which visualization will accomplish this? A. A histogram showing whether the most important input feature is Gaussian. B. A scatter plot with points colored by target variable that uses t-Distributed Stochastic Neighbor Embedding (t-SNE) to visualize the large number of input variables in an easier-to-read dimension. C. A scatter plot showing the performance of the objective metric over each training iteration. D. A scatter plot showing the correlation between maximum tree depth and the objective metric.

275 A data scientist for a medical diagnostic testing company has developed a machine learning (ML) model to identify patients who have a specific disease. The dataset that the scientist used to train the model is imbalanced. The dataset contains a large number of healthy patients and only a small number of patients who have the disease. The model should consider that patients who are incorrectly identified as positive for the disease will increase costs for the company. Which metric will MOST accurately evaluate the performance of this model? A. Recall B. F1 score C. Accuracy D. Precision

34 A Data Scientist develops a ML model to predict future patient outcomes based on info collected about patients, treatment plans. The model should output a continuous value as its prediction. The data has labeled outcomes for 4,000 patients. The study was conducted on 65+ aged person with disease which worsens with age. Initial models performed poorly. The Data Scientist notices that, out of 4,000 patient observations, there are 450 where patient age has been input as 0. The other features for these observations appear normal compared to the rest of the sample population How to correct this issue? A. Drop all records from the dataset where age has been set to 0. B. Replace the age field value for records with a value of 0 with the mean or median value from the dataset. C. Drop the age feature from the dataset and train the model using the rest of the features. D. Use k-means clustering to handle missing features.

54 Which of the following metrics should a Machine Learning Specialist generally use to compare/evaluate machine learning classification models against each other? A. Recall B. Misclassification rate C. Mean absolute percentage error (MAPE) D. Area Under the ROC Curve (AUC)

76 Uses long short-term memory (LSTM) to evaluate the risk factors of energy sector. The model uses text documents to analyze each sentence of text and categorizes it as risk or no risk. The model is not performing well, even though the Data Scientist has experimented with many different network structures and tuned the corresponding hyperparameters. Which approach will provide the MAXIMUM performance boost? A. Initialize the words by term frequency-inverse document frequency (TF-IDF) vectors pretrained on a large collection of news articles related to the energy sector. B. Use gated recurrent units (GRUs) instead of LSTM and run the training process until the validation loss stops decreasing. C. Reduce the learning rate and run the training process until the training loss stops decreasing. D. Initialize the words by word2vec embeddings pretrained on a large collection of news articles related to the energy sector.

81 A Data Scientist is training a multilayer perception (MLP) on a dataset with multiple classes. The target class of interest is unique compared to the other classes within the dataset, but it does not achieve and acceptable recall metric. The Data Scientist has already tried varying the number and size of the MLP's hidden layers, which has not significantly improved the results. A solution to improve recall must be implemented as quickly as possible. Which techniques should be used to meet these requirements? A. Gather more data using Amazon Mechanical Turk and then retrain B. Train an anomaly detection model instead of an MLP C. Train an XGBoost model instead of an MLP D. Add class weights to the MLP's loss function and then retrain

92 Wants to predict sale prices of houses based on historical sales data. Target variable in company's dataset is sale price. Features are: lot size, living area measurements, non-living area measurements, no. of bedrooms, no. of bathrooms, year built, postal code. The company wants to use multi-variable linear regression to predict house sale prices. Which step should ML specialist take to remove irrelevant features and reduce model's complexity? A. Plot a histogram of the features and compute their standard deviation. Remove features with high variance. B. Plot a histogram of the features and compute their standard deviation. Remove features with low variance. C. Build a heatmap showing the correlation of the dataset against itself. Remove features with low mutual correlation scores. D. Run a correlation check of all features against the target variable. Remove features with low target variable correlation scores.

Ver todos los conjuntos de estudio

AiMLShort

Conjuntos de estudio relacionados

anatomy test (The Lungs)

Comp tech chap 3 test review

HB 307: Chapter 3 Quiz

GERD and Other Esophageal Probs: NCLEX Qs on Chapter 54: IGGY

Contra-/Counter- Prefixes

Plate Tectonics Unit 6 Review

ECON 206 Final

Module 7

HW17: Homework - Ch. 17: Public Goods and Common Resources

EXAM 3 Acute Coronary Syndrome (LEWIS Med-Surg EAQs)

Chapter 12

Plate Tectonics Quiz

Chapter 10

restraints

chapter 6 vocab test guide

Networking Essentials: Module 2: Online Connections

Leader NCLEX PN C 101 102 103

ECON 201 Final

GERIATRIC PHARMACOLOGY

Systems Thinking Midterm SOS 220