AWS Machine Learning 02
A Workstation Learning Specialist previously trained a logistic regression model on a local machine using scikit-learn and now wishes to deploy it to production for the sole purpose of inference. What actions should be done to guarantee that an Amazon SageMaker model trained locally can be hosted? A. Build the Docker image with the inference code. Tag the Docker image with the registry hostname and upload it to Amazon ECR. B. Serialize the trained model so the format is compressed for deployment. Tag the Docker image with the registry hostname and upload it to Amazon S3. C. Serialize the trained model so the format is compressed for deployment. Build the image and upload it to Docker Hub. D. Build the Docker image with the inference code. Configure Docker Hub and upload the image to Amazon ECR.
A. Build the Docker image with the inference code. Tag the Docker image with the registry hostname and upload it to Amazon ECR.
A library is creating an Amazon Rekognition-based automated book-borrowing system. Amazon S3 buckets are used to store images of library users' faces.When members borrow books, the Amazon Rekognition CompareFaces API matches their actual faces to those stored in Amazon S3.The library's security should be strengthened by ensuring that photos are encrypted at rest. Additionally, when photos are utilized in conjunction with Amazon Rekognition, they must be secured during transport. Additionally, the library must verify that no photos are utilized to enhance Amazon Rekognition as a service. How might an expert in machine learning construct a system that satisfies these requirements? A. Enable server-side encryption on the S3 bucket. Submit an AWS Support ticket to opt out of allowing images to be used for improving the service, and follow the process provided by AWS Support. B. Switch to using an Amazon Rekognition collection to store the images. Use the IndexFaces and SearchFacesByImage API operations instead of the CompareFaces API operation. C. Switch to using the AWS GovCloud (US) Region for Amazon S3 to store images and for Amazon Rekognition to compare faces. Set up a VPN connection and only call the Amazon Rekognition API operations through the VPN. D. Enable client-side encryption on the S3 bucket. Set up a VPN connection and only call the Amazon Rekognition API operations through the VPN.
A. Enable server-side encryption on the S3 bucket. Submit an AWS Support ticket to opt out of allowing images to be used for improving the service, and follow the process provided by AWS Support. Rekognition API endpoints only support secure connections over HTTPS and all communication is encrypted in transit with TLS
A Machine Learning Specialist is employed by a multinational cybersecurity firm that handles real-time security events for businesses worldwide. The cybersecurity firm wants to develop a system that would enable it to employ machine learning to classify dangerous events as anomalies in data as it is consumed. Additionally, the corporation wishes to save the findings in its data lake for subsequent processing and analysis. Which method is the MOST EFFECTIVE for completing these tasks? A. Ingest the data using Amazon Kinesis Data Firehose, and use Amazon Kinesis Data Analytics Random Cut Forest (RCF) for anomaly detection. Then use Kinesis Data Firehose to stream the results to Amazon S3. B. Ingest the data into Apache Spark Streaming using Amazon EMR, and use Spark MLlib with k-means to perform anomaly detection. Then store the results in an Apache Hadoop Distributed File System (HDFS) using Amazon EMR with a replication factor of three as the data lake. C. Ingest the data and store it in Amazon S3. Use AWS Batch along with the AWS Deep Learning AMIs to train a k-means model using TensorFlow on the data in Amazon S3. D. Ingest the data and store it in Amazon S3. Have an AWS Glue job that is triggered on demand transform the new data. Then use the built-in Random Cut Forest (RCF) model within Amazon SageMaker to detect anomalies in the data.
A. Ingest the data using Amazon Kinesis Data Firehose, and use Amazon Kinesis Data Analytics Random Cut Forest (RCF) for anomaly detection. Then use Kinesis Data Firehose to stream the results to Amazon S3.
A specialist in machine learning is trying to construct a linear regression model.Given just the given residual plot, what is the MOST LIKELY cause of the model's failure? A. Linear regression is inappropriate. The residuals do not have constant variance. B. Linear regression is inappropriate. The underlying data has outliers. C. Linear regression is appropriate. The residuals have a zero mean. D. Linear regression is appropriate. The residuals have constant variance.
A. Linear regression is inappropriate. The residuals do not have constant variance.
A data science team is developing a dataset repository to house a significant volume of training data that is often utilized in machine learning models. Given that Data Scientists may develop an infinite amount of new datasets each day, the solution must be scalable and cost-effective. Additionally, SQL exploration of the data must be possible. Which storage method is the MOST SUITABLE for this scenario? A. Store datasets as files in Amazon S3. B. Store datasets as files in an Amazon EBS volume attached to an Amazon EC2 instance. C. Store datasets as tables in a multi-node Amazon Redshift cluster. D. Store datasets as global tables in Amazon DynamoDB.
A. Store datasets as files in Amazon S3.
Amazon Personalize is being used by a retail firm to deliver individualized product suggestions to consumers during a marketing campaign. The organization quickly notices a big rise in sales of suggested goods to current clients after the deployment of a new solution version, but these sales decline shortly thereafter. For training purposes, only historical data from prior to the marketing campaign is accessible. What adjustments should a data scientist make to the solution? A. Use the event tracker in Amazon Personalize to include real-time user interactions. B. Add user metadata and use the HRNN-Metadata recipe in Amazon Personalize. C. Implement a new solution using the built-in factorization machines (FM) algorithm in Amazon SageMaker. D. Add event type and event value fields to the interactions dataset in Amazon Personalize.
A. Use the event tracker in Amazon Personalize to include real-time user interactions.
A Machine Learning Specialist is putting a bespoke ResNet model into a Docker container in order to train the model using Amazon SageMaker. The Specialist is training the model on Amazon EC2 P3 instances and wants to setup the Docker container effectively to take use of the NVIDIA GPUs. What is the Specialist's role? A. Bundle the NVIDIA drivers with the Docker image. B. Build the Docker container to be NVIDIA-Docker compatible. C. Organize the Docker container's file structure to execute on GPU instances. D. Set the GPU flag in the Amazon SageMaker CreateTrainingJob request body.
B. Build the Docker container to be NVIDIA-Docker compatible. If you plan to use GPU devices for model training, make sure that your containers are nvidia-docker compatible. Only the CUDA toolkit should be included on containers; don't bundle NVIDIA drivers with the image.
A data scientist is tasked with the task of developing a bespoke recommendation model in Amazon SageMaker for an online retailer. Customers purchase just 4-5 things every 5-10 years due to the nature of the company's offerings. As a result, the business is reliant on a continual influx of new consumers. When a new client registers, the business gathers information about the consumer's preferences. The following is a sample of the data that the data scientist has access to. For this use case, how should the data scientist divide the dataset into a training and test set? A. Shuffle all interaction data. Split off the last 10% of the interaction data for the test set. B. Identify the most recent 10% of interactions for each user. Split off these interactions for the test set. C. Identify the 10% of users with the least interaction data. Split off all interaction data from these users for the test set. D. Randomly select 10% of the users. Split off all interaction data from these users for the test set.
B. Identify the most recent 10% of interactions for each user. Split off these interactions for the test set.
A real estate firm wishes to develop a machine learning model capable of forecasting home values using a historical dataset. 32 features are included in the dataset. Which model is most appropriate for the business requirement? A. Logistic regression B. Linear regression C. K-means D. Principal component analysis (PCA)
B. Linear regression
A machine learning (ML) professional is configuring model monitoring on a production Amazon SageMaker endpoint. When the Amazon SageMaker Model Monitor identifies violations on the SageMaker endpoint, the machine learning professional retrains the model using the most recent dataset. This dataset is statistically typical of production traffic at the time of writing. The machine learning expert sees that the SageMaker endpoint continues to have violations even after installing the new SageMaker model and executing the first monitoring operation. What actions should the machine learning professional take to rectify the violations? A. Manually trigger the monitoring job to re-evaluate the SageMaker endpoint traffic sample. B. Run the Model Monitor baseline job again on the new training set. Configure Model Monitor to use the new baseline. C. Delete the endpoint and recreate it with the original configuration. D. Retrain the model again by using a combination of the original training set and the new training set.
B. Run the Model Monitor baseline job again on the new training set. Configure Model Monitor to use the new baseline.
A machine learning specialist created a deep learning model for picture categorization. The Specialist, on the other hand, encountered an overfitting issue, with training and testing accuracies of 99 percent and 75%, respectively. How should the Specialist approach this situation and what is the underlying cause? A. The learning rate should be increased because the optimization process was trapped at a local minimum. B. The dropout rate at the flatten layer should be increased because the model is not generalized enough. C. The dimensionality of dense layer next to the flatten layer should be increased because the model is not complex enough. D. The epoch number should be increased because the optimization process was terminated before it reached the global minimum.
B. The dropout rate at the flatten layer should be increased because the model is not generalized enough.
On Amazon SageMaker, a Machine Learning team runs its own training algorithm. External assets are required for the training algorithm. The team must submit to Amazon SageMaker both its own algorithm code and algorithm-specific parameters. Which services should the team combine in order to create a bespoke algorithm in Amazon SageMaker? (Select two.) A. AWS Secrets Manager B. AWS CodeStar C. Amazon ECR D. Amazon ECS E. Amazon S3
C. Amazon ECR E. Amazon S3
A Machine Learning Specialist is enabling Amazon SageMaker to provide simultaneous access to notebooks, model training, and endpoint deployment by numerous Data Scientists. To guarantee optimal operational performance, the Specialist must be able to monitor the frequency with which the Scientists deploy models, the GPU and CPU use of deployed SageMaker endpoints, and any issues that occur when an endpoint is called. Which services are linked with Amazon SageMaker for the purpose of tracking this data? (Select two.) A. AWS CloudTrail B. AWS Health C. AWS Trusted Advisor D. Amazon CloudWatch E. AWS Config
A. AWS CloudTrail D. Amazon CloudWatch
Using a dataset of 100 continuous numerical characteristics, a Data Scientist is developing a model to predict customer attrition. The Marketing department has offered no guidance on which characteristics are significant for churn prediction. The Marketing department want to interpret the model and determine the direct effect of important characteristics on the model's output. While training a logistic regression model, the Data Scientist notices a significant difference in the accuracy of the training and validation sets. Which techniques can the Data Scientist employ to optimize the model's performance and meet the demands of the Marketing team? (Select two.) A. Add L1 regularization to the classifier B. Add features to the dataset C. Perform recursive feature elimination D. Perform t-distributed stochastic neighbor embedding (t-SNE) E. Perform linear discriminant analysis
A. Add L1 regularization to the classifier C. Perform recursive feature elimination recursive feature elimination: A machine learning dataset for classification or regression is comprised of rows and columns,
A expert in machine learning (ML) is tasked with the task of extracting embedding vectors from a text sequence. The objective is to create a feature space that is ready to be ingested by a data scientist for the purpose of developing downstream machine learning prediction models. The content is composed of handpicked English phrases. Numerous sentences include identical terms but in distinct situations. There are questions and answers interspersed throughout the sentences, and the embedding space must distinguish them. Which solutions can generate the embedding vectors necessary to collect word context and sequential quality assurance information? (Select two.) A. Amazon SageMaker seq2seq algorithm B. Amazon SageMaker BlazingText algorithm in Skip-gram mode C. Amazon SageMaker Object2Vec algorithm D. Amazon SageMaker BlazingText algorithm in continuous bag-of-words (CBOW) mode E. Combination of the Amazon SageMaker BlazingText algorithm in Batch Skip-gram mode with a custom recurrent neural network (RNN)
A. Amazon SageMaker seq2seq algorithm C. Amazon SageMaker Object2Vec algorithm
A retailer aims to classify new items using machine learning. The Data Science team was presented with a labeled dataset of current goods. There are 1,200 goods in the dataset. Each product in the labeled dataset includes 15 attributes, including its title, dimensions, weight, and price. Each item is tagged with a category, such as books, games, gadgets, or movies. Which model should be used to classify new items using the training data provided? A. AnXGBoost model where the objective parameter is set to multi:softmax B. A deep convolutional neural network (CNN) with a softmax activation function for the last layer C. A regression forest where the number of trees is set equal to the number of product categories D. A DeepAR forecasting model based on a recurrent neural network (RNN)
A. An XGBoost model where the objective parameter is set to multi:softmax
A machine learning (ML) expert is tasked with the responsibility of creating a categorization model for a financial services organization. The dataset, which is tabular in nature and has 10,000 rows and 1,020 characteristics, is provided by a domain expert. The expert discovers no missing values and just a tiny fraction of duplicate rows during exploratory data analysis. Correlation coefficients greater than 0.9 exist for 200 feature pairings. Each feature's mean value is about equal to its 50th percentile. Which feature engineering technique should the machine learning expert use while working with Amazon SageMaker? A. Apply dimensionality reduction by using the principal component analysis (PCA) algorithm. B. Drop the features with low correlation scores by using a Jupyter notebook. C. Apply anomaly detection by using the Random Cut Forest (RCF) algorithm. D. Concatenate the features with high correlation scores by using a Jupyter notebook.
A. Apply dimensionality reduction by using the principal component analysis (PCA) algorithm.
A financial services firm wants to make Amazon SageMaker its primary data science environment. On sensitive financial data, the company's data scientists run machine learning (ML) models. The organization is concerned about data egress and desires the services of a machine learning engineer to safeguard the environment. Which methods does the machine learning engineer have at his disposal to manage data egress from SageMaker? (Select three.) A. Connect to SageMaker by using a VPC interface endpoint powered by AWS PrivateLink. B. Use SCPs to restrict access to SageMaker. C. Disable root access on the SageMaker notebook instances. D. Enable network isolation for training jobs and models. E. Restrict notebook presigned URLs to specific IPs used by the company. F. Protect data with encryption at rest and in transit. Use AWS Key Management Service (AWS KMS) to manage encryption keys.
A. Connect to SageMaker by using a VPC interface endpoint powered by AWS PrivateLink. D. Enable network isolation for training jobs and models. E. Restrict notebook presigned URLs to specific IPs used by the company.
A business distributes wholesale apparel to thousands of retail locations. A data scientist must develop a model that forecasts each item's daily sales volume in each retailer. The data scientist learns that over half of the shops are less than six months old. Weekly sales data is quite stable. Weekly aggregates of daily data from the database were created, and weeks with no sales were excluded from the present dataset. Amazon S3 stores five years' worth of sales data (100 MB). Which variables will have a negative influence on the performance of the forecast model being constructed, and what mitigation measures can the data scientist take?(Select two.) A. Detecting seasonality for the majority of stores will be an issue. Request categorical data to relate new stores with similar stores that have more historical data. B. The sales data does not have enough variance. Request external sales data from other industries to improve the model's ability to generalize. C. Sales data is aggregated by week. Request daily sales data from the source database to enable building a daily model. D. The sales data is missing zero entries for item sales. Request that item sales data from the source database include zero entries to enable building the model. E. Only 100 MB of sales data is available in Amazon S3. Request 10 years of sales data, which would provide 200 MB of training data for the model.
A. Detecting seasonality for the majority of stores will be an issue. Request categorical data to relate new stores with similar stores that have more historical data. C. Sales data is aggregated by week. Request daily sales data from the source database to enable building a daily model.
A retailer sells its items through a worldwide online marketplace. The organization want to evaluate client feedback and find specific areas for development using machine learning (ML). A developer has created a program that scrapes consumer evaluations from online marketplaces and saves them to an Amazon S3 bucket. This procedure generates a dataset of forty reviews. A data scientist developing machine learning models must locate extra data sources to augment the dataset. Which data sources should the data scientist use in order to supplement the review dataset? (Select three.) A. Emails exchanged by customers and the company's customer service agents B. Social media posts containing the name of the company or its products C. A publicly available collection of news articles D. A publicly available collection of customer reviews E. Product sales revenue figures for the company F. Instruction manuals for the company's products
A. Emails exchanged by customers and the company's customer service agents B. Social media posts containing the name of the company or its products D. A publicly available collection of customer reviews
A business offers hundreds of items on a public website and wishes to detect products with possible durability issues automatically. The firm has 1.000 reviews with fields for date, star rating, review text, review summary, and customer email, however many reviews are incomplete and have blank fields. Each review has been pre-labeled with the appropriate durability rating. A machine learning expert must develop a model to recognize reviews indicating worries about the durability of a product. In two days, the first model must be trained and available for evaluation. What is the MOST DIRECT method for resolving this issue in two days? A. Train a custom classifier by using Amazon Comprehend. B. Build a recurrent neural network (RNN) in Amazon SageMaker by using Gluon and Apache MXNet. C. Train a built-in BlazingText model using Word2Vec mode in Amazon SageMaker. D. Use a built-in seq2seq model in Amazon SageMaker.
A. Train a custom classifier by using Amazon Comprehend.
A business is in possession of video feeds and photographs from a metro train stop. The business intends to develop a deep learning algorithm that will notify the station management if any passenger breaches the yellow safety line in the absence of a train. The notification will be based on video feeds. The corporation wishes for the model to recognize the yellow line, persons who cross it, and trains in video feeds. This activity necessitates the use of labels. The video data must be kept private.A data scientist labels the sample data with a bounding box and then applies an object detection model. However, the object detection model is unable to distinguish effectively between the yellow line, the people who cross it, and the trains. Which labeling strategy will aid the firm in optimizing this model? A. Use Amazon Rekognition Custom Labels to label the dataset and create a custom Amazon Rekognition object detection model. Create a private workforce. Use Amazon Augmented AI (Amazon A2I) to review the low-confidence predictions and retrain the custom Amazon Rekognition model. B. Use an Amazon SageMaker Ground Truth object detection labeling task. Use Amazon Mechanical Turk as the labeling workforce. C. Use Amazon Rekognition Custom Labels to label the dataset and create a custom Amazon Rekognition object detection model. Create a workforce with a third-party AWS Marketplace vendor. Use Amazon Augmented AI (Amazon A2I) to review the low-confidence predictions and retrain the custom Amazon Rekognition model. D. Use an Amazon SageMaker Ground Truth semantic segmentation labeling task. Use a private workforce as the labeling workforce.
A. Use Amazon Rekognition Custom Labels to label the dataset and create a custom Amazon Rekognition object detection model. Create a private workforce. Use Amazon Augmented AI (Amazon A2I) to review the low-confidence predictions and retrain the custom Amazon Rekognition model.
A media corporation with a large collection of unlabeled photographs, text, audio, and video footage seeks to index its assets in order to enable the Research team to quickly identify relevant information. The firm wishes to use machine learning in order to expedite the work of its in-house researchers, who have minimal experience with machine learning. Which approach is the FASTEST for indexing the assets? A. Use Amazon Rekognition, Amazon Comprehend, and Amazon Transcribe to tag data into distinct categories/classes. B. Create a set of Amazon Mechanical Turk Human Intelligence Tasks to label all footage. C. Use Amazon Transcribe to convert speech to text. Use the Amazon SageMaker Neural Topic Model (NTM) and Object Detection algorithms to tag data into distinct categories/classes. D. Use the AWS Deep Learning AMI and Amazon EC2 GPU instances to create custom models for audio transcription and topic modeling, and use object detection to tag data into distinct categories/classes.
A. Use Amazon Rekognition, Amazon Comprehend, and Amazon Transcribe to tag data into distinct categories/classes.
A large financial institution is automating its loan approval process via the use of machine learning. The business maintains a database of client information. The dataset includes categorical elements such as the customer's city of residence and housing status. Additionally, the information contains financial variables expressed in other units, such as account balances in US dollars and monthly interest expressed in US cents.The company's data scientists infer each customer's credit score using a gradient boosting regression model. The model has a 99 percent training accuracy and a 75 percent testing accuracy. The data scientists are interested in increasing the model's testing accuracy. Which procedure will have the greatest impact on testing accuracy? A. Use a one-hot encoder for the categorical fields in the dataset. Perform standardization on the financial fields in the dataset. Apply L1 regularization to the data. B. Use tokenization of the categorical fields in the dataset. Perform binning on the financial fields in the dataset. Remove the outliers in the data by using the z- score. C. Use a label encoder for the categorical fields in the dataset. Perform L1 regularization on the financial fields in the dataset. Apply L2 regularization to the data. D. Use a logarithm transformation on the categorical fields in the dataset. Perform binning on the financial fields in the dataset. Use imputation to populate missing values in the dataset.
A. Use a one-hot encoder for the categorical fields in the dataset. Perform standardization on the financial fields in the dataset. Apply L1 regularization to the data.
A Machine Learning Specialist is required to work for an online shop that want to do analytics on each client visit using a machine learning pipeline.The data must be ingested at a rate of up to 100 transactions per second using Amazon Kinesis Data Streams, and the JSON data blob must be 100 KB in size. What is the MINIMUM number of shards that the Specialist should employ in Kinesis Data Streams to effectively ingest this data? A. 1 shards B. 10 shards C. 100 shards D. 1,000 shards
B. 10 shards 100 kb * 100 t/second = 10000 kb = 10 mb 10mb / max_threshold_per_shard (1 mb) = 10 shards
A Machine Learning Specialist is responsible for preparing data for training by moving and transforming it. Certain data must be handled in near-real time, while others may be transferred on an hourly basis. There are already existing Amazon EMR MapReduce operations for data cleaning and feature engineering. Which of the following services are capable of supplying data to MapReduce jobs? (Select two.) A. AWS DMS B. Amazon Kinesis C. AWS Data Pipeline D. Amazon Athena E. Amazon ES
B. Amazon Kinesis C. AWS Data Pipeline
At a bank, a data engineer is assessing a new tabular dataset that contains customer data. The data engineer will use customer data to develop a new model for forecasting consumer behavior. The data engineer observes that several of the 100 characteristics are significantly associated with one another after constructing a correlation matrix for the variables. Which procedures should the data engineer use in order to resolve this issue? (Select two.) A. Use a linear-based algorithm to train the model. B. Apply principal component analysis (PCA). C. Remove a portion of highly correlated features from the dataset. D. Apply min-max feature scaling to the dataset. E. Apply one-hot encoding category-based variables.
B. Apply principal component analysis (PCA). C. Remove a portion of highly correlated features from the dataset.
A firm that encourages good sleep habits via the use of cloud-connected devices is now using AWS to host a sleep monitoring application. The program gathers information on the device's use from its users. The company's Data Science team is developing a machine learning model to forecast when and if a user may cease to use the company's gadgets. The model's predictions are utilized by a downstream application to identify the most effective method of engaging consumers.The Data Science team is developing numerous iterations of the machine learning model and comparing each iteration to the business objectives of the organization. To determine the model's long-term performance, the team intends to run numerous versions in parallel for extended periods of time, with the possibility to alter the percentage of inferences supplied by the models. Which method achieves these criteria with the LEAST amount of effort? A. Build and host multiple models in Amazon SageMaker. Create multiple Amazon SageMaker endpoints, one for each model. Programmatically control invoking different models for inference at the application layer. B. Build and host multiple models in Amazon SageMaker. Create an Amazon SageMaker endpoint configuration with multiple production variants. Programmatically control the portion of the inferences served by the multiple models by updating the endpoint configuration. C. Build and host multiple models in Amazon SageMaker Neo to take into account different types of medical devices. Programmatically control which model is invoked for inference based on the medical device type. D. Build and host multiple models in Amazon SageMaker. Create a single endpoint that accesses multiple models. Use Amazon SageMaker batch transform to control invoking the different models through the single endpoint.
B. Build and host multiple models in Amazon SageMaker. Create an Amazon SageMaker endpoint configuration with multiple production variants. Programmatically control the portion of the inferences served by the multiple models by updating the endpoint configuration.
A data scientist must discover fake user accounts on an ecommerce platform for a business. The organization want to establish whether a freshly formed account is connected to a previously identified fraudulent user. AWS Glue is being used by the data scientist to purify the company's application logs during ingestion. Which technique will enable the data scientist to detect bogus accounts? A. Execute the built-in FindDuplicates Amazon Athena query. B. Create a FindMatches machine learning transform in AWS Glue. C. Create an AWS Glue crawler to infer duplicate accounts in the source data. D. Search for duplicate accounts in the AWS Glue Data Catalog.
B. Create a FindMatches machine learning transform in AWS Glue.
A data engineer is using AWS Glue to improve and protect Amazon S3 datasets. The data science team need direct access to the ETL scripts from Amazon SageMaker notebooks contained inside a VPC. The data science team wants to be able to execute the AWS Glue task and trigger the SageMaker training process when this configuration is complete. Which actions should the data engineer perform in combination to achieve these requirements? (Select three.) A. Create a SageMaker development endpoint in the data science team's VPC. B. Create an AWS Glue development endpoint in the data science team's VPC. C. Create SageMaker notebooks by using the AWS Glue development endpoint. D. Create SageMaker notebooks by using the SageMaker console. E. Attach a decryption policy to the SageMaker notebooks. F. Create an IAM policy and an IAM role for the SageMaker notebooks.
B. Create an AWS Glue development endpoint in the data science team's VPC. C. Create SageMaker notebooks by using the AWS Glue development endpoint. F. Create an IAM policy and an IAM role for the SageMaker notebooks.
A firm operates a vast number of factories and maintains a complicated supply chain connection in which an unexpected breakdown of a machine might result in the suspension of operations at multiple plants. A data scientist wishes to examine factory sensor data in order to detect equipment in need of preventative maintenance and then deploy a repair crew to avoid unscheduled downtime. A single machine's sensor data may include up to 200 data points, including temperatures, voltages, vibrations, RPMs, and pressure measurements.The firm put Wi-Fi and LANs across the plants to capture this sensor data. Despite the fact that many industrial sites lack stable or high-speed internet access, the manufacturer want to retain near-real-time inference capabilities. Which model deployment architecture will satisfy these business requirements? A. Deploy the model in Amazon SageMaker. Run sensor data through this model to predict which machines need maintenance. B. Deploy the model on AWS IoT Greengrass in each factory. Run sensor data through this model to infer which machines need maintenance. C. Deploy the model to an Amazon SageMaker batch transformation job. Generate inferences in a daily batch report to identify machines that need maintenance. D. Deploy the model in Amazon SageMaker and use an IoT rule to write data to an Amazon DynamoDB table. Consume a DynamoDB stream from the table with an AWS Lambda function to invoke the endpoint.
B. Deploy the model on AWS IoT Greengrass in each factory. Run sensor data through this model to infer which machines need maintenance. For latency-sensitive use cases and for use-cases that require analyzing large amounts of streaming data, it may not be possible to run ML inference in the cloud. Besides, cloud-connectivity may not be available all the time. For these use cases, you need to deploy the ML model close to the data source. SageMaker Neo + IoT GreenGrass To design and push something to edge: 1. design something to do the job, say TF model 2. compile it for the edge device using SageMaker Neo, say Nvidia Jetson 3. run it on the edge using IoT GreenGras
A Data Scientist is developing a linear regression model and evaluating the statistical significance of each coefficient using the derived p-values. The Data Scientist observes that the majority of the characteristics in the dataset are regularly distributed. The image depicts the plot of a single feature from the dataset. Which transformation should the Data Scientist do to ensure that the linear regression model's statistical assumptions are met? A. Exponential transformation B. Logarithmic transformation C. Polynomial transformation D. Sinusoidal transformation
B. Logarithmic transformation
A Data Scientist is required to do employment data analysis. The dataset comprises roughly ten million observations of individuals across ten distinct characteristics. The Data Scientist discovers that the income and age distributions are not typical during the preliminary study. While income levels exhibit the anticipated right skew, with fewer persons earning more, the age distribution exhibits the same right skew, with fewer older individuals engaging in the workforce. Which feature transformations may the Data Scientist do to repair the data that has been skewed incorrectly? (Select two.) A. Cross-validation B. Numerical value binning C. High-degree polynomial transformation D. Logarithmic transformation E. One hot encoding
B. Numerical value binning D. Logarithmic transformation
Every minute, a monitoring service creates 1 TB of scale metrics record data. Amazon Athena is used by a research team to execute queries on this data. Due to the high number of data, the queries execute slowly, and the team demands improved speed. How should the records in Amazon S3 be kept to optimize query performance? A. CSV files B. Parquet files C. Compressed JSON D. RecordIO
B. Parquet files
An insurance firm is creating a new automobile gadget that use a camera to monitor drivers' behavior and alerts them when they look to be distracted. The organization developed roughly 10,000 training photos in a controlled setting that will be used to train and assess machine learning models by a Machine Learning Specialist.During the model assessment, the Specialist sees that as the number of epochs grows, the training error rate decreases quicker and the model is unable to effectively infer on unseen test pictures. Which of the following approaches should be used to remedy this situation? (Select two.) A. Add vanishing gradient to the model. B. Perform data augmentation on the training data. C. Make the neural network architecture complex. D. Use gradient checking in the model. E. Add L2 regularization to the model
B. Perform data augmentation on the training data. E. Add L2 regularization to the model
A machine learning professional is running an Amazon SageMaker endpoint on a P3 instance and using the built-in object identification algorithm to make real-time predictions in a production application. When the expert examines the model's resource consumption, he or she sees that the model is only using a portion of the GPU. Which architectural improvements would maximize the use of provided resources? A. Redeploy the model as a batch transform job on an M5 instance. B. Redeploy the model on an M5 instance. Attach Amazon Elastic Inference to the instance. C. Redeploy the model on a P3dn instance. D. Deploy the model onto an Amazon Elastic Container Service (Amazon ECS) cluster using a P3 instance.
B. Redeploy the model on an M5 instance. Attach Amazon Elastic Inference to the instance. Amazon Elastic Inference (EI) is a resource you can attach to your Amazon EC2 CPU instances to accelerate your deep learning (DL) inference workloads. Amazon EI accelerators come in multiple sizes and are a cost-effective method to build intelligent capabilities into applications running on Amazon EC2 instances.
Although a Machine Learning Specialist developed a regression model, the first iteration requires optimization. The Specialist must determine if the model over- or underestimates the objective more often. Which option does the Specialist have for determining if the goal number is being over- or underestimated? A. Root Mean Square Error (RMSE) B. Residual plots C. Area under the curve D. Confusion matrix
B. Residual plots
A manufacturing business uses an Amazon S3 bucket to store structured and unstructured data. A Machine Learning Specialist want to query this data using SQL. Which option requires the LEAST amount of work in order to query this data? A. Use AWS Data Pipeline to transform the data and Amazon RDS to run queries. B. Use AWS Glue to catalogue the data and Amazon Athena to run queries. C. Use AWS Batch to run ETL on the data and Amazon Aurora to run the queries. D. Use AWS Lambda to transform the data and Amazon Kinesis Data Analytics to run queries.
B. Use AWS Glue to catalogue the data and Amazon Athena to run queries.
A data storage solution for Amazon SageMaker is being developed by a machine learning specialist. There is already a TensorFlow-based model developed as a train.py script that makes use of static training data saved as TFRecords. Which approach of supplying training data to Amazon SageMaker would satisfy business needs with the LEAST amount of development time? A. Use Amazon SageMaker script mode and use train.py unchanged. Point the Amazon SageMaker training invocation to the local path of the data without reformatting the training data. B. Use Amazon SageMaker script mode and use train.py unchanged. Put the TFRecord data into an Amazon S3 bucket. Point the Amazon SageMaker training invocation to the S3 bucket without reformatting the training data. C. Rewrite the train.py script to add a section that converts TFRecords to protobuf and ingests the protobuf data instead of TFRecords. D. Prepare the data in the format accepted by Amazon SageMaker. Use AWS Glue or AWS Lambda to reformat and store the data in an Amazon S3 bucket.
B. Use Amazon SageMaker script mode and use train.py unchanged. Put the TFRecord data into an Amazon S3 bucket. Point the Amazon SageMaker training invocation to the S3 bucket without reformatting the training data.
A data scientist created a machine learning translation model for English to Japanese by combining 500,000 aligned phrase pairs with Amazon SageMaker's built-in seq2seq method. The data scientist discovers that the translation quality is acceptable for a five-word example while testing with sample sentences. However, the quality degrades to an unsatisfactory level when the statement exceeds 100 words in length. Which course of action will remedy the issue? A. Change preprocessing to use n-grams. B. Add more nodes to the recurrent neural network (RNN) than the largest sentence's word count. C. Adjust hyperparameters related to the attention mechanism. D. Choose a different weight initialization type.
C. Adjust hyperparameters related to the attention mechanism.
A data scientist is working with an Amazon SageMaker notebook and requires safe access to data stored in an Amazon S3 bucket. How is this to be accomplished by the data scientist? A. Add an S3 bucket policy allowing GetObject, PutObject, and ListBucket permissions to the Amazon SageMaker notebook ARN as principal. B. Encrypt the objects in the S3 bucket with a custom AWS Key Management Service (AWS KMS) key that only the notebook owner has access to. C. Attach the policy to the IAM role associated with the notebook that allows GetObject, PutObject, and ListBucket operations to the specific S3 bucket. D. Use a script in a lifecycle configuration to configure the AWS CLI on the instance with an access key ID and secret.
C. Attach the policy to the IAM role associated with the notebook that allows GetObject, PutObject, and ListBucket operations to the specific S3 bucket.
A business is establishing an Amazon SageMaker environment. Communication through the internet is prohibited under the business data security policy. How can the Amazon SageMaker service be enabled without also authorizing direct internet access to Amazon SageMaker notebook instances? A. Create a NAT gateway within the corporate VPC. B. Route Amazon SageMaker traffic through an on-premises network. C. Create Amazon SageMaker VPC interface endpoints within the corporate VPC. D. Create VPC peering with Amazon VPC hosting Amazon SageMaker.
C. Create Amazon SageMaker VPC interface endpoints within the corporate VPC.
A business is developing a new recommendation engine. To develop tailored suggestions, machine learning (ML) professionals must constantly integrate fresh data from consumers. The machine learning professionals collect data from users' interactions with the platform and from other sources such as websites and social media.On a daily basis, the pipeline cleans, converts, enriches, and compresses terabytes of data, which is then stored in Amazon S3. To accomplish the task, a series of Python scripts was written and is being stored in a huge Amazon EC2 instance. The whole procedure takes over 20 hours to complete, with each script needing at least an hour to complete. The organization want to migrate the scripts away from Amazon EC2 and onto a better managed solution that eliminates the requirement for server maintenance. Which technique will satisfy all of these needs with the LEAST amount of development work possible? A. Load the data into an Amazon Redshift cluster. Execute the pipeline by using SQL. Store the results in Amazon S3. B. Load the data into Amazon DynamoDB. Convert the scripts to an AWS Lambda function. Execute the pipeline by triggering Lambda executions. Store the results in Amazon S3. C. Create an AWS Glue job. Convert the scripts to PySpark. Execute the pipeline. Store the results in Amazon S3. D. Create a set of individual AWS Lambda functions to execute each of the scripts. Build a step function by using the AWS Step Functions Data Science SDK. Store the results in Amazon S3.
C. Create an AWS Glue job. Convert the scripts to PySpark. Execute the pipeline. Store the results in Amazon S3.
A business is digitizing a significant amount of unstructured paper receipts. The organization want to develop a model using natural language processing (NLP) in order to discover relevant entities such as date, location, and notes, as well as certain bespoke entities such as receipt numbers.The firm extracts text for data labeling using optical character recognition (OCR). However, since papers are structured and formatted differently, the organization is having difficulty putting up manual procedures for each document type. Additionally, the organization used a modest sample size to train a named entity recognition (NER) model for bespoke entity identification. This model has a low confidence score and will need to be retrained using a big dataset. Which text extraction and entity identification system will take the LEAST effort? A. Extract text from receipt images by using Amazon Textract. Use the Amazon SageMaker BlazingText algorithm to train on the text for entities and custom entities. B. Extract text from receipt images by using a deep learning OCR model from the AWS Marketplace. Use the NER deep learning model to extract entities. C. Extract text from receipt images by using Amazon Textract. Use Amazon Comprehend for entity detection, and use Amazon Comprehend custom entity recognition for custom entity detection. D. Extract text from receipt images by using a deep learning OCR model from the AWS Marketplace. Use Amazon Comprehend for entity detection, and use Amazon Comprehend custom entity recognition for custom entity detection.
C. Extract text from receipt images by using Amazon Textract. Use Amazon Comprehend for entity detection, and use Amazon Comprehend custom entity recognition for custom entity detection.
A Data Scientist was given a collection of insurance records, each of which had an ID for the record, the final result from 200 possible outcomes, and the date of the final outcome.Additionally, some limited information on the substance of claims is supplied, although only for a handful of the 200 categories. Hundreds of records have been provided during the last three years for each result category. The Data Scientist want to forecast the number of claims in each category month by month, many months in advance. Which machine learning algorithm should be used? A. Classification month-to-month using supervised learning of the 200 categories based on claim contents. B. Reinforcement learning using claim IDs and timestamps where the agent will identify how many claims in each category to expect from month to month. C. Forecasting using claim IDs and timestamps to identify how many claims in each category to expect from month to month. D. Classification with supervised learning of the categories for which partial information on claim contents is provided, and forecasting using claim IDs and timestamps for all other categories.
C. Forecasting using claim IDs and timestamps to identify how many claims in each category to expect from month to month.
A machine learning (ML) expert is using Amazon SageMaker hyperparameter optimization (HPO) to enhance the accuracy of a model. The following HPO setup specifies the learning rate parameter: {"Name":"learning_rate", "MaxValue":"0.0001", "MinValue":"0.1"} The machine learning professional concludes during the results analysis that the majority of training tasks had a learning rate of between 0.01 and 0.1. The best outcome has a rate of learning less than 0.01. Training tasks must be executed on a consistent basis against a changing dataset. The machine learning professional must devise a tuning mechanism that distributes the available learning rates more equally over the range between MinValue and MaxValue. Which of the following solutions produces the MOST ACCURATE result? A. Modify the HPO configuration as follows: {"Name":"learning_rate", "MaxValue":"0.0001", "MinValue":"0.1", "ScalingType":"ReverseLogarithmic"} Select the most accurate hyperparameter configuration form this HPO job. B. Run three different HPO jobs that use different learning rates form the following intervals for MinValue and MaxValue while using the same number of training jobs for each HPO job: ✑ [0.01, 0.1] ✑ [0.001, 0.01] ✑ [0.0001, 0.001] Select the most accurate hyperparameter configuration form these three HPO jobs. C. Modify the HPO configuration as follows: {"Name":"learning_rate", "MaxValue":"0.0001", "MinValue":"0.1", "ScalingType":"Logarithmic"} Select the most accurate hyperparameter configuration form this training job. D. Run three different HPO jobs that use different learning rates form the following intervals for MinValue and MaxValue. Divide the number of training jobs for each HPO job by three: ✑ [0.01, 0.1] ✑ [0.001, 0.01] [0.0001, 0.001] Select the most accurate hyperparameter configuration form these three HPO jobs.
C. Modify the HPO configuration as follows: {"Name":"learning_rate", "MaxValue":"0.0001", "MinValue":"0.1", "ScalingType":"Logarithmic"} Select the most accurate hyperparameter configuration form this training job.
A farming firm is interested in applying machine learning to identify particular weed species in a 100-acre grassland patch. The firm now employs tractor-mounted cameras to gather several photographs of the field in 10 — 10 grids. Additionally, the organization has a sizable training dataset comprised of annotated photos of common weed classifications such as broadleaf and non-broadleaf docks.The organization wishes to develop a weed identification model capable of identifying certain kinds of weeds and their position within a field. The model will be hosted on Amazon SageMaker endpoints once it is complete. The model will do real-time inference using the camera pictures. Which strategy should a Machine Learning Specialist use in order to achieve reliable predictions? A. Prepare the images in RecordIO format and upload them to Amazon S3. Use Amazon SageMaker to train, test, and validate the model using an image classification algorithm to categorize images into various weed classes. B. Prepare the images in Apache Parquet format and upload them to Amazon S3. Use Amazon SageMaker to train, test, and validate the model using an object- detection single-shot multibox detector (SSD) algorithm. C. Prepare the images in RecordIO format and upload them to Amazon S3. Use Amazon SageMaker to train, test, and validate the model using an object- detection single-shot multibox detector (SSD) algorithm. D. Prepare the images in Apache Parquet format and upload them to Amazon S3. Use Amazon SageMaker to train, test, and validate the model using an image classification algorithm to categorize images into various weed classes.
C. Prepare the images in RecordIO format and upload them to Amazon S3. Use Amazon SageMaker to train, test, and validate the model using an object- detection single-shot multibox detector (SSD) algorithm. Pay attention that the question is asking for 2 things: 1. detect specific types of weeds 2. detect the location of each type within the field. Image Classification can only classify images. Object detection algorithm: 1.identifies all instances of objects within the image scene. 2.its location and scale in the image are indicated by a rectangular bounding box. Data format for Computer Vision algorithms in SageMaker: Recommend to use RecordIO.
A business want to categorize user behavior as fraudulent or normal. A machine learning expert will develop a binary classifier based on two features: the account's age, represented by x, and the month of the transaction, denoted by y. The distributions of the classes are shown in the accompanying image. Positive classes are shown in red, whereas negative classes are depicted in black. Which model would be the most precise? A. Linear support vector machine (SVM) B. Decision tree C. Support vector machine (SVM) with a radial basis function kernel D. Single perceptron with a Tanh activation function
C. Support vector machine (SVM) with a radial basis function kernel
On Amazon SageMaker, a Machine Learning Specialist is preparing data for training. The Specialist is training using one of SageMaker's built-in algorithms. The dataset is saved in.CSV format and converted to a numpy.array, which looks to be slowing down the training process. What actions should the Specialist take to optimize the data for SageMaker training? A. Use the SageMaker batch transform feature to transform the training data into a DataFrame. B. Use AWS Glue to compress the data into the Apache Parquet format. C. Transform the dataset into the RecordIO protobuf format. D. Use the SageMaker hyperparameter optimization feature to automatically optimize the data.
C. Transform the dataset into the RecordIO protobuf format. Parquet is great for analytics data due to its small file size and allows you to scan only the columns of interest. RecordIO format is typically used for training machine learning models so that the data that the model needs is presented only when needed.
A business must swiftly analyze and acquire insight from a big volume of data. The data is in a variety of formats, the schemas are constantly changing, and new data sources are introduced on a regular basis. The organization want to use AWS services in order to do exploratory analysis of numerous data sources, recommend schemas, and enhance and modify the data. The solution should need as little code as feasible for the data flows and as little infrastructure administration as possible. Which AWS service combination will suit these requirements? A. Using Amazon EMR to find, enhance, and transform data Amazon Athena is used to query and analyze data stored in Amazon S3 using conventional SQL. Amazon QuickSight is a reporting and insight tool. B. For data intake, Amazon Kinesis Data Analytics is used. Using Amazon EMR to find, enhance, and transform data Amazon Redshift is used to query and analyze data stored in Amazon S3. C.AWS Glue is a service that enables the discovery, enrichment, and transformation of data. Amazon Athena is used to query and analyze data stored in Amazon S3 using conventional SQL. Amazon QuickSight is a reporting and insight tool. D. For data transport, AWS Data Pipeline is used. AWS Step Functions are used to orchestrate AWS Lambda operations that do data discovery, enrichment, and transformation. Amazon Athena is used to query and analyze data stored in Amazon S3 using conventional SQL. Amazon QuickSight is a reporting and insight tool.
C.AWS Glue is a service that enables the discovery, enrichment, and transformation of data.Amazon Athena is used to query and analyze data stored in Amazon S3 using conventional SQL.Amazon QuickSight is a reporting and insight tool.
A Machine Learning Specialist initiates a hyperparameter tuning project for a tree-based ensemble model using Amazon SageMaker with the target metric Area Under the Receiver Operating Characteristic Curve (AUC). This method will ultimately be integrated into a pipeline that retrains and optimizes hyperparameters each night in order to model click-through on stale data every 24 hours.The Specialist want to adjust the input hyperparameter range in order to reduce the time required to train these models and, eventually, to save expenditures (s). Which visualization will achieve this goal? A. A histogram showing whether the most important input feature is Gaussian. B. A scatter plot with points colored by target variable that uses t-Distributed Stochastic Neighbor Embedding (t-SNE) to visualize the large number of input variables in an easier-to-read dimension. C. A scatter plot showing the performance of the objective metric over each training iteration. D. A scatter plot showing the correlation between maximum tree depth and the objective metric.
D. A scatter plot showing the correlation between maximum tree depth and the objective metric. Reveal Solution
A firm that operates an online library is using a chatbot powered by Amazon Lex to deliver category-based book suggestions. This aim is accomplished via the use of an AWS Lambda function that searches an Amazon DynamoDB database for a list of book titles matching a specified category. For testing purposes, only three kinds of custom slot types have been implemented: "comedy," "adventure," and "documentary." ג€ A machine learning (ML) expert observes that sometimes, the request cannot be performed because Amazon Lex can not grasp the category mentioned by users using phrases such as "funny," "fun," and "humor". The machine learning professional must resolve the issue without modifying the Lambda code or DynamoDB data. How should the machine learning professional resolve the issue? A. Add the unrecognized words in the enumeration values list as new values in the slot type. B. Create a new custom slot type, add the unrecognized words to this slot type as enumeration values, and use this slot type for the slot. C. Use the AMAZON.SearchQuery built-in slot types for custom searches in the database. D. Add the unrecognized words as synonyms in the custom slot type.
D. Add the unrecognized words as synonyms in the custom slot type.
A financial institution is attempting to identify credit card fraud. According to the firm, around 2% of credit card transactions are fraudulent. On the basis of a year's worth of credit card transaction data, a data scientist trained a classifier. The model must distinguish between fraudulent transactions (positives) and legitimate ones (negatives). The company's objective is to catch as many positives as possible correctly. Which metrics should be used to optimize the model by the data scientist? (Select two.) A. Specificity B. False positive rate C. Accuracy D. Area under the precision-recall curve E. True positive rate
D. Area under the precision-recall curve E. True positive rate
A retail corporation wishes to connect consumer orders with data from its product catalog's product descriptions. Each dataset contains records with a unique structure and format. A data analyst attempted to integrate the datasets using a spreadsheet, however the effort resulted in duplicate entries and records that were not mixed correctly. The business need a solution for combining related records from the two databases and removing duplicates. Which solution will satisfy these criteria? A. Use an AWS Lambda function to process the data. Use two arrays to compare equal strings in the fields from the two datasets and remove any duplicates. B. Create AWS Glue crawlers for reading and populating the AWS Glue Data Catalog. Call the AWS Glue SearchTables API operation to perform a fuzzy- matching search on the two datasets, and cleanse the data accordingly. C. Create AWS Glue crawlers for reading and populating the AWS Glue Data Catalog. Use the FindMatches transform to cleanse the data. D. Create an AWS Lake Formation custom transform. Run a transformation for matching products from the Lake Formation console to cleanse the data automatically.
D. Create an AWS Lake Formation custom transform. Run a transformation for matching products from the Lake Formation console to cleanse the data automatically.
A machine learning expert is now working on a proof of concept for government users who are most concerned about security. The expert is training a convolutional neural network (CNN) model for a picture classification application using Amazon SageMaker. The expert wishes to safeguard the data from inadvertent access and transmission to a distant host by malicious programs put on the training container. Which of the following actions will give the MOST SECURE protection? A. Remove Amazon S3 access permissions from the SageMaker execution role. B. Encrypt the weights of the CNN model. C. Encrypt the training and validation dataset. D. Enable network isolation for training jobs.
D. Enable network isolation for training jobs. Reveal Solution
A data scientist is developing a model that selects tags from blog articles using the Amazon SageMaker Neural Topic Model (NTM) algorithm. The raw blog post data is kept in JSON format in an Amazon S3 bucket. During model assessment, the data scientist observed that the model proposes certain stopwords such as "a," "an," and "the" as tags for certain blog postings, as well as a few uncommon words that appear in just particular blog entries. Following many cycles of tag review with the content team, the data scientist observes that the rare terms are uncommon yet viable. Additionally, the data scientist must check that the resulting model's tag suggestions do not include stopwords. What actions should the data scientist take to ensure compliance with these requirements? A. Use the Amazon Comprehend entity recognition API operations. Remove the detected words from the blog post data. Replace the blog post data source in the S3 bucket. B. Run the SageMaker built-in principal component analysis (PCA) algorithm with the blog post data from the S3 bucket as the data source. Replace the blog post data in the S3 bucket with the results of the training job. C. Use the SageMaker built-in Object Detection algorithm instead of the NTM algorithm for the training job to process the blog post data. D. Remove the stopwords from the blog post data by using the CountVectorizer function in the scikit-learn library. Replace the blog post data in the S3 bucket with the results of the vectorizer.
D. Remove the stopwords from the blog post data by using the CountVectorizer function in the scikit-learn library. Replace the blog post data in the S3 bucket with the results of the vectorizer.
A data scientist has an Amazon Elastic File System dataset of machine component photos (Amazon EFS). The data scientist must utilize Amazon SageMaker to design and train a machine learning model for picture categorization using this dataset. Due to money and time restrictions, management expects the data scientist to design and train a model in the fewest possible stages and with the least amount of integration effort. How should the data scientist adhere to these standards? A. Mount the EFS file system to a SageMaker notebook and run a script that copies the data to an Amazon FSx for Lustre file system. Run the SageMaker training job with the FSx for Lustre file system as the data source. B. Launch a transient Amazon EMR cluster. Configure steps to mount the EFS file system and copy the data to an Amazon S3 bucket by using S3DistCp. Run the SageMaker training job with Amazon S3 as the data source. C. Mount the EFS file system to an Amazon EC2 instance and use the AWS CLI to copy the data to an Amazon S3 bucket. Run the SageMaker training job with Amazon S3 as the data source. D. Run a SageMaker training job with an EFS file system as the data source.
D. Run a SageMaker training job with an EFS file system as the data source. Reveal Solution
A data scientist is developing a sentiment analysis application. The validation accuracy is low, and the Data Scientist believes that this is due to the dataset's large vocabulary and low average frequency of terms. Which tool should be utilized to increase the accuracy of validation? A. Amazon Comprehend syntax analysis and entity detection B. Amazon SageMaker BlazingText cbow mode C. Natural Language Toolkit (NLTK) stemming and stop word removal D. Scikit-leam term frequency-inverse document frequency (TF-IDF) vectorizer
D. Scikit-leam term frequency-inverse document frequency (TF-IDF) vectorizer TF-IDF = low average frequency BlazingText : high optimization/Sentiment
A data scientist is assessing a GluonTS DeepAR model on Amazon SageMaker. According to the assessment criteria for the test set, the coverage score is 0.489 at the 0.5 and 0.889 at the 0.9 quantiles, respectively. What can the data scientist properly infer about the test set's distributional forecast? A. The coverage scores indicate that the distributional forecast is poorly calibrated. These scores should be approximately equal to each other at all quantiles. B. The coverage scores indicate that the distributional forecast is poorly calibrated. These scores should peak at the median and be lower at the tails. C. The coverage scores indicate that the distributional forecast is correctly calibrated. These scores should always fall below the quantile itself. D. The coverage scores indicate that the distributional forecast is correctly calibrated. These scores should be approximately equal to the quantile itself.
D. The coverage scores indicate that the distributional forecast is correctly calibrated. These scores should be approximately equal to the quantile itself. 默認輸出P10,P50和P90(0.9 quantiles)三個值。這裡的P10指的是機率分佈,即10%的可能性會小於P10這個值
A bank wants to introduce a low-interest credit campaign. The bank is situated in a community that has lately faced economic difficulties. Due to the fact that just a portion of the bank's clients were impacted by the crisis, the credit team must choose which consumers to target with the campaign. However, the credit team wants to ensure that when a judgment is made, the whole credit history of loyal clients is examined.The bank's data science team created a model for categorizing account activity and determining credit eligibility. The data science team trained the model using the XGBoost technique. Over the course of several days, the team trained and tuned hyperparameters using seven years of historical bank transaction data.While the model is sufficiently accurate, the credit team is difficult to articulate precisely why the model rejects credit to some consumers. The credit team have practically little data science expertise. What should the data science team do in order to handle this problem in the most efficient way possible? A. Use Amazon SageMaker Studio to rebuild the model. Create a notebook that uses the XGBoost training container to perform model training. Deploy the model at an endpoint. Enable Amazon SageMaker Model Monitor to store inferences. Use the inferences to create Shapley values that help explain model behavior. Create a chart that shows features and SHapley Additive exPlanations (SHAP) values to explain to the credit team how the features affect the model outcomes. B. Use Amazon SageMaker Studio to rebuild the model. Create a notebook that uses the XGBoost training container to perform model training. Activate Amazon SageMaker Debugger, and configure it to calculate and collect Shapley values. Create a chart that shows features and SHapley Additive exPlanations (SHAP) values to explain to the credit team how the features affect the model outcomes. C. Create an Amazon SageMaker notebook instance. Use the notebook instance and the XGBoost library to locally retrain the model. Use the plot_importance() method in the Python XGBoost interface to create a feature importance chart. Use that chart to explain to the credit team how the features affect the model outcomes. D. Use Amazon SageMaker Studio to rebuild the model. Create a notebook that uses the XGBoost training container to perform model training. Deploy the model at an endpoint. Use Amazon SageMaker Processing to post-analyze the model and create a feature importance explainability chart automatically for the credit team.
D. Use Amazon SageMaker Studio to rebuild the model. Create a notebook that uses the XGBoost training container to perform model training. Deploy the model at an endpoint. Use Amazon SageMaker Processing to post-analyze the model and create a feature importance explainability chart automatically for the credit team.