AWS Machine Learning Qs

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

You are using SageMaker Automatic Hyperparameter tuning to search for optimal parameters for a learning algorithm. What are the best practices when running a hyperparameter tuning job?

"you can simultaneously use up to 20 variables in a hyperparameter tuning job, limiting your search to a much smaller number is likely to give better results" "a tuning job improves only through successive rounds of experiments. Typically, running one training job at a time achieves the best results with the least amount of compute time" "Choose logarithmic scaling when you are searching a range that spans several orders of magnitude" "you specify a range of values between .0001 and 1.0 for the learning_rate hyperparameter, searching uniformly on a logarithmic scale gives you a better sample of the entire range than searching on a linear scale would, because searching on a linear scale would, on average, devote 90 percent of your training budget to only the values between .1 and 1.0, leaving only 10 percent of your training budget for the values between .0001 and .1"

A term frequency-inverse document frequency (tf-idf) matrix using both unigrams and bigrams is built from a text corpus consisting of the following two sentences: 1. Please call the number below. 2. Please do not call us. What are the dimensions of the tf-idf matrix? (Choose one)

(2,16) 2 sentences (2 rows) 8 unigrams + 8 bigrams = 16 columns

An organization has human experts who perform manual classification of products by visual inspection. A Machine Learning specialist is building a classification system to match human-level performance. When reviewing the error rate of humans, the specialist observes the following: Newly trained employees had a misclassification error rate of 5%, Experienced employee had an error rate of 2.5%, and when a team of experienced employees worked together, they had a misclassification rate of 1%. What should be considered as human-level performance?

1% should be used as the human-level performance and it is a good proxy for Bayes optimal error (theoretical best possible error rate).

A Kinesis Data Stream's capacity is provisioned by shards. What is the maximum throughput of a single shard?

1MB/sec or 1000 messages/sec

What happens when you select server-side encryption?

Amazon S3 encrypts an object before saving it to disk and decrypts it when it is ready to be downloaded

You are working for a data analytics company and the data is stored in the S3 bucket. You want to analyze the data, which of the AWS tool can be used? You also want the solution to be serverless?

Athena can be used for analyzing the data that is stored in S3 and Athena is serverless which makes it extremely easy to use.

You have setup a group of SageMaker Notebook instances for your company's data scientists. You wanted to uphold your company's philosophy on least privilege and disabled Internet access for the notebooks. However, the data scientists report that they are unable to import certain key libraries from the Internet into their notebooks. What is the most efficient path?

Notebook instances are Internet-enabled by default but you can disable this internet access. If you do so and you still need to access the Internet from the Notebook instances, you must create a NAT gateway, appropriate route and security groups that will allow outbound connections to the Internet.

A company has access to an semi-structured data which is in JSON format and if the company wants to perform some analytics on the data and visualize the data, which combination of AWS Service would achieve this goal?

Use Glue Crawlers to convert data from JSON format to Parquet format, then use Athena to analyze the data and finally Quicksight to visualize the data.

A Machine Learning Specialist has been using Amazon EC2 for quite some time to train classification and regression models. The Specialist wants to simplify the training job by leveraging Amazon SageMaker's built-in algorithms. However, he is unsure if SageMaker can support the format of his training data. Which data formats can the Specialist use for model training?

x-image, x-recordio, x-recordio-protobuf jsosonlines jpeg png csv libsvm

You want to perform data analysis and gain business insights by running analytics on data gathered from multiple sources. Which of the following storage can be used?

A data warehouse exists on top of several databases and used for business intelligence. Data warehouse consumes data from all these databases and creates a layer optimized to perform data analytics.

You have stored your data in S3 and would like to analyze the data. Which of the following AWS services could help you analyze the data?

AWS Amazon Redshift Spectrum allows analysts to run SQL queries on data stored in Amazon S3 buckets directly.

You want to automate the analysis of daily transaction costs and market performance. Which AWS service would help automate this?

AWS Batch allows for running of batch computing jobs on AWS. AWS batch optimizes the type and number of compute resources based on volume. AWS Batch performs all the scheduling and execution of the batch using Amazon EC2 and Spot Instances.

A Machine Learning Specialist has an on-premises MySQL database that needs to be replicated in Amazon S3 as CSV files. Once the data has been fully copied, on-going changes to the database should be continually streamed into the S3 bucket. The Amazon SageMaker's Pipe input mode will be used to fetch datasets from Amazon S3 for training the ML models. Which service will help the ML Specialist automate these tasks efficiently?

AWS Data Migration Service (DMS)

What AWS service helps users transfer data from DynamoDB to S3?

AWS Data Pipeline is an orchestration service that allows AWS users to transfer data reliably and securely between various AWS compute and storage services.

You need to implement transformations for data that is hosted in Amazon S3 and an Amazon RDS MySQL instance. Which of the following needs to occur to achieve this? (Choose four)

AWS Glue uses a crawler to populate the Data Catalog in AWS Glue. Once the Data Catalog is populated, you can then run your AWS Glue Job to transform the data. To access S3 data you must ensure the role passed to the crawler has permission to access Amazon S3 paths. For the JDBC connection, the username and password is used to connect to the RDS instance.

You have data stored in S3 in the Parquet format. You want to analyze and visualize the data. How can you achieve this?

AWS Glue, Athena and QuickSight work seamlessly together for ETL jobs and data visualization.

Your company is currently working on a project which tracks the movement of autonomous delivery bots. These cycles are fitted with a tracker that updates every second. The company wants to use this stream of data to track the location of bots. Which AWS service would help here?

AWS Kinesis

A healthcare company wants to deploy an ensemble of models behind a single endpoint with minimal management. The models include an XGBoost model trained on one of its structured datasets and a CNN model trained on an image dataset.Which solution can the company use to reach this objective?

AWS Lambda with Amazon API gateway provides you with serverless compute that can be used with minimum management with an Amazon SageMaker endpoint to preprocess the data.

You have been tasked with creating a labeled dataset by classifying text data into different categories depending on the summary of the corpus. You plan to use this data with a particular machine learning algorithm within AWS. Your goal is to make this as streamlined as possible with minimal amount of setup from you and your team. What tool can be used to help label your dataset with the minimum amount of setup?

AWS Sagemaker GroundTruth text classification job You can use SageMaker Ground Truth to create ground truth datasets by creating labeling jobs. When you create a text classification job, workers group text into the categories that you define. You can define multiple categories but the worker can apply only one category to the text. Use the instructions to guide your workers to make the correct choice. Always define a generic class in addition to your specific classes. Giving your workers a generic option helps to minimize inaccurately classified text

Which AWS function helps you convert workflow into the state machine diagram so that it becomes easy to debug?

AWS Step Functions allows for creating serverless workflows. Output from a step is fed as an input to the next step. AWS Step functions converts a workflow into a state machine diagram that's easy to debug and understand. AWS Step Functions allows for performing resilient workflow automation fast without writing code. It allows for advanced error handling and retrying mechanisms.

You want to combine various AWS Lambda functions into responsive serverless applications, without having to write code for workflow logic, parallel processes, error handling, timeouts or retries. Which AWS tool would help you with this?

AWS Step Functions allows for creating serverless workflows. Output from a step is fed as an input to the next step. AWS Step functions converts a workflow into a state machine diagram that's easy to debug and understand. AWS Step Functions allows for performing resilient workflow automation fast without writing code. It allows for advanced error handling and retrying mechanisms.

You are helping a client design a landscape for their mission critical ML model based on DeepAR deployed using SageMaker Hosting Services. Which of the following would you recommend they do to ensure high availability?

AWS recommends that customers deploy a minimum of 2 instances for mission critical workloads when using SageMaker Hosting Services. SageMaker will automatically spread multiple instances across different AZs within a region.

A data scientist has a large dataset that needs to be trained on the AWS SageMaker service. The training algorithm is optimized for GPU processing and can benefit from substantial speed-up when trained on instances with GPUs. Which instance family can you use for a training job for the best performance?

Accelerated computing family (P and G type instances) come with GPUs, and these are ideal for algorithms that are optimized for GPUs.

You want to allow access to your team members to access files stored in S3 bucket. How can you do that?

Access control lists (ACLs) belongs to the resource-based access policy. ACL could be used to allow permission to other AWS accounts to read/write objects

You are given access to a stream of data from a trading platform. You are tasked to detect any anomaly in the streaming data. How can you detect these anomalies?

Amazon Kinesis Data Firehose is the easiest way to reliably load streaming data into data lakes, data stores and analytics tools. It is a fully managed service that automatically scales to match the throughput of your data and requires no ongoing administration. Amazon Kinesis Data Analytics is the easiest way to analyze streaming data. Random_Cut_Foreset Detects anomalies in your data stream.

Which AWS service helps you to stream video from various devices and feed them into computer vision algorithms?

Amazon Kinesis Video Streams allows for seamless video streaming from millions of devices. Feed this video stream to computer vision/Deep Learning algorithms (ex: face detection).

How is kinesis video stream encrypted at REST?

Amazon Kinesis Video Streams automatically encrypt data both in transit and at rest using: (1) AT REST: using AWS Key Management Service (KMS), (2) IN TRANSIT: Using industry-standard Transport Layer Security (TLS) protocol.

You have been asked to help design a customer service bot that can help answer the most common customer service questions posed on a public chat service. Which of the following might meet the need and do so with the minimum overhead?

Amazon Lex can be used to create a chat bot that can understand natural language. As a service, it does not require any EC2 instances or models to be deployed before using and therefore has less overhead than a customized model using SageMaker.

An insurance company needs to automate claim compliance reviews because human reviews are expensive and error-prone. The company has a large set of claims and a compliance label for each. Each claim consists of a few sentences in English, many of which contain complex related information. Management would like to use Amazon SageMaker built-in algorithms to design a machine learning supervised model that can be trained to read each claim and predict if the claim is compliant or not. Which approach should be used to extract features from the claims to be used as inputs for the downstream supervised task?

Amazon SageMaker Object2Vec generalizes the Word2Vec embedding technique for words to more complex objects, such as sentences and paragraphs. Since the supervised learning task is at the level of whole claims, for which there are labels, and no labels are available at the word level, Object2Vec needs be used instead of Word2Vec.

A Machine Learning Specialist is using an Amazon SageMaker notebook instance in a private subnet of a corporate VPC. The ML Specialist has important data stored on the Amazon SageMaker notebook instance's Amazon EBS volume, and needs to take a snapshot of that EBS volume. However, the ML Specialist cannot find the Amazon SageMaker notebook instance's EBS volume or Amazon EC2 instance within the VPC. Why is the ML Specialist not seeing the instance visible in the VPC?

Amazon SageMaker notebook instances are based on EC2 instances running within AWS service accounts

Which Apache application that helps to perform interactive data analysis using SQL?

Apache Zepplin is a web-based notebook that allows for interactive data analytics and collaborative documents with SQL, Scala. Apache Zeppelin could be used for data exploration by creating interactive notebooks using Apache Spark.

A Data Science team within a large company uses Amazon SageMaker notebooks to access data stored in Amazon S3 buckets. The IT Security team is concerned that internet-enabled notebook instances create a security vulnerability where malicious code running on the instances could compromise data privacy. The company mandates that all instances stay within a secured VPC with no internet access, and data communication traffic must stay within the AWS network. How should the Data Science team configure the notebook instance placement to meet these requirements?

Associate the Amazon SageMaker notebook with a private subnet in a VPC. Ensure the VPC has S3 VPC endpoints and Amazon SageMaker VPC endpoints attached to it.

You and your team are working on an analytic project and you want the team members to perform analysis on the data stored in your S3 bucket. How can you do that?

Athena can be used to analyze the data and you can give access to your team members using access control lists.

You are trying to classify a number of items based on different features into one of 6 groups (books, electronics, movies, etc.) based on features. Which algorithm would be best suited for this type of problem?

Both XGBoost and Linear Learner are perfect choices for multi classification problems. When we are trying to solve a multi classification problem using XGBoost we set the objective hyperparameter to multi:softmax and when using the Linear Learner algorithm, we set the predictor hyperparameter to multiclass_classifier.

A Machine Learning Specialist is packaging a custom ResNet model into a Docker container so the company can leverage Amazon SageMaker for training. The Specialist is using Amazon EC2 P3 instances to train the model and needs to properly configure the Docker container to leverage the NVIDIA GPUs. What does the Specialist need to do?

Build the Docker container to be NVIDIA-Docker compatible

Network isolation is NOT supported by the following managed Amazon SageMaker containers as they require access to Amazon S3:

Chainer PyTorch Scikit-learn Amazon SageMaker Reinforcement Learning

Your company has just discovered a security breach occurred in a division separate from yours but has ordered a full review of all access logs. You have been asked to provide the last 180 days of access to the three SageMaker Hosted Service models that you manage. When you set up these deployments, you left everything default. How will you be able to respond?

CloudTrail is the proper service if you want to see who has sent API calls to your SageMaker Hosted model but, by default, it will only store the last 90 days of events. You can configure CloudTrail to store an unlimited amount of logs on S3 but this is not turned on by default. Whilst CloudTrail is not necessarily an Access Log, it performs the same auditing functions you might expect; and an auditor may not necessarily be familiar with the nuances of AWS

You are designing an image classification model that will detect objects in provided pictures. Which neural network approach would be most likely in this use case?

Convolutional Neural Networks are most commonly associated with image and signal processing. Recurrent Neural Networks are most commonly used with text or speech use-cases where sequence prediction is key.

A company is setting up an Amazon SageMaker environment. The corporate data security policy does not allow communication over the internet. How can the company enable the Amazon SageMaker service without enabling direct internet access to Amazon SageMaker notebook instances?

Create Amazon SageMaker VPC interface endpoints within the corporate VPC

A Machine Learning Specialist at a company sensitive to security is preparing a dataset for model training. The dataset is stored in Amazon S3 and contains Personally Identifiable Information (PII). The dataset: - Must be accessible from a VPC only. - Must not traverse the public internet. How can these requirements be satisfied?

Create a VPC endpoint and apply a bucket access policy that restricts access to the given VPC endpoint and the VPC

You have been tasked with using Polly to translate text to speech in the company announcements that launch weekly. The problem you are encountering is how Polly is incorrectly translating the companies acronyms. What can be done for future tasks to help prevent this?

Create dictionary lexicon & use SSML tags in documents; Using SSML-enhanced input text gives you additional control over how Amazon Polly generates speech from the text you provide. Using these tags allows you to substitute a different word (or pronunciation) for selected text such as an acronym or abbreviation. You can also create a dictionary lexicon to apply to any future tasks instead of apply SSML to each individual document.

Which AWS service helps for data replication in an AWS Redshift?

DMS allows for database consolidation and data replication in a Data warehouse such as Amazon Redshift and S3.

A data scientist is exploring the use of the XGBoost algorithm for a regression problem. The dataset consists of numeric features. Some of the features are highly correlated, and almost all the features are on different orders of magnitude. What data-transformation is required to train on XGBoost?

Decision Tree-based algorithms like XGBoost automatically handles correlated features, numeric features on a different scale, and numeric-categorical variables. Other algorithms like a neural network and the linear model would require features on a similar scale and range, and you need to keep only one feature in every highly correlated feature pairs and one-hot encode categorical features.

A data scientist uses logistic regression to build a fraud detection model. While the model accuracy is 99%, 90% of the fraud cases are not detected by the model. What action will definitively help the model detect more than 10% of fraud cases?

Decreasing the class probability threshold makes the model more sensitive and, therefore, marks more cases as the positive class, which is fraud in this case. This will increase the likelihood of fraud detection. However, it comes at the price of lowering precision.

A labeled dataset contains a lot of duplicate examples. How should you handle duplicate data?

Duplicates can accidentally leak into validation and test sets when you split your data. This can cause artificially better performance on validation and test sets. You should clean up the data so that all examples are distinct.

Which of the following helps you to directly deal with S3 for reading and writing data from Amazon EMR?

EMRFS is an implementation of the Hadoop file system used for reading and writing regular files from Amazon EMR directly to Amazon S3.

You want to perform analytics on clickstream data from a website. Which AWS service can help you with this?

Elasticsearch is an open source distributed search and analytics engine. Elasticsearch works with various types of data such as numerical, text, structured and unstructured. Elasticsearch performs data ingestion, enrichment, storage, analysis, and visualization.

You are working for a company with strict compliance and data security requirements that requires that data is encrypted at all times, including at rest and in transit within the AWS cloud. You have been tasked with setting up a streaming data pipeline to move their data into the AWS cloud. What combination of tools allows this to be implemented with minimum amount of custom code? (Choose one)

Encrypt data with the Amazon Kinesis Producer Library(KPL), decrypt data with Amazon Kinesis Consumer Library(KCL), and use AWS KMS to manage keys

What is the difference between AWS Glue and Data Pipeline?

Glue is an ETL service and Data Pipeline is an orchestration service Glue is an ETL service that runs on a serverless Apache Spark environment. AWS Data Pipeline is a managed orchestration service

Within an inference model pipeline, Amazon Sagemaker handles invocations as a sequence of ...

HTTP requests invocation is an execution of function or program

Which feature of Hadoop allows it to divide the given job into smaller ones and then distribute them to perform parallel processing?

Hadoop MapReduce execution engine divides the job into many smaller tasks and feed it across multiple nodes in Amazon EMR cluster to perform parallel processing.

Your company is working on petabytes of data and you want a very durable and scalable processing framework that is able to process the data in a time-efficient manner. Which of the following framework is suitable?

Hadoop is extremely scalable and process data in parallel which makes it perfectly suited for big data processing. Hadoop is very durable and available. Scalability is achieved by adding more servers to the Hadoop cluster.

A financial institution is seeking a way to improve security by implementing two-factor authentication (2FA). However, management is concerned about customer satisfaction by being forced to authenticate via 2FA for every login. The company is seeking your advice. What is your recommendation?

IP Insights is a built-in SageMaker algorithm that can detect anomalies as it relates to IP addresses. In this case, only enforcing 2FA where unusual activity is detected might be a good compromise between security and ease-of-use. While using facial recognition might be a tempting alternative, it can easily be bypassed by holding up a picture of some customer and it would not be true multi-factor authentication.

You are helping a client troubleshoot a security configuration issue in an IAM policy for a group of users. The users assigned to the policy cannot create models. Which of the following might indicate a problem in the IAM policy?

In SageMaker, the iam:PassRole action is needed for the Amazon SageMaker action sagemaker:CreateModel. This allows the user to pass authorization to SageMaker to actually create models.

A Machine Learning Specialist has created a deep learning neural network model that performs well on the training data but performs poorly on the test data. Which of the following methods should the Specialist consider using to correct this?

Increase regularization, dropout and feature combinations

A sports and betting company uses machine learning to predict the odds of winning during sporting events. It uses the Amazon SageMaker endpoint to serve its production model. The endpoint is on an m5.8xlarge instance.What can the company do to ensure that this endpoint is highly available, while using the most cost-effective and easily managed solution?

Increase the number of instances associated with the instance to more than 1 instance

A data scientist is working on optimizing a model during the training process by varying multiple parameters. The data scientist observes that, during multiple runs with identical parameters, the loss function converges to different, yet stable, values. What should the data scientist do to improve the training process?

It is most likely that the loss function is very curvy and has multiple local minima where the training is getting stuck. Decreasing the batch size would help the data scientist stochastically get out of the local minima saddles. Decreasing the learning rate would prevent overshooting the global loss function minimum.

You're a machine learning specialist working for an automobile broker who is looking to use machine learning to determine different models of muscle cars. You have been tasked with preparing a machine learning model to classify the different models of cars. The current implementation is using a neural network to classify other objects. What changes can be applied to help classify different models of muscle cars?

Keep initial weights and remove the last layer; When you're re-purposing a pre-trained model for your own needs, you start by removing the original classifier or last layer, then you add a new classifier that fits your purposes, and finally you can train the entire model on a new large dataset.

You want to create a face detection app based on the video stream coming from various sources like a cell phone camera or security camera. How can you achieve this?

Kinesis Video streams and Rekognition are seamlessly integrated together to create sophisticated computer vision systems.

A Machine Learning Specialist is developing a custom video recommendation model for an application. The dataset used to train this model is very large with millions of data points and is hosted in an Amazon S3 bucket. The Specialist wants to avoid loading all of this data onto an Amazon SageMaker notebook instance because it would take hours to move and will exceed the attached 5 GB Amazon EBS volume on the notebook instance. Which approach allows the Specialist to use all the data to train the model?

Load a smaller subset of the data into the SageMaker notebook and train locally. Confirm that the training code is executing and the model parameters seem reasonable. Initiate a SageMaker training job using the full dataset from the S3 bucket using Pipe input mode

EMR cluster consists of groups of EC2 instances. These instances are known as nodes. Which node manages the entire cluster?

Master node manages all the other nodes in the cluster and tracks the status of tasks and monitors health of the cluster

A Machine Learning Specialist has various CSV training datasets stored in an S3 bucket. Previous models trained with similar training data sizes using the Amazon SageMaker Linear learner algorithm have a slow training process. The Specialist wants to decrease the amount of time spent on training the model.

Most Amazon SageMaker algorithms work best when you use the optimized protobuf recordIO data format for training. Using this format allows you to take advantage of Pipe mode. In Pipe mode, your training job streams data directly from Amazon Simple Storage Service (Amazon S3).

A Machine Learning Specialist is creating a new natural language processing application that processes a dataset comprised of 1 million sentences. The aim is to then run Word2Vec to generate embeddings of the sentences and enable different types of predictions. Here is an example from the dataset: "The quck BROWN FOX jumps over the lazy dog." Which of the following are the operations the Specialist needs to perform to correctly sanitize and prepare the data in a repeatable manner?

Normalize all words by making the sentence lowercase Remove stop words using an English stopword dictionary Tokenize the sentence into words

A Data Scientist is training a convolutional neural network model to detect incoming employees at the company's front gate using a camera so that the system opens for them automatically. However, the model is taking too long to converge and the error oscillates for more than 10 epochs.What should the Data Scientist do to improve upon this situation?

Normalize the images before training add batch normalization

You have launched a new Jupyter Notebook instance and you want to make sure that you don't lose any files and data when the notebook instance restarts. Where should you save your files and data so that they are not overwritten when the instance restarts?

Only files and data saved within the /home/ec2-user/SageMaker folder persist between notebook instance sessions. Files and data that are saved outside this directory are overwritten when the notebook instance stops and restarts.

A monitoring service generates 1 TB of scale metrics record data every minute. A Research team performs queries on this data using Amazon Athena. The queries run slowly due to the large volume of data, and the team requires better performance. How should the records be stored in Amazon S3 to improve query performance?

Parquet files

A startup is analyzing social media trends with data stored in S3. For analysis, it is common to access a subset of attributes across a large number of records. Which of these formats can lower the cost of storage while improving query performance?

Parquet is a columnar storage format that transparently compresses data. It is a very efficient format for querying a subset of columns across a large number of records. Avro is a suitable binary format that uses row storage and optimized for use cases that need to access the entire row. JSON and CSV are text formats that use Row storage

What is the programming language supported by AWS Glue ETL?

Python and Scala

By default, how many users can access the S3 bucket?

S3 buckets are private by default, only the owner can access the bucket. Owner can allows access to others by creating an access policy.

Which feature of Quicksight helps it to perform better under heavy load and also makes it possible to scale to many users?

SPICE is a fast, optimize, in-memory calculation engine for Amazon QuickSight. SPICE is Highly available and durable, and could be scaled to hundreds of thousands of users. SPICE can be used for fast, ad hoc data visualization.

Which type of encryption is used for most of the machine learning services?

SSE-S3 and AWS KMS

You need to configure the SageMaker Endpoint to Scale on demand. Based on load testing, you have determined that one instance can handle 150 requests per second. Assume a safety factor of 0.5. What value do you need to set for SageMakerVariantInvocationsPerInstance to trigger auto-scaling action? Note: SageMakerVariantInvocationsPerInstance is a per minute metric.

SageMakerVariantInvocationsPerInstance is a per minute metric that you can monitor with CloudWatch to trigger Auto Scaling actions. When this value exceeds 4500, Autoscaling needs to add a server to handle the increased workload. SageMakerVariantInvocationsPerInstance = (MAX_RPS * SAFETY_FACTOR) * 60 = 150 * 0.5 * 60 = 4500

Only three of the built-in SageMaker algorithms support incremental training. Can you identify those three algorithms?

Semantic Segmentation Image Classification Object Detection

A machine learning specialist needs to come up with an approach to automatically summarize the content of large text documents. Which algorithm can be used for this use case?

Seq2Seq algorithm is used for text summarization - It accepts a series of tokens as input and outputs another sequence of tokens. LDA is an unsupervised algorithm for topic modeling - it can generate probabilities of a document belonging to a number of specified topics. K-Means is a clustering algorithm that is used for identifying grouping within data. Random Cut Forest is used for detecting anomalous data points

John is working for an analytic company and that company uses Hadoop Cluster on the premises. This EMR cluster is equipped with Spark. John is tasked to build an ML model on Sagemaker and would like to use this data, how can he go about it?

Since Spark is supported in the Hadoop Cluster, you can use the Sagemaker Spark library. Amazon EMR installs and manages Apache Spark on Hadoop YARN. By using Spark library, you can perform analysis on the data.

Which activation function would you use in the output layer for a Multi-class Classification neural network that predicts a single label from a set of possible labels?

Softmax activation is used for predicting a single label from a set of possible labels. Softmax returns the probability for each label, and the sum of all probabilities adds up to a 1. The class with the highest probability is used as the final class for the example.

You are preparing plain text corpus data to use in a NLP process. Which of the following is/are one of the important step(s) to pre-process the text in NLP based projects?

Stemming is a rudimentary rule-based process of stripping the suffixes ("ing", "ly", "es", "s" etc) from a word. Stop words are those words which will have no relevance to the context of the data for example is/am/are. Object Standardization is also one of the good ways to pre-process the text by removing things like acronyms, hashtags with attached words, and colloquial slang that typically are not recognized by search engines and models.

Paul Shmidt is a Machine Learning Specialist working with data currently stored in Amazon Elastic Map Reduce (EMR). His company has asked him to look for ways to reduce costs, so he is looking into using spot instances for some of the EMR nodes. Data loss needs to be prevented, and sudden termination is not acceptable for his applications. Which node types should Paul put on spot instances?

Task nodes process data but do not hold persistent data in HDFS. Terminating a task node does not result in data loss or cause the application to terminate.

When submitting Amazon SageMaker training jobs using one of the built-in algorithms, which common parameters MUST be specified?

The IAM role that the Amazon SageMaker can assume to perform tasks on behalf of the users The Amazon EC2 instance class specifying whether training will be run using CPU or GPU The output path specifying where on an Amazon S3 bucket the trained model will persist

During mini-batch training of a neural network for a classification problem, a Data Scientist notices that training accuracy oscillates. What is the MOST likely cause of this issue?

The learning rate is very high

Training data has values for all features. With Test data, some of the features have missing values. If you build a neural network with training data and use test data to verify performance, how would the neural network behave?

The system would not learn insights from missing values. You would need to create new examples in training data with missing values so that the model can learn to ignore missing values

Which encryption is used for encrypting the data-in-transit between Athena and S3?

Transport Layer Security ( TLS ) encrypts in transit data between Athena and S3

For a regression problem, which of these algorithms cap the output to a range of values seen in the training set? (Choose two)

Tree-based algorithms like decision tree, random forest, and xgboost have a lower and upper bound it can predict for regression. The lower and upper bound is determined based on the range of values seen during training.

You have been given access to streaming data from a game application. What are the required tools to analyze the stream data effectively?

Use Firehouse to stream the data and Kinesis Analytics to perform analysis

You want to analyze clickstream data using Athena. How would you stream the clickstream data in a format, that makes it easy for Athena to perform queries

Use Kinesis Firehouse to stream the data and convert it into a Parquet format. Firehose can covert the incoming data into Parquet and ORC format, which makes it easy for analyzing.

A machine learning specialist is running a training job on a single EC2 instance using their own Tensorflow code on a Deep Learning AMI. The specialist wants to run distributed training and inference using SageMaker. What should the machine learning specialist do?

Use Tensorflow in Sagemaker and edit your code to run using the SageMaker Python SDK When using custom TensorFlow code, the Amazon SageMaker Python SDK supports script mode training scripts. Script mode has the following advantages: Script mode training scripts are more similar to training scripts you write for TensorFlow in general, so it is easier to modify your existing TensorFlow training scripts to work with Amazon SageMaker. Script mode supports both Python 2.7- and Python 3.6-compatible source files. Script mode supports Horovod for distributed training.

How can you change the storage tiers throughout the data lifecycle?

Users can configure a lifecycle policy that S3 could apply to group of objects.

You have data stored in Amazon S3 and Amazon Redshift. You would like to work with data on both these AWS services. Which platform allows you to use the data from these?

Using Jupyter-based EMR Notebooks, developers can work with data anywhere in AWS such as Amazon S3, Amazon DynamoDB, and Amazon Redshift. This was discussed in Lecture 58: Elastic Map Reduce - Part 1

Creating an S3 VPC Endpoint in your VPC will have which of the following impacts?

Using a VPC Endpoint will redirect the S3 traffic through the AWS private network rather than egressing to the public internet. Both of these attributes will reduce egress costs and increase security.

You want to store a stream of data coming from a game application to S3. What is the most efficient way to achieve that?

We can use firehose to transform, encrypt and load data into s3, redshift and Elasticsearch service.

You have setup autoscaling for your deployed model using SageMaker Hosting Services. You notice that in times of heavy load spikes, it takes a long time for the hosted model to scale out in response to the load. How might you increase the reaction time of auto-scaling?

When scaling responsiveness is not as fast as you would like, you should look at the cooldown period. The cooldown period is a duration when scale events will be ignored, allowing the new instances to become established and take on load. Decreasing this value will launch new variant instance faster.

When you use automatic model tuning for the linear learning, then internal tuning is ... and the number of parallel models is set to ...

When you use automatic model tuning, the linear learner internal tuning mechanism is turned off automatically. This sets the number of parallel models, num_models, to 1. The algorithm ignores any value that you set for num_models.

A Business Process Outsourcing (BPO) company uses Amazon Polly to translate plaintext documents to speech for its voice response system. After testing, some acronyms and business-specific terms are being pronounced incorrectly. Which approach will fix this issue?

With Amazon Polly's custom lexicons or vocabularies, you can modify the pronunciation of particular words, such as company names, acronyms, foreign words, and neologisms (e.g., "ROTFL", "C'est la vie" when spoken in a non-French voice). To customize these pronunciations, you upload an XML file with lexical entries. For example, you can customize the pronunciation of the Filipino word: "Pilipinas" by using the phoneme element in your input XML.

You are building out a machine learning model using multiple algorithms. You are at the point where you feel like one of the models is ready for production but you want to test difference variants of the model and compare the inference results in a testing environment before launching into production. What is the simplest way for you to test different model variants before launching into production.

You can deploy multiple variants of a model to the same Amazon SageMaker HTTPS endpoint. This is useful for testing variations of a model in production. For example, suppose that you've deployed a model into production. You want to test a variation of the model by directing a small amount of traffic, say 5%, to the new model. To do this, create an endpoint configuration that describes both variants of the model.

Your company uses S3 for storing data collected from a variety of sources. The users are asking for a feature similar to a trash can or recycle bin. Deleted files should be available for restore for up to 30 days. How would you implement this?

You can enable S3 versioning to keep the older version of the objects. You can create life cycle policies to remove the older version after 30 days.

You have been tasked with transforming highly sensitive data using AWS Glue. Which of the following AWS Glue settings allowing you to control encryption for your transformation process? (Choose three)

You can encrypt metadata objects in your AWS Glue Data Catalog in addition to the data written to Amazon Simple Storage Service (Amazon S3) and Amazon CloudWatch Logs by jobs, crawlers, and development endpoints. You can enable encryption of the entire Data Catalog in your account. When you create jobs, crawlers, and development endpoints in AWS Glue, you can provide encryption settings, such as a security configuration, to configure encryption for that process. With AWS Glue, you can encrypt data using keys that you manage with AWS Key Management Service (AWS KMS). With encryption enabled, when you add Data Catalog objects, run crawlers, run jobs, or start development endpoints, AWS KMS keys are used to write data at rest. You can also configure AWS Glue to only access Java Database Connectivity (JDBC) data stores through a trusted Secure Sockets Layer (SSL) protocol.

You are streaming the data from clickstream. You want to find where most of the customers are interested in a given website. How can you find that?

You can use HOTSPOTS from Kinesis Data Analytics to identify dense areas in the data (activity in some regions that might be higher than the norm).

The performance of your Linear Learner training process is way too slow. What might you do to speed up the training process but not sacrifice accuracy?

You can usually realize an increase in training performance by moving from CSV to recordIO-protobuf format, as this allows for the data to be streamed or piped from S3, rather than fully downloaded to the training instance.

For Sagemaker instances with direct internet access, Amazon Sagemaker provides ...

network interface that allows for the notebook to talk to the internet through a VPC managed by the service

For Sagemaker instances where internet access is disabled, Amazon Sagemaker connects externally via...

won't be able to train or host models unless your VPC has an interface endpoint (PrivateLink) or a NAT gateway and your security groups allow outbound connections


Ensembles d'études connexes

Test 2 Chapter 4 Histology lab (maria ribiero FIU)

View Set

Chapter 13: Prejudice and Intergroup Relations

View Set

Persuasive Text - Article: Studying Abroad (100%)

View Set

neuro endo ch 14 student questions

View Set

General Wisconsin Insurance Laws

View Set