ACloudGuru Practice Questions

Ace your homework & exams now with Quizwiz!

schema

S3 has no fixed ____________________.

1

10 records per second with a max payload size of 1000 KB (10 records x 100 KB = 1000KB) written to the shard. A single shard can ingest up to 1 MB of data per second, which is enough to ingest the 1000 KB from the streaming game play. Therefor ___ shard is enough to handle the streaming data.

Java Lambda

A KPL must be installed as a _____ application before it can be used with your Kinesis Data Streams. That said, There are ways to process KPL serialized data within AWS _____, in Java, Node.js, and Python.

c

A ML specialist is working for a bank and trying to determine if credit card transactions are fraudulent or non-fraudulent. The features of the data collected include things like customer name, customer type, transaction amount, length of time as a customer, and transaction type. The transaction type is classified as 'normal' and 'abnormal'. What data preparation action should the ML specialist take? a. Drop the length of time as a customer and perform label encoding on the transaction type before training the model. b. Drop the transaction type and perform label encoding on the customer type before training the model. c. Drop the customer name and and perform label encoding on the transaction type before training the model. d. Drop both the customer type and the transaction type before training the model.

rule of thumb

A heuristic is mental "______" that provides guidance but does always ensure a consistent outcome.

c

A term frequency-inverse document frequency (tf-idf) matrix using both unigrams and bigrams is built from a text corpus consisting of the following two sentences: { Hello world } and { Hello how are you }. What are the dimensions of the tf-idf vector/matrix? a (5, 9) b (2, 5) c (2, 9) d (2, 6) e (2, 10)

format

AWS Glue makes it super simple to transform data from one _______ to another. You can simply create a Job that takes in data defined within the Data Catalog and outputs in any of the following formats: avro csv, ion, grokLog, json, orc, parquet, glueparquet, xml Further information:

custom classfier

AWS Glue provides built-in classifiers for various formats, including JSON, CSV, web logs, and many database systems. If AWS Glue cannot determine the format of your input data, you will need to set up a _______________________________________________ that helps AWS Glue crawler determine the schema of your input data.

SQL

Although Amazon DynamoDB provides key-value access and consistent reads, it does not support ____ based queries.

conda pip

Amazon SageMaker notebook instances come with multiple environments already installed. These environments contain Jupyter kernels and Python packages including: scikit, Pandas, NumPy, TensorFlow, and MXNet. You can also install your own environments that contain your choice of packages and kernels. This is typically done using _____ install or ____install.

steps

An algorithm is a specific set of ____ intended to solve a specific problem with consistency.

d

An organization needs to store a mass amount of data in AWS. The data has a key-value access pattern, developers need to run complex SQL queries and transactions, and the data has a fixed schema. Which type of data store meets all of their needs? a. Athena b. DynamoDB c. S3 d. RDS

error

Stochastic Gradient Descent (SGD) is a cost function that seeks to find the minimal ______. This can be analogous to trying to find the lowest point on a landscape.

Spark

Apache ____ can be used as an ETL tool to preprocess data and then integrate it directly with Amazon SageMaker for model training and hosting.

a d e

Choose the scenarios in which one-hot encoding techniques are NOT a good idea. (Choose 3) When our algorithm expects numeric input and we have thousands of nominal categorical values. a. When our algorithm expects numeric input and we have few nominal categorical values. b. When our values cannot be ordered in any meaningful way, there are only a few to choose from, and our algorithm expects numeric input. c. When our algorithm expects numeric input and we have ordinal categorical values. d. When our algorithm accepts numeric input and we have continuous values.

DocumentDB

Currently, _________________________ is not supported as an input data store for AWS Glue Crawlers.

time-series, multiple

DeepAR is a supervised forecasting algorithm used with _______ data. DeepAR seeks to be better than traditional time-series algorithms by accommodating _______ cross-sectional datasets.

outliers

Standardize your data to deal with ______

Bar Charts, Line Charts

For comparisons, use ______ and ______

normalize

If you don't think you have outliers, do you standardize or normalize your data?

1

If you have multiple sentences, but you remove punctuation before applying n-gram, you make it ___ sentence (tf-idf first number will be 1)

Ordinal

If your data is ______ (order does matter) we cannot use one-hot encoding techniques. We need to map these values to some values that have scale or we just train our model with different encodings and see which encoding works best.

S3

If your on-prem Hadoop cluster has Spark, you can use the SageMaker Spark Library to convert Spark DataFrame format into protobuf and load onto ____. From there, you can use SageMaker as normal.

d

In general within your dataset, what is the minimum number of observations you should have compared to the number of features? a. 10,000 times as many observations as features. b. 100 times as many observations as features. c. 1000 times as many observations as features. d. 10 times as many observations as features.

D

In what scenario is the DeepAR algorithm best suited? A. Decide whether to extend a credit card offer to a potential customer. B. Determine the correlation between a person's diet and energy levels. C. Predict whether a football team will score a certain number of points in a match. D. Predict future sales of a new product based on historic sales of similar products. E. Provide the certainty that a given picture includes a human face.

SQL

Kinesis Data Analytics gets its input streaming data from Kinesis Data Streams or Kinesis Data Firehose. You can use Kinesis Data Analytics to run real-time ______ queries on your data.

DeliveryStreamName, Record

Kinesis Data Firehose is used as a delivery stream. We do not have to worry about shards, partition keys, etc. All we need is the Firehose _______________________ and the _______ object (which contains the data).

S3

Kinesis Data Streams and Kinesis Data Analytics cannot write data directly to ____.

Firehose

Kinesis Data ______ is used as the main delivery mechanism for outputting data into S3.

asynchronous

Kinesis Producer Library (KPL) implements an __________________________ send function, so it works for less critical events

Producer Library

Kinesis ________________ (KPL) makes it easy to integrate retry mechanism, aggregate records to improve throughput, and automatically submits CloudWatch metrics. Technically, Kinesis API (AWS SDK) can do it too, but KPL is easier.

streams

Kinesis ____________________ allows you to stream data into AWS and build custom applications around that streaming data.

less, EMR, more

Since AWS Glue is fully managed it requires ____ configuration and setup than would have to be done on EMR. If we have mass amounts of data that needs processing and AWS Glue is too slow or too expensive, an alternative would be to use an ____ cluster with appropriate frameworks installed. Depending on your workload size and needs, EMR can be cheaper but requires much ____ configuration and setup over the fully managed AWS Glue service.

unstructured

Since PDFs have no real structure to them, like key-value pairs or column names, they are considered ____________________________ data.

Box chart, scatter plot, histogram

The 3 charts that show distribution are _____, ____ and ____

KPL

The ____ can incur an additional processing delay over Kinesis API (of up to RecordMaxBufferedTime within the library (user-configurable)). Larger values of RecordMaxBufferedTime results in higher packing efficiencies and better performance. Applications that cannot tolerate this additional delay may need to use the AWS SDK directly to use the Kinesis API

sources, formats

The benefit of using Data Catalog (over Hive Metastore) is because it provides a unified metadata repository across a variety of data ___________ and data ____________________, integrating with Amazon EMR as well as Amazon RDS, Amazon Redshift, Redshift Spectrum, Athena, and any application compatible with the Apache Hive metastore. We can simply run a Hive script to query tables and output that data in CSV (or other formats) into S3. Once that data is on S3, we can crawl it to create a Data Catalog of the Hive Metastore or import the data directly from S3.

D

To remove inconsistency in a process, you have created a very specific step-by-step process for patching servers. How would this best be described? A. The procedure is a heuristic as it will yield a consistent output. B. The procedure is an algorithm as it will allow the person performing the upgrades to choose the best path. C. The procedure is a heuristic as it permits the person performing the upgrades from making unintended mistakes. D. The procedure is an algorithm as it will yield a consistent output. E. The procedure is a work-around until a supervised learning process can be implemented which will remove the human from the process.

target

To run unsupervised learning algorithms that don't have a ____, specify the number of label columns in the content type. e.g label_size = 0

True

True or False. If you have mission critical data that must be processed with as minimal delay as possible, you should use the Kinesis API (AWS SDK) over the Kinesis Producer Library.

A

We are analyzing the following text { Hello cloud gurus! Keep being awesome! }. We apply lowercase transformation, remove punctuation and n-gram with a sliding window of 3. What are the unique trigrams produced? What are the dimensions of the tf-idf vector/matrix? a. ['cloud gurus keep', 'gurus keep being', 'hello cloud gurus', 'keep being awesome'] and (1, 4) b. ['hello cloud gurus', 'cloud gurus keep', 'keep being awesome'] and (1, 3) c. ['hello cloud gurus', 'cloud gurus keep', 'gurus keep being', 'keep being awesome'] and (2, 4) d. ['hello cloud gurus!', 'cloud gurus keep', 'gurus keep being', 'keep being awesome.'] and (1, 4)

A B

We are running a training job over and over again using slightly different, very large datasets as an experiment. Training is taking a very long time with your I/O-bound training algorithm and you want to improve training performance. What might you consider? (Choose 2) A. Make use of pipe mode to stream data directly from S3. B. Convert the data format to protobuf recordIO format. C. Use the SageMaker console to change your training job instance type from an ml.c5.xlarge to a r5.xlarge. D. Convert the data format to an Integer32 tensor. E. Make use of file mode to stream data directly from S3.

e

We are using a CSV dataset for unsupervised learning that does not include a target value. How should we indicate this for training data as it sits on S3? A. SageMaker will automatically detect the data format for supervised learning algorithms. B. Enable pipe mode when we initiate the training run. C. Include a reserved word metadata key of "ColumnCount" for the S3 file and set it to the number of columns. D. CSV data format should not be used for unsupervised learning algorithms. E. Include label_size=0 appended to the Content-Type key.

C

We are using a k-fold method of cross-validation for our linear regression model. What outcome will indicate that our training data is not biased? A. K-fold is not appropriate for us with linear regression problems. B. Each subsequent k-fold validation round has an increasing accuracy rate over the one prior. C. All k-fold validation rounds have roughly the same error rate. D. Bias is not a concern with linear regression problems as the error function resolves this. E. Each subsequent k-fold validation round has a decreasing error rate over the one prior.

scatter plot, bubble chart

What 2 visualizations show relationships?

B C

What are the programming languages offered in AWS Glue for Spark job types? (Choose 2) A. Java B. Python C. Scala D. R E. C#

b c d f

What are your options for storing data into S3? (Choose 4) a UPLOAD command b AWS CLI c The AWS console d AWS SDK e PutRecords API call f UNLOAD command

rewrite

When creating AWS Glue jobs if you select Spark job type you would have to _____ your code in Pyspark or Scala instead of copy paste using Python shell.

Python

When creating AWS Glue jobs you can select _______ shell as the job type that allows you to use several built-in Python libraries that most Data Scientists and ML Specialists are used to using. If you chose Spark job type you would have to rewrite your code in Pyspark or Scala instead of copy paste using Python shell.

C

When you issue a CreateModel API call using a built-in algorithm, which of the following actions would be next? A. SageMaker launches an appropriate inference container for the algorithm selected from the global container repository. B. Sagemaker provisions an EC2 instances using the appropriate AMI for the algorithm selected from the regional container registry. C. SageMaker launches an appropriate inference container for the algorithm selected from the regional container repository. D. SageMaker provisions an EMR cluster and prepares a Spark script for the training job. E. SageMaker launches an appropriate training container from the algorithm selected from the regional container repository. F. Sagemaker provisions an EC2 instances using the appropriate AMI for the algorithm selected from the global container registry.

Stacked Area Chart, Stacked Bar Char, Pie Chart

Which 3 visualizations help show composition?

C

Which best describes SGD in common terms? A. Calculate the linear distance between arrows shot into a target to determine accuracy. B. Ensure that our sample size in a traffic study has at least 30 drivers. C. Seek to find the lowest point in elevation on a landscape. D. Attempt to find the most efficient path to deliver packages to multiple destinations.

B D E

Which of the following are good candidate problems for using XGBoost? (Choose 3) A. Create a policy that will guide an autonomous robot through an unknown maze. B. Evaluate handwritten numbers on a warranty card to detect what number they represent. C. Map a text string to an n-gram vector. D. Providing a ranking of search results on an e-commerce site customized to a customer's past purchases. E. Deciding whether a transaction is fraudulent or not based on various details about the transaction.

C D

Which of the following is an example of unsupervised learning? (Choose 2) A. Using Seq2Seq to extract a text string from a segment of a recorded speech. B. Using XGBoost to predict the selling price of a house in a particular market. C. Using K-Means to cluster customers into demographic segments. D. Using NTM to extract topics from a set of scientific journal articles. E. Using a Factorization Machine to provide book recommendations.

A

Which of the following might be used to focus a model on most relevant features? A. PCA B. XGB C. NLM D. LDA E. AQS F. K-NN

C D

Which of these examples would be considered as introducing bias into a problem space? (Choose 2) A. Filtering out outliers in a dataset which are greater than 4 standard deviations outside the mean. B. Deciding to use a supervised learning method to estimate missing values in a dataset. C. Removing records from a set of customer reviews that were not fully complete. D. Failing to randomize a dataset even though you were told it was already random. E. Omitting records before a certain date in a forecasting problem.

A

Which service built by AWS makes it easy to set up a retry mechanism, aggregate records to improve throughput, and automatically submits CloudWatch metrics? a Kinesis Producer Library (KPL) b Kinesis Consumer Library c Kinesis API (AWS SDK) d Kinesis Client Library (KCL)

b

Which service in the Kinesis family allows you to build custom applications that process or analyze streaming data for specialized needs? a Kinesis Firehose b Kinesis Streams c Kinesis Data Analytics d Kinesis Video Streams

b

Which service in the Kinesis family allows you to easily load streaming data into data stores and analytics tools? a. Kinesis Video Streams b .Kinesis Firehose c. Kinesis Streams d. Kinesis Data Analytics

F G

Which visualizations help show comparisons? (Choose 2) A. Stacked bar chart B. Scatter plot C. Stacked area chart D. Histogram E. Bubble chart F. Bar chart G. Line chart

A D G

Which visualizations help show distribution? (Choose 3) A Box plot B Stacked area chart C Line chart D Scatter chart E Stacked bar chart F Bubble chart G Histogram

A C F

Which visualizations help show relationships? (Choose 2) A. Bar chart B. Pie chart C. Stacked area chart D. Scatter plot E. Stacked bar chart F. Histogram G. Bubble chart

D F

Which visualizations help show relationships? (Choose 2) a. Bar chart b. Pie chart c. Stacked area chart d. Scatter plot e. Stacked bar chart f. Histogram g. Bubble chart

B

While using K-Means, what does it mean if we pass in k=4 as a hyperparameter? A. We want the algorithm to return the top 4 results. B. We want the algorithm to group into 4 clusters. C. We want the algorithm to group into clusters of no more than 4 samples each. D. We want the algorithm to use 4 as the cutoff value for classification purposes. E. We want the algorithm to classify into 4 groups.

D

You are a ML specialist building a regression model to predict the amount of rainfall for the upcoming year. The data you have contains 18,000 observations collected over the last 50 years. Each observation contains the date, amount of rainfall (in cm), humidity, city, and state. You plot the values in a scatter plot for a given day and amount of rainfall. After plotting points, you find a large grouping of values around 0 cm and 0.2 cm. There is a small grouping of values around 500 cm. What are the reasons for each of these groupings? What should you do to correct these values? A. The groupings around 0 cm are days that had no rainfall, the groupings around 0.2 cm are days where it rained, the groupings around 500 cm are days where it snowed. The values should be used as is. B. The groupings around 0 cm and 0.2 cm are extremes and should be removed. The values around 500 cm should be normalized and used once normalized. C. The groupings around 0 cm are days that had no rainfall, the groupings around 0.2 cm are days where it rained, the groupings around 500 cm are outliers. The values around 500 cm should be normalized so they are on the same scale as the other values. D. The groupings around 0 cm are days that had no rainfall, the groupings around 0.2 cm are days where it rained, the groupings around 500 cm are outliers. The values around 500 cm should be dropped and the other values should be used as is.

D

You are a ML specialist designing a regression model to predict the sales for an upcoming festival. The data from the past consists of 1,000 records containing 20 numeric attributes. As you start to analyze the data, you discovered that 30 records have values that are in the far left of a box plot's lower quartile. The festival manager confirmed that those values are unusual, but plausible. There are also 65 records where another numerical value is blank. What should you do to correct these problems? A. Drop the unusual records and replace the blank values with separate Boolean values. B. Use the unusual data and replace the missing values with a separate Boolean variable. C. Drop the unusual records and fill in the blank values with 0. D. Drop the unusual records and replace the blank values with the mean value.

a

You are collecting clickstream data from an e-commerce website using Kinesis Data Firehose. You are using the PutRecord API from the AWS SDK to send the data to the stream. What are the required parameters when sending data to Kinesis Data Firehose using the API PutRecord call? a. DeliveryStreamName and Record (containing the data) b. Data, PartitionKey, StreamName c. DataStreamName, PartitionKey, and Record (containing the data) d. Data, PartitionKey, StreamName, ShardId

a

You are a ML specialist needing to collect data from Twitter tweets. Your goal is to collect tweets that include only the name of your company and the tweet body, and store it off into a data store in AWS. What set of tools can you use to stream, transform, and load the data into AWS with the LEAST amount of effort? a. Setup a Kinesis Data Firehose for data ingestion and immediately write that data to S3. Next, setup a Lambda function to trigger when data lands in S3 to transform it and finally write it to DynamoDB. b. Create a Kinesis Data Stream to ingest the data. Next, setup a Kinesis Data Firehose and use Lambda to transform the data from the Kinesis Data Stream, then use Lambda to write the data to DynamoDB. Finally, use S3 as the data destination for Kinesis Data Firehose. c. Setup A Kinesis Data Stream for data ingestion, setup EC2 instances as data consumers to poll and transform the data from the stream. Once the data is transformed, make an API call to write the data to DynamoDB. d. Setup Kinesis Data Streams for data ingestion. Next, setup Kinesis Data Firehouse to load that data into RedShift. Next, setup a Lambda function to query data using RedShift spectrum and store the results onto DynamoDB.

C

You are a ML specialist preparing a dataset for a supervised learning problem. You are using the Amazon SageMaker Linear Learner algorithm. You notice the target label attributes are highly imbalanced and multiple feature columns contain missing values. The proportion of missing values across the entire dataset is less than 5%. What should you do to minimize bias due to missing values? a First normalize the non-missing values then replace the missing values with the normalized values. b Replace the missing values with mean or median values from the other values of the same feature. c For each feature that is missing, use a supervised learning to approximate the values based on other features. d Drop all of the rows that contain missing values because they represent less than 5% of the data.

C

You are a ML specialist preparing some labeled data to help determine whether a given leaf originates from a poisonous plant. The target attribute is poisonous and is classified as 0 or 1. The data that you have been analyzing has the following features: leaf height (cm), leaf length (cm), number of cells (trillions), poisonous (binary). After initial analysis you do not suspect any outliers in any of the attributes. After using the data given to train your model, you are getting extremely skewed results. What technique can you apply to possibly help solve this issue? A. Standardize the number of cells attribute. B. Drop the number of cells attribute. C. Normalize the number of cells attribute. D. Apply one-hot encoding to each of the attributes, except for the poisonous attribute (since it is already encoded).

a

You are a ML specialist that has been tasked with setting up a transformation job for 900 TB of data. You have set up several ETL jobs written in Pyspark on AWS Glue to transform your data, but the ETL jobs are taking a very long time to process and it is extremely expensive. What are your other options for processing the data? a Create an EMR cluster with Spark, Hive, and Flink to perform the ETL jobs. Tweak cluster size, instance types, and data partitioning until performance and cost satisfaction is met. b Offload the data to Redshift and perform transformation from Redshift rather than S3. Setup AWS Glue jobs to use Redshift as input data store, then run ETL jobs on batches of Redshift data. Adjust the batch size until performance and cost satisfaction is met. c Change job type to Python shell and use built-in libraries to perform the ETL jobs. The built-in libraries perform better than Spark jobs and are a fraction of the cost. d Create Kinesis Data Stream to stream the data to multiple EC2 instances each performing partition workloads and ETL jobs. Tweak cluster size, instance types, and data partitioning until performance and cost satisfaction is met.

C

You are a ML specialist that has been tasked with setting up an ETL pipeline for your organization. The team already has a EMR cluster that will be used for ETL tasks and needs to be directly integrated with Amazon SageMaker without writing any specific code to connect EMR to SageMaker. Which framework allows you to achieve this? a Apache Hive b Apache Pig c Apache Spark d Apache Mahout e Apache Flink

D

You are a ML specialist who has a Python script using libraries like Boto3, Pandas, NumPy, and sklearn to help transform data that is in S3. On your local machine the data transformation is working as expected. You need to find a way to schedule this job to run periodically and store the transformed data back into S3. What is the best option to use to achieve this? A. Create an EMR cluster that runs Apache Spark code to transform and store data in S3. Then set up this job to run on some schedule. B. Create an AWS Glue job that uses Spark as the job type to create Pyspark code to transform and store data in S3. Then set up this job to run on some schedule. C. Create an AWS Glue job that uses Spark as the job type to create Scala code to transform and store data in S3. Then set up this job to run on some schedule. D> Create an AWS Glue job that uses Python shell as the job type and executes the code written to transform and store data in S3. Then set up this job to run on some schedule.

d

You are a ML specialist who is setting up a ML pipeline. The amount of data you have is massive and needs to be set up and managed on a distributed system to efficiently run processing and analytics on. You also plan to use tools like Apache Spark to process your data to get it ready for your ML pipeline. Which setup and services can most easily help you achieve this? a. Multi AZ RDS Read Replicas with Apache Spark installed. b. Redshift out-performs Apache Spark and should be used instead. c. Self-managed cluster of EC2 instances with Apache Spark installed. d. Elastic Map Reduce (EMR) with Apache Spark installed.

b

You are a ML specialist who is working within SageMaker analyzing a dataset in a Jupyter notebook. On your local machine you have several open-source Python libraries that you have downloaded from the internet using a typical package manager. You want to download and use these same libraries on your dataset in SageMaker within your Jupyter notebook. What options allow you to use these libraries? a. Upload the library in .zip format into S3 and use the Jupyter notebook in SageMaker to reference S3 bucket with Python libraries. b. Use the integrated terminals in SageMaker to install libraries. This is typically done using conda install or pip install. c. SSH into the Jupyter notebook instance and install needed libraries. This is typically done using conda install or pip install. d. SageMaker offers a wide variety of built-in libraries. If the library you need is not included, contact AWS support with details on libraries needed for distribution.

b

You are a ML specialist within a large organization who needs to run SQL queries and analytics on thousands of Apache logs files stored in S3. Which set of tools can help you achieve this with the LEAST amount of effort? a. Data Pipeline and RDS b. AWS Glue Data Catalog and Athena c. Data Pipeline and Athena d. Redshift and Redshift Spectrum

C

You are consulting for a mountain climbing gear manufacturer and have been asked to design a machine learning approach for predicting the strength of a new line of climbing ropes. Which approach might you choose? A. You would choose a multi-class classification approach to classify the rope into an appropriate price range. B. You would approach the problem as a linear regression problem to predict the tensile strength of the rope based on other ropes. C. You would recommend they do not use a machine learning model. D. You would choose a binary classification approach to determine if the rope will fail or not. E. You would choose a simulation-based reinforcement learning approach.

c

You are trying to set up a crawler within AWS Glue that crawls your input data in S3. For some reason after the crawler finishes executing, it cannot determine the schema from your data and no tables are created within your AWS Glue Data Catalog. What is the reason for these results? a. The checkbox for 'Do not create tables' was checked when setting up the crawler in AWS Glue. b. The bucket path for the input data store in S3 is specified incorrectly. c. AWS Glue built-in classifiers could not find the input data format. You need to create a custom classifier. d. The crawler does not have correct IAM permissions to access the input data in the S3 bucket.

E

You are working for a major research university analyzing data about the professors who teach there. The features within the data contain information like employee id, position, department, job description, salary, and tenure. The tenure attribute is binary 0 or 1, whether the professor has tenure or does not have tenure. You need to find the distribution of professors and salaries. What is the best visualization to use to achieve this? A Line chart B Pie chart C Bubble chart D Scatter chart E Histogram

c

You are working for an organization that takes different metrics about its customers and classifies them with one of the following statuses: bronze, silver, and gold. Depending on their status they get more/less discounts and are placed as a higher/lower priority for customer support. The algorithm you have chosen expects all numerical inputs. What can be done to handle these status values? a. Use one-hot encoding techniques to map values for each status. b. Apply random numbers to each status value and apply gradient descent until the values converge to expect results. c. Experiment with mapping different values for each status and see which works best. d. Use one-hot encoding techniques to map values for each status dropping the original status feature.

B

You are working on a model that tries to predict the future revenue of select companies based on 50 years of historic data from public financial filings. What might be a strategy to determine if the model is reasonably accurate? A. Use a softmax function to invert the historical data then run the validation job from most recent to earliest history. B. Use a set of the historic data as testing data to back-test the model and compare results to actual historical results. C. Randomize the training data and reserve 20% as a validation set after the training process is completed. D. Use Random Cut Forest to remove any outliers and rerun the algorithm on the last 20% of the data.

D

You have been asked to help develop a vision system for a manufacturing line that will reorient parts to a specific position using a robotic arm. What algorithm might you choose for the vision part of this problem? A. Object Detection B. Image Analysis C. AWS Comprehend D. Semantic Segmentation E. Seq2Seq F. Object2Vec

d

You have been tasked with capturing data from an online gaming platform to run analytics on and process through a machine learning pipeline. The data that you are ingesting is players controller inputs every 1 second (up to 10 players in a game) that is in JSON format. The data needs to be ingested through Kinesis Data Streams and the JSON data blob is 100 KB in size. What is the minimum number of shards you can use to successfully ingest this data? a 100 shards b Greater than 500 shards, so you'll need to request more shards from AWS c 10 shards d 1 shard

d

You have been tasked with capturing two different types of streaming events. The first event type includes mission critical data that needs to immediately be processed before operations can continue. The second event type includes data of less importance, but processing can continue without processing. What is the most appropriate solution to record these different types of events? a. Capture the mission critical events with the PutRecords API call and the second event type with the Kinesis Producer Library (KPL). b. Capture both events with the PutRecords API call. c. Capture both event types using the Kinesis Producer Library (KPL). d. Capture the mission critical events with the Kinesis Producer Library (KPL) and the second event type with the Putrecords API call.

c

You have been tasked with collecting thousands of PDFs for building a large corpus dataset. The data within this dataset would be considered what type of data? a. Relational b. Semi-structured c. Unstructured d. Structured

c

You have been tasked with converting multiple JSON files within a S3 bucket to Apache Parquet format. Which AWS service can you use to achieve this with the LEAST amount of effort? a. Create an EMR cluster to run an Apache Spark job to process the data the Apache Parquet and output newly formatted files into S3. b. Create a Lambda function that reads all of the objects in the S3 bucket. Loop through each of the objects and convert from JSON to Apache Parquet. Once the conversion is complete, output newly formatted files into S3. c. Create an AWS Glue Job to convert the S3 objects from JSON to Apache Parquet, then output newly formatted files into S3. d. Create a Data Pipeline job that reads from your S3 bucket and sends the data the EMR. Create an Apache Spark job to process the data the Apache Parquet and output newly formatted files into S3.

a

You have been tasked with setting up crawlers in AWS Glue to crawler different data stores to populate your organization's AWS Glue Data Catalogs. Which of the following input data store is NOT an option when creating a crawler? a. DocumentDB b. JDBC Connections c. RDS d. Redshift e. S3 f. DynamoDB

B

You have launched a training job but it fails after a few minutes. What is the first thing you should do for troubleshooting? a Go to CloudTrail logs and try to identify the error in the logs for your job. b Go to CloudWatch logs and try to identify the error in the logs for your job. c Check to see that your Notebook instance has the proper permissions to access the input files on S3. d Submit the job with AWS X-Ray enabled for additional debug information. e Ensure that your instance type is large enough and resubmit the job in a different region.

C

You want to be sure to use the most stable version of a training container. How do you ensure this? A. Use the path to the global container repository. B. Use the ECR repository located in US-EAST-2. C. Use the :1 tag when specifying the ECR container path. D. Use the :latest tag when specifying the ECR container path.

b c d

You work for a farming company that has dozens of tractors with build-in IoT devices. These devices stream data into AWS using Kinesis Data Streams. The features associated with the data is tractor Id, latitude, longitude, inside temp, outside temp, and fuel level. As a ML specialist you need to transform the data and store it in a data store. Which combination of services can you use to achieve this? (Choose 3) a. Use Kinesis Data Streams to immediately write the data into S3. Next, set up a Lambda function that fires any time an object is PUT onto S3. Transform the data from the Lambda function, then write the transformed data into S3. b. Set up Kinesis Firehose to ingest data from Kinesis Data Streams, then send data to Lambda. Transform the data in Lambda and write the transformed data into S3. c. Immediately send the data to Lambda from Kinesis Data Streams. Transform the data in Lambda and write the transformed data into S3. d. Set up Kinesis Data Analytics to ingest the data from Kinesis Data Stream, then run real-time SQL queries on the data to transform it. After the data is transformed, ingest the data with Kinesis Data Firehose and write the data into S3. e. Use Kinesis Data Analytics to run real-time SQL queries to transform the data and immediately write the transformed data into S3.

a

You work for an organization that wants to manage all of the data stores in S3. The organization wants to automate the transformation jobs on the S3 data and maintain a data catalog of the metadata concerning the datasets. The solution that you choose should require the least amount of setup and maintenance. Which solution will allow you to achieve this and achieve its goals? a. Create an AWS Glue crawler to populate the AWS Glue Data Catalog. Then, create an AWS Glue job, and set up a schedule for data transformation jobs. b. Create a cluster in EMR that uses Apache Hive. Then, create a simple Hive script that runs transformation jobs on a schedule. c. Create an AWS Data Pipeline that transforms the data. Then, create an Apache Hive metastore and a script that runs transformation jobs on a schedule. d. Create a cluster in EMR that uses Apache Spark. Then, create an Apache Hive metastore and a script that runs transformation jobs on a schedule.

C

Your company currently has a large on-prem Hadoop cluster that contains data you would like to use for a training job. Your cluster is equipped with Mahout, Flume, Hive, Spark, and Ganglia. How might you most efficiently use this data? A. Use Data Pipeline to make a copy of the data in Spark DataFrame format. Upload the data to S3 where it can be accessed by the SageMaker training jobs. B. Using EMR, create a Scala script to export the data to an HDFS volume. Copy that data over to an EBS volume where it can be read by the SageMaker training containers. C. Ensure that Spark is supported on your Hadoop cluster and leverage the SageMaker Spark library. D. Use Mahout on the Hadoop Cluster to preprocess the data into a format that is compatible with SageMaker. Export the data with Flume to the local storage of the training container and launch the training job.

c

Your organization has a standalone Javascript (Node.js) application that streams data into AWS using Kinesis Data Streams. You notice that they are using the Kinesis API (AWS SDK) over the Kinesis Producer Library (KPL). What might be the reasoning behind this? a. The Kinesis API (AWS SDK) provides greater functionality over the Kinesis Producer Library. b. The Kinesis Producer Library cannot be integrated with a Javascript application because of its asynchronous architecture. c. The Kinesis Producer Library must be installed as a Java application to use with Kinesis Data Streams. d. The Kinesis API (AWS SDK) runs faster in Javascript applications over the Kinesis Producer Library.

b

Your organization has given you several different sets of key-value pair JSON files that need to be used for a machine learning project within AWS. What type of data is this classified as and where is the best place to load this data into? a. Structured data, stored in RDS. b. Semi-structured data, stored in S3. c. Unstructured data, stored in S3. d. Semi-structured data, stored in DynamoDB.

a

Your organization needs to find a way to capture streaming data from certain events customers are performing. These events are a crucial part of the organization's business development and cannot afford to be lost. You've already set up a Kinesis Data Stream and a consumer EC2 instance to process and deliver the data into S3. You've noticed that the last few days of events are not showing up in S3 and your EC2 instance has been shutdown. What combination of steps can you take to ensure this does not happen again? a. Set up CloudWatch monitoring for your EC2 instance as well as AutoScaling if your consumer EC2 instance is shutdown. Next, ensure that the maximum amount of hours are selected (168 hours) for data retention when creating your Kinesis Data Stream. Finally, write logic on the consumer EC2 instance that handles unprocessed data in the Kinesis Data Stream and failed writes to S3. b. Set up CloudWatch monitoring for your EC2 instance as well as AutoScaling if your consumer EC2 instance is shutdown. Next, set up multiple Kinesis Data Streams to process the data on the EC2 instance. c. Set up CloudWatch monitoring for your EC2 instance as well as AutoScaling if your consumer EC2 instance is shutdown. Next, send the data to Kinesis Data Firehose before writing the data into S3. Since Kinesis Data Firehose has retry mechanism built-in, the changes of data being lost is extremely unlikely. d. Set up CloudWatch monitoring for your EC2 instance as well as AutoScaling if your consumer EC2 instance is shutdown. Next, set up a Lambda function to poll the Kinesis Data Stream for failed delivered records and then send those requests back into the consumer EC2 instance.

CreateModel

_____________ API call is used to launch an inference container. When using the built-in algorithms, SageMaker will automatically reference the current stable version of the container

PutRecords

_____________________ is a synchronous send function, so it must be used for critical events

sentences, unique

tf-idf matrix is (# of _____, # of _____ words/bigrams, trigrams, etc over both sentences)

missing

using a supervised learning to predict missing values based on the values is the best way to minimize bias when replacing _____ data

10

you should have ____ times as many observations as features.


Related study sets

Cardiovascular Disease Prevention

View Set

Chapter 8 Multiple Choice Questions - Ex. II

View Set

Biology CEll types and structure/ function

View Set

Programa Master sobre Intervención ABA en Autismo y otros Trastornos del Desarrollo, Promoción 2017-2018

View Set

Business Finance chapter 2 homework

View Set