WhizLabs Practice Questions

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

binary

A ______ data type for an attribute only has 2 possible values, yes and no.

Categorical

A ___________ data type for an attribute holds a limited set of unique strings (user name, region, etc)

multidimensional regression

A ______________ is used to find more than one real number values (what is the height and width of the animal in the image?)

multiclass classification

A ______________ solves a classification problem where you have more than one class for your answer (e.g. what type of animal is in the image?)

PR Curve

Another name for a Precision Recall curve is a _______

longer

Baynesian Search uses regression to iteratively choose sets hyperparameters to test. Due to this reiterative approach, this method cannot run the max number of concurrent training jobs without impacting the performance of the search, so it takes _______ than Random Search.

Rekognition

Face collection contents can cause __________ to fail to recognize a face (e.g. if it only has one face per person)

more

Feature selection to eliminate irrelevant features requires significantly ______ effort than using the random forest algorithm approach to address overfitting

C

For binary classification (complete y/n), a model will produce a score denoting the strength of the prediction and a prediction_label denoting complete or not complete You are using a "complete?" feature as your prediction response feature and are making predictions on new data. When you interrogate the response of your model, which of the following do you expect to find? A. score: the prediction produced by the model B. score: the prediction produced by the model AND predicted_class which is an integer from 0 to num_classes-1 C. score: single floating point number measuring the strength of the prediction AND predicted_label which is 0 or 1 D. score: the prediction produced by the model OR predicted_label which is 0 or 1

better

GAM is ________ than SMOTE when it comes to oversampling

inefficient

IAM resource-based roles as a security measure for a data lake is an ______ approach because data lakes tend to contain large numbers of buckets and objects

ordinal

If you are trying to look at a time series validation, you can't use K-Fold because it randomizes the data (and you lose the _________ nature of it).

regression

If you are trying to solve for a numeric result (e.g. the number of purchases customers will make for each next potential product), you should use a ________ model

No

Is there such thing as ImplicitHashKey request parameters?

No Yes

Is there such thing as PartitionKeys request parameters? Is there such thing as PartitionKey request parameter?

No Yes

Is there such thing as ShardID request paramteres? Is there such thing as ShardID response parameter?

documents

Neural Topic Model is used to organize _________ into topics

speech

Nueral Topic Model is used to group documents. It does not work on __________.

Hive

ORC file format provides a highly efficient way to store _____ data

cancer

PR Curve is best used to evaluate models on data sets where most of the cases are negative (such as _________ screening). The true negative cases are not weighted heavily in the equation, thus reducing the impact of the imbalance.

tabular

Pandas is the best choice for data wrangling and manipulation of _________ data (such as CSV formatted data)

concurrent

Random Search technique allows you to run the max number of ___________ training jobs without impacting the performance of the search.

format

Scikit-learn is the best python package to transform raw feature vectors into a _______ suited to downstream estimators

unique

The General Adversarial Networks (GAN) technique generates ____________ observations that more closely resemble the real minority observations without being so similar they are almost identical

3

The Lambda time out value default is ___ seconds. For many Kinesis Data Firehose implementations, 3 seconds is not enough time to execute the transformation function

identical

The SMOTE technique creates new observations from underrepresented classes, and will be mostly _________ to what exists already.

Spark

The SageMaker Spark library makes it so you can easily train models using data frames in your _____ clusters

D

Using your company's cameras, you and your team of ML specialists have been contracted by the Wolf Conservation Center of North America to build a ML model to identify and count a specific species of worf in remote areas of the Arctic Circle. What type of ML problem are you trying to solve? A. Linear regression B. Binary classification C. Multidimentional regression D. Multiclass classification

D

Workers have an image of their face stored in the HR database. You have decided to use AWS Rekognition for your facial recognition solution. On occassion the Rekognition model fails to recognize visitors to the buildings. What could be the source of the problem? A. Face landmarks filters set to max sharpness B. Bounding box and confidence score for face comparison threshold tolerances set to max values C. Confidence threhold tolerance set to the default D. Face collection contents

D

You are a data scientist working for a cancer screening center. The center has gathered data on many patients that have been screened over the years. The data is obviously skewed toward negative results, as most screen patients don't have cancer. You are evaluating several machine learning models to decide which model best predicts true positives when using your cancer screening data. You have split your data into a 70/30 ratio of training set to test set. You now need to decide which metric to use to evaluate your models. Which metric will most accurately determine the model best suited to solve your classification problem? A. ROC Curve B. Precision C. Recall D. PR Curve

A

You are building a ML model for your user behavior prediction problem using your company's user iteraction data stored in DynamoDB. You want to get your data into CSV format and load it into an S3 bucket so you can use it for your ML algorithm. Your data needs to be updated automatically in order to produce real-time recommendations. Your business analysts also want to have the ability to run ad hoc queries on your data. Which of the following architectures will be most efficient to do this? A. Use AWS Data Pipeline to coordinate the following set of tasks: export DynamorDB to S3 as JSON; Convert JSON to CSV; SageMaker model uses the data to produce real-time predictions; analysts use Athena to perform ad hoc queries against the CSV data in S3 B. Create a custom classifier in AWS Glue ETL job that extract the DynamoDB data to CSV format on your S3 bucket; run your SageMaker model on the new data to produce real-time recommendations; analysts use Athena to perform ad hoc queries against CSV data in S3 C. Use AWS DMS to connect your DynamoDB database and export the data to S3 in CSV format; run your SageMaker model on the new data to produce real-time recommendations; analysts use Athena to perform ad hoc queries against the CSV in S3 D. Use Kinesis Data Streams to receive the data from DynamoDB; use an ETL job running on EC2 instance to consume the data and produce the CSV representation; run your SageMaker model on the new data to produce real-time recommendations; analysts use Athena to perform ard hoc queries against the CSV data in S3

A C E

You are creating the data producer application code to take trade data from your trade system and set the trade records to your Kinesis Data Stream. Which of the following are valid put_record request parameters? Select 3. A. Data B. Implicit C. ExplicitHashKey D. PartitionKeys E. SequenceNumberforOrdering F. ShardID

D

You are deploying you data streaming pipeline for your machine learning environment. Your cloud formation stack has a Kinesis Data Firehose using the Data Transformation feature where you have configured Firehose to write to your S3 data lake. When you stream data through your Kinesis Firehose, you noteice that no data is arriving on your S3 bucket. What might be the problem that is causing the failure? A. Your Lmabda memory setting is set to the maximum value allowed B. Your S3 bucket is in the same region as your Kinesis Firehose C. Your Kinesis Data Firehose buffer setting is set to the default value D. Your Lambda timeout value is set to the default value

B

You are looking for a distribution of consultants and their billing hours for the given period. What visualization best describes this relationship? A. Scatter Plot B. Histogram C. Line chart D. Box plot E. Bubble chart

E

You are on the team that is using SageMaker Image Classification machine learning to read and classify license plates by state, and then identify the actual license plate number. Very rarely, cars pass through the toll gates with foreign plates. The outliers must not adversely affect your model's predictions. Which hyperparameter should you set and to what value to ensure your model is not impacted by outliers? A. feature_dim set to 5 B. feature_dim set to 1 C. sample_size set to 10 D. sample size set to 100 E. learning rate set to .1 F. learning rate set to .75

C D F G

You are working on a property foreclosure model to predict potential price drops. You have decided to use the SageMaker Linear Learner algorithm. Data is as follows: Type: condo or house Bedrooms: 1-4 Area: number of sf (has missing data) Solar Rating: Categories (has missing data) Price: Dollar amount Foreclosed: Y/N Which of the following SageMaker built in scikit_learn library transformers would you used to clean and format your data? Select 4. A. StandardScaler to encode the Solar_Rating feature B. OneHotEncoder to encode the Area feature C. SimpleImputer to complete the missing values in teh Solar_Rating and Area features D. OneHotEncoder to encode the Type feature E. OrdinalEncoder to complete the missing values in the Solar_Rating and Area features F. OrdinalEncoder to encode the Solar_Rating feature G. LabelBinarizer to encode the Foreclosed feature H. MinMaxScaler to encode the Foreclosed feature

TargetAttributeName

You can assign a _________________ field to the name of the attribute that you are trying to predict (e.g. Will this user subscribe to my campaign?"

D

You have been tasked to write an AWS Glue job to convert files from JSON to a format that will store Hive data. Which data format is most efficient to convert the data for use with Hive? A. ion B. grokLog C. xml D. orc

C

You have build a data streaming pipeline using Kinesis Data Firehose and S3. Due to the personal identifiable information contained in your data stream, your data must be encrypted in flight and at rest. How should you configure your solution to achieve encryption at rest? A. Encrypt the data at the data consumer application level B. Encrypt the data by configuring Firehose to use S3-managed encryption keys (SSE-S3) C. Encrypt the data by configuring Firehose to use S3 server-side encryption with AWS Key Management Service (SSE-KMS) D. Encrypt the data by configuring Firehose to use S3 server-side encryption with 256-bit AES-GCM with HKDF

B

You have chosen to use S3 to house your data lake. How will you most efficiently protect the data lake, your machine learning data source, against internal threats to data confidentiality and security? A. Create IAM resource-based policies for each data lake S3 bucket resource. Use bucket policies and ACLs to control the resources at the bucket level and at the object level B. Create IAM user policies so that the permissions to access your S3 data lake assets are linked to user roles and permissions. Place your data scientists into IAM groups and ssign the user policies to those groups. These policies and permissions will define access to the data processing and analytics services which your data scientists will use. C. Create an access key ID and a secret access key for each internal user of your S3 data lake. Your internal users will then only ba able to gain access to your data lake using these keys D. Use the AWS CloudHSM cloud-based hardware security module (HSM) to secure your S3 data lake. Internal users of your data lake will use the encryption keys generated by the CloudHSM module to gain access to the data needed for the machine learning models.

B C E

You have chosen to use SageMaker's automatic model tuning and you have set your objective to validation: precision in your hyperparameter tuning job. How do you pass your tuning job settings into your hyperparameter tuning job? Select 3. A. Define a JSON object and pass it as the value of the HyperParameterConfig to the HyperParameterTuningJob B. Deving a JSON object and pass it as the value of the HyperParameterTuningCongif to the CreateHyperParameterTuningJob C. In the JSON object, specify the ranges of the hyperparameters you want to tune D. In the JSON object, specify the limits of the hyperparameters you want to tune E. In the JSON object, specify the objective metric for the hyperparameter tuning job F. In the JSON object, specify the MaxSequentialTrainingJobs parameter in the ResourceLimits section

C

You have created a Glue Crawler that you ahve configured to crawl the data on S3 and you have writetn a custom classifier. Unfortunately, the crawler failed to create a schema. Why might the Glue crawler fail in this way? A. You did not add an exclude pattern when you configured the data source B. The IAM role you assigned the crawler to has the AWSGlueServiceRole managed policy attached plus an inline policy that allows read access to the S3 bucket C. All the classifiers returned a certainty of 0.0 D. You chose to create a single schema for each S3 path

B

You have decided to drop one of the categories per feature beacuse you suspect you may have perfectly collinear features. Which of the following is NOT a drop methodology used in the OneHotEncoder transformer? A. None B. Last C. Array D. First

D

You have set the predictor_type hyperparameter to binary_classifer. Which loss function hyperparameter is NOT one of your options? A. auto B. logistic C. hinge_loss D. softmax_loss

A E

You have setup a data pipeline delivery stream using Kinesis Data Firehose as your data streaming service and AWS Redshift as your data warehouse. Your researchers have setup the S3 bucket in their own account, that you have used for your Kinesis Data Firehose. Your researchers need to access the data using BI tools such as QuickSight to build dashboards and use metrics in their research. However, when you implement your solution you notice that your streaming data does not load into your Redshift data warehouse. What could be a reason why this is happening? Choose 2. A. You have not created an IAM role for your Kinesis Firehose to access the S3 bucket B. You defined a cluster security group and associated it with your Redshift cluster C. The access policy associated with your Kinesis Firehose does not have lambda:InvokeFunction specified in the Allow Action section of the Lambda actions D. The access policy associated with your Kinesis Firehose does not have kms: GenerateDataKey specified in the Allow Action section of the KMS actions E. The access policy associated with your Kinesis Firehose does not have S3: PutObjectAcl specified in the Allow Action section of the S3 actions

A

You management team has leveraged off-shore call center services to reduce costs, but they now want to take advantage of voice recognitio to automate many of the most frequent support call types such as "I forgot my password" or "my internet is down". Which SageMaker built-in algorithm is the best choice for this problem? A. Seq2Seq B. K-Means C. Semanatic Segmentation D. Neural Topic Model

B

You need to analyze streamed text to find important or relevant repeated common words and phrases and correlate this data to client products. You'll then use these topics in your client product marketing material. Which of the following text feature engineering technique is the best solution for this task? A. Orthogonal Sparse Bigram (OSB) B. tf-idf C. Bag-of-words D. N-Gram

D F

You plan to use the XGBoost algorithm on the binary classification problem. Which of the following hyperparameters must you use in your tuning jobs if your objective is set to multi:softprob? Select 2 A. alpha B. base_score C. eta D. num_round E. gamma F. num_class

D

You want to use your company's data to predict the ratings distribution of a movie based on the genre of the movie. Your training data consists of genre feature with a set of categories such as documentatry, romance, etc. You have sorted your data by the genre feature and used the ML sequential split option to split your data into training and test datasets. When using your test dataset to verify your genre-prediction model, you discover that the accuracy rate is very low. What could be the underlying problem? A. You should have sorted by a different feature before you used the sequential split option B. You should have split your data categorically by genre C. You should have split your data sequentially by year D. You should not have used the sequential split option

C F G

You work as a ML specialist for an eyeware manufacturing plant where you have used XGBoost to train a model that uses assembly line image data to categorize context lenses as malformed or correctly formed. You have engineered your data and used CSV as your Training ContentType. You are now ready to deploy your model using SageMaker hosting services. Assuming you used the default configuration settings, which of the following are true statements about your hosted model (Select 3) A. The training instance class is GPU B. The algorithm is not parallelizable for distributed training C. The training target data should be in the forst column with no header D. The training target data should be in the last column of the CSV with no header E. The inference data target should be in the first column of the CSV with no header F. The inference CSV data has no label column G. The training instance class is CPU

D

You work as a machine learning specialist for a financial services company. You are building a ML model to perform futures price prediction. You have trained your model and you now want to evaluate it to make sure it is not overtrained (and can generalize). Which of the following techniques is the most appropriate method to cross validate your machine learning model? A. Leave One Out Cross Validation (LOOCV) B. K-Fold Cross validation C. Stratified cross validation D. Time Series cross validation

D

You work as a machine learning specialist for a healthcare insurance company. You company wishes to determine which registered plan participants will choose a new health care option your comapny plans to release. The roll-out plan for the new option is compressed, so you need to produce results quickly. You plan to use a binary classification algorithm on the problem. In order to find the optimal model quickly, you plan to run the max number of concurrent hyperparameter training jobs to reach the best hyperparameter values. Which of the following types of hyperparameters tuning techniques will best suit your needs? A. Baynesian Search B. Hidden Markov Models C. Conditional Random Fields D. Random Search

B

You work at a SW comapny that has developed a popular mobile gaming app. You want to run a predictive model on real-time data generated by the users of the app to see how to structure an upcoming marketing campaign. The data you need for the model is the age of the user, their location, their activity level (measured in playing time). You ened to filter the data for users who are not yet signed up for the premium service. You'll also need to deliver your data in JSON format and convert the playing time into a string format and finally put the data into an S3 bucket. What is the simplest, most cost effective, performant and scalable way to architet this pipeline? A. Create Kinesis Data Streams application running on an EC2 instance that gaters the mobile user data from it's log files; use Kinesis Analytics to transform the log data into the subset you need; connect the Kinesis Data Stream to a Kinesis Firehose which puts the data onto your S3 bucket B. Create a Kinesis Data Streams application running on EC2 instances in an Auto Scaling Group that gathers the mobile user data from it's log files; use Kinesis Analytics to transform the log data into the subset you need; connect the Kinesis data stream to a Kinesis Firehose which uses a Lambda function to convert the playing time; Kinesis Firehose then puts the data onto your S3 bucket C. Create a Kinesis Firehose which gathers the data and puts it onto your S3 bucket D. Create a Kinesis Data Streams application running on EC2 instances in an Auto Scaling Group that gathers the mobile user data from it's log files and puts the data onto your S3 bucket

C

You work for a manufacturing company that produces retail apparel. You need to determine which product, among a list of potential next products, your company should invest it's resources to produce. You decide you need to predict the sales levels of each of the potential next products and select the one with the highest predicted purchase rate. What type of ML approach should you use? A. You are solving a multiclass classification problem and you should use multinominal logistic regression model B. You are solving a classification problem and you should use random cut forest model C. You are solving a regression problem and should use a linear regression model D. You are solving a binary classification problem and should use logistic regression model

A C

You work for a manufacturing plant where you are attempting to use supervised learning to train assembly line image recognition to categorize malformed parts. You have engineered your data and produced a CSV file and placed it on S3. Which of the following input channel specifications are correct for your data? Select 2. A. Metadata Content Type is defined as text/csv B. Metadata Content-Type is identified as text/csv; label_size=0 C. Target value should be in the first column with no header D. Target value should be in the last column with no header E. Target value should be in the first column with a header F. Target value should be in the last column with a header

A

You work for a real estate company where you are building a ML model to predict the prices of houses. You are using a regression decision tree. As you train your model, you see that it is overfitted to your training data and that it doesn't generalize well to unseen data. How can you improve your situation and get better training results in the most efficient way? A. Use a random forest by building multiple randomized decision trees and averaging their outputs to get predictions of the housing prices B. Gather additional training data that gives a more diverse representation of the housing price data C. Use the "dropout" technique to penalize large weights and prevent overfitting D. Use feature selection to eliminate irrelevant features and iteratively train your model until you eliminate the overfitting

B

You work for a retail firm that wishes to conduct a direct mail campaign to attract new customers. You marketing manager wishes to get answers to questions that can be put into discrete categories such as "using historical customer email campaign responses, should this customer receive an email from our current campaign?" YOu decide to use the SageMaker Linear Learner algorithm to build your model. Which hyperparameters setting would you use to get the algorithm to produce discrete results? A. set the objective hyperparameter to reg:logistic B. set the predictor_type hyperparameter to binary_classifier C. set the predictor_type hyperparameter to regressor D. set the objective hyperparameter to reg:linear

D

You work for a scientific research company where you ened to gather data on tree specimens. You have scientist peers who go out in the field across the globe and photograph tree species. The images that they gather need to be classified and labeled so yuo can use them in your training datasets in your meaching learning models. What is the best way to label your image data more accurately and in the most cost effective manner? A. Hire a human image labeler to process all your images B. Use AWS Rekognition to analyze all your images. For the one that Rekognition cannot label, have human labelers that you hire attemt to label them C. Use an open source labeling tool like BBox-Label-Tool to process. For the ones the tool cannot label, hire a human. D. Use AWS SageMaker Ground Truth to automatically label your images and use the AWS Ground Truth human labelers to label the images that automatic labeling cannot label

D

Your company has a contract to produce real-time prediction capabilities for fighter jet flight assist software. You are in the development stages and have chosen to use the DeepAT SageMaker built-in deep learning model. You are setting up your jupyter notebook instance in SageMaker. Which of the following jupyter notebook settings will allow you to test and evaluate production performance when you are building your models? A. Notebook instance type B. Lifecycle configuration C. Volume Size D. Elastic Inference E. Primary container

B

Your company is trying to use ML to help determine the breed of dogs in the photos your customers tag on Instagram and Twitter. You need to build a ML model to accomplish this problem. Which SageMaker model would you use to best fit your ML problem? A. K-Means B. Linear Learner C. Seq2Seq D. Neural Topic Model

A D F

Your company needs to do complex analysis on it's crude and oil chemical compound structures. You have selected an algorithm for your ML model that is not one of the SageMaker built-in algorithms. You have created your model using CreateModel and you have created you HTTPS endpoint. Your docker container running your model is now ready to recieve inference requests for real-time inferences. When SageMaker returns the inference result from a client's request, which of the following are true (select 3) A. To receive inference requests, your inference container must have a web server running on port 8080 B. Your inference container must accept GET requests to the /invocations endpoint C. Your container must accept PUT requests to the /inferences endpoint D. AWS SageMaker strips all POST headers except those supported by InvokeEndpoint. SageMaker might add additional headers. Your inference container must be able to safely ignore those additional headers E. You inference container must accept all POST requests to the /inference endpoint F. Your inference container must accept POST requests to the /invocations endpoint

C

Your company needs to move from traditional translation software to a ML model based approach that produces the translations accurately. One of your first tasks is to take text given in the form of a document and use histograms to measure the occurence of individual words in the document for use in document classification. Which of the following text feature engineering technique is the best solution for this task? A. Orthogonal Sparse Bigram (OSB) B. tf-idf C. Bag-of-words D. N-Gram

B C E

Your company wants to build an election prediction model that uses multiple independent variables such as age of voter, religion, sex, etc to predict the candidate for which each observed voter will vote for in the upcoming election. Which type of algorithm is NOT a good choice for your prediction? Select 3. A. Ordinary Least Squares Regression (OLSR) B. Least Outlier Factor (LOF) C. Naive Bayes D. Least-Angle Regression (LARS) E. K-Means

A

Your data origins (Canada - 1210, Mexico - 120, US - 68) is imbalanced. In order to address the imbalance in your training data you will need to use a preprocessing step before yo ucreate your SageMaker training job. Which technique should you use to address the imbalance? A. Run your training data through a preprocessing script that uses the SMOTE approach B. Run your training data through a Spark pipeline in AWS Glue to one-hot encode the features C. Run your training data through a preprocessing script that uses the feature-split technique D. Run your training data through a preprocessing script that uses the min-max normalization technique

B

Your firm is working on a new quant algorithm to predict when to enter and exit holdings in their portfolio. You are building a ML model to predict these entry and exit points in time. You have cleaned your data and you are now ready to split the data into training and test datasets. Which splitting technique is best suited for your model's requirements? A. Use k-fold cross validation to split the data B. Sequentially split the data C. Randomly split the data D. Categorically split the data

C

Your investment management operations team wants to used data from a data lake to build financial prediction models. You want to use data from the Hadoop cluster in your ML training jobs. Your Hadoop cluster has Hive, Spark, Sqoop, and Flume installed. How can you mode effectively load data from your Hadoop clueter into your SageMaker model for training? A. Use the distcp utility to copy your dataset from your hadoop platform to the S3 bucket where your SageMaker training job can use it B. Use the HadoopActivity command with AWS Data Pipeline to move your dataset from your hadoop platform to the S3 bucket where your SageMaker training job can use it C. Use the SageMaker Spark library using the data frames in your Spark clusters to train yout model D. Use the Sqoop export commend to export your dataset from your Haddop cluster to the S3 bucket where your SageMaker traning job can use it

N-Gram

________ is used to find multi-word phrases in the text, but does not weight common words or phrases

bag-of-words

_________ creates tokens of the input document text and outputs a statistical depiction (like a count) of the test. e.g. it may be a histogram with count by word.

tf-idf

_________ determines how important a word is in a dcoument by giving weights to words that are common and less common in a document.

sample_size

_________ hyperparameters are applicable to KNN algorithms

Seq2Seq

__________ takes audio as input data and generates a sequence of tokens (such as words in the audio) that can be used to provide automated responses to user requests.

OrdinalEncoder

___________ transformer encodes categorical features as an integer array, but does not complete missing values

Time Series Cross Validation

____________ technique uses forward chaining where the origin of the forecast moves forward in time.

StandardScalerTransformer

_____________ is user to standardize features by removing the mean and scaling to unit varience.

Random forest algorithm

________________ can increase the prediction accuracy and prevent overfitting that occurs with a single decision tree.

feature_dim

feature_dim hyperparameters are applicable to K-Means and KNN hyperparameters

outliers

learning_rate hyperparameters governs how quickly the model adapts to new/changing data and runs from 0 to 1. Setting a low number (like .1) will make is learn slower and be less sensitive to __________.


Ensembles d'études connexes

group decision making and problem solving chapter 11

View Set

INCREASES and DECREASES panel 3: Kidney Function Panel (SST Tube)

View Set

Teaching Social Studies - Quizzes

View Set

CAP 100 Unit 3 Exam Essentials 2018

View Set

Chapter 54: Caring for Clients with Breast Disorders

View Set