AWS Certified Machine Learning Specialty A Cloud Guru Quiz Questions (Level 2)
You are consulting for a shipping company who wants to implement a very specific algorithm for shipping container optimization. The algorithm is not part of the currently available SageMaker built-in algorithms. What are your options? (Choose 2) 1-Search the AWS Marketplace for the training and inference algorithm. If it exists, use it to create training jobs and a deployable model in SageMaker. 2-Wait until the algorithm is available in SageMaker before further work. 3-Use a series of existing algorithms to simulate the actions of the unavailable algorithm. 4-Build the algorithm in a docker container and use that custom algorithm for training and inference in SageMaker. 5- Post an incendiary message to Twitter hoping to shame AWS into adopting the specialized algorithm.
1 & 4 If SageMaker does not support a desired algorithm, you can either bring your own or buy/subscribe to an algorithm from the AWS Marketplace. SageMaker Algorithms.
Which Amazon service allows you to create interactive graphs and charts, and acts as Business Intelligence (BI) tool? 1-Quicksight 2-Tableau 3-Matplotlib 4-Athena
1 - Quicksight QuickSight is a fully managed service in AWS that lets you easily create and publish interactive dashboards. Dashboards can then be accessed from any device, and embedded into your applications, portals, and websites.
Which visualizations help show composition? (select 3) 1- stacked area chart 2-stacked bar chart 3-bar chart 4-bubble chart 5-box plot 6-histogram 7-pie chart
1- stacked area chart 2-stacked bar chart 7-pie chart Visualizing the composition of our data is a great way to show what our data is made of.
You are consulting for a shipping company who wants to implement a very specific algorithm for shipping container optimization. The algorithm is not part of the currently available SageMaker built-in algorithms. What are your options? Choose 2 1-Build the algorithm in a docker container and use that custom algorithm for training and inference in SageMaker. 2-Search the AWS Marketplace for the training and inference algorithm. If it exists, use it to create training jobs and a deployable model in SageMaker. 3-Wait until the algorithm is available in SageMaker before further work. 4-Post an incendiary message to Twitter hoping to shame AWS into adopting the specialized algorithm. 5-Use a series of existing algorithms to simulate the actions of the unavailable algorithm.
1-Build the algorithm in a docker container and use that custom algorithm for training and inference in SageMaker. 2-Search the AWS Marketplace for the training and inference algorithm. If it exists, use it to create training jobs and a deployable model in SageMaker. If SageMaker does not support a desired algorithm, you can either bring your own or buy/subscribe to an algorithm from the AWS Marketplace. SageMaker Algorithms.
Which of the following are good candidate problems for using XGBoost? (Choose 3). 1-Deciding whether a transaction is fraudulent or not based on various details about the transaction. 2-Providing a ranking of search results on an e-commerce site customized to a customer's past purchases. 3-Create a policy that will guide an autonomous robot through an unknown maze. 4-Map a text string to an n-gram vector. 5-Evaluate handwritten numbers on a warranty card to detect what number they represent.
1-Deciding whether a transaction is fraudulent or not based on various details about the transaction. 2-Providing a ranking of search results on an e-commerce site customized to a customer's past purchases. 5-Evaluate handwritten numbers on a warranty card to detect what number they represent. XGBoost is an extremely flexible algorithm which can be used in regression, binary classification, and multi-class classification.
When analyzing a set of one-hot encoded data you realize that, while there is a massive amount of data, most of the values are absent. This is expected given the type of data, but what built-in SageMaker algorithm might you choose to work with this data? 1-Factorization Machines 2-Linear Learner 3-K-Nearest Neighbor 4-Semantic Segmentation 5-Object2Vec
1-From the description, it seems that we have a very sparse dataset. Factorization Machines is most often associated with high-dimensional sparse datasets that we are likely to see in one-hot encoded datasets.
Which type of color coded visualization shows the intersection of two dimensions where values fall in a range. 1-Heatmap 2-Bubble plot 3-Scatter chart 4-Histogram
1-Heatmap Heatmaps use color to show values increasing or decreasing. They are used in many different domains and can help show distribution, correlation, relationships and much more insightful information.
In a regression problem, if we plot the residuals in a histogram and observe a distribution heavily skewed to the right of zero indicating mostly positive residuals, what does this mean? 1-Our model is consistent underestimating. 2-Our model is consistently overestimating. 3-Our model is sufficient with regard to RMSE. 4-Our model is sufficient with regard to aggregate residual.
1-Our model is consistent underestimating. Residual is the actual value minus the predicted value. If most of our residuals are positive numbers, that means that our predicted values are mostly less than the actual values. This means that our model is consistently underestimating.
After training and validation sessions, we notice that the accuracy rate for training is acceptable but the accuracy rate for validation is very poor. What might we do? Choose 3 1-Reduce dimensionality. 2-Add an early stop. 3-Encode the data using Laminar Flow Step-up. 4-Gather more data for our training process. 5-Increase the learning rate. 6-Run training for a longer period of time.
1-Reduce dimensionality. 2-Add an early stop. 4-Gather more data for our training process. High error rate observed in validation and not training usually indicates overfitting to the training data. We can introduce more data, add early stopping to the training job and reduce features among other things to help return the model to a generalizer.
After training and validation sessions, we notice that the error rate is higher than we want for both sessions. Visualization of the data indicates that we don't seem to have any outliers. What else might we do? (Choose 4). 1-Reduce the dimensions of the data. 2-Encode the data using Laminar Flow Step-up. 3-Run training for a longer period of time. 4-Gather more data for our training process. 5-Add more variables to the dataset. 6-Run a random cut forest algorithm on the data.
1-Reduce the dimensions of the data. 3-Run training for a longer period of time. 4-Gather more data for our training process. 5-Add more variables to the dataset. This approach has the potential to reduce the error rate. BE MINDFUL OF: However, we would argue that during data analysis and feature engineering the data scientist should have done their due diligence to trim down the features and determine which ones were relevant to the problem they're trying to solve (i.e., what they're trying to predict) When both training and testing error is high, it indicates that our model is underfitting the data. We can try to add more details to the dataset, gather more data for training and/or run the training session longer. We might also need to identify a better algorithm.
You are designing a testing plan for an update release of your company's mission critical loan approval model. Due to regulatory compliance, it is critical that the updates are not used in production until regression testing has shown that the updates perform as good as the existing model. Which validation strategy would you choose? Choose 2 1-Use a K-Fold validation method. 2-Use an A/B test to expose the updates to real-world traffic. 3-Use a rolling upgrade to determine if the model is ready for production. 4-Make use of backtesting with historic data. 5-Use a canary deployment to collect data on whether the model is ready for production.
1-Use a K-Fold validation method. 4-Make use of backtesting with historic data. Because we must demonstrate that the updates perform as well as the existing model before we can use it in production, we would be seeking an offline validation method. Both k-fold and backtesting with historic data are offline validation methods and will allow us to evaluate the model performance without having to use live production traffic.
You want to be sure to use the most stable version of a training container. How do you ensure this? 1-Use the :1 tag when specifying the ECR container path. 2-Use the ECR repository located in US-EAST-2. 3-Use the :latest tag when specifying the ECR container path. 4-Use the path to the global container repository.
1-Use the :1 tag when specifying the ECR container path. When specifying a training or inference container, use the :1 tag at the end of the path to use the stable version. If you want the latest version, use :latest but that might not be backward compatible.
Which of the following is an example of unsupervised learning? 1-Using NTM to extract topics from a set of scientific journal articles. 2-Using Seq2Seq to extract a text string from a segment of a recorded speech. 3-Using XGBoost to predict the selling price of a house in a particular market. 4-Using K-Means to cluster customers into demographic segments. 5-Using a Factorization Machine to provide book recommendations.
1-Using NTM to extract topics from a set of scientific journal articles. 4-Using K-Means to cluster customers into demographic segments. Both K-Means and Neural Topic Modelling are unsupervised learning methods. XGBoost, Seq2Seq, and Factorization Machines are supervised learning methods.
Which of the following metrics are recommended for tuning a Linear Learner model so that we can help avoid overfitting? (Choose 3). 1-validation:objective_loss 2-validation:precision 3-test:precision 4-validation:recall 5-test:objective_loss 6-test:recall
1-validation:objective_loss 2-validation:precision 4-validation:recall To avoid overfitting, AWS recommends tuning the model against a validation metric instead of a training metric.
You have launched a training job but it fails after a few minutes. What is the first thing you should do for troubleshooting? 1-Go to CloudTrail logs and try to identify the error in the logs for your job. 2-Go to CloudWatch logs and try to identify the error in the logs for your job. 3-Ensure that your instance type is large enough and resubmit the job in a different region. 4-Submit the job with AWS X-Ray enabled for additional debug information. 5-Check to see that your Notebook instance has the proper permissions to access the input files on S3.
2-Go to CloudWatch logs and try to identify the error in the logs for your job. all errors in a training job will be logged in CloudWatch, so that should be your first stop to determine the cause of what the failure might be. Cloudtrail is a service for governance, compliance, and operational and risk auditing
Which of these examples would be considered as introducing bias into a problem space? (Choose 2) 1-Omitting records before a certain date in a forecasting problem. 2-Not randomizing a dataset even though you were told the data should be random. 3-Deciding to use a supervised learning method to estimate missing values in a dataset. 4-Filtering out outliers in a dataset which are greater than 4 standard deviations outside the mean. 5-Removing records from a set of customer reviews that were not fully complete.
2 & 5- Failing to randomize records even when you think they are already random can lead to introducing bias in the model when splitting. Additionally, removing customer review records could unintentionally filter out statistically significant parts of the dataset, biasing towards those customers who have the time or energy to fully complete the reviews. The other answers are reasonable and do not explicitly introduce bias.
You are a ML specialist within a large organization who needs to run SQL queries and analytics on thousands of Apache logs files stored in S3. Which set of tools can help you achieve this with the LEAST amount of effort? 1- Redshift and Redshift Spectrum 2- AWS Glue Data Catalog and Athena 3-Data Pipeline and Athena 4-Data Pipeline and RDS
2 Glue Data Catalog and Athena Answer-Using Redshift/Redshift Spectrum and Data Pipeline/RDS could work, but require much more effort in setting up and provisioning resources. Using AWS Glue you can use a crawler to crawl the logs files in S3. This will create structured tables within your AWS Glue database. These tables can then be queried using Athena. This solution requires the least amount of effort.
You are working for a major research university analyzing data about the professors who teach there. The features within the data contain information like employee id, position, department, job description, salary, and tenure. The tenure attribute is binary 0 or 1, whether the professor has tenure or does not have tenure. You need to find the distribution of salaries for professors in general. What is the best visualization to use to achieve this? 1-Pie chart 2-Histogram 3-Line chart 4-Scatter chart 5-Bubble chart
2-Histogram Since we are looking for distribution, a histogram is the best visualization to use from these answers. We're really only looking at one variable, Salary, we don't really care about how many professors there are. They're going to fall into salary (probably range) buckets, which makes a histogram a natural visualization for this data. If we wanted to say look at salary, vs area of expertise a scatter plot would be a good visualization for that.
Which text analysis algorithm is recommended when trying to compare documents based on the presence of words in each document? 1-Neural Topic Model (NTM) 2-Latent Dirichlet Allocation (LDA) 3-Sequence to Sequence (seq2seq) 4-BlazingText
2-Latent Dirichlet Allocation (LDA) LDA is most commonly used to identify topics in common between documents in a text corpus.
In what scenario is the DeepAR algorithm best suited? 1-Decide whether to extend a credit card offer to a potential customer. 2-Predict future sales of a new product based on historic sales of similar products. 3-Predict whether a football team will score a certain number of points in a match. 4-Determine the correlation between a person's diet and energy levels. 5-Provide the certainty that a given picture includes a human face.
2-Predict future sales of a new product based on historic sales of similar products. DeepAR is a supervised forecasting algorithm used with time-series data. DeepAR seeks to be better than traditional time-series algorithms by accommodating multiple cross-sectional datasets.
Which of these examples would be considered as introducing bias into a problem space? Choose 2 1-Omitting records before a certain date in a forecasting problem. 2-Removing records from a set of customer reviews that were not fully complete. 3-Deciding to use a supervised learning method to estimate missing values in a dataset. 4-Filtering out outliers in a dataset which are greater than 4 standard deviations outside the mean. 5-Not randomizing a dataset even though you were told the data should be random.
2-Removing records from a set of customer reviews that were not fully complete. 5-Not randomizing a dataset even though you were told the data should be random. Failing to randomize records even when you think they are already random can lead to introducing bias in the model when splitting. Additionally, removing customer review records could unintentionally filter out statistically significant parts of the dataset, biasing towards those customers who have the time or energy to fully complete the reviews. The other answers are reasonable and do not explicitly introduce bias.
When you issue a CreateModel API call using a built-in algorithm, which of the following actions would be next? 1-Sagemaker provisions an EC2 instances using the appropriate AMI for the algorithm selected from the global container registry. 2-SageMaker launches an appropriate inference container for the algorithm selected from the regional container repository. 3-SageMaker launches an appropriate training container from the algorithm selected from the regional container repository. 4-Sagemaker provisions an EC2 instances using the appropriate AMI for the algorithm selected from the regional container registry. 5-SageMaker launches an appropriate inference container for the algorithm selected from the global container repository. 6-SageMaker provisions an EMR cluster and prepares a Spark script for the training job.
2-SageMaker launches an appropriate inference container for the algorithm selected from the regional container repository. CreateModel API call is used to launch an inference container. When using the built-in algorithms, SageMaker will automatically reference the current stable version of the container.
Which best describes SGD in common terms? 1-Attempt to find the most efficient path to deliver packages to multiple destinations. 2-Seek to find the lowest point in elevation on a landscape. 3-Calculate the linear distance between arrows shot into a target to determine accuracy. 4-Ensure that our sample size in a traffic study has at least 30 drivers.
2-Stochastic Gradient Descent (SGD) is a cost function that seeks to find the minimal error. This can be analogous to trying to find the lowest point on a landscape.
You are working on a model that tries to predict the future revenue of select companies based on 50 years of their historical data (from public financial filings). What might be a strategy to determine if the model is reasonably accurate? 1-Randomize the training data and reserve 20% as a validation set after the training process is completed. 2-Use a set of the historic data as testing data to back-test the model, then compare the results to the actual historical results. 3-Use a softmax function to invert the historical data, then run the validation job from most recent to earliest history. 4-Use Random Cut Forest to remove any outliers, then rerun the algorithm on the last 20% of the data.
2-Use a set of the historic data as testing data to back-test the model, then compare the results to the actual historical results. Time-series data should be typically training and validated in the existing order. A common method to validate time-series data is backtesting. Backtesting is the replaying the historical data as if it were new data, then evaluating the model on how successful it predicted the historic values.
Which of the following are good candidate problems for using XGBoost? (Choose 3). 1-Map a text string to an n-gram vector. 2-Create a policy that will guide an autonomous robot through an unknown maze. 3-Evaluate handwritten numbers on a warranty card to detect what number they represent. 4-Providing a ranking of search results on an e-commerce site customized to a customer's past purchases. 5-Deciding whether a transaction is fraudulent or not based on various details about the transaction.
3, 4, and 5-XGBoost is an extremely flexible algorithm which can be used in regression, binary classification, and multi-class classification.
In what scenario is the DeepAR algorithm best suited? 1-Provide the certainty that a given picture includes a human face. 2-Decide whether to extend a credit card offer to a potential customer. 3-Predict whether a football team will score a certain number of points in a match. 4-Predict future sales of a new product based on historic sales of similar products. 5-Determine the correlation between a person's diet and energy levels.
4-DeepAR is a supervised forecasting algorithm used with time-series data. DeepAR seeks to be better than traditional time-series algorithms by accommodating multiple cross-sectional datasets.
You are a ML specialist designing a regression model to predict the sales for an upcoming festival. The data from the past consists of 1,000 records containing 20 numeric attributes. As you start to analyze the data, you discovered that 30 records have values that are in the far left of a box plot's lower quartile. The festival manager confirmed that those values are unusual, and he is not sure if they are right. There are also 65 records where another numerical value is blank. What should you do to correct these problems? 1-Drop the unusual records and fill in the blank values with 0. 2-Use the unusual data and replace the missing values with a separate Boolean variable. 3-Drop the unusual records and replace the blank values with the mean value. 4-Drop the unusual records and replace the blank values with separate Boolean values.
3- Drop the unusual records and replace the blank values with the mean value. There are many different ways to handle this scenario. We can eliminate the answer that deals with creating separate Boolean. This leaves the two answers with filling in the missing values with 0 or the mean. The mean is going to give us much better results than using 0. We should drop the unusual values and replace the missing values with the mean.
Which of the following mean that our algorithm predicted false but the real outcome was true? 1-True Positive 2-False Positive 3- False Negative 4-False Affirmative 5-True Negative
3- False Negative A false negative is when the model predicts a false result but the real outcome was true.
You are a ML specialist working for a retail organization. You are analyzing data that has different items at different costs. You decide to choose the top 5 most expensive items and visually compare their prices. Which visualization can help you achieve this? 1-Scatter chart 2-Pie chart 3-Bar chart 4-Histogram
3-Bar chart Bar charts can show single values really well. Each product is represented by a bar extended up the the price of the item.
Which text analysis algorithm is recommended when trying to compare documents based on the presence of words in each document? 1-BlazingText 2-Neural Topic Model (NTM) 3-Latent Dirichlet Allocation (LDA) 4-Sequence to Sequence (seq2seq)
3-LDA is most commonly used to identify topics in common between documents in a text corpus.
Your company currently has a large on-prem Hadoop cluster that contains data you would like to use for a SageMaker training job. Your cluster is equipped with Mahout, Flume, Hive, Spark and Ganglia. How might you most efficiently use this data?" 1-Use Mahout on the Hadoop Cluster to preprocess the data into a format that is compatible with SageMaker. Export the data with Flume to the local storage of the training container and launch the training job. 2-Use Data Pipeline to make a copy of the data in Spark DataFrame format. Upload the data to S3 where it can be accessed by the SageMaker training jobs. 3-Leverage the SageMaker Spark Library with S3. 4-Using EMR, create a Scala script to export the data to an HDFS volume. Copy that data over to an EBS volume where it can be read by the SageMaker training containers.
3-Leverage the SageMaker Spark Library with S3. Since the Hadoop cluster has Spark, you can use the SageMaker Spark Library to convert Spark DataFrame format into protobuf and load onto S3. From there, you can use SageMaker as normal.
You are a ML specialist working for a retail organization. You are analyzing customer spending data for particular locations and comparing how it changes over time. You want to visualize the monthly total amount spent at each location over the last 5 years. Which visualization can you use to help you see this? 1-Scatter chart 2-Bar chart 3-Line chart 4-Histogram
3-Line chart The key words in this question is how the data changes over time. We can sum up the total amount spent by all customers for each month. Place the months on the x axis and the dollar amount on the y axis. Plot a point for each month and connect each point creating a line chart, where each line represent a different location.
We have just completed a validation job for a multi-class classification model that attempts to classify books into one of five genres. In reviewing the validation metrics, we observe a Macro Average F1 score of 0.28 (average of all F1 scores). The F1 score for a single genre (historic fiction) is 0.9, though. What can we conclude from this? 1-Our model is very poor at predicting historic fiction but quite good at the other genres given the Macro F1 Score. 2-We might try a linear regression model instead of a multi-class classification. 3-Our training data might be biased toward historic fiction and lacking in examples of other genres. 4-We must have a very high Type II error rate. 5-We cannot conclude anything for certain with just an F1 score.
3-Our training data might be biased toward historic fiction and lacking in examples of other genres. For multi-class classification problems, the Macro F1 Score is an average of all F1 scores and a higher F1 score indicates more accuracy. If the average F1 score is 0.28 and one genre has 0.9, then this indicates that the model has a much greater accuracy with that single genre. That could mean that we have bias in our training or testing data toward that specific genre or that our data was not sufficiently randomized.
You have been provided with a cleansed CSV dataset you will be using for a linear regression model. Of these tasks, which might you do next? (Choose 2) 1-Run a Peterman distribution on the data to sort it properly for linear regression. 2-Perform one-hot encoding on the softmax results. 3-Run a randomization process on the data. 4-Split the data into testing and training datasets.
3-Run a randomization process on the data. 4-Split the data into testing and training datasets. When given a dataset, we should randomize it before separating it into testing and training sets. However, if it is a time-series dataset, you could just split the data into testing and training datasets without randomization. The one-hot encoding and Peterman distribution are nonsense answers.
A colleague is preparing for their very first training job using the XGBoost algorithm on an Amazon SageMaker notebook instance, which is a machine learning (ML) compute instance running the Jupyter Notebook App. They ask you how they can ensure that training metrics are captured during the training job. How do you direct them? 1-Do nothing. Use SageMaker's built-in logging to DynamoDB Streams. 2-Enable CloudTrail logging for the SageMaker API service. 3-Do nothing. Sagemaker's built-in algorithms are already configured to send training metrics to CloudTrail. 4-Do nothing. Sagemaker's built-in algorithms are already configured to send training metrics to CloudWatch. 5-Enable CloudWatch logging for Jupyter Notebook and the IAM user. 6-Do nothing. Use SageMaker's built-in logging feature and view the logs using Quicksight.
4-Do nothing. Sagemaker's built-in algorithms are already configured to send training metrics to CloudWatch. SageMaker's built-in algorithms and supporting containers are already configured to send metrics to CloudWatch.
You are a ML specialist building a regression model to predict the amount of rainfall for the upcoming year. The data you have contains 18,000 observations collected over the last 50 years. Each observation contains the date, amount of rainfall (in cm), humidity, city, and state. You plot the values in a scatter plot for a given day and amount of rainfall. After plotting points, you find a large grouping of values around 0 cm and 0.2 cm. There is a small grouping of values around 500 cm. What are the reasons for each of these groupings? What should you do to correct these values? 1-The groupings around 0 cm are days that had no rainfall, the groupings around 0.2 cm are days where it rained, the groupings around 500 cm are days where it snowed. The values should be used as is. 2-The groupings around 0 cm and 0.2 cm are extremes and should be removed. The values around 500 cm should be normalized and used once normalized. 3-The groupings around 0 cm are days that had no rainfall, the groupings around 0.2 cm are days where it rained, the groupings around 500 cm are outliers. The values around 500 cm should be dropped and the other values should be used as is. 4-The groupings around 0 cm are days that had no rainfall, the groupings around 0.2 cm are days where it rained, the groupings around 500 cm are outliers. The values around 500 cm should be normalized so they are on the same scale as the other values.
3-The groupings around 0 cm are days that had no rainfall, the groupings around 0.2 cm are days where it rained, the groupings around 500 cm are outliers. The values around 500 cm should be dropped and the other values should be used as is. Normalizing the values will not help since the values around 500 cm are outliers. There must have been some mistake when the data was created. The groupings around 0 cm are days where it did not rain. The groupings around 0.2 cm are days where it rained. The grouping around 500 cm are extreme values. The values around 500 cm should be dropped and the other values should be used as is.
What does the box in a box plot represent? 1-The maximum values. 2-The minimum values. 3-The middle 50% of the values. 4-The median value.
3-The middle 50% of the values. The box in a box plot represents 50% of the data, or the middle quartile. The line in the box represents the median. The upper/far right of the box plot represents the upper quartile. The lower/left of the box represents the lower quartile.
In a binary classification problem, you observe that precision is poor. Which of the following most contribute to poor precision? 1-Type V Error 2-Type II Error 3-Type I Error 4-Type IV Error 5-Type III Error
3-Type I Error Precision is defined as the ratio of True Positives over the sum of all Predicted Positives, which includes correctly labeled trues and those that we predicted as true but were really false (false positives). Another term for False Positives is Type I error.
You are consulting with a retailer who wants to evaluate the sentiment of social media posts to determine if they are positive or negative. Which approach is best for this analysis? 1-Use BlazingText in Word2Vec mode for skip-gram. 2-Use Amazon Macie. 3-Use Amazon Comprehend. 4-Use BlazingText in Text Classification mode. 5-Use Object2Vec in sentiment detection mode.
3-Use Amazon Comprehend. The most direct method of sentiment analysis is using the Amazon Comprehend service. Word2Vec can be used as a pre-processing step, but it alone is not sufficient to detect sentiment.
You are on a personal quest to design the best chess playing model in the world. What might be a good strategy for this objective? 1-Use an unsupervised learning strategy to analyze similarities across thousands of the best chess matches. 2-Use a supervised learning strategy that is trained by feeding in the chess moves of thousands of famous chess experts. 3-Use a reinforcement learning strategy to let the model learn itself. 4-Use Mechanical Turk to crowdsource the best chess moves in a variety of scenarios then use those for a supervised learning session. 5-Use a Factorization Machine approach to analyze the winning series of moves across thousands of chess matches to find the best series of moves.
3-Use a reinforcement learning strategy to let the model learn itself. Chess is a complex game and would require some training to develop a good model. Supervised learning, by definition, can only be as good as the training data that it is supplied with so we won't be able to meet our goal of the best chess model using supervised learning. Rather, reinforcement learning would have the best chance of becoming better than any other player if given enough iterations and a proper reward function.
We are using a k-fold method of cross-validation for our linear regression model. What outcome will indicate that our training data is not biased? 1-Each subsequent k-fold validation round has an increasing accuracy rate over the one prior. 2-Each subsequent k-fold validation round has a decreasing error rate over the one prior. 3-K-fold is not appropriate for us with linear regression problems. 4-All k-fold validation rounds have roughly the same error rate. 5-Bias is not a concern with linear regression problems as the error function resolves this.
4-All k-fold validation rounds have roughly the same error rate. When using a k-fold cross validation method, we want to see that all k-groups have close to the same error rate. Otherwise, this may indicate that the data was not properly randomized before the training process.
We are running a training job over and over again using slightly different, very large datasets as an experiment. Training is taking a very long time with your I/O-bound training algorithm and you want to improve training performance. What might you consider? (Choose 2) 1-Make use of file mode to stream data directly from S3. 2-Use the SageMaker console to change your training job instance type from an ml.c5.xlarge to a r5.xlarge. 3-Convert the data format to an Integer32 tensor. 4-Convert the data format to protobuf recordIO format. 5-Make use of pipe mode to stream data directly from S3.
4-Convert the data format to protobuf recordIO format. 5-Make use of pipe mode to stream data directly from S3. The combination of using the protobuf recordIO format and pipe mode will result in improved performance for I/O-bound algorithms because the data can be streamed directly from S3 versus having to be first copied to the instance locally.
After multiple training runs, you notice that the the loss function settles on different but similar values. You believe that there is potential to improve the model through adjusting hyperparameters. What might you try next? 1-Increase the learning rate. 2-Decrease the objective rate. 3-Change to another algorithm. 4-Decrease the learning rate. 5-Change from a CPU instances to a GPU instance.
4-Decrease the learning rate. Learning rate can be thought of as the "step length" of the training process. A learning rate can be too large that it cannot find the the true global minimum. Decreasing the learning rate allows the training process to find lower loss function floors but it can also increase the time needed for convergence.
We are designing a binary classification model that tries to predict whether a customer is likely to respond to a direct mailing of our catalog. Because it is expensive to print and mail our catalog, we want to only send to customers where we have a high degree of certainty they will buy something. When considering if the customer will buy something, what outcome would we want to minimize in a confusion matrix? 1-False Negative 2-False Affirmative 3-True Negative 4-False Positive 5-True Positive
4-False Positive We would want to minimize the occurrence of False Positives. This would mean that our model predicted that the customer would buy something but the actual outcome was that the customer did not buy anything.
We are using a CSV dataset for unsupervised learning that does not include a target value. How should we indicate this for training data as it sits on S3? 1-CSV data format should not be used for unsupervised learning algorithms. 2-Include a reserved word metadata key of "ColumnCount" for the S3 file and set it to the number of columns. 3-SageMaker will automatically detect the data format for supervised learning algorithms. 4-Include label_size=0 appended to the Content-Type key. 5-Enable pipe mode when we initiate the training run.
4-Include label_size=0 appended to the Content-Type key. To run unsupervised learning algorithms that don't have a target, specify the number of label columns in the content type. For example, in this case 'text/csv;label_size=0'
In your first training job of a binary classification problem, you observe an F1 score of 0.996. You make some adjustments and rerun the training job again, which results in an F1 score of 0.034. What can you conclude from this? Choose 2 1-The adjustments drastically worsened our model. 2-Our RMSE has improved greatly. 3-The adjustments drastically improved our model. 4-Our accuracy has decreased. 5-Nothing can be concluded from an F1 score by itself.
4-Our accuracy has decreased. 1-The adjustments drastically worsened our model. The F1 score is a measure of accuracy for classification models ranging from 0 to 1. An F1 score of 1 indicates perfect precision and recall, so a larger F1 score is better. In our case, our F1 score dropped significantly so we conclude that our adjustments dramatically decreased the accuracy of our model.
In your first training job of a regression problem, you observe an RMSE of 3.4. You make some adjustments and run the training job again, which results in an RMSE of 2.2. What can you conclude from this? 1-The adjustments made your model recall worse. 2-The adjustments improved your model recall. 3-The adjustments made your model accuracy worse. 4-The adjustments improved your model accuracy. 5-The adjustments had no effect on your model accuracy.
4-The adjustments improved your model accuracy. Root Mean Square Error (RMSE) is a common way of measuring regression accuracy. A lower RMSE is better so the adjustments improved our model.
You are preparing for a first training run using a custom algorithm that you have prepared in a docker container. What should you do to ensure that the training metrics are visible to CloudWatch? 1-Do nothing. SageMaker will automatically parse training logs for custom algorithms and carry those over to CloudWatch. 2-Enable Kinesis Streams to capture the log stream emitting from the custom algorithm containers. 3-Create a Lambda function to scrape the logs in the custom algorithm container and deposit them into CloudWatch via API. 4-When defining the training job, ensure that the metric_definitions section is populated with relevant metrics from the stdout and stderr streams in the container. 5-Enable CloudTrail for the respective container to capture the relevant training metrics from the custom algorithm.
4-When defining the training job, ensure that the metric_definitions section is populated with relevant metrics from the stdout and stderr streams in the container. When using a custom algorithm, you need to ensure that the desired metrics are emitted to stdout output. You also need to include the metric definition and regex expression for the metric in the stdout output when defining the training job.
Which visualization types are recommended for displaying the distribution of data? (choose 3) 1- stacked area chart 2-stacked bar chart 3-line chart 4-box plot 5-scatter plot 6-histogram
4-box plot 5-scatter plot 6-histogram A scatter plot is a good visualization type for displaying multi-distribution data, as it easily shows data clusters, minimum and maximum values, and outliers. A histogram is a good visualization type for displaying the single distribution of data. A box plot is a good visualization type for displaying multi-distribution data, as it easily shows the minimum, maximum, and mean values of data, as well as outliers.
Which visualizations help show relationships? (sect 2) 1- stacked area chart 2-stacked bar chart 3-bar chart 4-bubble chart 5-scatter plot 6-histogram 7-pie chart
4-bubble chart 5-scatter plot Visualizing relationship in data is important because it shows how different attributes can effect one another. They can also show trends and outliers within our data.
You are consulting with a retailer who wants to evaluate the sentiment of social media posts to determine if they are positive or negative. Which approach is best for this analysis? 1-Use Object2Vec in sentiment detection mode. 2-Use BlazingText in Text Classification mode. 3-Use Amazon Macie. 4-Use BlazingText in Word2Vec mode for skip-gram. 5-Use Amazon Comprehend.
5- The most direct method of sentiment analysis is using the Amazon Comprehend service. Word2Vec can be used as a pre-processing step, but it alone is not sufficient to detect sentiment.
You are being asked to develop a model to predict the likelihood that a student will pass a certification exam based on hours of study. Of the options given, what would be the best approach to this problem? 1-Build a model using LDA based on the text of the questions on the exam and predict student outcome. 2-Build a model using NLP to classify students into passing and failing groups. 3-Build a clustering mode with K-Means to group students who pass in a cluster. 4-Build a simulation-based model which will analyze past student performance at varying levels of study. 5-Build a Logistic Regression model using the hours of study as as a feature.
5- This problem is best described as a binary classification problem as we are trying to understand whether a student will pass or fail. The option that most directly provides for a binary classification problem is logistic regression.
While using K-Means, what does it mean if we pass in k=4 as a hyperparameter? 1-We want the algorithm to use 4 as the cutoff value for classification purposes. 2-We want the algorithm to return the top 4 results. 3-We want the algorithm to use 4 hidden layers. 4-We want the algorithm to group into clusters of no more than 4 samples each. 5-We want the algorithm to group into 4 clusters.
5-We want the algorithm to group into 4 clusters. K-Means is a clustering algorithm and the K represents the number of clusters the algorithm should seek. Unsupervised learning algorithms such as K-Means clumps similar data points together. Through trial and error, the optimal number of clumps for the dataset can be found and then interpreted by a domain expert.
You are consulting for a mountain climbing gear manufacturer and have been asked to design a machine learning approach for predicting the strength of a new line of climbing ropes. Which approach might you choose? 1-You would choose a multi-class classification approach to classify the rope into an appropriate price range. 2-You would choose a binary classification approach to determine if the rope will fail or not. 3-You would approach the problem as a linear regression problem to predict the tensile strength of the rope based on other ropes. 4-You would choose a simulation-based reinforcement learning approach. 5-You would recommend they do not use a machine learning model.
5-You would recommend they do not use a machine learning model. We take care not to assume every problem is a machine learning problem. In this case we can test the strength of a rope through physical tests, so creating a machine learning problem does not make sense.
You have been asked to help develop a vision system for a manufacturing line that will reorient parts to a specific position using a robotic arm. What algorithm might you choose for the vision part of this problem? 1-Seq2Seq 2-Object2Vec 3-Object Detection 4-AWS Comprehend 5-Image Analysis 6-Semantic Segmentation
6-Semantic Segmentation Semantic Segmentation can perform edge detection, which could be used to identify orientation. Object Detection and Image Analysis are better used for full images where as Semantic Segmentation can create a segmentation mask or outline of a part of the image.
Sample Question 1- 2- 3- 4- 5-
Answer-
You are a ML specialist within a large organization who needs to run SQL queries and analytics on thousands of Apache logs files stored in S3. Your organization already uses Redshift as their data warehousing solution. Which tool can help you achieve this with the LEAST amount of effort? 1-Redshift Spectrum 2-Apache Hive 3-Athena 4-S3 Analytics
Answer-Since the organization already uses Redshift as their data warehouse solution, Redshift spectrum would require less effort than using AWS Glue and Athena.
Choose the scenarios in which one-hot encoding techniques are NOT a good idea. 1-When our algorithm expects numeric input and we have ordinal categorical values 2-When our algorithm expects numeric input and we have few nominal categorical values 3-When our values cannot be ordered in any meaningful way, there are only a few to choose from, and our algorithm expects numeric input 4-When our algorithm accepts numeric input and we have continuous values 5-When our algorithm expects numeric input and we have thousands of nominal categorical values
Answer- 1, 4,5 We need to apply one-hot encoding techniques only when our algorithm is expecting numeric inputs and the values are nominal (order does not matter). If the amount of different values is extremely high, then one-hot might not be a good idea. Remember, each category creates a new feature and this can exponentially grow your datasets.
You are a ML specialist working with data that is stored in a distributed EMR cluster on AWS. Currently, your machine learning applications are compatible with the Apache Hive Metastore tables on EMR. You have been tasked with configuring Hive to use the AWS Glue Data Catalog as its metastore. Before you can do this you need to transfer the Apache Hive metastore tables into an AWS Glue Data Catalog. What two answer option workflows can accomplish the requirements with the LEAST amount of effort? 1-Create a Data Pipeline job that reads from your Apache Hive Metastore, exports the data to an intermediate format in Amazon S3, and then imports that data into the AWS Glue Data Catalog. 2-Create a second EMR cluster that runs an Apache Spark script to copy the Hive metastore tables from the original EMR cluster into AWS Glue. 3-Run a Hive script on EMR that reads from your Apache Hive Metastore, exports the data to an intermediate format in Amazon S3, and then imports that data into the AWS Glue Data Catalog. 4-Setup your Apache Hive application with JDBC driver connections, then create a crawler that crawls the Apache Hive Metastore using the JDBC connection and creates an AWS Glue Data Catalog. 5-Create DMS endpoints for both the input Apache Hive Metastore and the output data store S3 bucket, run a DMS migration to transfer the data, then create a crawler that creates an AWS Glue Data Catalog.
Answer- 3&4 - Apache Hive supports JDBC connections that easily can be used with a crawler to create an AWS Glue Data Catalog. The benefit of using Data Catalog (over Hive Metastore) is because it provides a unified metadata repository across a variety of data sources and data formats, integrating with Amazon EMR as well as Amazon RDS, Amazon Redshift, Redshift Spectrum, Athena, and any application compatible with the Apache Hive metastore. We can simply run a Hive script to query tables and output that data in CSV (or other formats) into S3. Once that data is on S3, we can crawl it to create a Data Catalog of the Hive Metastore or import the data directly from S3.
You are a ML specialist that has been tasked with setting up an ETL pipeline for your organization. The team already has a EMR cluster that will be used for ETL tasks and needs to be directly integrated with Amazon SageMaker without writing any specific code to connect EMR to SageMaker. Which framework allows you to achieve this? 1-Apache Mahout 2-Apache Pig 3-Apache Flink 4-Apache Hive 5-Apache Spark
Answer- Apache Spark can connect directly to sagemaker with an SDK
You have been tasked with capturing two different types of streaming events. The first event type includes mission-critical data that needs to immediately be processed before operations can continue. The second event type includes data of less importance, but operations can continue without immediately processing. What is the most appropriate solution to record these different types of events? 1- Capture the mission critical events with the PutRecords API call and the second event type with the Kinesis Producer Library (KPL). 2- Capture both event types using the Kinesis Producer Library (KPL). 3-Capture both events with the PutRecords API call. 4-Capture the mission critical events with the Kinesis Producer Library (KPL) and the second event type with the Putrecords API call.
Answer- The question is about sending data to Kinesis synchronously vs. asynchronously. PutRecords is a synchronous send function, so it must be used for the first event type (critical events). The Kinesis Producer Library (KPL) implements an asynchronous send function, so it can be used for the second event type. In this scenario, the reason to use the KPL over the PutRecords API call is because: KPL can incur an additional processing delay of up to RecordMaxBufferedTime within the library (user-configurable). Larger values of RecordMaxBufferedTime results in higher packing efficiencies and better performance. Applications that cannot tolerate this additional delay may need to use the AWS SDK directly. For more information about using the AWS SDK with Kinesis Data Streams, see Developing Producers Using the Amazon Kinesis Data Streams API with the AWS SDK for Java. For more information about RecordMaxBufferedTime and other user-configurable properties of the KPL, see Configuring the Kinesis Producer Library.
True or False. If you have mission critical data that must be processed with as minimal delay as possible, you should use the Kinesis API (AWS SDK) over the Kinesis Producer Library.
Answer- True. The KPL can incur an additional processing delay of up to RecordMaxBufferedTime within the library (user-configurable). Larger values of RecordMaxBufferedTime results in higher packing efficiencies and better performance. Applications that cannot tolerate this additional delay may need to use the AWS SDK directly.
An organization needs to store a mass amount of data in AWS. The data has a key-value access pattern, developers need to run complex SQL queries and transactions, and the data has a fixed schema. Which type of data store meets all of their needs? 1-S3 2-DynamoDB 3-RDS 4-Athena
Answer-"Amazon RDS handles all these requirements. Transactional and SQL queries are the important terms here. Although RDS is not typically thought of as optimized for key-value based access, using a schema with a primary key can solve this. S3 has no fixed schema. Although Amazon DynamoDB provides key-value access and consistent reads, it does not support complex SQL based queries. Simple SQL queries are supported for DynamoDB via PartiQL. Finally, Athena is used to query data on S3 so this is not a data store on AWS."
You have been tasked with converting multiple JSON files within a S3 bucket to Apache Parquet format. Which AWS service can you use to achieve this with the LEAST amount of effort? 1-Create a Data Pipeline job that reads from your S3 bucket and sends the data to EMR. In EMR, create an Apache Spark job to process the data as Apache Parquet and output the newly formatted files into S3. 2-Create an AWS Glue job to convert the S3 objects from JSON to Apache Parquet. Output the newly formatted files into S3. 3-Create a Lambda function that reads all of the objects in the S3 bucket. Loop through each of the objects and convert from JSON to Apache Parquet. Once the conversion is complete, output the newly formatted files into S3. 4-Create an EMR cluster to run an Apache Spark job that processes the data as Apache Parquet. Output the newly formatted files into S3.
Answer-AWS Glue makes it super simple to transform data from one format to another. You can simply create a job that takes in data defined within the Data Catalog and outputs in any of the following formats: avro, csv, ion, grokLog, json, orc, parquet, glueparquet, or xml.
You are trying to set up a crawler within AWS Glue that crawls your input data in S3. For some reason after the crawler finishes executing, it cannot determine the schema from your data and no tables are created within your AWS Glue Data Catalog. What is the reason for these results? 1-The crawler does not have correct IAM permissions to access the input data in the S3 bucket. 2-The checkbox for 'Do not create tables' was checked when setting up the crawler in AWS Glue. 3-The bucket path for the input data store in S3 is specified incorrectly. 4-AWS Glue built-in classifiers could not find the input data format. You need to create a custom classifier.
Answer-AWS Glue provides built-in classifiers for various formats, including JSON, CSV, web logs, and many database systems. If AWS Glue cannot determine the format of your input data, you will need to set up a custom classifier that helps AWS Glue crawler determine the schema of your input data.
You are a ML specialist needing to collect data from Twitter tweets. Your goal is to collect tweets that include only the name of your company and the tweet body, and store it off into a data store in AWS. What set of tools can you use to stream, transform, and load the data into AWS with the LEAST amount of effort? 1-Setup Kinesis Data Streams for data ingestion. Next, setup Kinesis Data Firehose to load that data into Redshift. Next, setup a Lambda function to query data using Redshift spectrum and store the results onto DynamoDB. 2-Setup a Kinesis Data Firehose for data ingestion and immediately write that data to S3. Next, setup a Lambda function to trigger when data lands in S3 to transform it and finally write it to DynamoDB. 3-Setup A Kinesis Data Stream for data ingestion, setup EC2 instances as data consumers to poll and transform the data from the stream. Once the data is transformed, make an API call to write the data to DynamoDB. 4-Create a Kinesis Data Stream to ingest the data. Next, setup a Kinesis Data Firehose and use Lambda to transform the data from the Kinesis Data Stream, then use Lambda to write the data to DynamoDB. Finally, use S3 as the data destination for Kinesis Data Firehose.
Answer-All of these could be used to stream, transform, and load the data into an AWS data store. The setup that requires the LEAST amount of effort and moving parts involves setting up a Kinesis Data Firehose to stream the data into S3, have it transformed by Lambda with an S3 trigger, and then written to DynamoDB.
Which service built by AWS makes it easy to set up a retry mechanism, aggregate records to improve throughput, and automatically submits CloudWatch metrics? 1-Kinesis Producer Library (KPL) 2-Kinesis API (AWS SDK) 3-Kinesis Consumer Library 4-Kinesis Client Library (KCL)
Answer-Although the Kinesis API built into the AWS SDK can be used for all of this, the Kinesis Producer Library (KPL) makes it easy to integrate all of this into your applications.
You work for a farming company that has dozens of tractors with build-in IoT devices. These devices stream data into AWS using Kinesis Data Streams. The features associated with the data is tractor Id, latitude, longitude, inside temp, outside temp, and fuel level. As a ML specialist you need to transform the data and store it in a data store. Which combination of services can you use to achieve this? (Select 3) 1-Set up Kinesis Firehose to ingest data from Kinesis Data Streams, then send data to Lambda. Transform the data in Lambda and write the transformed data into S3. 2-Set up Kinesis Data Analytics to ingest the data from Kinesis Data Stream, then run real-time SQL queries on the data to transform it. After the data is transformed, ingest the data with Kinesis Data Firehose and write the data into S3. 3-Immediately send the data to Lambda from Kinesis Data Streams. Transform the data in Lambda and write the transformed data into S3. 4-Use Kinesis Data Streams to immediately write the data into S3. Next, set up a Lambda function that fires any time an object is PUT onto S3. Transform the data from the Lambda function, then write the transformed data into S3. 5-Use Kinesis Data Firehose to run real-time SQL queries to transform the data and immediately write the transformed data into S3.
Answer-Amazon Kinesis Data Firehose can ingest streaming data from Amazon Kinesis Data Streams, which can leverage Lambda to transform the data and load into Amazon S3. Amazon Kinesis Data Analytics can query, analyze and transform streaming data from Amazon Kinesis Data Streams and use Amazon Kinesis Data Firehose as a destination for loading data into Amazon S3. Amazon Kinesis Data Streams can ingest and store data streams for Lambda processing, which can transform and load the data into Amazon S3.
Which service in the Kinesis family can continuously capture gigabytes of data per second and make the collected data available in milliseconds to enable real-time analytics use cases? 1-Kinesis Data Analytics 2-Kinesis Video Streams 3-Kinesis Data Firehose 4-Kinesis Data Streams
Answer-Amazon Kinesis Data Streams (KDS) is a massively scalable and durable real-time data streaming service. KDS can continuously capture gigabytes of data per second from hundreds of thousands of sources such as website clickstreams, database event streams, financial transactions, social media feeds, IT logs, and location-tracking events. The data collected is available in milliseconds to enable real-time analytics use cases such as real-time dashboards, real-time anomaly detection, dynamic pricing, and more.
You are a ML specialist who is working within SageMaker analyzing a dataset in a Jupyter notebook. On your local machine you have several open-source Python libraries that you have downloaded from the internet using a typical package manager. You want to download and use these same libraries on your dataset in SageMaker within your Jupyter notebook. What options allow you to use these libraries? 1-SageMaker offers a wide variety of built-in libraries. If the library you need is not included, contact AWS support with details on libraries needed for distribution. 2-SSH into the Jupyter notebook instance and install needed libraries. This is typically done using conda install or pip install. 3-Upload the library in .zip format into S3 and use the Jupyter notebook in SageMaker to reference S3 bucket with Python libraries. 4-Use the integrated terminals in SageMaker to install libraries. This is typically done using conda install or pip install.
Answer-Amazon SageMaker notebook instances come with multiple environments already installed. These environments contain Jupyter kernels and Python packages including: scikit, Pandas, NumPy, TensorFlow, and MXNet. You can also install your own environments that contain your choice of packages and kernels. This is typically done using conda install or pip install.
You are a ML specialist who is setting up a ML pipeline. The amount of data you have is massive and needs to be set up and managed on a distributed system to efficiently run processing and analytics on. You also plan to use tools like Apache Spark to process your data to get it ready for your ML pipeline. Which setup and services can most easily help you achieve this? 1-Redshift out-performs Apache Spark and should be used instead. 2-Multi AZ RDS Read Replicas with Apache Spark installed. 3-Self-managed cluster of EC2 instances with Apache Spark installed. 4-Elastic Map Reduce (EMR) with Apache Spark installed.
Answer-Amazon's EMR allows you to set up a distributed Hadoop cluster to process, transform, and analyze large amounts of data. Apache Spark is a processing framework and programming model that helps you do machine learning, stream processing, or graph analytics using Amazon EMR clusters.
What is the most common data source can you use to pull training datasets into Amazon SageMaker? 1-RDS 2-S3 3-DynamoDB 4-RedShift
Answer-Generally, we store our training data in S3 to use for training our model.
You have been tasked with capturing data from an online gaming platform to run analytics on and process through a machine learning pipeline. The data that you are ingesting is players controller inputs every 1 second (up to 10 players in a game) that is in JSON format. The data needs to be ingested through Kinesis Data Streams and the JSON data blob is 100 KB in size. What is the minimum number of shards you can use to successfully ingest this data? 1- 10 shards 2- 100 shards 3- 1 shard 4- Greater than 500 shards, so you'll need to request more shards from AWS
Answer-In this scenario, there will be a maximum of 10 records per second with a max payload size of 1000 KB (10 records x 100 KB = 1000KB) written to the shard. A single shard can ingest up to 1 MB of data per second, which is enough to ingest the 1000 KB from the streaming game play. Therefor 1 shard is enough to handle the streaming data.
Your organization needs to find a way to capture streaming data from certain events customers are performing. These events are a crucial part of the organization's business development and cannot afford to be lost. You've already set up a Kinesis Data Stream and a consumer EC2 instance to process and deliver the data into S3. You've noticed that the last few days of events are not showing up in S3 and your EC2 instance has been shutdown. What combination of steps can you take to ensure this does not happen again? 1-Set up CloudWatch monitoring for your EC2 instance as well as AutoScaling if your consumer EC2 instance is shutdown. Next, set up multiple Kinesis Data Streams to process the data on the EC2 instance. 2-Set up CloudWatch monitoring for your EC2 instance as well as AutoScaling if your consumer EC2 instance is shutdown. Next, send the data to Kinesis Data Firehose before writing the data into S3. Since Kinesis Data Firehose has retry mechanism built-in, the changes of data being lost is extremely unlikely. 3-Set up CloudWatch monitoring for your EC2 instance as well as AutoScaling if your consumer EC2 instance is shutdown. Next, ensure that the maximum amount of hours are selected (168 hours) for data retention when creating your Kinesis Data Stream. Finally, write logic on the consumer EC2 instance that handles unprocessed data in the Kinesis Data Stream and failed writes to S3. 4-Set up CloudWatch monitoring for your EC2 instance as well as AutoScaling if your consumer EC2 instance is shutdown. Next, set up a Lambda function to poll the Kinesis Data Stream for failed delivered records and then send those requests back into the consumer EC2 instance.
Answer-In this setup, the data is being ingested by Kinesis Data Streams and processes and delivered using an EC2 instance. It's best practice to always setup CloudWatch monitoring for your EC2 instance as well as AutoScaling if your consumer EC2 instance is shutdown. Since this data is critical data that we cannot afford to lose, we should set the retention period for the maximum number of hours (168 hours or 7 days). Finally, we need to have reprocessed the failed records that are still in the data stream and that fail to write to S3.
You are a ML specialist within a large organization who helps job seekers find both technical and non-technical jobs. You've collected data from a data warehouse from an engineering company to determine which skills qualify job seekers for different positions. After reviewing the data you realise the data is biased. Why? 1-The data collected has missing values for different skills for job seekers. 2-The data collected only has a few attributes. Attributes like skills and job title are not included in the data. 3-The data collected needs to be from the general population of job seekers, not just from a technical engineering company. 4-The data collected is only a few hundred observations making it bias to a small subset of job types.
Answer-It's important to know what type of questions we are trying to solve. Since our organization helps both technical and non-technical job seekers, only gathering data from an engineering company is biased to those looking for technical jobs. We need to gather data from many different repositories, both technical and non-technical.
Your organization has given you several different sets of key-value pair JSON files that need to be used for a machine learning project within AWS. What type of data is this classified as and where is the best place to load this data into? 1-Semi-structured data, stored in DynamoDB. 2-Structured data, stored in RDS. 3-Unstructured data, stored in S3. 4-Semi-structured data, stored in S3.
Answer-Key-value pair JSON data is considered Semi-structured data because it doesn't have a defined structure, but has some structural properties. If our data is going to be used for a machine learning project in AWS, we need to find a way to get that data into S3.
Which services in the Kinesis family allows you to analyze streaming data, gain actionable insights, and respond to your business and customer needs in real time? 1- Kinesis Streams 2- Kinesis Video Streams 3- Kinesis Data Analytics 4- Kinesis Firehose
Answer-Kinesis Data Analytics
You are collecting clickstream data from an e-commerce website to make near-real time product suggestions for users actively using the site. Which combination of tools can be used to achieve the quickest recommendations and meets all of the requirements? 1- Use Kinesis Data Firehose to ingest click stream data, then use Kinesis Data Analytics to run real time SQL queries to gain actionable insights and trigger real-time recommendations with AWS Lambda functions based on conditions, then use Lambda to load these results into S3. 2- Use Kinesis Data Streams to ingest clickstream data, then use Kinesis Data Analytics to run real time SQL queries to gain actionable insights and trigger real-time recommendations with AWS Lambda functions based on conditions. 3- Use Kinesis Data Streams to ingest clickstream data, then use Lambda to process that data and write it to S3. Once the data is on S3, use Athena to query based on conditions that data and make real time recommendations to users. 4- Use the Kinesis Data Analytics to ingest the clickstream data directly and run real time SQL queries to gain actionable insights and trigger real-time recommendations with AWS Lambda functions based on conditions.
Answer-Kinesis Data Analytics gets its input streaming data from Kinesis Data Streams or Kinesis Data Firehose. You can use Kinesis Data Analytics to run real-time SQL queries on your data. Once certain conditions are met you can trigger Lambda functions to make real time product suggestions to users. It is not important that we store or persist the clickstream data.
You are collecting clickstream data from an e-commerce website using Kinesis Data Firehose. You are using the PutRecord API from the AWS SDK to send the data to the stream. What are the required parameters when sending data to Kinesis Data Firehose using the API PutRecord call? 1- Data, PartitionKey, StreamName 2- DeliveryStreamName and Record (containing the data) 3- DataStreamName, PartitionKey, and Record (containing the data) 4- Data, PartitionKey, StreamName, ShardId
Answer-Kinesis Data Firehose is used as a delivery stream. We do not have to worry about shards, partition keys, etc. All we need is the Firehose DeliveryStreamName and the Record object (which contains the data).
Which service in the Kinesis family allows you to easily load streaming data into data stores and analytics tools? 1-Kinesis Video Streams 2-Kinesis Streams 3-Kinesis Data Analytics 4-Kinesis Firehose
Answer-Kinesis Firehose is perfect for streaming data into AWS and sending it directly to its final destination - places like S3, Redshift, Elastisearch, and Splunk Instances.
Which service in the Kinesis family allows you to securely stream video from connected devices to AWS for analytics, machine learning (ML), and other processing? 1-Kinesis Firehose 2-Kinesis Streams 3-Kinesis Data Analytics 4-Kinesis Video Streams
Answer-Kinesis Video Streams allows you to stream video, images, audio, radar into AWS to further analyze, build custom application around, or store in S3.
A local university wants to track cars in a parking lot to determine which students are parking in the lot. The university is wanting to ingest videos of the cars parking in near-real time, use machine learning to identify license plates, and store that data in an AWS data store. Which solution meets these requirements with the LEAST amount of development effort? 1-Use Amazon Kinesis Data Streams to ingest the video in near-real time, use the Kinesis Data Streams consumer integrated with Amazon Rekognition Video to process the license plate information, and then store results in DynamoDB. 2-Use Amazon Kinesis Firehose to ingest the video in near-real time and outputs results onto S3. Set up a Lambda function that triggers when a new video is PUT onto S3 to send results to Amazon Rekognition to identify license plate information, and then store results in DynamoDB. 3-Use Amazon Kinesis Data Streams to ingest videos in near-real time, call Amazon Rekognition to identify license plate information, and then store results in DynamoDB. 4-Use Amazon Kinesis Video Streams to ingest the videos in near-real time, use the Kinesis Video Streams integration with Amazon Rekognition Video to identify the license plate information, and then store the results in DynamoDB.
Answer-Kinesis Video Streams is used to stream videos in near-real time. Amazon Rekognition Video uses Amazon Kinesis Video Streams to receive and process a video stream. After the videos have been processed by Rekognition we can output the results in DynamoDB.
You are a ML specialist that has been tasked with setting up a transformation job for 900 TB of data. You have set up several ETL jobs written in PySpark on AWS Glue to transform your data, but the ETL jobs are taking a very long time to process and it is extremely expensive. What are your other options for processing the data? 1-Offload the data to Redshift and perform transformation from Redshift rather than S3. Set up AWS Glue jobs to use Redshift as input data store, then run ETL jobs on batches of Redshift data. Adjust the batch size until performance and cost satisfaction is met. 2-Create an EMR cluster with Spark, Hive, and Flink to perform the ETL jobs. Tweak cluster size, instance types, and data partitioning until performance and cost satisfaction is met. 3-Create Kinesis Data Stream to stream the data to multiple EC2 instances, each performing partition workloads and ETL jobs. Tweak cluster size, instance types, and data partitioning until performance and cost satisfaction is met. 4-Change job type to Python shell and use built-in libraries to perform the ETL jobs. The built-in libraries perform better than Spark jobs and are a fraction of the cost.
Answer-Since AWS Glue is fully managed, it requires less configuration and setup than would have to be done on EMR. If we have mass amounts of data that need processing and AWS Glue is too slow or too expensive, an alternative would be to use an EMR cluster with appropriate frameworks installed. Depending on your workload size and needs, EMR can be cheaper but requires much more configuration and setup over the fully managed AWS Glue service.
You have been tasked with collecting thousands of PDFs for building a large corpus dataset. The data within this dataset would be considered what type of data? 1-Unstructured 2-Relational 3-Semi-structured 4-Structured
Answer-Since PDFs have no real structure to them, like key-value pairs or column names, they are considered unstructured data.
You are a ML specialist preparing some labeled data to help determine whether a given leaf originates from a poisonous plant. The target attribute is poisonous and is classified as 0 or 1. The data that you have been analyzing has the following features: leaf height (cm), leaf length (cm), number of cells (trillions), poisonous (binary). After initial analysis, you do not suspect any outliers in any of the attributes. After using the data given to train your model, you are getting extremely skewed results. What technique can you apply to possibly help solve this issue? 1-Drop the "number of cells" attribute. 2-Apply one-hot encoding to each of the attributes, except for the "poisonous" attribute (since it is already encoded). 3-Standardize the "number of cells" attribute. 4-Normalize the "number of cells" attribute.
Answer-Since the "number of cells" attribute is on a scale of trillions and we do not suspect any outliers, we can normalize the values within the "number of cells" features so all of our values are between 0 and 1.
A ML specialist is working for a bank and trying to determine if credit card transactions are fraudulent or non-fraudulent. The features of the data collected include things like customer name, customer type, transaction amount, length of time as a customer, and transaction type. The transaction type is classified as 'normal' and 'abnormal'. What data preparation action should the ML specialist take? 1-Drop both the customer type and the transaction type before training the model. 2-Drop the transaction type and perform label encoding on the customer type before training the model. 3-Drop the length of time as a customer and perform label encoding on the transaction type before training the model. 4-Drop the customer name and and perform label encoding on the transaction type before training the model.
Answer-Since the customer name has nothing to do with whether a transaction was fraudulent or non-fraudulent, we can safely drop this attribute. The other attributes are important to us as our ML algorithm can use these to help determine a prediction. We also need to encode the target label attribute of transaction type.
You are working for an organization that takes different metrics about its customers and classifies them with one of the following statuses: bronze, silver, and gold. Depending on their status they get more/less discounts and are placed as a higher/lower priority for customer support. The algorithm you have chosen expects all numerical inputs. What can be done to handle these status values? 1-Use one-hot encoding techniques to map values for each status dropping the original status feature. 2-Apply random numbers to each status value and apply gradient descent until the values converge to expect results. 3-Use one-hot encoding techniques to map values for each status. 4-Map bronze, silver and gold to some respective graduated value to respect the ordinal nature.
Answer-Since these values are ordinal (order does matter) we cannot use one-hot encoding techniques. We need to map these values to some values that have scale or we just train our model with different encodings and see which encoding works best.
You are a ML specialist preparing a dataset for a supervised learning problem. You are using the Amazon SageMaker Linear Learner algorithm. You notice the target label attributes are highly imbalanced and multiple feature columns contain missing values. The proportion of missing values across the entire dataset is less than 5%. You have plenty of time to implement a solution. What should you do to have the least amount of bias due to missing values? 1-For each feature that is missing, use supervised learning to approximate the values based on other features. 2-First, normalize the non-missing values. Then, replace the missing values with the normalized values. 3-Replace the missing values with mean or median values from the other values of the same feature. 4-Drop all of the rows that contain missing values because they represent less than 5% of the data.
Answer-Since we have the time and want to create the least amount of bias, using a supervised learning to predict missing values based on the values of other features is the answer. Different supervised learning approaches might have different performances, but any properly implemented supervised learning approach should provide the same or better approximation than mean or median approximation, or dropping the values as a whole.
Your organization has a standalone Javascript (Node.js) application that streams data into AWS using Kinesis Data Streams. You notice that they are using the Kinesis API (AWS SDK) instead of Kinesis Producer Library (KPL). What might be the reasoning behind this? 1-The Kinesis API (AWS SDK) provides greater functionality over the Kinesis Producer Library. 2-The Kinesis Producer Library cannot be integrated with a Javascript application because of its asynchronous architecture. 3-The Kinesis API (AWS SDK) runs faster in Javascript applications over the Kinesis Producer Library. 4-The Kinesis Producer Library must be installed as a Java application to use with Kinesis Data Streams.
Answer-The KPL must be installed as a Java application before it can be used with your Kinesis Data Streams. There are ways to process KPL serialized data within AWS Lambda, in Java, Node.js, and Python, but none of these answers mentions Lambda.
You work for an organization that wants to manage all of the data stores in S3. The organization wants to automate the transformation jobs on the S3 data and maintain a data catalog of the metadata concerning the datasets. The solution that you choose should require the least amount of setup and maintenance. Which solution will allow you to achieve this and achieve its goals? 1-Create an AWS Glue crawler to populate the AWS Glue Data Catalog. Then, create an AWS Glue job, and set up a schedule for data transformation jobs. 2-Create an AWS Data Pipeline that transforms the data. Then, create an Apache Hive metastore and a script that runs transformation jobs on a schedule. 3-Create a cluster in EMR that uses Apache Hive. Then, create a simple Hive script that runs transformation jobs on a schedule. 4-Create a cluster in EMR that uses Apache Spark. Then, create an Apache Hive metastore and a script that runs transformation jobs on a schedule.
Answer-The answer that requires the least amount of setup and maintenance would be setting up an AWS Glue crawler to create a metastore of your data and AWS Glue job to transform that data on some schedule you choose.
A term frequency-inverse document frequency (tf-idf) matrix using both unigrams and bigrams is built from a text corpus consisting of the following two sentences: { Hello world } and { Hello how are you }. What are the dimensions of the tf-idf vector/matrix? 1-(2, 6) 2-(5, 9) 3-(2, 10) 4-(2, 9) 5-(2, 5)
Answer-There are 2 sentences (or corpus data we are vectorizing) with 5 unique unigrams ('are', 'hello', 'how', 'world', 'you') and there are 4 unique bigrams ('are you', 'hello how', 'hello world', 'how are'). Thus, the vectorized matrix would be (2, 9).
We are analyzing the following text { Hello cloud gurus! Keep being awesome! }. We apply lowercase transformation, remove punctuation and n-gram with a sliding window of 3. What are the unique trigrams produced? What are the dimensions of the tf-idf vector/matrix? 1-['hello cloud gurus', 'cloud gurus keep', 'gurus keep being', 'keep being awesome'] and (1, 4) 2-['hello cloud gurus', 'cloud gurus keep', 'gurus keep being', 'keep being awesome'] and (2, 4) 3-['hello cloud gurus', 'cloud gurus keep', 'keep being awesome'] and (1, 3) 4-['hello cloud gurus!', 'cloud gurus keep', 'gurus keep being', 'keep being awesome.'] and (1, 4)
Answer-There is only 1 sentence (or corpus data we are vectorizing) with 4 unique trigrams ('hello cloud gurus', 'cloud gurus keep', 'gurus keep being', 'keep being awesome'). So the vectorized matrix would be (1, 4). Also, remember, since we removed punctuation and performed lowercase transformation, those cannot be part of the unique trigrams.
You are a ML specialist who has 780 GB of files in a data lake-hosted S3. The metadata about these files is stored in the S3 bucket as well. You need to search through the data lake to get a better understanding of what the data consists of. You will most likely do multiple searches depending on results found throughout your research. Which solution meets the requirements with the LEAST amount of effort? 1-Use Amazon Athena to analyze and query your S3 data. 2-First, enable S3 analytics then use the metastore files to analyze your data. 3-Create a Redshift cluster that uses S3 as the input data course, and use Redshift Spectrum to analyze and query your S3 data. 4-Create an EMR cluster with Apache Hive to analyze and query your data.
Answer-We can use Amazon Athena to query our S3 data with the least amount of effort. S3 analytics is used for store class analysis and the other answers require much more effort and setup.
In general within your dataset, what is the minimum number of observations you should have compared to the number of features? 1-10,000 times as many observations as features. 2-100 times as many observations as features. 3-10 times as many observations as features. 4-1000 times as many observations as features.
Answer-We need a large, robust, feature-rich dataset. In general, having AT LEAST 10 times as many observations as features is a good place to start. So for example, we have a dataset with the following features: id, date, full review, full review summary, and a binary safe/unsafe tag. Since id is just an identifier, we have 4 features (date, full review, full review summary, and a binary safe/unsafe tag). This means we need AT LEAST 40 rows/observations.
What are the programming languages offered in AWS Glue for Spark job types? 1-Java 2-R 3-Python 4-C# 5-Scala
Answer-When choosing Spark as the job type for AWS Glue jobs, you can write code in Scala or Python (Pyspark). You can have the code generated for you by AWS or you can provide your own scripts.
You are a ML specialist who has a Python script using libraries like Boto3, Pandas, NumPy, and sklearn to help transform data that is in S3. On your local machine, the data transformation is working as expected. You need to find a way to schedule this job to run periodically and store the transformed data back into S3. What is the best option to achieve this? 1-Create an AWS Glue job that uses Python shell as the job type and executes the code written to transform and store data in S3. Then, set up this job to run on some schedule. 2-Create an AWS Glue job that uses Spark as the job type to create Scala code to transform and store data in S3. Then, set up this job to run on some schedule. 3-Create an EMR cluster that runs Apache Spark code to transform and store data in S3. Then set up this job to run on some schedule. 4-Create an AWS Glue job that uses Spark as the job type to create PySpark code to transform and store data in S3. Then, set up this job to run on some schedule.
Answer-When creating AWS Glue jobs, you can select Python shell as the job type that allows you to use several built-in Python libraries that most data scientists and ML specialists are used to using. If you chose Spark job type, you would have to rewrite your code in PySpark or Scala instead of copy/paste using Python shell.
What are your options for storing data into S3? 1 -The AWS console 2-AWS CLI 3-AWS SDK 4-PutRecords API call 5-UPLOAD command 6-UNLOAD command
Answer-You can use the AWS console, the AWS command line interface (cli), or the AWS SDK.
Which Amazon service allows you to build a high-quality training labeled dataset for your machine learning models? This includes human workers, vendor companies that you choose, or an internal, private workforce. 1-S3 2-Lambda 3-SageMaker Ground Truth 4-Jupyter Notebooks
Answer-You could use Jupyter Notebooks or Lambda to help automate the labeling process, but SageMaker Ground Truth is specifically used for building high-quality training datasets.
You are on a personal quest to design the best chess playing model in the world. What might be a good strategy for this objective? 1-Use Mechanical Turk to crowdsource the best chess moves in a variety of scenarios then use those for a supervised learning session. 2-Use a supervised learning strategy that is trained by feeding in the chess moves of thousands of famous chess experts. 3-Use an unsupervised learning strategy to analyze similarities across thousands of the best chess matches. 4-Use a reinforcement learning strategy to let the model learn itself. 5-Use a Factorization Machine approach to analyze the winning series of moves across thousands of chess matches to find the best series of moves.
Chess is a complex game and would require some training to develop a good model. Supervised learning, by definition, can only be as good as the training data that it is supplied with so we won't be able to meet our goal of the best chess model using supervised learning. Rather, reinforcement learning would have the best chance of becoming better than any other player if given enough iterations and a proper reward function.
We want to perform automatic model tuning on our linear learner model using the built-in algorithm from SageMaker. We have chosen the tunable hyperparameter we want to use. What is our next step? Choose2 1-Choose the SageMaker Notebook instance where the input data is stored. 2-Decide what hyperparameter we want SageMaker to tune in the tuning process. 3-Choose an algorithm container from ECR ensuring it's tagged with :1 4-Choose a range of values which SageMaker will sweep through for the selected tunable hyperparameter and target objective metric we want to use in the tuning process. 5-Submit the tuning job via the console or CLI.
For automatic tuning, we first must choose the tunable hyperparameter, then choose a range of values SageMaker can use on that tunable hyperparameter. We then have to choose the objective metric we want SageMaker to watch as it adjusts the tunable hyperparameter. The final step is to submit the tuning job, and this can be done using the console or CLI.