AWS Machine Learning 2022

Ace your homework & exams now with Quizwiz!

Precision metric

Amazon ML provides the different metrics to measure the predictive accuracy of the Machine Learning model, namely: Accuracy, Precision, Recall, and False-Positive Rate. The Precision metric measures the fraction of actual positives among those examples that are predicted as positive. The range is 0 to 1. A larger value indicates better predictive accuracy: In this scenario, the false positives refer to movies recommended to users that they don't actually like. Since the company is more concerned about its customer satisfaction then the model should have low false positives as much as possible. And based on the formula, the lower the false positives means the higher Precision that we get. Recall is incorrect because this metric does not consider the false positives in your data. False Negative Rate is incorrect because this metric is mostly concerned with false negatives instead of having low false positives. Root Mean Square Error (RMSE) is incorrect because this is just an evaluation metric for regression models and is not relevant for this scenario.

K-nearest neighbor (K-NN)

An algorithm in which "k" equals the number of nearest neighbors plotted on a graph

transfer learning

In transfer learning, the network is initialized with pre-trained weights and just the top fully connected layer is initialized with random weights. Then, the whole network is fine-tuned with new data. In this mode, training can be achieved even with a smaller dataset. This is because the network is already trained and therefore can be used in cases without sufficient training data.

Binary Classification Model

ML models for binary classification problems predict a binary outcome (one of two possible classes). To train binary classification models, Amazon ML uses the industry-standard learning algorithm known as logistic regression.

Regression Model

ML models for regression problems predict a numeric value. For training regression models, Amazon ML uses the industry-standard learning algorithm known as linear regression

Random cropping

Random cropping is incorrect because this is just a data augmentation technique and will not decrease dimensionality.

Blazing Text

The Amazon SageMaker BlazingText algorithm provides highly-optimized implementations of the Word2vec and text classification algorithms. The Word2vec algorithm is useful for many downstream natural language processing (NLP) tasks, such as sentiment analysis, named entity recognition, machine translation, etc. Text classification is an important task for applications that perform web searches, information retrieval, ranking, and document classification. Word embedding is a vector representation of a word. Words that are semantically similar correspond to vectors that are close together. That way, word embeddings capture the semantic relationships between words.

Amazon SageMaker Object2Vec algorithm

The Amazon SageMaker Object2Vec algorithm is a general-purpose neural embedding algorithm that is highly customizable. It can learn low-dimensional dense embeddings of high-dimensional objects. The embeddings are learned in a way that preserves the semantics of the relationship between pairs of objects in the original space in the embedding space.

Correlation matrix

This visualization technique just provides information about how close the predicted values are from true values.

Amazon Comprehend

Amazon Comprehend is a natural language processing (NLP) service that uses machine learning to find meaning and insights in text. In this scenario, you can use these three services to build the ML-pipeline needed to satisfy the requirements. First, you'd have to create a transcription job using Amazon Transcribe to transform the recordings to text. Then, translate non-English calls to English using Amazon Translate. Finally, use Amazon Comprehend for sentiment analysis. There's no need to deploy or train your own model as all of these services are fully managed and are readily-available through APIs.

Amazon Translate

Amazon Translate is a Neural Machine Translation (MT) service for translating text between supported languages.

k-NN binary classification model

Normalization of numeric variables can help the learning process if there are very large range differences between numeric variables because variables with the highest magnitude could dominate the ML model, no matter if the feature is informative with respect to the target or not. Most k-NN models use the Euclidean distance to measure how similar the target data is to a specific class. Since the Euclidean distance is a function of data points in a graph, the position of our data points greatly influences the resulting Euclidean distance. Consider a dataset with a feature called 'age' that ranges between 18-35 and a 'product price' that ranges between $50 - $5,000. Since the product price has a significantly larger value than the age, the model will treat the product price with "more importance". This would have a negative impact on the model's ability to classify data correctly.

Amazon SageMaker supported data formats

The following table lists the supported data formats for Amazon SageMaker built-in algorithms:

Min-max scaling

normalizes data within a fixed range

image classification

process of extracting differentiated classes or themes from raw remotely sensed satellite data

Pearson correlation coefficient

this is used for measuring the statistical relationship, or association, between two continuous variables

Target encoding

this type of encoding is achieved by replacing categorical variables with just one new numerical variable and replacing each category of the categorical variable with its corresponding probability of the target. This won't convert categorical variables into binary values.

Label encoding

this type of encoding will only convert categorical data into integer labels (e.g. 0,1,2,3,4) and not into a vector of binary values (e.g. [1,0,0], [0,1,0]).

Multi Classification model to predict movie genres

A confusion matrix is a tool for visualizing the performance of a multiclass model. It has entries for all possible combinations of correct and incorrect predictions, and shows how often each one was made by our model. Typical metrics used in multiclass are the same as the metrics used in the binary classification case. The metric is calculated for each class by treating it as a binary classification problem after grouping all the other classes as belonging to the second class. Then the binary metric is averaged over all the classes to get either a macro average (treat each class equally) or weighted average (weighted by class frequency) metric. In Amazon ML, the macro average F1-measure is used to evaluate the predictive success of a multiclass classifier.

L2 regularization

A type of regularization that penalizes weights in proportion to the sum of the squares of the weights. L2 regularization helps drive outlier weights (those with high positive or low negative values) closer to 0 but not quite to 0. (Contrast with L1 regularization.) L2 regularization always improves generalization in linear models.

Area Under the Curve (AUC)

AUC measures the ability of the model to predict a higher score for positive examples as compared to negative examples. Because it is independent of the score cut-off, you can get a sense of the prediction accuracy of your model from the AUC metric without picking a threshold. The Receiver Operating Characteristic (ROC) curve is a graphical plot that shows the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The AUC metric returns a decimal value from 0 to 1. AUC values near 1 indicate an ML model that is highly accurate. Values near 0.5 indicate an ML model that is no better than guessing at random. Values near 0 are unusual to see, and typically indicate a problem with the data. Essentially, an AUC near 0 says that the ML model has learned the correct patterns, but is using them to make predictions that are flipped from reality ('0's are predicted as '1's and vice versa).

Increase the availability of the application's underlying machine learning component

Add SageMaker instances of the same size and use the existing endpoint to host them While the benefit of building custom ML models for each use case is higher inference accuracy, the downside is that the cost of deploying models increases significantly, and it becomes difficult to manage so many models in production. These challenges become more pronounced when you don't access all models at the same time but still need them to be available at all times. Amazon SageMaker multi-model endpoints address these pain points and give businesses a scalable yet cost-effective solution to deploy multiple ML models. Multi-model endpoints provide a scalable and cost-effective solution to deploying large numbers of models. They use a shared serving container that is enabled to host multiple models. This reduces hosting costs by improving endpoint utilization compared with using single-model endpoints. It also reduces deployment overhead because Amazon SageMaker manages loading models in memory and scaling them based on the traffic patterns to them.

Amazon Augmented AI (Amazon A2I)

Amazon Augmented AI (Amazon A2I) enables you to build the workflows that are required for human review of machine learning predictions. Amazon Textract is directly integrated with Amazon A2I so that you can easily get low-confidence results from Amazon Textract's AnalyzeDocument API operation reviewed by humans. You can use Amazon Textract's AnalyzeDocument API for form data extraction and the Amazon A2I console to specify the conditions under which Amazon A2I routes predictions to reviewers. The conditions are set based on the confidence threshold of important form keys. For example, you can send a document to a human to review if the key "Name" or its associated value "Jane Doe" was detected with low confidence.

Amazon Forecast

Amazon Forecast provides a number of 'filling' methods to handle missing values in your target time series and related time-series datasets. In the filling process, Amazon Forecast standardized values to missing entries in your dataset. Amazon Forecast supports the following filling methods: Middle filling - Fills any missing values between the item start and item end date of a data set. Back filling - Fills any missing values between the last recorded data point and global end date of a dataset. Future filling (related time series only) - Fills any missing values between the global end date and the end of the forecast horizon.

Amazon Lex

Amazon Lex is a service for building conversational interfaces into any application using voice and text. Amazon Lex provides the advanced deep learning functionalities of automatic speech recognition (ASR) for converting speech to text and natural language understanding (NLU) to recognize the intent of the text, which enables you to build applications with highly engaging user experiences and lifelike conversational interactions.

Amazon Transcribe

Amazon Transcribe is an AWS service that makes it easy for customers to convert speech-to-text. Using Automatic Speech Recognition (ASR) technology, customers can choose to use Amazon Transcribe for a variety of business applications, including transcription of voice-based customer service calls, generation of subtitles on audio/video content, and conduct (text-based) content analysis on audio/video content.

Keras Convolutional Neural Network (CNN)

CNN models are mainly used for problems that deal with image data

Classification model type

Classification is a task that requires the utilization of machine learning algorithms that learn how to assign a particular class label to examples from the problem domain. An easy to understand example for this particular model is classifying emails as "spam" or "not spam." Examples of classification problems include: Given an example, classify if it is spam or not. Given a handwritten character, classify it as one of the known characters. Given recent user behavior, classify as churn or not. Classification is the most suitable model type for the given task since you have to categorize employees into two groups — "stay" or "leave" which is actually a type of Binary Classification Model as it predicts a binary outcome (one of two possible classes).

k-fold validation

Cross-validation is a technique for evaluating ML models by training several ML models on subsets of the available input data and evaluating them on the complementary subset of the data. Use cross-validation to detect overfitting, i.e. failing to generalize a pattern. In k-fold cross-validation, you split the input data into k subsets of data (also known as folds). You train your models on all but one (k-1) of the subsets and then evaluate them on the subset that was not used for training. This process is repeated k times, with a different subset reserved for evaluation (and excluded from training) each time. The diagram above shows an example of the training subsets and complimentary evaluation subsets generated for each of the four models that are created and trained during 4-fold cross-validation. Model one uses the first 25 percent of data for evaluation, and the remaining 75 percent for training. Model two uses the second subset of 25 percent (25 percent to 50 percent) for evaluation, and the remaining three subsets of the data for training, and so on. Performing 4-fold cross-validation generates four models, four data sources to train the models, four data sources to evaluate the models, and four evaluations, one for each model. Amazon ML generates a model performance metric for each evaluation. For example, in 4-fold cross-validation for a binary classification problem, each of the evaluations reports an area under curve (AUC) metric. You can get the overall performance measure by computing the average of the four AUC metrics.

Custom entity recognition

Custom entity recognition extends the capability of Amazon Comprehend by enabling you to identify new entity types not supported as one of the preset generic entity types. This means that in addition to identifying entity types such as LOCATION, DATE, PERSON, and so on, you can analyze documents and extract entities like product codes or business-specific entities that fit your particular needs. Creating a custom entity recognition model is a more effective approach, compared to using string matching or regular expressions to identify entities. For example, to extract product codes, it would be difficult to enumerate all possible patterns to apply string matching. But a custom entity recognition model can learn the context where those product codes are most likely to appear and then make such inferences even though it has never previously seen the exact product codes. As well, typos in product codes and the addition of new product codes can still be expected to be caught by Amazon Comprehend's custom entity recognition model but would be missed when using string matches against a static list.

Data augmentation

Data augmentation is a technique to artificially create new training data from existing training data. This is done by applying domain-specific techniques to examples from the training data that create new and different training examples. Image data augmentation is perhaps the most well-known type of data augmentation and involves creating transformed versions of images in the training dataset that belong to the same class as the original image. Training deep learning neural network models on more data can result in more skillful models, and the augmentation techniques can create variations of the images that can improve the ability of the fit models to generalize what they have learned to new images. The Amazon SageMaker image classification algorithm is a supervised learning algorithm that supports multi-label classification. It takes an image as input and outputs one or more labels assigned to that image. It uses a convolutional neural network (ResNet) that can be trained from scratch or trained using transfer learning when a large number of training images are not available. The recommended input format for the Amazon SageMaker image classification algorithms is Apache MXNet RecordIO. However, you can also use raw images in .jpg or .png format. You can also use a data augmentation type (augmentation_type) hyperparameter to configure your input images to be augmented in multiple ways. You can randomly crop the image and flip the image horizontally and alter the color using the Hue-Saturation-Lightness channels.

AWS Glue DataBrew

In AWS Glue DataBrew, you can use recipe actions to tabulate and summarize data from different perspectives, or to perform advanced transformations. One hot encoding is a process by which categorical variables are converted into numerical values that could be provided to ML algorithms to do a better job in prediction. It creates a number (n) of numerical columns, where n is the number of unique values in a selected categorical variable. For example, consider a column named shirt_size. Shirts are available in small, medium, large, or extra-large. In this scenario, there are four distinct values for shirt_size. Therefore, ONE_HOT_ENCODING generates four new columns. Each new column is named shirt_size_x, where x represents a distinct shirt_size value.

Early Stopping option

In Amazon SageMaker, you can opt to stop the training jobs earlier if the results are not improving significantly based on a particular metric. Stopping training jobs early can help reduce compute time and helps you avoid overfitting your model. When you enable early stopping for a hyperparameter tuning job, SageMaker evaluates each training job that the hyperparameter tuning job launches as follows: - After each epoch of training, get the value of the objective metric. - Compute the running average of the objective metric for all previous training jobs up to the same epoch, and then compute the median of all of the running averages. - If the value of the objective metric for the current training job is worse (higher when minimizing or lower when maximizing the objective metric) than the median value of running averages of the objective metric for previous training jobs up to the same epoch, SageMaker stops the current training job. To use early stopping with your own algorithm, you must write your algorithms such that it emits the value of the objective metric after each epoch. You can use the following frameworks for this task: TensorFlow, MXNet, Chainer, PyTorch, and Spark

Matrix Multiplication

In Linear Algebra, matrix multiplication is a binary operation that produces a matrix from two matrices or in other words, it multiplies two matrices that are usually in array form. In Machine Learning, matrix multiplication is a compute-intensive operation used to process sparse or scattered data produced by the training model.

Residual plot

It is common practice to review the residuals for regression problems. A residual for an observation in the evaluation data is the difference between the true target and the predicted target. Residuals represent the portion of the target that the model is unable to predict. A positive residual indicates that the model is underestimating the target (the actual target is larger than the predicted target). A negative residual indicates an overestimation (the actual target is smaller than the predicted target).

Amazon Kinesis Client Library (KCL)

KCL helps you consume and process data from a Kinesis data stream by taking care of many of the complex tasks associated with distributed computing. These include load balancing across multiple consumer application instances, responding to consumer application instance failures, checkpointing processed records, and reacting to resharding. The KCL takes care of all of these subtasks so that you can focus your efforts on writing your custom record-processing logic. The KCL is different from the Kinesis Data Streams APIs that are available in the AWS SDKs. The Kinesis Data Streams APIs help you manage many aspects of Kinesis Data Streams, including creating streams, resharding, and putting and getting records. The KCL provides a layer of abstraction around all these subtasks, specifically so that you can focus on your consumer application's custom data processing logic.

Latent Dirichlet Allocation (LDA)

LDA is an unsupervised algorithm mainly used to discover a user-specified number of topics shared by documents within a text corpus. For example, you may uncover new topics/categories from a document by determining the occurrence of each word. This algorithm is mainly used for topic modeling. For example, you may discover new topics/categories from a document by determining the occurrence of each word

Multiclass Classification Model

ML models for multiclass classification problems allow you to generate predictions for multiple classes (predict one of more than two outcomes). For training multiclass models, Amazon ML uses the industry-standard learning algorithm known as multinomial logistic regression

Linear regression

ML models for regression problems predict a numeric value. For training regression models, Amazon ML uses the industry-standard learning algorithm known as linear regression. Examples of Regression Problems - "What will the temperature be in Seattle tomorrow?" - "For this product, how many units will sell?" - "What price will this house sell for?"

PCA

Principal Component Analysis (PCA) is an unsupervised machine learning algorithm that attempts to reduce the dimensionality (number of features) within a dataset while still retaining as much information as possible. This is done by finding a new set of features called components, which are composites of the original features that are uncorrelated with one another. The k-means algorithm attempts to find discrete groupings within data, where members of a group are as similar as possible to one another and as different as possible from members of other groups (see the following figure). You define the attributes that you want the algorithm to use to determine similarity. Another way you can define k-means is that it is a clustering problem that finds k cluster centroids for a given set of records, such that all points within a cluster are closer in distance to their centroid than they are to any other centroid.

RandomCutForest

Random Cut Forest (RCF) is an unsupervised algorithm for detecting anomalous data points within a data set. These are observations that diverge from otherwise well-structured or patterned data. Anomalies can manifest as unexpected spikes in time series data, breaks in periodicity, or unclassifiable data points. They are easy to describe in that when viewed in a plot, they are often easily distinguishable from the "regular" data. Including these anomalies in a data set can drastically increase the complexity of a machine learning task since the "regular" data can often be described with a simple model. With each data point, RCF associates an anomaly score. Low score values indicate that the data point is considered "normal." High values indicate the presence of an anomaly in the data. The definitions of "low" and "high" depend on the application but common practice suggests that scores beyond three standard deviations from the mean score are considered anomalous.

Term Frequency - Inverse Document Frequency (TfIdf)

Term Frequency - Inverse Document Frequency (TfIdf) is an algorithm used to convert text data into its numerical representation that can be passed into a machine learning model. The first function (Term Frequency) counts how frequently a word appears in a sentence belonging to a corpus. The second function (Inverse Document Frequency) counts how frequently a word appears in the whole corpus. The Tf-Idf is a great way of giving weights to words as it penalizes generic words that commonly appear across all sentences. Consider the example below: In the example, we have 4 sentences inside a document. The word "blue" is only present in the first sentence, while the word "horizon" is present in 3 sentences. Notice how the word "blue" and "horizon" each appears 3 times in the whole document, but the "blue" has more weight compared to the "horizon". Even if a word has a low frequency across all documents, the Tf-Idf vectorizer is still able to capture how special that word is by giving it a higher score.

DeepAR Forecasting

The Amazon SageMaker DeepAR forecasting algorithm is a supervised learning algorithm for forecasting scalar (one-dimensional) time series using recurrent neural networks (RNN). It can provide better forecast accuracies compared to classical forecasting techniques such as Autoregressive Integrated Moving Average (ARIMA) or Exponential Smoothing (ES), both of which are implemented in many open-source and commercial software packages for forecasting. The scenario is an example of a cold start forecasting where we want to produce forecasts for a time series with little or no existing historical data (e.g., new shoe releases). Traditional methods such as ARIMA or ES rely solely on the historical data of an individual time series, and as such, they are typically less accurate in the cold start case. A neural network-based algorithm such as DeepAR can learn the typical behavior of new sneaker sales based on the sales patterns of other types of sneakers when they were first released. By learning relationships from multiple related time series within the training data, DeepAR can provide more accurate forecasts than the existing alternatives.

Linear Learner

The Amazon SageMaker linear learner algorithm supports three data channels: train, validation, and test. The algorithm logs validation loss at every epoch, and uses a sample of the validation data to calibrate and select the best model. If you don't provide validation data, the algorithm uses a sample of the training data to calibrate and select the model. If you provide test data, the algorithm logs include the test score for the final model. To set up regression, you can set the predictor_type hyperparameter to regressor where the score is the prediction produced by the model. For classification, you can set the predictor_type to either binary_classifier or multiclass_classifier where the model returns a score and also a predicted_label.

Factorization Machines

The Factorization Machine algorithm is a supervised algorithm used for solving regression and classification tasks. In a nutshell, supervised algorithms are useful in solving problems where you know what output values are to expect based on previous results. For example, classifying whether an email is spam or not.

The Multiple Imputations by Chained Equations (MICE)

The Multiple Imputations by Chained Equations (MICE) algorithm is a robust, informative method of dealing with missing data in your datasets. This procedure imputes or 'fills in' the missing data in a dataset through an iterative series of predictive models. Each specified variable in the dataset is imputed in each iteration using the other variables in the dataset. These iterations will be run continuously until convergence has been met.In General, MICE is a better imputation method than naive approaches (filling missing values with 0, dropping columns).

XGBoost

The XGBoost (eXtreme Gradient Boosting) is a popular and efficient open-source implementation of the gradient boosted trees algorithm. Gradient boosting is a supervised learning algorithm that attempts to accurately predict a target variable by combining an ensemble of estimates from a set of simpler and weaker models. You can use XGBoost for regression, classification (binary and multiclass), and ranking problems To enable XGBoost to perform classification tasks, set the objective parameter to multi:softmax and specify the number of classes in the num_class parameter.

The confusion matrix

The confusion matrix is a result of the model evaluation that consists of the distribution of model predictions over a testing dataset. It summarizes the evaluation with four important statistics relative to the total number of predictions: the percentage of true negatives (TN), true positives (TP), false negatives (FN), and false positives (FP). These stats are often presented in the form of as follows: Accuracy measures the fraction of correct predictions. The range is 0 to 1. A larger value indicates better predictive accuracy. It is computed by dividing the sum of True Positives and True Negatives by the total number of predictions. To solve for Accuracy, let us first identify the four statistics of the model's confusion matrix: TP = 8, TN = 81, FP = 8, FN = 3 Accuracy = (TP + TN ) / (TP + TN + FP + FN) Accuracy = (8 + 81)/(100) = 0.89 x 100 = 89% Hence, the accuracy of this model is 89%. For this type of business problem, minimizing the False Negatives should be the top priority because they incorrectly predict that a churning customer will stay. In simple terms, this is the costliest among the four as you'd lose $15 for every erroneous prediction. The False Positives represents happy customers that the model mistakenly predicted to churn. This means that you're wasting a retention incentive of $8 for customers that don't have the tendency to cancel their subscription. Let's calculate the cost for the False Positive and False Negative. Cost for FN = 3 * $15 = $45 Cost for FP = 8 * $8 = $64 Since the resulting cost incurred by FN is less than FP, the overall spending will be lesser.

loss function is oscillating

The learning rate is too high The learning rate is a constant value used in the Stochastic Gradient Descent (SGD) algorithm. Learning rate affects the speed at which the algorithm reaches (converges to) the optimal weights. The SGD algorithm makes updates to the weights of the linear model for every data example it sees. The size of these updates is controlled by the learning rate. Too large a learning rate might prevent the weights from approaching the optimal solution. Too small a value results in the algorithm requiring many passes to approach the optimal weights. In Amazon ML, the learning rate is auto-selected based on your data. A Loss function evaluates how effective an algorithm is at modeling the data. For example, if a model consistently predicts values that are very different from the actual values, it returns a large loss. Depending on the training algorithm, more than one loss function might be used.

One-hot encoding

The process by which categorical variables are converted into binary form (0 or 1) for machine reading. It is one of the most common methods for handling categorical features in text data. Will increase your data's dimension instead of reducing it

Tokenization

The process of replacing sensitive data with unique identification symbols that retain all the essential information about the data without compromising its security. this method is commonly used in Natural Language Processing (NLP) where you split a string into a list of words that have a semantic meaning

Semantic segmentation

The semantic segmentation algorithm provides a fine-grained, pixel-level approach to developing computer vision applications. It tags every pixel in an image with a class label from a predefined set of classes. Tagging is fundamental for understanding scenes, which is critical to an increasing number of computer vision applications such as self-driving vehicles, medical imaging diagnostics, and robot sensing. Because the semantic segmentation algorithm classifies every pixel in an image, it also provides information about the shapes of the objects contained in the image. The segmentation output is represented as a grayscale image called a segmentation mask. A segmentation mask is a grayscale image with the same shape as the input image.

t-distributed stochastic neighbor embedding (t-SNE)

The t-Distributed Stochastic Neighbor Embedding (TSNE) is a non-linear dimensionality reduction algorithm used for exploring high-dimensional data. PCA and t-SNE are both valid dimensionality reduction techniques that you can use.

Amazon VPC endpoint

Use an Amazon VPC endpoint to establish a private connection between the VPC and Amazon S3. Configure the access to the source VPC endpoint and the VPC ID in the bucket policy. A VPC endpoint enables private connections between your VPC and supported AWS services and VPC endpoint services powered by AWS PrivateLink. AWS PrivateLink is a technology that enables you to privately access services by using private IP addresses. Traffic between your VPC and the other service does not leave the Amazon network. A VPC endpoint does not require an Internet gateway, virtual private gateway, NAT device, VPN connection, or AWS Direct Connect connection. Instances in your VPC do not require public IP addresses to communicate with resources in the service.

Amazon Kinesis Data Analytics for SQL Applications

With Amazon Kinesis Data Analytics for SQL Applications, you can process and analyze streaming data using standard SQL. The service enables you to quickly author and run powerful SQL code against streaming sources to perform time-series analytics, feed real-time dashboards, and create real-time metrics. The Amazon Kinesis Data Analytics RANDOM_CUT_FOREST function detects anomalies in your data stream. A record is an anomaly if it is distant from other records. The algorithm starts developing the machine learning model using current records in the stream when you start the application. The algorithm does not use older records in the stream for machine learning, nor does it use statistics from previous executions of the application. Kinesis Data Streams can't be used to transform data on the fly and store the output data to Amazon S3.

Amazon CloudWatch API

You can use Amazon CloudWatch API operations to send the training metrics to CloudWatch, and create a dashboard of those metrics. Lastly, use Amazon Simple Notification Service (Amazon SNS) to send a notification when the model is overfitting. You can optionally add a Lambda function to the architecture if you wish to run some mitigation plan (e.g., stop the training when overfitting occurs). In the scenario, Amazon SNS would suffice as we only need to send notifications.

Underfitting Model

Your model is underfitting the training data when the model performs poorly on the training data. This is because the model is unable to capture the relationship between the input examples (often called X) and the target values (often called Y). In the scenario, the trained model was described as a linear function trying to fit a parabolic function. This is a characteristic of an underfitting model. It means that the model is too simple to recognize the variations in the target function.

Collaborative filtering

a process that automatically groups people with similar buying intentions, preferences, and behaviors and predicts future purchases Collaborative filtering is based on (user, item, rating) tuples. So, unlike content-based filtering, it leverages other users' experiences. The main concept behind collaborative filtering is that users with similar tastes (based on observed user-item interactions) are more likely to have similar interactions with items they haven't seen before. Compared to content-based filtering, collaborative filtering provides better results for diversity (how dissimilar recommended items are); serendipity (a measure of how surprising the successful or relevant recommendations are); and novelty (how unknown recommended items are to a user). However, collaborative filtering is more computationally expensive and more complex and costly to implement and manage. Though some algorithms used for collaborative filtering such as factorization machines are more lightweight than others. Collaborative filtering has a cold start problem as well, since it has difficulty recommending new items without a large amount of interaction data to train a model

K-means

an algorithm in which "k" indicates the number of clusters and "means" represents the clusters' centroids The K-means is an unsupervised algorithm that finds discrete groupings within a dataset in which the number of groupings is specified by the k-parameter. This algorithm is not suitable for the problem because we're using a labeled dataset to train a model.

Root Mean Square Error (RMSE)

an error measure used in determining the accuracy of the overall transformation of the unreferenced data

Scatterplot and Box plot

both provide visualization for predicted and true values but will not give any insights into the model's performance. Scatter plots are similar to line graphs in that they use horizontal and vertical axes to plot data points. However, they have a very specific purpose. Scatter plots show how much one variable is affected by another. The relationship between two variables is called their correlation. Scatter plots usually consist of a large body of data. The closer the data points come when plotted to make a straight line, the higher the correlation between the two variables, or the stronger the relationship.

Amazon FSx For Lustre

is a fully managed file system that is optimized for compute-intensive workloads, such as high performance computing, machine learning, and media data processing workflows. A distributed file system such as Amazon FSx for Lustre or EFS can speed up machine learning training by eliminating the need for this download step.

Amazon Kinesis Data Firehose

is the easiest way to reliably load streaming data into data stores and analytics tools. Kinesis Data Firehose can invoke your Lambda function to transform incoming source data and deliver the transformed data to destinations. You can enable Kinesis Data Firehose data transformation when you create your delivery stream. When you enable Kinesis Data Firehose data transformation, Kinesis Data Firehose buffers incoming data up to 3 MB by default. (To adjust the buffering size, use the ProcessingConfiguration API with the ProcessorParameter called BufferSizeInMBs.) Kinesis Data Firehose then invokes the specified Lambda function asynchronously with each buffered batch using the AWS Lambda synchronous invocation model. The transformed data is sent from Lambda to Kinesis Data Firehose. Kinesis Data Firehose then sends it to the destination when the specified destination buffering size or buffering interval is reached, whichever happens first.


Related study sets

Adults 1 - Final, Final adult 1 .exm

View Set

Investing and The Stock Market Game

View Set

Macro Final exam - Study tests 1&2

View Set