AWS ML Speciality
what enables you to build the workflows that are required for human review of machine learning predictions?
Amazon Augmented AI
What is SageMaker Debugger?
Amazon SageMaker Debugger comes with built-in analytics that automatically analyzes data emitted during training, such as inputs, outputs, and transformations known as tensors.
Accuracy
(TP+TN)/(TP+TN+FP+FN)
What's the capacity of 1 shard in Kinesis data stream?
1 MB/sec data input and 2 MB/sec data output and up to 1000 PUT per second
F1 Score
2 * precision * recall / (precision + recall)
What is huristic technique?
A heuristic is an educated guess or intuition that usually does not ensure an optimal outcome.
In AWS SageMaker automatic hyperparameter tuning, which of the following two methods are used?
Bayesian Optimization and Random Search
What to use in order to test updates to a model in production?
Because we must demonstrate that the updates perform as well as the existing model before we can use it in production, we would be seeking an offline validation method. Both k-fold and backtesting with historic data are offline validation methods and will allow us to evaluate the model performance without having to use live production traffic.
Which algorithm is good for multi-classification problem?
Both XGBoost and Linear Learner are perfect choices for multi classification problems. When we are trying to solve a multi classification problem using XGBoost we set the objective hyperparameter to multi:softmax and when using the Linear Learner algorithm, we set the predictor hyperparameter to multiclass_classifier.
what is the best model that gives movie recommendations as well as identify customer shopping patterns, trends, and preferences?
Collaborative filtering algorithm
what can you do to speed up a slow linear learner training process without sacrificing accuracy?
Convert the training data from CSV to recordIO-protobuf format as this allows for the data to be streamed or piped from S3, rather than fully downloaded to the training instance
what does Sagemaker do when we issue CreatEndpoint API?
CreateEndpoint API call is used to launch an inference container. When using the built-in algorithms, SageMaker will automatically reference the current stable version of the container
What's data augmentation?
Data augmentation is a technique to artificially create new training data from existing training data. Image data augmentation is perhaps the most well-known type of data augmentation and involves creating transformed versions of images in the training dataset that belong to the same class as the original image.
Which built-in algorithm in SageMaker is the MOST suitable for Click predictions and item recommendations?
Factorization Machines
What is a good target metric to use in general when comparing different binary classification models?
For binary classification problems, the AUC or Area Under the Curve is an industry-standard metric to evaluate the quality of a classification machine learning model. AUC measures the ability of the model to predict a higher score for positive examples, those that are "correct," than for negative examples, those that are "incorrect." The AUC metric returns a decimal value from 0 to 1. AUC values near 1 indicate an ML model that is highly accurate.
What is a good target metric to use generally when comparing different regression models?
For regression tasks, the industry standard Root Mean Square Error (RMSE) metric. It is a distance measure between the predicted numeric target and the actual numeric answer (ground truth). The smaller the value of the RMSE, the better is the predictive accuracy of the model. A model with perfectly correct predictions would have an RMSE of 0.
What visualization can be used to show features and the correlation they have with all other features?
Heatmaps are a great way to show correlation and compare values to other values
Which metric should the ML engineer use for tracking to to implement auto-scaling for the SageMaker endpoint to ensure the service can keep up with the demand?
InvocationsPerInstance
which algorithms can be used to extract keywords from a collection of texts such as news stories?
Latent Dirichlet Allocation (LDA) and Neural Topic Model (NTM) algorithms both can perform topic extraction from bodies of text but each use a slightly different method
example of linear regression models:
Linear Learner, XGBoost
What helps in preferable in fixing positively skewed data?
Logarithmic transformation
What's model pruning?
Model pruning aims to remove weights that don't contribute much to the training process.
How to configure the container so that it could be run as executable by Amazon SageMaker?
Modify the Docker file by adding the training script as an ENTRYPOINT
To configure a Docker container to run as an executable,
Modify the Docker file by adding the training script as an ENTRYPOINT
If LDA is used for text modeling, what's used for categorical placement or sentiment analysis?
Multinomial logistic regression is used to predict categorical placement in or the probability of category membership (labeling)
when a feature is missing data, use
Multiple Imputations to fill out the missing data with the least bias
when there is missing data in a feature but the missing features can be determined from other features in the data set. What can you do to replace the values that will cause least amount of bias?
Multiple data imputations for missing data makes it possible for the researcher to obtain approximately unbiased estimates of all the parameters from the random error. The researcher cannot achieve this result from deterministic imputation, which the multiple imputation for missing data can do.
One hot encoding?
One hot encoding is a process by which categorical variables are converted into numerical values that could be provided to ML algorithms to do a better job in prediction. These numerical values are binary digits composed of 0 and 1
What is principal component analysis PCA?
PCA is an unsupervised machine learning algorithm that attempts to reduce the dimensionality (number of features) within a dataset while still retaining as much information as possible.
For anomaly detection use,
RCF
What do you call a problem when there is no training data?
Reinforcement Learning Problem
What is the residual?
Residual = Actual value - Predicted Value
What is Sagemaker ground truth?
SageMaker Ground Truth is just a fully managed data labeling service that makes it easy to build highly accurate training datasets for machine learning.
What is SageMaker Neo?
SageMaker Neo provides a way to compile XGBoost models which are optimized for the ARM processor in the Raspberry Pi
What is the best algorithm for language translation?
Seq2Seq as it involves converting a sequence of tokens (or words in this case) to another set of tokens
Specificity
Specificity = TN/(TN + FP)
You are preparing plain text corpus data to use in a NLP process. Which of the following is/are one of the important step(s) to pre-process the text in NLP based projects?
Stemming, Stop word removal and Word standardization (object standardization)
to use your custom python training script on Amazon SageMaker,
Store the training script inside the /opt/ml/code directory and define it as the script entry point in the SAGEMAKER_PROGRAM environment variable.
Recall
TP/(TP+FN)
Precision
TP/(TP+FP)
What helps in preferable in fixing negatively skewed data?
Third-order polynomial transformation
What's tf-idf?
Term Frequency - Inverse Document Frequency (TfIdf) is an algorithm used to convert text data into its numerical representation that can be passed into a machine learning model. The Tf-Idf is a great way of giving weights to words as it penalizes generic words that commonly appear across all sentences.
What is Macro F1 Score?
The Macro F1 Score is an unweighted average of the F1 score across all classes and is typically used to evaluate the accuracy of multi-class models. A number closer to 1 indicates higher accuracy
Which Algorithm is good fit for anomalous detection in a dataset?
The Random Cut Forest algorithm is an unsupervised learning algorithm for detecting anomalous data points. The hyperparameter, num_trees, sets the number of trees used in the RCF model.
What is the softmax function?
The Softmax activation function is a mathematical equation that converts a vector of real numbers into a vector of a probability distribution whose sum is equal to 1.
what's The Support Vector Machines (SVM)
The Support Vector Machines (SVM) is a supervised algorithm mainly used for classification tasks. It uses decision boundaries to separate groups of data. The SVM with Radial Basis Function (RBF) kernel is a variation of the SVM (linear) used to separate non-linear data.
Hyperbolic Tangent (tanh) function
This activation function is almost similar to the Sigmoid Function — but it's centered around 0 instead of Sigmoid's 0.5. Most of the time, the tanh function is more preferable because it allows the model to converge faster to a minimum since its derivatives are larger/steeper than Sigmoids.
What True Positive is also known as?
Type I Error
what is normalization?
change the feature value to a value within the range from 0 to 1
What can be an easier and faster way for a team to perform data preprocessing tasks, such as normalization and filling in missing values?
Use AWS Glue DataBrew to visualize and clean the data
In Amazon Polly how can you modify the pronunciation of particular words, such as company names, acronyms, foreign words, and neologisms?
Use Amazon Polly's custom lexicons or vocabularies
How can you automate installing libraries in SageMaker notebook instances especially after you restart them?
Use a lifecycle configuration script feature to bootstrap the package installation
What can we do When the error rate is high on both training and evaluating sessions?
When both training and testing errors are high, it indicates that our model is underfitting the data. We can try to add more details to the dataset, gather more data for training and/or run the training session longer. We might also need to identify a better algorithm.
In RCF (Random Cut Forest) how to determine how accurate the model is along with other metrics like precision, recall, and F1-score metrics on the labeled data
When using RCF the optional test channel is used to compute accuracy, precision, recall, and F1-score metrics on labeled data. Train and test data content types can be either application/x-recordio-protobuf or text/csv and AWS recommends using ml.m4, ml.c4, and ml.c5 instance families for training.
For Binary classification use,
XGBoost
what algorithm to use with the least amount of setup to label dataset by classifying text data into different categories depending on the summary of the corpus
You can use SageMaker Ground Truth to create ground truth datasets by creating labeling jobs. When you create a text classification job, workers group text into the categories that you define. You can define multiple categories but the worker can apply only one category to the text. Use the instructions to guide your workers to make the correct choice. Always define a generic class in addition to your specific classes. Giving your workers a generic option helps to minimize inaccurately classified text.
What technique can you use when you do not have enough samples of certain labels such as fraudulent transactions?
You can use techniques like SMOTE (Synthetic Minority Over-Sample Technique) to create more samples of the fraudulent transactions. you can also request more data
The factorization machine algorithm can be run in
either in binary classification mode or regression mode. In regression mode, the testing dataset is scored using Root Mean Square Error (RMSE). In binary classification mode, the test dataset is scored using Binary Cross Entropy (Log Loss), Accuracy (at threshold=0.5) and F1 Score (at threshold =0.5). For training, the factorization machines algorithm currently supports only the recordIO-protobuf format with Float32 tensors.
to reduce query execution time and cost
partition the data, only use needed columns, and convert the files to Apache Parquet or Apache ORC
what can you provide for Words that are semantically similar correspond to vectors that are close together?
provide a pre-trained word embedding to capture the semantic relationships between words.
to correct imbalances in data
resample the dataset
The BlazingText algorithm expects a
single preprocessed text file with space-separated tokens. Each line in the file should contain a single sentence. If you need to train on multiple text files, concatenate them into one file and upload the file in the respective channel.
What is Sigmoid Function?
squashes inputs as 0 (as more negative a number gets) or 1 (as more positive a number gets). This activation function is more desirable for binary classification problems.
What is t-sne
t-Distributed Stochastic Neighbor Embedding (TSNE) is a non-linear dimensionality reduction algorithm used for exploring high-dimensional data
what is a good forecasting algorithm when working with multiple sets of historic data?
the SageMaker DeepAR Forecasting algorithm
when the accuracy is high in training but low in testing,
the model is either overfitting or insufficient randomization. to resolve this issue, you can randomize the training and testing data and you can reduce the number of features for the overfitting.
when residual value is negative, this means
the predicted value is higher than the actual value which means the model is over estimating
Target encoding
this type of encoding is achieved by replacing categorical variables with just one new numerical variable and replacing each category of the categorical variable with its corresponding probability of the target.
when the error rate during training is high but low during testing,
this usually is an indication of either a systemic or programmatic issue in the algorithm or issue in the data
Label Encoding
type of encoding will only convert categorical data into integer labels (e.g. 0,1,2,3,4) and not into a vector of binary values (e.g. [1,0,0], [0,1,0]).
What type of learning is clustering?
unsupervised learning problem
to get inferences for the entire dataset
use batch transform job using the trained model
to classify data into similar groups,
use clustering which is unsupervised method uses algorithms such as K-Means
when there are only two complete possible outcomes for an inference
use discrete classification not continuous classification which is used when the outcome can be partial
How to avoid overfitting in neural networks?
use dropouts. It prevents overfitting and provides a way of approximately combining exponentially many different neural network architectures efficiently. The term dropout refers to dropping out units (hidden and visible) in a neural network.
What's Viseme?
viseme Speech Mark is used to synchronize speech with facial animation (lip-syncing) or to highlight written words as they're spoken.