CMSC 471
How many neurons are needed in the output layer if the ML task is binary classification (two class/labels)?
1
How many neurons are needed in the output layer if the ML task is regression (predicting one value)?
1
For a Multi-class classifier with 4 labels/classes, what is the dimensionality of the confusion matrix?
4 x 4
Which of the following loss functions can be used for an MLP which is used for Regression assuming there are some outliers in the training data? Select all that apply.
MAE Huber
Normalization
Min-max scaling (many people call this normalization) is quite simple: values are shifted and re-scaled so that they end up ranging from 0 to 1. We do this by subtracting the min value and dividing by the max minus the min
It is common to use 80% of the data for training and hold out 20% for testing
True
Regression is predicting a target numeric value, such as the price of a car, given a set of features (mileage, age, brand, etc.) called predictors.
True
Which of the following ML tasks can be handled by Deep Neural Networks? Select all that apply.
binary classification multi-class classification multi-label classification regression
Which of the following loss functions is NOT used for regression?
binary_crossentropy
Which of the following ML methods is NOT supervised learning?
Clustering
A regression model can predict categorical values.
False
What activation function should be used for the neuron(s) of the output layer if the ML task is regression with no restriction on the predicted value?
None
Which of the following statements is true. Select all that apply.
One way to reduce the probability of overfitting is to gather more training data. Fine-tuning the model hyperparameters may improve the results of the ML model. The model is evaluated on the test set.
What is RMSE used for?
Regression error
Which statement is true about Regularization? Select all that apply.
Regularization helps with overfitting. Regularization reduces variance.
Which of the following ML applications and examples is NOT classification?
Temperature forecast
What are the other names for recall? Select all that apply.
Sensitivity TPR
Clustering is an example method of unsupervised learning.
True
Standardization
Standardization is quite different: first it subtracts the mean value (so standardized values always have a zero mean), and then it divides by the variance so that the resulting distribution has unit variance.
A confusion matrix with high scores on the main diagonal indicates a good model performance.
True
Muti-class Classification
Whereas binary classifiers distinguish between two classes, multiclass classifiers (also called multinomial classifiers) can distinguish between more than two classes. Multiclass Classification Some algorithms (such as Random Forest classifiers or naive Bayes classifiers) are capable of handling multiple classes directly
Data Cleaning
1. Get rid of the corresponding districts. 2. Get rid of the whole attribute. 3. Set the values to some value (zero, the mean, the median, etc.).
How many neurons are needed in the output layer if the ML task is 3-class classification (three class/labels)?
3
Which statement about precision vs recall is true?
The lower number of False Negatives the higher recall.
What is the major role of activation functions like tanh or ReLU in neural networks?
They add non-linearity to the model.
Which of the following statements about cross validation is true? Select all that apply.
Cross validation splits the data into K different folds, runs for k iterations, and in each iteration reserves 1 fold for testing. Cross validation splits the data into K different folds, runs for k iterations, and in each iteration reserves k-1 fold for training. Cross validation is a performance measure for ML models.
About handwritten digit classification on MNIST dataset (0-9) , which statement is true? Select all that apply.
The features of each data sample are pixel intensities and there are 10 different labels. The features are numbers from 0-255 and the labels are from 0-9. It is often called "Hello World" of Machine Learning.
Unsupervised Learning
The training data is unlabeled, system tries learning without a teacher Clustering 1. k-Means 2. Hierarchical Cluster (HCA) 3. Expectation Maximization Visualization and dimensional reduction 1. Principal Component Analysis(PCA) 2. Kernel PCA 3. Locally-Linear Embedding(LLE) 4. t-distributed Stochastic Neighbor Embedding (t-SNE) Association rule learning 1. Apriori 2. Eclat
When a Neural Network contains a deep stack of hidden layers, it is called a deep neural network (DNN).
True
Which statement accurately describes the difference between supervised learning and unsupervised learning?
In supervised learning, the training data you feed to the algorithm includes the desired solutions, called labels whereas in unsupervised learning data is not labelled.
Which of the following ML methods is NOT unsupervised?
Logistic regression
Which statement is false?
Normalization is random shuffling and always hurts the results.
Overfitting
The model performs well on training data, but it does not generalize well. Happens when the general data is noisy, or the sample it to small. The possible solutions are to: 1. To simplify the model by selecting one with fewer parameters (e.g., a linear model rather than a high-degree polynomial model), by reducing the number of attributes in the training data or by constraining the model 2. To gather more training data 3. To reduce the noise in the training data (e.g., fix data errors and remove outliers)
Underfitting
The opposite of overfitting, occurs when your model is too simple to learn the underlying structure of the data. The possible solutions are to: 1. Selecting a more powerful model, with more parameters 2. Feeding better features to the learning algorithm (feature engineering) 3. Reducing the constraints on the model (e.g., reducing the regularization hyper‐ parameter)
What statement is true about softmax function? Select all that apply.
The softmax function computes the exponential of every class score, then normalizes them which makes it a good fit for direct multi-class classification. The output of softmax function are class probabilities that sum to 1 Softmax function can also be used for binary classification.
Batch Learning
The system is incapable of learning incrementally; it must be trained using all available data. Takes a lot of time and computing resources, usually done offline.
Binary classification
The task of classifying the members of a given set of objects into two groups on the basis of whether they have some property or not. Support Vector Machine classifiers or Linear classifiers are strictly binary classifiers.
Supervised Learning
The training data you feed to the algorithm includes the desired solutions. Examples of algorithms 1. k-Nearest Neighbors 2. Linear Regression 3. Logistic Regression 4. Support Vector Machines(SVM) 5. Decision Trees and Random Forest 6. Neural Networks
Online Learning
Train the system incrementally by feeding it data instances sequentially, either individually or by small groups called mini-batches. Each learning step is fast and cheap, so the system can learn about new data on the fly as it arrives. Great for systems that receive data continuous.
A typical supervised learning task is classification.
True
Comparing AI, Machine Learning, and Deep Learning, one can argue that there is a superset-subset relationship between them such that Deep Learning is a subset of Machine Learning approaches, and Machine Learning is a subset of the broad field of approaches, algorithms and techniques in AI.
True
Connection weights of hidden layers and bias terms (connection weights of bias neurons) are both trainable parameters.
True
Gradient Descent is GUARANTEED to find the global minimum of logistic regression cost function (log loss) if the learning rate is not too large and the training session is long enough because the cost function is convex.
True
Machine Learning is great for problems for which existing solutions require a lot of hand-tuning or long lists of rules: one Machine Learning algorithm can often simplify code and perform better.
True
Preprocessing the data is a critical step in preparing the data for the ML model and may include cleaning the data by dropping NA values.
True
ROC curve plots "true positive rate" TPR on y-axis against "false positive rate" FPR on x-axis, and its "area under curve" AUC is a performance measure of ML models.
True
Regularization is usually done by adding constraints on model parameters.
True
Some algorithms can deal with partially labeled training data, usually a lot of unlabeled data and a little bit of labeled data. This is called semisupervised learning.
True
Sometimes scaling large values improve the results of ML models.
True
There is a trade-off between precision and recall such that any attempt to increase precision will decrease recall and vice-versa.
True
matplotlib is a Python module that has a wide variety of plotting features and functions and can be used for data visualization.
True
Which of the following ML applications and examples is a regression problem? Select all that apply.
Weather forecast Predicting stock market (stock value)
In non-regularized linear regression, the cost function differs from the performance measure.
False
Shuffling data randomly before or after splitting to train/test sets would significantly reduce the model performance.
False
For image classification of CIFAR-10 dataset which has 10 distinct image labels, how many classifiers are required to be trained using One-vs-Rest (OvR) aka One-vs-All (OvA) strategy?
10
For image classification of CalTech-101 dataset with pictures of objects belonging to 101 categories, how many classifiers are required using One-vs-One (OvO) strategy?
5050
Testing and Validating
A better option is to split your data into two sets: the training set and the test set. As these names imply, you train your model using the training set, and you test it using the test set. The error rate on new cases is called the generalization error (or out-ofsample error), and by evaluating your model on the test set, you get an estimation of this error. This value tells you how well your model will perform on instances it has never seen before.
Performance Measures
Confusion Matrix A much better way to evaluate the performance of a classifier is to look at the confusion matrix. The general idea is to count the number of times instances of class A are classified as class B. For example, to know the number of times the classifier confused images of 5s with 3s, you would look in the 5th row and 3rd column of the confusion matrix. Precision/Recall/f1_score It is often convenient to combine precision and recall into a single metric called the F1 score, in particular if you need a simple way to compare two classifiers. The F1 score is the harmonic mean of precision and recall (Equation 3-3). Whereas the regular mean treats all values equally, the harmonic mean gives much more weight to low values. As a result, the classifier will only get a high F1 score if both recall and precision are high. Unfortunately, you can't have it both ways: increasing precision reduces recall, and vice versa. This is called the precision/recall tradeoff. ROC Curve and AUC The receiver operating characteristic (ROC) curve is another common tool used with binary classifiers. It is very similar to the precision/recall curve, but instead of plotting precision versus recall, the ROC curve plots the true positive rate (another name for recall) against the false positive rate. The FPR is the ratio of negative instances that are incorrectly classified as positive. It is equal to one minus the true negative rate, which is the ratio of negative instances that are correctly classified as negative. The TNR is also called specificity. Hence the ROC curve plots sensitivity (recall) versus 1 - specificity Once again there is a tradeoff: the higher the recall (TPR), the more false positives (FPR) the classifier produces Accuracy The amount of data that is correct or true Cross Validation K-fold cross-validation: it randomly splits the training set into 10 distinct subsets called folds, then it trains and evaluates the Decision Tree model 10 times, picking a different fold for evaluation every time and training on the other 9 folds Cross-Validation allows you to get not only an estimate of the performance of your model, but also a measure of how precise this estimate is (i.e., its standard deviation)
Which statement is true about Logistic Regression cost function (log loss)? Assume y=1 is positive class and y=0 is negative class. Select all that apply.
Considering cost function (log loss), the cost will be large if the model estimates a probability close to 0 for a positive instance. Considering cost function (log loss), the cost will be close to 0 if the estimated probability is close to 0 for a negative instance. Considering cost function (log loss), the cost will be be large if the model estimates a probability close to 1 for a negative instance. Considering cost function (log loss), the cost will be close to 0 if the model estimates a probability close to 1 for a positive instance.
Data Handling
Convert each text labels to numbers when neccessary One issue with this representation is that ML algorithms will assume that two nearby values are more similar than two distant values. Obviously this is not the case (for example, categories 0 and 4 are more similar than categories 0 and 1). To fix this issue, a common solution is to create one binary attribute per category:
What are the advantages and implications of using callbacks and EarlyStopping? Select all that apply.
EarlyStopping callback will interrupt training when it measures no progress on the validation set for a number of epochs defined by the patience argument. EarlyStopping callback will roll back to the best model because it will keep track of the best weightsand restore them for you at the end of training. The number of epochs can be set to a large value since training will stop automatically when there is no more progress. By setting a validation set ratio, EarlyStopping callback would save your model when its performance on the validation set is the best. This way, you do not need to worry about training for too long and overfitting the training set.
All the hidden layers' connection weights are initialized by 1 before the training starts.
False
Which statement is true about Gradient Descent? Select all that apply.
Gradient descent attempts to find the global minima. Gradient descent may get stuck in local optimas.
Which of the following statements is true about hyperparameters. Select all that apply.
Hyperparameters do not change during each training session. Number of neurons and number of layers in a deep neural network are hyperparameters that need fine-tuning. Parameters of a neural network are connection weights and are adjusted during training. Hyperparameters are set to control and optimize model performance.
Which of the following ML examples and applications is NOT supervised learning?
Image clustering
Which of the following statements best describes the difference between online learning and batch learning?
In batch learning, the system is incapable of learning incrementally; it must be trained using all the available data, whereas in online learning, you train the system incrementally by feeding it data instances sequentially, either individually or by small groups called mini-batches.
Which statement(s) about backpropagation algorithm is true? Select all that apply.
In just two passes through the network (one forward, one backward), the backpropagation algorithm is able to compute the gradient of the network's error with regard to every single model parameter. It can find out how each connection weight and each bias term should be tweaked in order to reduce the error. In short, Backprop is Gradient Descent using an efficient technique for computing the gradients automatically. Backprop handles one mini-batch at a time.
What is overfitting? Select all that may apply.
Overfitting means the model performs well on the training data, but it does not generalize well. Overfitting means the model performs well on the training data, but it does not perform well on the testing data. Complex models such as deep neural networks can detect subtle patterns in the data, but if the training set is noisy, or if it is too small (which introduces sampling noise), then the model is likely to detect patterns in the noise itself.
Which statement about Gradient Descent is true?
Stochastic Gradient Descent uses only one random instance to compute the gradients at every step (iteration) whereas Batch Gradient Descent uses the whole training set .
Which statement is true about regularized linear regression? Select all that apply.
The impact of adding regularization term is that it constraints model parameters and prevents them from getting too large.
Which statement is true about Bias Variance Trade-off? Select all that apply.
The lower the variance, the lower likelihood of overfitting.
Hyperparameter
Which model to choose? One option is to train both and compare how well they generalize using the test set. How to choose the value for your hyperparameter: 1. One option is to train 100 different models using 100 different values for this hyperparameter So you launch this model into production, but unfortunately it does not perform as well as expected and produces 15% errors. What just happened? The problem is that you measured the generalization error multiple times on the test set, and you adapted the model and hyperparameters to produce the best model for that set. This means that the model is unlikely to perform as well on new data. A common solution to this problem is to have a second holdout set called the validation set. You train multiple models with various hyperparameters using the training set, you select the model and hyperparameters that perform best on the validation set, and when you're happy with your model you run a single final test against the test set to get an estimate of the generalization error. To avoid "wasting" too much training data in validation sets, a common technique is to use cross-validation: the training set is split into complementary subsets, and each model is trained against a different combination of these subsets and validated against the remaining parts.
What activation function should be used for the neuron(s) of the output layer if the ML task is binary classification?
sigmoid
What activation function should be used for the neuron(s) of the output layer if the ML task is multi-class classification?
softmax