Machine Learning Notecards
State and explain the five phases of supervised learning framework?
1) Development/test split 2) Hyperparameter tuning: how to choose hyperparamters 3) Optimal Model Training / Model Selection: evaluating the performance of models at different hyperparameter levels 4) Model Evaluation 5) Model deployment
State two hyper parameters that can be tuned for the following models: (1) K-Nearest Neighbors (3) Ridge Regression (4) Lasso Regression (7) Elastic-net regression (2) Decision Trees (5) Random Forest (6) Gradient Boosted Trees (8) Soft-margin support vector machines (9) Kernel Support Vector Machines
(1) # of neighbors, weights, clusters (3) Regularization Parameter (Alpha value), Fit Intercept (7) Regularization Parameter (Alpha value), Fit Intercept (4) Regularization Parameter (Alpha value), Fit Intercept (2) number of estimators, max depth (5) number of estimators, max_depth (6) number of estimators, learning rate (8) Regularization Parameter (C), Loss, Dual (9) Regularization Parameter (C), Gamma, Kernel
Can you briefly explain how the K-Means algorithm works?
(1) Choose a number of clusters - K (2) Assign random starting locations to clusters (3) Keep iterating below until cluster centers stop shifting (a) Assign all data points to the closest cluster center (b) Shift the cluster center to the average location of all observations assigned to the given cluster
State and explain two ways to estimate the principal components in PCAs?
(1) Eigendecomposition of the data covariance matrix o Standardize the dataset, calculate the covariance matrix o Find the eigenvalues and eigenvectors of the matrix o Pick the eigenvector which corresponds to the maximum eigenvalue (2) Singular Value Decomposition o X = UDV^T o X^TX in terms of above and manipulate formula to achieve D^2V , where D^2 is are the eigenvalues and V is the eigenvector
Can you explain how permutation feature importances work for estimating feature importances for a ML model?
(1) For every feature, a new dataset is created by randomly shuffling the values of the feature. (2) If the feature is important, then performance will drop significantly, if it's just noise, then mixing up the noise further will have little impact on the model. (3) The feature importance is measured as the difference in model performance between the original dataset and the permuted dataset.
Describe the scenarios when you would choose the following model selection strategies: (1) Stratified k-fold cross-validation (2) Leave-one-out cross-validation
(1) Imbalanced dataset (2) Small dataset
State one example when using classification accuracy as a performance metric is a bad idea? What would you use instead.
(1) In the case of imbalanced datasets, like credit card fraud detection. (2) Instead, we would want to use precision/recall/f1 score.
Can you state two techniques for performing local interpretability of ML models? How about two techniques for performing global interpretability?
(1) Local - SHAP Values & LIME (2) Global - SHAP & Permutation Feature Importance
State and explain two methods to measure impurity of a node in a decision tree when applying it for a regression task?
(1) Mean Squared-error = 1/n (summation (yi - yhat)^2) (2) Mean Absolute-error = 1/n (summation( abs(yi - yhat)))
(1) State an example of when choosing precision over recall is a better idea and why? (2) How about another example when choosing recall is better than precision and why?
(1) Precision is better than Recall when the cost of a FP, or committing Type I error, is high. As the example in class, email SPAM, where incorrectly classifying an important message as spam is more costly to the user than the other way around. (2) Recall is better than Precision when the cost of a FN, or Type II error, is high. For example, someone getting a cancer screening, where an FN would lead to someone not pursuing treatment.
State and explain two differences between random forests and bagging of trees?
(1) Random Forest uses a random subset of features in training each decision tree, where as bagging allows for all features in all trees (2) Random Forest uses OOB error for model selection, whereas bagging of trees method uses general hyper-parameter tuning methods
Can you explain how target encoding works for categorical variables? When would you use it and why?
(1) Take a categorical variable and count the number of times each distinct value occurs (2) For each distinct value in categorical variable, count the number of obs where target = 1 (3) Replace each observation of the categorical variable with (1) / (2) (i.e x = blue 4 times, x = blue when y = 1 twice, thus replace blue in category with 1/2) *This is the "mean" of the target given x = blue You use this when a categorical variable has a large number of dimensions and we want to avoid one-hot encoding inflating the number of dimensions in the model. In all, it provides the mean value of the target variable for a given categorical feature P(Y|X=value)
(1) How is feature importance determined in random forests? (2) How do you perform model selection in random forests? (3) How do you predict using a random forest?
(1) The decrease in node impurity weighted by the probability of reaching that node, where node probability can be calculated by the number of samples that reach the node, divide by the total number of samples. (2) For model selection, we use Out-of-bag (OOB) error. OOB error is the average error of a data point calculated using predictions from the trees that do not contain the data point in their respective bootstrap sample. (3) Output from each model is averaged if regression (numeric output) and voted on if classification (categorical)
Can you explain one way to choose the # of principal components while performing PCAs?
(1) Use a set threshold of explained variance, such as 80%, and then select the number of components that generate a cumulative sum of explained variance as close as possible to that threshold.
(1) How would you determine if your trained decision tree is overfitting? (2) State and explain two ways to prevent this overfitting?
(1) When training accuracy is high but the test accuracy is low, in other words the model dosn't generalize well. Or if I change the training data, my model changes signifigantly (2) Reduce max depth of leaf nodes Increase minimum sample split Reduce number of iterations (early stopping) Cost Complexity Pruning
GIVEN the Primal and Dual Formation(s) (1) Would training the soft-margin SVMs either using primal or the dual formulations yield the same results on a test dataset? Why? (2) Which formulation supports extension to kernel-SVMs? Rewrite the formulation to represent kernel-SVMs?
(1) Yes, since this is a convex optimization problem (2) In the DUAL, replace (x_i . x_j) with K(x_i . x_j),
State and explain two differences between random forests and gradient boosted trees?
(a) In GB, trees are trained sequential, whereas in RF trees can be trained in parallel (b) In RF, all trees are trained on samples of the data, whereas GB trees are trained on the residuals of the previous iteration (d) In RF, trees are trained on random subsets of features, whereas GB do not restrict features in training trees (b) RF uses OOB error for model selection
State and explain two strategies for multi-class classification tasks?
1) One vs. Rest (OVR): - If I have N classifiers, I'm going to solve N binary classification problems, where I am classifying one group vs. all other groups (red vs. yellow& blue) 2) One vs One (OVO): - Solve classification problem for all 2 pair combinations of classes -> (n(n-2))/2 classifiers - When you make a prediction, you run through all of your 2 pair predictors, and make prediction based on majority vote of classes
List two problems when training Linear Regression models?
1) Too Many Predictors: leads to overfitting 2) Multi-collinearity: predictors are correlated 3) Non-linear relationship between predictors and target
EF: When would you use Gradient Descent vs. Gradient Ascent?
Gradient Descent is used to minimize some loss function, whereas Gradient Ascent is used to if you want to reach some sort of maxima, for instance distance between separation of hyperplanes in a SVM model.
Can you explain the bias-variance tradeoff for a ML model?
Bias - inherent error in your prediction, difference between the actual values and the predicted values by the model Variance - how much noise/variability there is in your predictions, refers to the change in accuracy of the model's prediction on which the model has not been trained
Assume you choose AUROC as a performance metric for your classification task. Now you train two different ML models and obtain probabilities on a test dataset. You observe that one model produces probabilities that are exactly one-tenth of the probabilities that the other model produces. How does the AUROC compare for both the models on the test dataset? Do you expect AUROC values to be the same or would they differ and why?
Both the models will have the same AUROC value since the ranking of the samples did not change
State two differences between Adaboost and Gradient Boosting?
Both train a model sequentially, focusing on the incorrect predictions from the previous iteration Adaboosting: (A) trains a series of decision stumps, which have a depth of 1 (B) Samples which are incorrect are up-weighted in the next iteration, but each round trains on the original samples (C) Each classifier has different weights assigned based on its performance in making the final prediction, where higher performing stumps will have higher weights Gradient Boosting: (A) Trees tend to be grown to greater depths (B) After the first layer, it actually trains on the residuals from the previous iteration, it doesn't pass the samples themselves (C) All classifiers are weighed equally and their predictive capacity is restricted with learning rate to increase accuracy
What is Calibration?
Calibrated means the probabilities reflect the likelihood of the true event. Or that your distribution of targets within probability thresholds increases consistently. So let's say that all of my predictions with target probs between 0.0 and 0.1 , 95% of them are a zero, then for my predictions between 0.1 and 0.2, 85% of them are zero, in other words that the proportion of true positives that fall between a certain interval increases consistently over the course of your predicted values.
Given a confusion matrix, how would you calculate: (1) Accuracy (2) Precision (3) Recall (4) F1-Score
Classification Accuracy = TN + TP / TN + TP + FP + FN Precision - % of your Target predictions which were correct Precision = TP / (TP + FP) Recall - % of all Target's you captured Recall = TP / (TP + FN) F1-score = 2*(Precision * Recall)/(Precision + Recall)
Describe one way to find the optimal split in a decision tree when training on data that has only one categorical variable with a lot of categories (0(10000))?
Convert the feature to target encoding, and order the features by their target encoding value (mean response rate). Starting with the highest value, try combinations of split in descending order. Optimal split can be found in O(L)
Assume you are given an India housing dataset that lists the characteristics of a house and its sale price. Also, assume one of the features given is the OMA (designated Marketing Area) for the house. A OMA can be thought of as a group of zip codes (in order of 10s to 100s). Assume we want to train a regression model and would like to use OMA as one of the features. How would you want to encode OMA and why?
Encode OMA using target encoding, as this would allow us to incorporate the categorical feature without blowing up the number of features, or creating a sparse matrix. Downside is that you are generating a variable which makes use of the target, and could lead to target leakage.
State and explain two methods to measure impurity of a node in a decision tree when applying it for a classification task?
Entropy: Measure of information gain, where K is the number of classes. So you want the Gini: Probability of misclassifying an observation For both, 0 is perfect
Name all the components of a boxplot
Outlier Maximum within upper fence (Q3 + (1.5*IQR)) third quartile median first quartile minimum within lower fence outlier IQR = inter quartile range = Q3 - Q1
Can you state two uninformed search strategies for defining the possible values for hyper-parameters in a ML model?
Random Search - you provide a range for random values to be sampled as hyperparameters. Typically does better than grid search in higher dimensions. Grid Search - you manually provide multiple values for each hyper parameter, and try all combinations of values across hyper parameters. Number of combinations to test grows exponentially.
In a development-test splitting, explain the difference between stratified splitting and random splitting? When would you prefer a stratified splitting as compared to random splitting?
In stratified splitting, you are adding a constraint to the random sampling between your training, development, and test set, ensuring that each set have the same proportion of target to non-target observations. This technique is preferred when the data is highly imbalanced.
During training, if you observe that your model is underfitting, state one way to improve the performance of the model?
Increase model complexity, as you are failing to capture all of the variance in your data.
What are the differences between L1, L2 and elastic-net regularizers that are generally applied in the context of machine learning models? Which of these regularizers do not promote feature selection?
L1 norm (Lasso): add the sum of absolute values of the weights of the model in the loss function. Promotes feature selection by setting some weights to 0. Cost reduction does not depend on magnitude of feature, so 0-1 is same as 11-12. L2 norm (Ridge): square of the weights to the loss function does not promote feature selection, l2 norm is square root, notice image says squared. Has disproportionately greater effect as weights get larger. (penalty between 2-3 is far smaller than 5-6) Elastic-net: convex convolution of L1 and L2 norms, does promote feature selection (k||w1|| + (1-k)||w2||^2 All penalization terms reduce variance, but introduce bias compared to Least Squares solution
How do you measure feature importances in linear models and tree-based models?
Linear Model: (1) Absolute value of feature coefficient, assuming x variables are scaled (2) The higher the value, the more important the feature Tree Based Models (Decision Tree & RF): (1) Calculated as the decrease in node impurity weighted by the probability of reaching that node (2) Node probability can be calculated by the number of samples that reach the node, divided by the total number of samples (3) The higher the value, the more important the feature
State and explain two assumptions of Linear Regression?
Linearity: linear relationship between X & y Independence: no relation between samples, samples are iid Homoscedasticity: variance of the residuals is constant across all values, evenly distributed around 0 Normality: residuals follow a normal distribution
When would you consider using a log scale to plot a feature during the exploratory data analysis phase?
Logarithmic scales are useful for representing: (1) Features which are ratios (2) If you have really high variance features
Consider the following code: X_dev, X_test, y_dev, y_test = train_test_split(num_df, target, random_state=42) scaler = StandardScaler() X_dev_scaled = scaler.fit_transform(X_dev) X_test_scaled = scaler.fit_transform(X_test) lr_scaled = Ridge().fit(X_dev_scaled, y_dev) lr_not_scaled = Ridge().fit(X_dev, y_dev) What is wrong with the above code and how would you fix it?
We are applying fit_transform on the Test set, when we should use just transform. In other words, we want to be applying the standardization of subtracting mean and dividing by stand. dev from the dev set. Whereas we are using different standardization values for the dev and test set.
When we plot the residuals of a linear regression, what assumption are checking? Which model violates this assumption?
We are checking homoscedasticity, and we want the residuals to appear randomly distributed around 0, if there is a pattern then it means you are violating the assumption that errors are normally distributed.
Given two reliability diagrams, which one favors the majority class? Which favors the minority?
The diagonal line indicates that it's perfectly calibrated, Below the line is undercalibrated, over the line is overcalibrated. Majority class will be the one closer to zero, minority class will be closer to 1
How do you calculate the brier score?
p(yi) is the strength of your prediction, or the probability of the prediction. Lower is better, and measures the strenght of your prediction relative the distribution of the prediction in your dataset.