Machine Learning Must Know
problem with sparsity of matrix
1. takes space to store values(zeros) which do not have any valuable information 2. large matrix will increase computation time
What is a ROC curve?
An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters: True Positive Rate = TP / (TP + FN) False Positive Rate = FP / (FP + TN)
Given this equation for polynomial regression y= B0 + B1x + B2x-square + B3x-cube, what will be the extracted features
B0, B1, B2, B3
Explain bagging.
Bagging, or Bootstrap Aggregating, is an ensemble method in which the dataset is first divided into multiple subsets through resampling. Then, each subset is used to train a model, and the final predictions are made through voting or averaging the component models. Bagging is performed in parallel.
Neural Network - In neural network architecture what should be the dimension of input layer X, output layer y and dimension of hidden layer?
Dimension of X = number of features Dimension on y = number of classes Dimension of hidden layer(i.e. number of hidden units) should generally be more than X. May be 2 times of dim X. Usually the more the better but I can be computationally expensive.
Kernel Regression
Is based on weighted local averaging, that fits a simple model separately at Xo. It is an alternate to KNN and requires little training.
Lasso Regression
Lasso regression is a type of linear regression that uses shrinkage. Shrinkage is where data values are shrunk towards a central point, like the mean. The lasso procedure encourages simple, sparse models (i.e. models with fewer parameters)
Why is "Naive" Bayes naive?
Naive Bayes (NB) is 'naive' because it makes the assumption that features of a measurement are independent of each other. This is naive because it is (almost) never true
F1 Score
harmonic mean of precision and recall (2X (Precision X Recall)/(Precision + Recall))
Write 2 layer NN flow
See image
How do you ensure you're not overfitting with a model?
TBD
What are common failure modes of Gradient Descent?
Vanishing Gradients, Exploding Gradients, ReLu layers can die
Why overfitting happens?
The possibility of overfitting exists as the criteria used for training the model is not the same as the criteria used to judge the efficacy of a model.
How does softmax work?
The softmax function, also known as softargmax or normalized exponential function,converts a vector of K real numbers into a probability distribution of K possible outcomes. >>> import numpy as np >>> a = [1.0, 2.0, 3.0, 4.0, 1.0, 2.0, 3.0] >>> np.exp(a) / np.sum(np.exp(a)) array([0.02364054, 0.06426166, 0.1746813, 0.474833, 0.02364054, 0.06426166, 0.1746813])
Neoral Net - Lets assume we have a neural net with one hidden layer? and X is (2,1). How many layers do we have? If Z = Wx + b, and the hidden layer is (3,1) what is the dimension of Z and W.
We have 2 layers. Dimension of Z is (3,1) and dimension of W is (3,2). See pic for explanation. Notice the general form underlined in red in the pic.
What is a low rank matrix?
When a matrix can be factored into few factors say 2 then it is a low rank matrix.
When is Ridge regression favorable over Lasso regression?
You can quote ISLR's authors Hastie, Tibshirani who asserted that, in presence of few variables with medium / large sized effect, use lasso regression. In presence of many variables with small / medium sized effect, use ridge regression. Conceptually, we can say, lasso regression (L1) does both variable selection and parameter shrinkage, whereas Ridge regression only does parameter shrinkage and end up including all the coefficients in the model. In presence of correlated variables, ridge regression might be the preferred choice. Also, ridge regression works best in situations where the least square estimates have higher variance. Therefore, it depends on our model objective.
Principal Component Analysis (PCA)
a dimension-reduction tool that can be used to reduce a large set of variables to a small set that still contains most of the information in the large set (find principal patterns)
logistic regression derive log likelihood function
imp tbd
What are 3 ways of reducing dimensionality?
1. Removing collinear features. 2. Performing PCA, ICA, or other forms of algorithmic dimensionality reduction. 3. Combining features with feature engineering.
We know that one hot encoding increasing the dimensionality of a data set. But, label encoding doesn't. How ?
Don't get baffled at this question. It's a simple question asking the difference between the two. Using one hot encoding, the dimensionality (a.k.a features) in a data set get increased because it creates a new variable for each level present in categorical variables. For example: let's say we have a variable 'color'. The variable has 3 levels namely Red, Blue and Green. One hot encoding 'color' variable will generate three new variables as Color.Red, Color.Blue and Color.Green containing 0 and 1 value. In label encoding, the levels of a categorical variables gets encoded as 0 and 1, so no new variable is created. Label encoding is majorly used for binary variables.
While working on a data set, how do you select important variables? Explain your methods.
Following are the methods of variable selection you can use: Remove the correlated variables prior to selecting important variables Use linear regression and select variables based on p values Use Forward Selection, Backward Selection, Stepwise Selection Use Random Forest, Xgboost and plot variable importance chart Use Lasso Regression Measure information gain for the available set of features and select top n features accordingly.
L1 regularization or Lasso and its limitations
Lasso shrinks large coefficients and truncates small coefficients to zero. This leads to a sparse solution where majority of the input features have zero weights and very few features have non zero weights.The difference between ridge and lasso regression is that it tends to make coefficients to absolute zero as compared to Ridge which never sets the value of coefficient to absolute zero. Limitation of Lasso Regression: If the number of predictors (p) is greater than the number of observations (n), Lasso will pick at most n predictors as non-zero, even if all predictors are relevant (or may be used in the test set)
Why do we need regularization and how does it work?
Regularization helps identify the less informative features and remove the noise. Regularization works on assumption that smaller weights generate simpler model and thus helps avoid overfitting.
what are disadvantages of polynomial regression?
Remove part of function is sensitive to outliers (see image) Less flexibility due to global function structure
Neural Network - how can we use x1 AND x2, NOT x1 and NOT x2, x1 OR x2 to create a network for x1 XNOR x2
This is a good example of how neural networks can be used in layers to create complicated functions.
Word Embeddings
Word tokens in a dense vector space (~few hundred real numbers), where the location and distance between words indicates how similar they are semantically.
Standard Deviation and Variance
a measure of variability that describes an average distance of every score from the mean
dropout layer
adding drop out layer has the effect of regularization (discourages learning a more complex model, to avoid risk of overfitting). you can add dropout to all layers or just the more complex layers.
Convex function vs nonconvex function
convex function has a one optimal minimum which is also the global minimum where as a nonconvex function can many minimums (and 1 global minimum)
in classification micro average is
weighted mean
What is a disadvantage of PCA? What is the solution?
Like regression PCA is sensitive to outliers. Robust PCA can retrieve correct low-rank structures.
How are eigenvectors computed and used for clustering?
1) Eigenvectors are computed from Laplacian of a graph. 2) Similar eigenvectors are used to cluster nodes together. 3) Threshold eigenvector values are used to separate the clusters.
Problems with high dimension data
1. If we have more features than observations than we run the risk of massively overfitting our model 2. When we have too many features, observations become harder to cluster — believe it or not, too many dimensions cause every observation in your dataset to appear equidistant from all the others. If the distances are all approximately equal, then all the observations appear equally alike (as well as equally different), and no meaningful clusters can be formed.
Neural Network - Steps to train a neural network
1. Randomly initialize weights 2. Implement forward propagation 3. Compute cost function 4. Implement backpropagation to compute partial derivatives 5. Use gradient descent with backpropagation to minimize cost function
What are the advantages and disadvantages of k-nearest neighbors?
Advantages: K-Nearest Neighbors have a nice intuitive explanation, and then tend to work very well for problems where comparables are inherently indicative. For example, you could build a kNN housing price model by modeling on other houses in the area with similar number of bedrooms, floor space, etc. Disadvantages: They are memory-intensive.They also do not have built-in feature selection or regularization, so they do not handle high dimensionality well.
Precision
How many of those who we predicted as diabetic are actually diabetic? Precision = TP/(TP+FP)
What is a neural network? What is deep learning?
Neural networks, a beautiful biologically-inspired programming paradigm which enables a computer to learn from observational data. Deep learning, a powerful set of techniques for learning in neural networks.
Entropy
measure of randomness
What is vanishing gradients? and How to you solve it?
When training reaches saturation point each layer reduces its signal vs noise especially when using sigmoid or tanh functions in its hidden layers.As a result of which during backpropagation gradients become smaller and smaller and then they vanish. When this happens your weights are no longer updating and training comes to a halt. FIX = Use non-saturating, non-linear activation functions such as ReLu, Elu, etc.
How is KNN different from k-means clustering?
knn is a supervised learning algorithm and tries to find k nearest neighbours (labeled data) and uses voting to decide which neighbour its closest to, k can't be even or can't be a multiple of number of classes to avoid ties.Knn can be slow on large dataset because of search complexity. K-means clustering is unsupervised learning. we choose k random centroids and try to find the nearest centroid for each element.After each iteration we take mean of all points around each centroid and come up with a new centroid and try to cluster elements around that centroid again. We do this until none of the cluster assignments change.
Type 1 error
rejecting the null hypothesis when it is true (false positive e.g. Reject new item add Request when in should be accepted)
Neural Network - Cost function for logistic regression (think about cost function but summed for all activation leyers)
see pic
Specificity (true negative rate)
Calculated as the number of correct negative predictions divided by the total number of negatives => TN / (TN + FP)
continuous vs discrete vs categorical data
Categorical (eye color, or type of something), Discrete (whole number data that is counted e.g. number of sisters, can't be 1.5 sisters), Continuous ( measured data on a scale e.g. weight 1.23333333 , it can be almost any numeric value)
k-fold cross validation
Data is divided into k subsets. Now the holdout (train/test) method is repeated k times, such that each time, one of the k subsets is used as the test set/ validation set and the other k-1 subsets are put together to form a training set. This significantly reduces bias as we are using most of the data for fitting, and also significantly reduces variance as most of the data is also being used in validation set. Usually K is 5 or 10.
Difference between dot product and matrix multiplication? Formula for dot product of vector? Constraint of matrix multiplication? What will be the end dimension of resulting matrix after multiplying two matrices?
Dot product is between two vectors where as cross product is between two matrices. Number of cols of first matrix must be equal to rows of second matrix. If we multiply (L by M) X (M by N) end matrix will be L by N. (See image)
Bias-variance tradeoff
Error due to simplistic assumptions, underfitting, Error due to complexity, Overfit, too much noise, bias-variance decomposition, optimally reduced amount of error, no high bias or high variance
Robust PCA Applications?
Face Recognition, Anomaly Detection, Text Mining, Web Mining, Image and Video repair (removes occlusion) and video surveillance. See image for low rank and sparse matrix (outlier).
GS vs SGD
GD computes gradient for all observations and sums it and then updates the coefficients, but in SGD it randomly selects a training sample and updates the coefficients right away instead of going through all the training samples and then updating the coefficients.
How does gradient descent work?
Gradient descent starts from initial point and moves toward (negative direction) that minimizes the value of the function. In image theta is coefficient. We try to find the best coefficient which minimizes the value of cost function.
p-value
It tells us, how likely it is to get a value like this if null hypothesis is true. If p value is very small then we reject the null hypothesis.
What are different types of gradient descent algorithms?
Normal Gradient Descent, GD with backtracking, Accelerated GD (Nestrov's Algo), Stochastic GD (preferred when N is large), Stochastic GD mini-batch
Recall/Sensitivity/True positive rate
Of all the people who are diabetic, how many of those we correctly predict?Recall = TP/(TP+FN)
Specificity
Of all the people who are healthy (not diabetic), how many of those did we correctly predict? Specificity = TN/(TN+FP)
Neural net (forward propagation) - if input it x1,x2,x3 then write formula for hidden layer a1,a2,a3? and formula for h_theta(x)
see pic
in classification macro average is
simple mean (this can be misleading if there is a class imbalance)
softmax vs sigmoid
sofmax = in case of multiple classes convers all score values to normalized probability distribution sigmoid = in case of binary classification tells which class does the outcome belong to
tensorflow define a constant tensor 3, also initialize a variable with [1,2,3] list. What is the main difference between tensor and variable.
tf.constant(3) , tf.variable([1,2,3]). Tensors are not mutable but variables are.
How to figure out if your time series data is random (random walk)?
1. The time series shows a strong temporal dependence (autocorrelation) that decays linearly or in a similar pattern. (see image t+1 predicts t) 2. The time series is non-stationary and making it stationary (mean and covariance are static over time ) shows no obviously learnable structure in the data.
normal distribution
1. symmetric bell shape 2. mean and median are equal; both located at the center of the distribution 3. ≈ 68, percent of the data falls within 1 standard deviation of the mean 4. 95, percent of the data falls within 2 standard deviations of the mean 5. 99.7, percent of the data falls within 3 standard deviations of the mean
R squared
A statistical measure of how close the data are to the fitted regression line.It cannot determine whether the coefficient estimates and predictions are biased, which is why you must assess the residual plots.
L1 Norm and L2 Norm
Absolute length of 1D vector. Square root of (sum of absolute values squared), in Euclidean space, e.g. 2D vector
What are the advantages and disadvantages of neural networks?
Advantages: Neural networks (specifically deep NNs) have led to performance breakthroughs for unstructured datasets such as images, audio, and video. Their incredible flexibility allows them to learn patterns that no other ML algorithm can learn. Disadvantages: However, they require a large amount of training data to converge. It's also difficult to pick the right architecture, and the internal "hidden" layers are incomprehensible.
Neural Network - Before we start gradient descent why can't we initialize theta to 0 like we did in linear/logistic regression? How do we initialize theta?
Because of the way neural networks are connected between layers initializing with theta with 0 will make parameters going to each layer after each update identical so activation a1 = activation a2 (see pic). This will result in less number of interesting features being generated as only 1 feature will be generated. To solve this we randomly initialize theta.
How can you avoid overfitting ?
By using a lot of data overfitting can be avoided, overfitting happens relatively as you have a small dataset, and you try to learn from it. But if you have a small database and you are forced to come with a model based on that. In such situation, you can use a technique known as cross validation. In this method the dataset splits into two section, testing and training datasets, the testing dataset will only test the model while, in training dataset, the datapoints will come up with the model. In this technique, a model is usually given a dataset of a known data on which training (training data set) is run and a dataset of unknown data against which the model is tested. The idea of cross validation is to define a dataset to "test" the model in the training phase.
Write flow of hypothesis testing? When to do A/B testing, back-testing, long running A/B testing.
Computing Statistical Significance, if p-value is less than .05 (alpha) you can reject the null hypothesis and you can go with alternative hypothesis. Back-testing - In some cases, we need to be more confident about the result of an A/B experiment when it is overly optimistic, so we do back-testing against historical data. Long Running A/B tests - n a few experiments, one key concern could be that the experiment can have a negative long term impact since we do A/B testing for only a short period of time. Say by showing more adds we are increasing revenue in short term but losing customers in long term. The long-running experiment, which measures long-term behaviors, can also be done via a backtest. We can launch the experiment based on initial positive results while continuing to run a long-running backtest to measure any potential long term effects. If we can notice any significant negative behavior, we can revert the changes from the launched experiment.
Coordinate Descent
Coordinate descent updates one parameter at a time, while gradient descent attempts to update all parameters at once.Coordinate descent is not that commonly used but it is optimal for Lasso Regression.
What is the difference between covariance and correlation?
Correlation is the standardized form of covariance. Covariances are difficult to compare. For example: if we calculate the covariances of salary ($) and age (years), we'll get different covariances which can't be compared because of having unequal scales. To combat such situation, we calculate correlation to get a value between -1 and 1, irrespective of their respective scale.
How is principal component ( linearly uncorrelated variables ) calculated?
It is given by eigenvectors (with largest eigenvalues) of covariance matrix.
Neural Network - how is gradient computer using forward propagation for a 3 layer neural network
see pic
Which is more important to you- model accuracy, or model performance?
High accuracy does not always mean the best model performance. For an imbalanced dataset, accuracy is not a valid measure of model performance. For a dataset where the default rate is 5%, even if all the records are predicted as 0, the model will still have an accuracy of 95%. But this model will ignore all the defaults and can be very detrimental to the business. So accuracy is not a right measure for model performance in this scenario. You can choose what to optimize for by looking at the cost of TN, TP, FP, FN. Lets say if you are trying to predict whether someone has cancer or not so that they can get further more invasive tests done , even small number of FN is can be very bad and you would want to optimize for a model with least number of FP.
What is the IQR? How do you calculate it? And how do you use it to find outliers?
IQR is used to measure variability by dividing a data set into quartiles. The data is sorted in ascending order and split into 4 equal parts. Q1, Q2, Q3 called first, second and third quartiles are the values which separate the 4 equal parts. For 1, 2, 3, 4, 5, 6, 50 Q1 25 percentile of the given data is, 2.5 Q2 50 percentile of the given data is, 4.0 Q3 75 percentile of the given data is, 5.5 Q3 - Q1 = > Interquartile range is 3.0 Finding outliers = Lower Limit = Q1 - (1.5 X IQR) = -2.0 Upper Limit = Q3 + (1.5 X IQR) = 10 Any value below the lower limit or above the upper limit is considered an outlier.
How can you tell is a function is convex?
If it is univariate then 2nd derivative is >= 0 and if it is multivariate then hessian should be positive semi definite.
How can you choose a classifier based on training set size?
If training set is small, high bias / low variance models (e.g. Naive Bayes) tend to perform better because they are less likely to be overfit. If training set is large, low bias / high variance models (e.g. Logistic Regression) tend to perform better because they can reflect more complex relationships.
L2 regularization or ridge regularization and its limitations
In Ridge regression, we add a penalty term which is equal to the square of the coefficient.We also add a coefficient to control that penalty term.As we increase the value of this constraint causes the value of the coefficient to tend towards zero. This leads to both low variance (as some coefficient leads to negligible effect on prediction) and low bias (minimization of coefficient reduce the dependency of prediction on a particular variable). Limitation of Ridge Regression: Ridge regression decreases the complexity of a model but does not reduce the number of variables since it never leads to a coefficient been zero rather only minimizes it. Hence, this model is not good for feature reduction.
stratified k-fold cross validation
In some cases, there may be a large imbalance in the response variables. For example, in dataset concerning price of houses, there might be large number of houses having high price. Or in case of classification, there might be several times more negative samples than positive samples. For such problems, a slight variation in the K Fold cross validation technique is made, such that each fold contains approximately the same percentage of samples of each target class as the complete set, or in case of prediction problems, the mean response value is approximately equal in all the folds.
Elastic Net Regression
It is a variable selection method that enjoys benefit of both lasso and ridge. Includes both L1 and L2 norm. L2 penalty - Useful in high dimension cases where number of important variables is more than observation, and helps with issues of multicollinearity like ridge regression. L1 penalty encourages sparsity i.e. majority of x's components (weights) are zeros, only few are non-zeros
What is exploding gradients? and How to you solve it?
Opposite of vanishing gradients. Gradients get bigger and bigger and then weights get so large that we overflow.Even if we start with small gradients e.g. 2, it can compound and get big, especially true for sequence models with long sequence lengths. Fix = smaller batch size, batch normalization, gradient clipping
Accuracy
Percentage of predictions that were correct (TP+TN/Total). Accuracy answers the following question: How many students did we correctly label out of all the students?
What's the difference between probability and likelihood?
Probability = area under fixed distribution. Likelihood = y-axis values for fixed data points with distributions that can be moved.
Probability vs Liklihood. What is the maximum likelihood function of logistic regression?
Probability is used to finding the chance of occurrence of a particular situation without changing the distribution of data such as (mean and standard deviation), whereas likelihood in very simple terms means to increase the chances of a particular situation to happen/occur by varying the characteristics of the dataset distribution.
Why do RelLu layers die? and How do you solve it?
ReLus stop working when their inputs keep them in the negative domain causing their output to give a value of 0. this can been seen and monitored in tensorboard. Fix : Use leaky Relu or slower ELUs or Lower your learning rate.
L1 regularization
Removes unimportant features and reduces overfitting. A type of regularization that penalizes weights in proportion to the sum of the absolute values of the weights. In models relying on sparse features, L1 regularization helps drive the weights of irrelevant or barely relevant features to exactly 0, which removes those features from the model. Contrast with L2 regularization.
Standardization (Z-score normalization)
Rescaling technique that refers to centering the distribution of the data on the value 0 (zero mean) and the standard deviation to the value 1. It is important when we compare measurements that have different units. Variables that are measured at different scales do not contribute equally to the analysis and might end up creating a bias. standardized_value i = value i - mean (of the col) / standard deviation (of the col). Standardization assumes that your data has a Gaussian (bell curve) distribution. This does not strictly have to be true, but the technique is more effective if your attribute distribution is Gaussian. Standardization is useful when your data has varying scales and the algorithm you are using does make assumptions about your data having a Gaussian distribution, such as linear regression, logistic regression, and linear discriminant analysis. Normalization (mix-max scaler) = Bring value between 0 and 1 by x - X min / X max - X min. we need to have the domain knowledge to know the min and max values. It is a good technique to use when you do not know the distribution of your data or when you know the distribution is not Gaussian (a bell curve). Normalization is useful when your data has varying scales and the algorithm you are using does not make assumptions about the distribution of your data, such as k-nearest neighbors and artificial neural networks.
How would you evaluate a logistic regression model?
TBD
How would you handle an imbalanced dataset?
TBD
L1 and L2 regularization
TBD - https://towardsdatascience.com/intuitions-on-l1-and-l2-regularisation-235f2db4c261#f810
Explain how a ROC (receiver operating characteristic) curve works.
TDB
Tensorflow Graph (static) Mode vs Eager Mode - what does @tf.function do, which mode we we run in production, can we declare variables multiple times in graph mode?
Tensorflow supports graph and eager modes. Functions decorated by @tf.function run as static computaton graphs. Graph/Static mode - first define operations then execute e.g. java. Should be used in production since its faster since computation is evaluated before execution. Eager/Dynamic mode - Execution is performed as operations are defined e.g. python. Good for disbudding but slower in production. In graph mode the variable you declare the first time should be reused subsequent times so in graph mode you should create variables exactly once. This makes sense if you think of static variables.
Both being tree based algorithm, how is random forest different from Gradient boosting algorithm (GBM)?
The fundamental difference is, random forest uses bagging technique to make predictions. GBM uses boosting techniques to make predictions. In bagging technique, a data set is divided into n samples using randomized sampling. Then, using a single learning algorithm a model is build on all samples. Later, the resultant predictions are combined using voting or averaging. Bagging is done is parallel. In boosting, after the first round of predictions, the algorithm weighs misclassified predictions higher, such that they can be corrected in the succeeding round. This sequential process of giving higher weights to misclassified predictions continue until a stopping criterion is reached. Random forest improves model accuracy by reducing variance (mainly). The trees grown are uncorrelated to maximize the decrease in variance. On the other hand, GBM improves accuracy my reducing both bias and variance in a model.
Why are ensemble methods superior to individual models?
They average out biases, reduce variance, and are less likely to overfit. There's a common line in machine learning which is: "ensemble and get 2%." This implies that you can build your models as usual and typically expect a small performance boost from ensembling.
How can Regularization be used for variable selection
We need to figure out important variables and regularization can help. However, L2 norm and ridge regression are not that useful here since they can't have 0 coefficients. L0 and L1 norm can help and provide 0 coefficients for features that are non-informative this is also knows an encouraging sparsity which prevents overfitting.
RMSE
root mean square error; standard deviation of the difference between actual values and predicted values. Preferred over MAE (mean absolute error) since it penalizes large differences
Type 2 error
failing to reject a false null hypothesis (false negative e.g. Accept new item add request when in should be Rejected)