Machine Learning Interview Questions, Machine Learning General

अब Quizwiz के साथ अपने होमवर्क और परीक्षाओं को एस करें!

Entropy

measure of randomness

Betweenness Centrality

How often a node is on the shortest path between two other nodes that are not directly connected (e.g. to figure out bottlenecks, people with most power in social network, or airports that need to be protected in case of attack)

Do you suggest that treating a categorical variable as continuous variable would result in a better predictive model?

For better predictions, categorical variable can be considered as a continuous variable only when the variable is ordinal in nature.

R squared

A statistical measure of how close the data are to the fitted regression line.It cannot determine whether the coefficient estimates and predictions are biased, which is why you must assess the residual plots.

Coordinate Descent

Coordinate descent updates one parameter at a time, while gradient descent attempts to update all parameters at once.Coordinate descent is not that commonly used but it is optimal for Lasso Regression.

Robust PCA Applications?

Face Recognition, Anomaly Detection, Text Mining, Web Mining, Image and Video repair (removes occlusion) and video surveillance. See image for low rank and sparse matrix (outlier).

Kernel Regression

Is based on weighted local averaging, that fits a simple model separately at Xo. It is an alternate to KNN and requires little training.

What is Kalman filter? What are its applications?

Kalman filtering, also known as linear quadratic estimation (LQE), is an algorithm that uses a series of measurements observed over time, containing statistical noise and other inaccuracies, and produces estimates of unknown variables that tend to be more accurate than those based on a single measurement alone, by estimating a joint probability distribution over the variables for each timeframe. A common application is for guidance, navigation, and control of vehicles, particularly aircraft, spacecraft and dynamically positioned ships.[1] Furthermore, the Kalman filter is a widely applied concept in time series analysis used in fields such as signal processing and econometrics.

Why do RelLu layers die? and How do you solve it?

ReLus stop working when their inputs keep them in the negative domain causing their output to give a value of 0. this can been seen and monitored in tensorboard. Fix : Use leaky Relu or slower ELUs or Lower your learning rate.

How can we use b-splines to reduce dimension?

TBD

Name an example where ensemble techniques might be useful.

TBD

L1 and L2 regularization

TBD - https://towardsdatascience.com/intuitions-on-l1-and-l2-regularisation-235f2db4c261#f810

Tensorflow Graph (static) Mode vs Eager Mode - what does @tf.function do, which mode we we run in production, can we declare variables multiple times in graph mode?

Tensorflow supports graph and eager modes. Functions decorated by @tf.function run as static computaton graphs. Graph/Static mode - first define operations then execute e.g. java. Should be used in production since its faster since computation is evaluated before execution. Eager/Dynamic mode - Execution is performed as operations are defined e.g. python. Good for disbudding but slower in production. In graph mode the variable you declare the first time should be reused subsequent times so in graph mode you should create variables exactly once. This makes sense if you think of static variables.

in classification macro average is

simple mean (this can be misleading if there is a class imbalance)

Neoral Net - Lets assume we have a neural net with one hidden layer? and X is (2,1). How many layers do we have? If Z = Wx + b, and the hidden layer is (3,1) what is the dimension of Z and W.

We have 2 layers. Dimension of Z is (3,1) and dimension of W is (3,2). See pic for explanation. Notice the general form underlined in red in the pic.

How does gradient descent work?

Gradient descent starts from initial point and moves toward (negative direction) that minimizes the value of the function. In image theta is coefficient. We try to find the best coefficient which minimizes the value of cost function.

What is stationary data? How can we find if a time series data is stationary? How can we make data stationary and? Why is stationarity important in time-series model?

It means that statistical structure of series is independent of time i.e. mean and standard deviation don't change over time. You can plot the data or check statistics overtime to see if data is stationary. We can make data stationary by identifying and removing seasonal effects, trends or (taking the difference between two time periods until data appears stationary). Once the data in stationary you can test it with augumented dicky-fuller test, if p-value of test is > 0.5 (to test that mean and variance are not dependent on time). Stationarity is important because it provides a framework in which averaging (used in AR and MA processes) can be properly used to describe time series behaviour over time.

What is exploding gradients? and How to you solve it?

Opposite of vanishing gradients. Gradients get bigger and bigger and then weights get so large that we overflow.Even if we start with small gradients e.g. 2, it can compound and get big, especially true for sequence models with long sequence lengths. Fix = smaller batch size, batch normalization, gradient clipping

Histogram of Oriented Gradients (HoG)

The histogram of oriented gradients is a feature descriptor used in computer vision and image processing for the purpose of object detection. The technique counts occurrences of gradient orientation in localized portions of an image We feed hog features in ml classifiers to classify images.

Do eigenvectors change direction?

They are only scaled by a magnitude (eigenvalue), they don't change direction.

What is tucker decomposition?

Tucker decomposition decomposes a tensor into a core tensor, G, multiplied by set of factorizing tensors. The core tensor, G, is what we use instead of high dimension tensor X. We can recreate an approximation of original tensor X by combining G and the factorizing tensors. Tucker decomposition is more stable and accurate than CP decomposition.

tensorflow define a constant tensor 3, also initialize a variable with [1,2,3] list. What is the main difference between tensor and variable.

tf.constant(3) , tf.variable([1,2,3]). Tensors are not mutable but variables are.

L1 regularization

Removes unimportant features and reduces overfitting. A type of regularization that penalizes weights in proportion to the sum of the absolute values of the weights. In models relying on sparse features, L1 regularization helps drive the weights of irrelevant or barely relevant features to exactly 0, which removes those features from the model. Contrast with L2 regularization.

Neural Network - Steps to train a neural network

1. Randomly initialize weights 2. Implement forward propagation 3. Compute cost function 4. Implement backpropagation to compute partial derivatives 5. Use gradient descent with backpropagation to minimize cost function

Directed Network

A graph with directed edges ( e.g. world wide web - a web site can link to another website that does not link back)

What is a low rank matrix?

When a matrix can be factored into few factors say 2 then it is a low rank matrix.

a sparse matrix is mostly comprised of

zeros

Undirected Network

A graph with bi-directional edges (e.g. internet - it is form of a communication channel other examples are power grids, facebook friends)

How to compute degree (of node) from adjacency matrix

Add all the entries in the (Nth) row to compute (this can also be used to find how many friends N has)

How are eigenvectors computed and used for clustering?

1) Eigenvectors are computed from Laplacian of a graph. 2) Similar eigenvectors are used to cluster nodes together. 3) Threshold eigenvector values are used to separate the clusters.

L1 Norm and L2 Norm

Absolute length of 1D vector. Square root of (sum of absolute values squared), in Euclidean space, e.g. 2D vector

What is CP decomposition?

CP decomposition factorizes a tensor into a sum of component rank-one tensors.

Specificity (true negative rate)

Calculated as the number of correct negative predictions divided by the total number of negatives => TN / (TN + FP)

Neural Network - In neural network architecture what should be the dimension of input layer X, output layer y and dimension of hidden layer?

Dimension of X = number of features Dimension on y = number of classes Dimension of hidden layer(i.e. number of hidden units) should generally be more than X. May be 2 times of dim X. Usually the more the better but I can be computationally expensive.

How can you tell is a function is convex?

If it is univariate then 2nd derivative is >= 0 and if it is multivariate then hessian should be positive semi definite.

What are 3 data preprocessing techniques to handle outliers?

1. Winsorize (cap at threshold). 2. Transform to reduce skew (using Box-Cox or similar). 3. Remove outliers if you're certain they are anomalies or measurement errors.

normal distribution

1. symmetric bell shape 2. mean and median are equal; both located at the center of the distribution 3. ≈ 68, percent of the data falls within 1 standard deviation of the mean 4. 95, percent of the data falls within 2 standard deviations of the mean 5. 99.7, percent of the data falls within 3 standard deviations of the mean

Simple Network

In an undirected network where only one edge between any two notes

Lasso Regression

Lasso regression is a type of linear regression that uses shrinkage. Shrinkage is where data values are shrunk towards a central point, like the mean. The lasso procedure encourages simple, sparse models (i.e. models with fewer parameters)

What is a disadvantage of PCA? What is the solution?

Like regression PCA is sensitive to outliers. Robust PCA can retrieve correct low-rank structures.

What is a neural network? What is deep learning?

Neural networks, a beautiful biologically-inspired programming paradigm which enables a computer to learn from observational data. Deep learning, a powerful set of techniques for learning in neural networks.

What are different types of gradient descent algorithms?

Normal Gradient Descent, GD with backtracking, Accelerated GD (Nestrov's Algo), Stochastic GD (preferred when N is large), Stochastic GD mini-batch

Clustering Coefficient

Number of connections that exist between nearest neighbors of a node (as a proportion of the maximum number of possible connections)

what are disadvantages of polynomial regression?

Remove part of function is sensitive to outliers (see image) Less flexibility due to global function structure

Recall/Sensitivity/True positive rate

True positive/ Number of actual positive (TP/TP+FN)

What are common failure modes of Gradient Descent?

Vanishing Gradients, Exploding Gradients, ReLu layers can die

Sparse Network

a network which has many nodes compared to number of edges e.g. social network

Type 2 error

failing to reject a false null hypothesis (false negative e.g. Accept new item add request when in should be Rejected)

What are the advantages of Naive Bayes?

n Naïve Bayes classifier will converge quicker than discriminative models like logistic regression, so you need less training data. The main advantage is that it can't learn interactions between features.

Neural Network - how is gradient computer using forward propagation for a 3 layer neural network

see pic

in classification micro average is

weighted mean

Steps and tech stack for streaming data

1. Ingest variable amounts of streaming data using message queue (cloud pub/sub) 2. Deal with latency and unordered data using cloud data flow (apache beam) with triggers, windowing operations, transformations and aggregations 3. Real time insights - do continuous querying using BigQuery and provide visualizations and analytics

p-value

It tells us, how likely it is to get a value like this if null hypothesis is true. If p value is very small then we reject the null hypothesis.

'People who bought this, also bought...' recommendations seen on amazon is a result of which algorithm?

The basic idea for this kind of recommendation engine comes from collaborative filtering. Collaborative Filtering algorithm considers "User Behavior" for recommending items. They exploit behavior of other users and items in terms of transaction history, ratings, selection and purchase information. Other users behaviour and preferences over the items are used to recommend items to the new users. In this case, features of the items are not known.

When can a linear model learn a non-linear decision boundary?

When the features are non-linear e.g. x square

OLS is to linear regression. Maximum likelihood is to logistic regression. Explain the statement

OLS and Maximum likelihood are the methods used by the respective regression methods to approximate the unknown parameter (coefficient) value. In simple words, Ordinary least square(OLS) is a method used in linear regression which approximates the parameters resulting in minimum distance between actual and predicted values. Maximum Likelihood helps in choosing the the values of parameters which maximizes the likelihood that the parameters are most likely to produce observed data.

Difference between dot product and matrix multiplication? Formula for dot product of vector? Constraint of matrix multiplication? What will be the end dimension of resulting matrix after multiplying two matrices?

Dot product is between two vectors where as cross product is between two matrices. Number of cols of first matrix must be equal to rows of second matrix. If we multiply (L by M) X (M by N) end matrix will be L by N. (See image)

What are 3 ways of reducing dimensionality?

1. Removing collinear features. 2. Performing PCA, ICA, or other forms of algorithmic dimensionality reduction. 3. Combining features with feature engineering.

Monte Carlo Simulation

A process which generates hundreds or thousands of probable performance outcomes based on probability distributions for cost and schedule on individual tasks. The outcomes are then used to generate a probability distribution for the project as a whole.

Adjacency Matrix

An NxN matrix (N is number of nodes) where if two nodes have an edge then its represented by 1 else 0.

You are given a data set. The data set contains many variables, some of which are highly correlated and you know about it. Your manager has asked you to run PCA. Would you remove correlated variables first? Why?

Chances are, you might be tempted to say No, but that would be incorrect. Discarding correlated variables have a substantial effect on PCA because, in presence of correlated variables, the variance explained by a particular component gets inflated. For example: You have 3 variables in a data set, of which 2 are correlated. If you run PCA on this data set, the first principal component would exhibit twice the variance than it would exhibit with uncorrelated variables. Also, adding correlated variables lets PCA put more importance on those variable, which is misleading.

What is the difference between covariance and correlation?

Correlation is the standardized form of covariance. Covariances are difficult to compare. For example: if we calculate the covariances of salary ($) and age (years), we'll get different covariances which can't be compared because of having unequal scales. To combat such situation, we calculate correlation to get a value between -1 and 1, irrespective of their respective scale.

What is 'Overfitting' in Machine learning?

In machine learning, when a statistical model describes random error or noise instead of underlying relationship 'overfitting' occurs. When a model is excessively complex, overfitting is normally observed, because of having too many parameters with respect to the number of training data types. The model exhibits poor performance which has been overfit.

How is principal component ( linearly uncorrelated variables ) calculated?

It is given by eigenvectors (with largest eigenvalues) of covariance matrix.

Why is "Naive" Bayes naive?

Naive Bayes (NB) is 'naive' because it makes the assumption that features of a measurement are independent of each other. This is naive because it is (almost) never true

You've built a random forest model with 10000 trees. You got delighted after getting training error as 0.00. But, the validation error is 34.23. What is going on? Haven't you trained your model perfectly?

The model has overfitted. Training error 0.00 means the classifier has mimiced the training data patterns to an extent, that they are not available in the unseen data. Hence, when this classifier was run on unseen sample, it couldn't find those patterns and returned prediction with higher error. In random forest, it happens when we use larger number of trees than necessary. Hence, to avoid these situation, we should tune number of trees using cross validation.

Type 1 error

rejecting the null hypothesis when it is true (false positive e.g. Reject new item add Request when in should be accepted)

How can you choose a classifier based on training set size?

If training set is small, high bias / low variance models (e.g. Naive Bayes) tend to perform better because they are less likely to be overfit. If training set is large, low bias / high variance models (e.g. Logistic Regression) tend to perform better because they can reflect more complex relationships.

stratified k-fold cross validation

In some cases, there may be a large imbalance in the response variables. For example, in dataset concerning price of houses, there might be large number of houses having high price. Or in case of classification, there might be several times more negative samples than positive samples. For such problems, a slight variation in the K Fold cross validation technique is made, such that each fold contains approximately the same percentage of samples of each target class as the complete set, or in case of prediction problems, the mean response value is approximately equal in all the folds.

What is the difference between heuristic for rule learning and heuristics for decision trees?

The difference is that the heuristics for decision trees evaluate the average quality of a number of disjointed sets while rule learners only evaluate the quality of the set of instances that is covered with the candidate rule.

You are given a data set. The data set has missing values which spread along 1 standard deviation from the median. What percentage of data would remain unaffected? Why?

This question has enough hints for you to start thinking! Since, the data is spread across median, let's assume it's a normal distribution. We know, in a normal distribution, ~68% of the data lies in 1 standard deviation from mean (or mode, median), which leaves ~32% of the data unaffected. Therefore, ~32% of the data would remain unaffected by missing values.

We know that one hot encoding increasing the dimensionality of a data set. But, label encoding doesn't. How ?

Don't get baffled at this question. It's a simple question asking the difference between the two. Using one hot encoding, the dimensionality (a.k.a features) in a data set get increased because it creates a new variable for each level present in categorical variables. For example: let's say we have a variable 'color'. The variable has 3 levels namely Red, Blue and Green. One hot encoding 'color' variable will generate three new variables as Color.Red, Color.Blue and Color.Green containing 0 and 1 value. In label encoding, the levels of a categorical variables gets encoded as 0 and 1, so no new variable is created. Label encoding is majorly used for binary variables.

What is deep learning, and how does it contrast with other machine learning algorithms?

ML algorithms learn from data to make intelligent decisions where as Deep learning structures algorithms in layers to create an "artificial neural network" that can learn and make intelligent decisions on its own

In what areas Pattern Recognition is used?

Pattern Recognition can be used in a) Computer Vision b) Speech Recognition c) Data Mining d) Statistics e) Informal Retrieval f) Bio-Informatics

Accuracy

Percentage of predictions that were correct (TP+TN/Total)

What is the ROC Curve and what is AUC (a.k.a. AUROC)?

The ROC (receiver operating characteristic) the performance plot for binary classifiers of True Positive Rate (y-axis) vs. False Positive Rate (x- axis). AUC is area under the ROC curve, and it's a common performance metric for evaluating binary classification models. It's equivalent to the expected probability that a uniformly drawn random positive is ranked before a uniformly drawn random negative.

How much data should you allocate for your training, validation, and test sets?

You have to find a balance, and there's no right answer for every problem. If your test set is too small, you'll have an unreliable estimation of model performance (performance statistic will have high variance). If your training set is too small, your actual model parameters will have high variance. A good rule of thumb is to use an 80/20 train/test split. Then your train set can be further split into train/validation or into partitions for cross-validation.

What's the difference between a generative and discriminative model?

A generative model will learn categories of data while a discriminative model will simply learn the distinction between different categories of data. Discriminative models will generally outperform generative models on classification tasks. Both are used in supervised learning where you want to learn a rule that maps input x to output y, given a number of training examples of the form {(𝑥𝑖,𝑦𝑖)} { ( x i , y i ) } . A generative model (e.g., naive Bayes) explicitly models the joint probability distribution 𝑝(𝑥,𝑦) p ( x , y ) and then uses the Bayes rule to compute 𝑝(𝑦|𝑥) p ( y | x ) . On the other hand, a discriminative model (e.g., logistic regression) directly models 𝑝(𝑦|𝑥) p ( y | x ) .

After spending several hours, you are now anxious to build a high accuracy model. As a result, you build 5 GBM models, thinking a boosting algorithm would do the magic. Unfortunately, neither of models could perform better than benchmark score. Finally, you decided to combine those models. Though, ensembled models are known to return high accuracy, but you are unfortunate. Where did you miss?

As we know, ensemble learners are based on the idea of combining weak learners to create strong learners. But, these learners provide superior result when the combined models are uncorrelated. Since, we have used 5 GBM models and got no accuracy improvement, suggests that the models are correlated. The problem with correlated models is, all the models provide same information. For example: If model 1 has classified User1122 as 1, there are high chances model 2 and model 3 would have done the same, even if its actual value is 0. Therefore, ensemble learners are built on the premise of combining weak uncorrelated models to obtain better predictions.

Explain bagging.

Bagging, or Bootstrap Aggregating, is an ensemble method in which the dataset is first divided into multiple subsets through resampling. Then, each subset is used to train a model, and the final predictions are made through voting or averaging the component models. Bagging is performed in parallel.

How can you avoid overfitting ?

By using a lot of data overfitting can be avoided, overfitting happens relatively as you have a small dataset, and you try to learn from it. But if you have a small database and you are forced to come with a model based on that. In such situation, you can use a technique known as cross validation. In this method the dataset splits into two section, testing and training datasets, the testing dataset will only test the model while, in training dataset, the datapoints will come up with the model. In this technique, a model is usually given a dataset of a known data on which training (training data set) is run and a dataset of unknown data against which the model is tested. The idea of cross validation is to define a dataset to "test" the model in the training phase.

continuous vs discrete vs categorical data

Categorical (eye color, or type of something), Discrete (whole number data that is counted e.g. number of sisters, can't be 1.5 sisters), Continuous ( measured data on a scale e.g. weight 1.23333333 , it can be almost any numeric value)

While working on a data set, how do you select important variables? Explain your methods.

Following are the methods of variable selection you can use: Remove the correlated variables prior to selecting important variables Use linear regression and select variables based on p values Use Forward Selection, Backward Selection, Stepwise Selection Use Random Forest, Xgboost and plot variable importance chart Use Lasso Regression Measure information gain for the available set of features and select top n features accordingly.

Which is more important to you- model accuracy, or model performance?

High accuracy does not always mean the best model performance. For an imbalanced dataset, accuracy is not a valid measure of model performance. For a dataset where the default rate is 5%, even if all the records are predicted as 0, the model will still have an accuracy of 95%. But this model will ignore all the defaults and can be very detrimental to the business. So accuracy is not a right measure for model performance in this scenario. You can choose what to optimize for by looking at the cost of TN, TP, FP, FN. Lets say if you are trying to predict whether someone has cancer or not so that they can get further more invasive tests done , even small number of FN is can be very bad and you would want to optimize for a model with least number of FP.

You are given a data set on cancer detection. You've build a classification model and achieved an accuracy of 96%. Why shouldn't you be happy with your model performance? What can you do about it?

If you have worked on enough data sets, you should deduce that cancer detection results in imbalanced data. In an imbalanced data set, accuracy should not be used as a measure of performance because 96% (as given) might only be predicting majority class correctly, but our class of interest is minority class (4%) which is the people who actually got diagnosed with cancer. Hence, in order to evaluate model performance, we should use Sensitivity (True Positive Rate), Specificity (True Negative Rate), F measure to determine class wise performance of the classifier. If the minority class performance is found to to be poor, we can undertake the following steps: We can use undersampling, oversampling or SMOTE to make the data balanced. We can alter the prediction threshold value by doing probability caliberation and finding a optimal threshold using AUC-ROC curve. We can assign weight to classes such that the minority classes gets larger weight. We can also use anomaly detection.

You've got a data set to work having p (no. of variable) > n (no. of observation). Why is OLS as bad option to work with? Which techniques would be best to use? Why?

In such high dimensional data sets, we can't use classical regression techniques, since their assumptions tend to fail. When p > n, we can no longer calculate a unique least square coefficient estimate, the variances become infinite, so OLS cannot be used at all. To combat this situation, we can use penalized regression methods like lasso, LARS, ridge which can shrink the coefficients to reduce variance. Precisely, ridge regression works best in situations where the least square estimates have higher variance. Among other methods include subset regression, forward stepwise regression.

Diameter of Network

It is the maximum length a shortest path can have(The largest geodesic distance in the (connected) network) (six degrees of separation - everyone in the world is connected by at most 6 acquaintance so diameter of global network is 6)

Explain likelihood in context of naiveBayes algorithm?

Likelihood is the probability of classifying a given observation as 1 in presence of some other variable. For example: The probability that the word 'FREE' is used in previous spam message is likelihood.

You came to know that your model is suffering from low bias and high variance. Which algorithm should you use to tackle it? Why?

Low bias occurs when the model's predicted values are near to actual values. In other words, the model becomes flexible enough to mimic the training data distribution. While it sounds like great achievement, but not to forget, a flexible model has no generalization capabilities. It means, when this model is tested on an unseen data, it gives disappointing results. In such situations, we can use bagging algorithm (like random forest) to tackle high variance problem. Bagging algorithms divides a data set into subsets made with repeated randomized sampling. Then, these samples are used to generate a set of models using a single learning algorithm. Later, the model predictions are combined using voting (classification) or averaging (regression). Also, to combat high variance, we can: Use regularization technique, where higher model coefficients get penalized, hence lowering model complexity. Use top n features from variable importance chart. May be, with all the variable in the data set, the algorithm is having difficulty in finding the meaningful signal.

What cross validation technique would you use on time series data set? Is it k-fold or LOOCV?

Neither. In time series problem, k fold can be troublesome because there might be some pattern in year 4 or 5 which is not in year 3. Resampling the data set will separate these trends, and we might end up validation on past years, which is incorrect. Instead, we can use forward chaining strategy with 5 fold as shown below: fold 1 : training [1], test [2] fold 2 : training [1 2], test [3] fold 3 : training [1 2 3], test [4] fold 4 : training [1 2 3 4], test [5] fold 5 : training [1 2 3 4 5], test [6] where 1,2,3,4,5,6 represents "year".

Explain prior probability in context of naiveBayes algorithm?

Prior probability is nothing but, the proportion of dependent (binary) variable in the data set. It is the closest guess you can make about a class, without any further information. For example: In a data set, the dependent variable is binary (1 and 0). The proportion of 1 (spam) is 70% and 0 (not spam) is 30%. Hence, we can estimate that there are 70% chances that any new email would be classified as spam.

After analyzing the model, your manager has informed that your regression model is suffering from multicollinearity. How would you check if he's true? Without losing any information, can you still build a better model?

To check multicollinearity, we can create a correlation matrix to identify & remove variables having correlation above 75% (deciding a threshold is subjective). In addition, we can use calculate VIF (variance inflation factor) to check the presence of multicollinearity. VIF value <= 4 suggests no multicollinearity whereas a value of >= 10 implies serious multicollinearity. Also, we can use tolerance as an indicator of multicollinearity. But, removing correlated variables might lead to loss of information. In order to retain those variables, we can use penalized regression models like ridge or lasso regression. Also, we can add some random noise in correlated variable so that the variables become different from each other. But, adding noise might affect the prediction accuracy, hence this approach should be carefully used.

Precision

True positive / Number of predicted positive (TP/TP+FP)

You have built a multiple regression model. Your model R² isn't as good as you wanted. For improvement, your remove the intercept term, your model R² becomes 0.8 from 0.3. Is it possible? How?

Yes, it is possible. We need to understand the significance of intercept term in a regression model. The intercept term shows model prediction without any independent variable i.e. mean prediction. The formula of R² = 1 - ∑(y - y´)²/∑(y - ymean)² where y´ is predicted value. When intercept term is present, R² value evaluates your model wrt. to the mean model. In absence of intercept term (ymean), the model can make no such evaluation, with large denominator, ∑(y - y´)²/∑(y)² equation's value becomes smaller than actual, resulting in higher R².

If you split your data into train/test splits, is it still possible to overfit your model?

Yes, it's definitely possible. One common beginner mistake is re-tuning a model or training new models with different parameters after seeing its performance on the test set. In this case, its the model selection process that causes the overfitting. The test set should not be tainted until you're ready to make your final selection.

Is it possible capture the correlation between continuous and categorical variable? If yes, how?

Yes, we can use ANCOVA (analysis of covariance) technique to capture association between continuous and categorical variables.

How to figure out if your time series data is random (random walk)?

1. The time series shows a strong temporal dependence (autocorrelation) that decays linearly or in a similar pattern. (see image t+1 predicts t) 2. The time series is non-stationary and making it stationary (mean and covariance are static over time ) shows no obviously learnable structure in the data.

ARIMA (Autoregressive Integrated Moving Average)

Autoregressive = Regressing a variable on past values of itsel.Uses values from previous time steps to predict next time steps. Integrated = If a trend exists then time series is considered non stationary and shows seasonality. Integrated is a property that reduces seasonality from a time series Moving average = Error terms of previous time points are used to predict current and future point's observation. MA removes non-determinism or random movements from a time series.

Given this equation for polynomial regression y= B0 + B1x + B2x-square + B3x-cube, what will be the extracted features

B0, B1, B2, B3

Neural Network - Before we start gradient descent why can't we initialize theta to 0 like we did in linear/logistic regression? How do we initialize theta?

Because of the way neural networks are connected between layers initializing with theta with 0 will make parameters going to each layer after each update identical so activation a1 = activation a2 (see pic). This will result in less number of interesting features being generated as only 1 feature will be generated. To solve this we randomly initialize theta.

Bias-variance tradeoff

Error due to simplistic assumptions, underfitting, Error due to complexity, Overfit, too much noise, bias-variance decomposition, optimally reduced amount of error, no high bias or high variance

GS vs SGD

GD computes gradient for all observations and sums it and then updates the coefficients, but in SGD it randomly selects a training sample and updates the coefficients right away instead of going through all the training samples and then updating the coefficients.

L2 regularization or ridge regularization and its limitations

In Ridge regression, we add a penalty term which is equal to the square of the coefficient.We also add a coefficient to control that penalty term.As we increase the value of this constraint causes the value of the coefficient to tend towards zero. This leads to both low variance (as some coefficient leads to negligible effect on prediction) and low bias (minimization of coefficient reduce the dependency of prediction on a particular variable). Limitation of Ridge Regression: Ridge regression decreases the complexity of a model but does not reduce the number of variables since it never leads to a coefficient been zero rather only minimizes it. Hence, this model is not good for feature reduction.

Elastic Net Regression

It is a variable selection method that enjoys benefit of both lasso and ridge. Includes both L1 and L2 norm. L2 penalty - Useful in high dimension cases where number of important variables is more than observation, and helps with issues of multicollinearity like ridge regression. L1 penalty encourages sparsity i.e. majority of x's components (weights) are zeros, only few are non-zeros

L1 regularization or Lasso and its limitations

Lasso shrinks large coefficients and truncates small coefficients to zero. This leads to a sparse solution where majority of the input features have zero weights and very few features have non zero weights.The difference between ridge and lasso regression is that it tends to make coefficients to absolute zero as compared to Ridge which never sets the value of coefficient to absolute zero. Limitation of Lasso Regression: If the number of predictors (p) is greater than the number of observations (n), Lasso will pick at most n predictors as non-zero, even if all predictors are relevant (or may be used in the test set)

What are splines?

Linear combination of piecewise polynomial functions under continuity assumption Partition the domain of x into continuous intervals and fit polynomials in each interval separately Provides flexibility and local fitting e.g. bsplines, smoothing splines (does not have boundry issue of bsplines)

What is Particle Filter?

Particle filters are non-parametric (does not assume underlying statistical distributions in the data), recursive Bayes filters ! Posterior (Probability that event will happen) is represented by a set of weighted samples ! Not limited to Gaussians ! Proposal to draw new samples ! Weight to account for the differences between the proposal and the target ! Work well in low-dimensional spaces

Probability vs Liklihood. What is the maximum likelihood function of logistic regression?

Probability is used to finding the chance of occurrence of a particular situation without changing the distribution of data such as (mean and standard deviation), whereas likelihood in very simple terms means to increase the chances of a particular situation to happen/occur by varying the characteristics of the dataset distribution.

Why do we need regularization and how does it work?

Regularization helps identify the less informative features and remove the noise. Regularization works on assumption that smaller weights generate simpler model and thus helps avoid overfitting.

Standardization (Z-score normalization)

Rescaling technique that refers to centering the distribution of the data on the value 0 (zero mean) and the standard deviation to the value 1. It is important when we compare measurements that have different units. Variables that are measured at different scales do not contribute equally to the analysis and might end up creating a bias. standardized_value i = value i - mean (of the col) / standard deviation (of the col). Standardization assumes that your data has a Gaussian (bell curve) distribution. This does not strictly have to be true, but the technique is more effective if your attribute distribution is Gaussian. Standardization is useful when your data has varying scales and the algorithm you are using does make assumptions about your data having a Gaussian distribution, such as linear regression, logistic regression, and linear discriminant analysis. Normalization (mix-max scaler) = Bring value between 0 and 1 by x - X min / X max - X min. we need to have the domain knowledge to know the min and max values. It is a good technique to use when you do not know the distribution of your data or when you know the distribution is not Gaussian (a bell curve). Normalization is useful when your data has varying scales and the algorithm you are using does not make assumptions about the distribution of your data, such as k-nearest neighbors and artificial neural networks.

Neural Network - how can we use x1 AND x2, NOT x1 and NOT x2, x1 OR x2 to create a network for x1 XNOR x2

This is a good example of how neural networks can be used in layers to create complicated functions.

How can Regularization be used for variable selection

We need to figure out important variables and regularization can help. However, L2 norm and ridge regression are not that useful here since they can't have 0 coefficients. L0 and L1 norm can help and provide 0 coefficients for features that are non-informative this is also knows an encouraging sparsity which prevents overfitting.

What is vanishing gradients? and How to you solve it?

When training reaches saturation point each layer reduces its signal vs noise especially when using sigmoid or tanh functions in its hidden layers.As a result of which during backpropagation gradients become smaller and smaller and then they vanish. When this happens your weights are no longer updating and training comes to a halt. FIX = Use non-saturating, non-linear activation functions such as ReLu, Elu, etc.

Convex function vs nonconvex function

convex function has a one optimal minimum which is also the global minimum where as a nonconvex function can many minimums (and 1 global minimum)

RMSE

root mean square error; standard deviation of the difference between actual values and predicted values. Preferred over MAE (mean absolute error) since it penalizes large differences

Neural Network - Cost function for logistic regression (think about cost function but summed for all activation leyers)

see pic

Neural net (forward propagation) - if input it x1,x2,x3 then write formula for hidden layer a1,a2,a3? and formula for h_theta(x)

see pic

Problems with high dimension data

1. If we have more features than observations than we run the risk of massively overfitting our model 2. When we have too many features, observations become harder to cluster — believe it or not, too many dimensions cause every observation in your dataset to appear equidistant from all the others. If the distances are all approximately equal, then all the observations appear equally alike (as well as equally different), and no meaningful clusters can be formed.

problem with sparsity of matrix

1. takes space to store values(zeros) which do not have any valuable information 2. large matrix will increase computation time

Tree

A directed graph with no cycles

Why is Area Under ROC Curve (AUROC) better than raw accuracy as an out-of- sample evaluation metric?

AUROC is robust to class imbalance, unlike raw accuracy. For example, if you want to detect a type of cancer that's prevalent in only 1% of the population, you can build a model that achieves 99% accuracy by simply classifying everyone has cancer-free.

What are the advantages and disadvantages of neural networks?

Advantages: Neural networks (specifically deep NNs) have led to performance breakthroughs for unstructured datasets such as images, audio, and video. Their incredible flexibility allows them to learn patterns that no other ML algorithm can learn. Disadvantages: However, they require a large amount of training data to converge. It's also difficult to pick the right architecture, and the internal "hidden" layers are incomprehensible.

k-fold cross validation

Data is divided into k subsets. Now the holdout (train/test) method is repeated k times, such that each time, one of the k subsets is used as the test set/ validation set and the other k-1 subsets are put together to form a training set. This significantly reduces bias as we are using most of the data for fitting, and also significantly reduces variance as most of the data is also being used in validation set. Usually K is 5 or 10.

What is the difference between artificial learning and machine learning?

Designing and developing algorithms according to the behaviours based on empirical data are known as Machine Learning. While artificial intelligence in addition to machine learning, it also covers other aspects like knowledge representation, natural language processing, planning, robotics etc.

You are working on a classification problem. For validation purposes, you've randomly sampled the training data set into train and validation. You are confident that your model will work incredibly well on unseen data since your validation accuracy is high. However, you get shocked after getting poor test accuracy. What went wrong?

In case of classification problem, we should always use stratified sampling instead of random sampling. A random sampling doesn't takes into consideration the proportion of target classes. On the contrary, stratified sampling helps to maintain the distribution of target variable in the resultant distributed samples also.

Closeness Centrality

It indicates how close a node is to all other nodes in the network (use case - best city to build airport)

Degree of a Node

Number of edges connected (adjacent) to a node

What's the difference between probability and likelihood?

Probability = area under fixed distribution. Likelihood = y-axis values for fixed data points with distributions that can be moved.

Both being tree based algorithm, how is random forest different from Gradient boosting algorithm (GBM)?

The fundamental difference is, random forest uses bagging technique to make predictions. GBM uses boosting techniques to make predictions. In bagging technique, a data set is divided into n samples using randomized sampling. Then, using a single learning algorithm a model is build on all samples. Later, the resultant predictions are combined using voting or averaging. Bagging is done is parallel. In boosting, after the first round of predictions, the algorithm weighs misclassified predictions higher, such that they can be corrected in the succeeding round. This sequential process of giving higher weights to misclassified predictions continue until a stopping criterion is reached. Random forest improves model accuracy by reducing variance (mainly). The trees grown are uncorrelated to maximize the decrease in variance. On the other hand, GBM improves accuracy my reducing both bias and variance in a model.

Why overfitting happens?

The possibility of overfitting exists as the criteria used for training the model is not the same as the criteria used to judge the efficacy of a model.

Why are ensemble methods superior to individual models?

They average out biases, reduce variance, and are less likely to overfit. There's a common line in machine learning which is: "ensemble and get 2%." This implies that you can build your models as usual and typically expect a small performance boost from ensembling.

You are working on a time series data set. You manager has asked you to build a high accuracy model. You start with the decision tree algorithm, since you know it works fairly well on all kinds of data. Later, you tried a time series regression model and got higher accuracy than decision tree model. Can this happen? Why?

Time series data is known to posses linearity. On the other hand, a decision tree algorithm is known to work best to detect non - linear interactions. The reason why decision tree failed to provide robust predictions because it couldn't map the linear relationship as good as a regression model did. Therefore, we learned that, a linear regression model can provide robust prediction given the data set satisfies its linearity assumptions.

How can we check if a network is a DAG?

We can check if a network is a DAG by just computing the eigenvalues of its adjacency matrix.

You are given a data set consisting of variables having more than 30% missing values? Let's say, out of 50 variables, 8 variables have missing values higher than 30%. How will you deal with them?

We can deal with them in the following ways: Assign a unique category to missing values, who knows the missing values might decipher some trend We can remove them blatantly. Or, we can sensibly check their distribution with the target variable, and if found any pattern we'll keep those missing values and assign them a new category while removing others.

Word Embeddings

Word tokens in a dense vector space (~few hundred real numbers), where the location and distance between words indicates how similar they are semantically.

Principal Component Analysis (PCA)

a dimension-reduction tool that can be used to reduce a large set of variables to a small set that still contains most of the information in the large set (find principal patterns)

Directed Acyclic Graph (DAG)

a directed graph with no directed cycles.

dropout layer

adding drop out layer has the effect of regularization (discourages learning a more complex model, to avoid risk of overfitting). you can add dropout to all layers or just the more complex layers.

F1 Score

harmonic mean of precision and recall (2X (Precision X Recall)/(Precision + Recall))

What is Model Selection in Machine Learning?

he process of selecting models among different mathematical models, which are used to describe the same data set is known as Model Selection. Model selection is applied to the fields of statistics, machine learning and data mining.

Eigenvector Centrality

how well-connected are those I'm connected to (gives each node a score proportional to the sum of the score of its neighbors; start calculations with a node by guessing a score say 1)

How is KNN different from k-means clustering?

knn is a supervised learning algorithm and tries to find k nearest neighbours (labeled data) and uses voting to decide which neighbour its closest to, k can't be even or can't be a multiple of number of classes to avoid ties.Knn can be slow on large dataset because of search complexity. K-means clustering is unsupervised learning. we choose k random centroids and try to find the nearest centroid for each element.After each iteration we take mean of all points around each centroid and come up with a new centroid and try to cluster elements around that centroid again. We do this until none of the cluster assignments change.

What does it mean to "fit" a model? How do hyperparameters relate?

process of learning the parameters of a model using training data,

What are the advantages and disadvantages of k-nearest neighbors?

Advantages: K-Nearest Neighbors have a nice intuitive explanation, and then tend to work very well for problems where comparables are inherently indicative. For example, you could build a kNN housing price model by modeling on other houses in the area with similar number of bedrooms, floor space, etc. Disadvantages: They are memory-intensive.They also do not have built-in feature selection or regularization, so they do not handle high dimensionality well.

Rise in global average temperature led to decrease in number of pirates around the world. Does that mean that decrease in number of pirates caused the climate change?

After reading this question, you should have understood that this is a classic case of "causation and correlation". No, we can't conclude that decrease in number of pirates caused the climate change because there might be other factors (lurking or confounding variables) influencing this phenomenon. Therefore, there might be a correlation between global average temperature and number of pirates, but based on this information we can't say that pirated died because of rise in global average temperature.

You have been asked to evaluate a regression model based on R², adjusted R² and tolerance. What will be your criteria?

Tolerance (1 / VIF) is used as an indicator of multicollinearity. It is an indicator of percent of variance in a predictor which cannot be accounted by other predictors. Large values of tolerance is desirable. We will consider adjusted R² as opposed to R² to evaluate model fit because R² increases irrespective of improvement in prediction accuracy as we add more variables. But, adjusted R² would only increase if an additional variable improves the accuracy of model, otherwise stays same. It is difficult to commit a general threshold value for adjusted R² because it varies between data sets. For example: a gene mutation data set might result in lower adjusted R² and still provide fairly good predictions, as compared to a stock market data where lower adjusted R² implies that model is not good.

I know that a linear regression model is generally evaluated using Adjusted R² or F value. How would you evaluate a logistic regression model?

We can use the following methods: Since logistic regression is used to predict probabilities, we can use AUC-ROC curve along with confusion matrix to determine its performance. Also, the analogous metric of adjusted R² in logistic regression is AIC. AIC is the measure of fit which penalizes model for the number of model coefficients. Therefore, we always prefer model with minimum AIC value. Null Deviance indicates the response predicted by a model with nothing but an intercept. Lower the value, better the model. Residual deviance indicates the response predicted by a model on adding independent variables. Lower the value, better the model.

In k-means or kNN, we use euclidean distance to calculate the distance between nearest neighbors. Why not manhattan distance ?

We don't use manhattan distance because it calculates distance horizontally or vertically only. It has dimension restrictions. On the other hand, euclidean metric can be used in any space to calculate distance. Since, the data points can be present in any dimension, euclidean distance is a more viable option. Example: Think of a chess board, the movement made by a bishop or a rook is calculated by manhattan distance because of their respective vertical & horizontal movements.

When is Ridge regression favorable over Lasso regression?

You can quote ISLR's authors Hastie, Tibshirani who asserted that, in presence of few variables with medium / large sized effect, use lasso regression. In presence of many variables with small / medium sized effect, use ridge regression. Conceptually, we can say, lasso regression (L1) does both variable selection and parameter shrinkage, whereas Ridge regression only does parameter shrinkage and end up including all the coefficients in the model. In presence of correlated variables, ridge regression might be the preferred choice. Also, ridge regression works best in situations where the least square estimates have higher variance. Therefore, it depends on our model objective.

You are assigned a new project which involves helping a food delivery company save more money. The problem is, company's delivery team aren't able to deliver food on time. As a result, their customers get unhappy. And, to keep them happy, they end up delivering food for free. Which machine learning algorithm can save them?

You might have started hopping through the list of ML algorithms in your mind. But, wait! Such questions are asked to test your machine learning fundamentals. This is not a machine learning problem. This is a route optimization problem. A machine learning problem consist of three things: There exist a pattern. You cannot solve it mathematically (even by writing exponential equations). You have data on it. Always look for these three factors to decide if machine learning is a tool to solve a particular problem.

Considering the long list of machine learning algorithm, given a data set, how do you decide which one to use?

You should say, the choice of machine learning algorithm solely depends of the type of data. If you are given a data set which is exhibits linearity, then linear regression would be the best algorithm to use. If you given to work on images, audios, then neural network would help you to build a robust model. If the data comprises of non linear interactions, then a boosting or bagging algorithm should be the choice. If the business requirement is to build a model which can be deployed, then we'll use regression or a decision tree model (easy to interpret and explain) instead of black box algorithms like SVM, GBM etc. In short, there is no one master algorithm for all situations. We must be scrupulous enough to understand which algorithm to use.

Running a binary classification tree algorithm is the easy part. Do you know how does a tree splitting takes place i.e. how does the tree decide which variable to split at the root node and succeeding nodes?

A classification trees makes decision based on Gini Index and Node Entropy. In simple words, the tree algorithm find the best possible feature which can divide the data set into purest possible children nodes. Gini index says, if we select two items from a population at random then they must be of same class and probability for this is 1 if population is pure. We can calculate Gini as following: Calculate Gini for sub-nodes, using formula sum of square of probability for success and failure (p^2+q^2). Calculate Gini for split using weighted Gini score of each node of that split Entropy is the measure of impurity as given by (for binary class): Entropy, Decision Tree Here p and q is probability of success and failure respectively in that node. Entropy is zero when a node is homogeneous. It is maximum when a both the classes are present in a node at 50% - 50%. Lower entropy is desirable.

सभी स्टडी सेट्स देखें

Machine Learning Interview Questions, Machine Learning General

संबंधित स्टडी सेट्स

Csci 103 final study guide

Old Testament Test 1

Deep Fungal Infections

The 7 Major Minerals in detail

Ch. 28: Impressionism, Post-Impressionism, and Symbolism

CS2810 Quizes

Elements 72-89

Chp. 6 Skeletal System

chapter 13 econ question 1-40

Final Exam

Reserve Requirement

IM EOR: Infectious Disease - Diptheria, EBV, HSV, Parasites, Pertussis, Rock Mt Fever, Varicella, Botulism

2 Testing of Materials & 3 properties ect

Week 13

Unit 14 and 15 Real Estate Financing Principles and Practices

Chapter 16, 17 Science Vocab.

Unit 1: Echoes from the Past, Lesson 3: The Odyssey, Part II

Hypomania, acute mania, delirious mania

PrepU Pharm Assignment 10

Ch 4 BIO 210