DS Interview Questions
10. What's the difference between a generative and discriminative model?
A generative model creates a possible distribution function the data of each class comes from, then given a new data point decides which class is most probable. For example, Naive Bayes and hidden Markov model. A discriminative model creates rules which separate data into class predictions. By using these rules we predict the class of a new data point. For example, KNN, SVM, decision tree, logistic regression, etc.
73. Can you explain the difference between a test set and a validation set?
A validation set tests how well your model describes the training set. A test set measures how well your model reacts to new data
66. What are categorical variables?
Catagorical variables are ones which can only take on a finite set of values, if there is an ordering to the values it is ordinal data and if there isnt it is called cardinal data.
You are given a data set. The data set contains many variables, some of which are highly correlated and you know about it. Would you remove correlated variables before using PCA? Why? hint: duplicate variables?
Chances are, you might be tempted to say No, but that would be incorrect. Discarding correlated variables have a substantial effect on PCA because, in presence of correlated variables, the variance explained by a particular component gets inflated.
24. When is Ridge regression favorable over Lasso regression?
Conceptually, we can say, lasso regression (L1) does both variable selection and parameter shrinkage, whereas Ridge regression only does parameter shrinkage and end up including all the coefficients in the model. In presence of correlated variables, ridge regression might be the preferred choice. Also, ridge regression works best in situations where the least square estimates have higher variance. Therefore, it depends on our model objective.
55. What is the difference between squared error and absolute error? Which one is better?
Squared error is much more sensitive to outliers. It depends on whether you want to weight outliers heavily.
46. Why overfitting happens?
You've trained a model on variance that is not seen in the population.
57. Can you write the formula to calculate R-squared?
(sum of errors squared) / (sum of errors of average squared)
50. What are the advantages of Naive Bayes?
Advantages of Naive Bayes: Super simple, you're just doing a bunch of counts. If the NB conditional independence assumption actually holds, a Naive Bayes classifier will converge quicker than discriminative models like logistic regression, so you need less training data.
1. What's the trade-off between bias and variance?
All training will decrease bias. When training a model, you learn from the training set. The amount your training set generalizes to the population, is the amount your variance will decrease with training. Also the amount your model learned which does not generalize to the population will cause the variance to increase. So at first your model learns more from the training set which generalizes (bias and variance go down). But as it learns at some point most of what the model learns does not generalize to the population. So the bias still goes down but the variance goes up!
17. What are your favorite use cases of machine learning models?
Answer this one yourself. The key is enthusiasm, and having an answer prepared. Preferably the answer will be related to their business but that is not necessary.
79. Give us a common application of machine learning that you get to see on a daily basis?
Any reccomendation on the internet such as youtube reccomendations or Amazon item reviews.
58. Suppose that we want to estimate the uncertainty in the prediction of a linear regression. Lets say we use bootstrapping to estimate the uncertainty (possible values) of β based on the available data. For example, we estimate that it is likely that β0∈[1.2, 1.4],β1∈[2.3, 2.4]. Say you are given a data pointxi=0. Why it is not sufficient to expect that yi∈[1.2, 1.4] but that the variability of the target can be greater?
As the confidence bands of beta values are usually at the 95% confidence level. We should expect when bootstraping, for our xi values should fall outside of our beta_0 confidence range approximately 5% of the time.
13. How do you ensure you're not overfitting with a model?
By gathering new data or keeping hold-out data and evaluating the model on data that it has not seen before.
43. Explain what resampling methods are and why they are useful. Also explain their limitations.
Classical statistical parametric tests compare observed statistics to theoretical sampling distributions. Resampling a data-driven, not theory-driven methodology which is based upon repeated sampling within the same sample. Resampling refers to methods for doing one of these Estimating the precision of sample statistics (medians, variances, percentiles) by using subsets of available data (jackknifing) or drawing randomly with replacement from a set of data points (bootstrapping) Exchanging labels on data points when performing significance tests (permutation tests, also called exact tests, randomization tests, or re-randomization tests) Validating models by using random subsets (bootstrapping, cross validation)
25. What is the difference between covariance and correlation?
Covariances are difficult to compare. For example: if we calculate the covariances of salary ($) and age (years), we'll get different covariances which can't be compared because of having unequal scales. To combat such situation, we calculate correlation to get a value between -1 and 1, irrespective of their respective scale.
63. What are the basic assumptions to be made for linear regression?
E[epsilon] = 0 Variance is not a function of the explanatory variables error terms are not correlated with eachother
26. While working on a data set, how do you select important variables?
Following are the methods of variable selection you can use: Remove the correlated variables prior to selecting important variables Use linear regression and select variables based on p values Use Forward Selection, Backward Selection, Stepwise Selection Use Random Forest, Xgboost and plot variable importance chart Use Lasso Regression Measure information gain for the available set of features and select top n features accordingly.
. Do you suggest that treating a categorical variable as continuous variable would result in a better predictive model?
For better predictions, categorical variable can be considered as a continuous variable only when the variable is ordinal in nature.
78. Discuss how you go about feature engineering (look for both intuition and specific evaluation tech- niques).
I like to encode catagorical variables to get meaning out of them, such as one hot encoding, hash encoding, and frequency encoding. For continuous variables I like to group the data by a catagorical and then find statistics of the aggregated groups.
80. How to assess the quality of clustering, especially to know when you have the right number of clusters?
I make graph the data and see if there is any obvious pattern being ignored.
69. During analysis, how do you treat missing values?
I would first consider if I think that there is a reason why certain values are missing. If I think there is a pattern to the missing values I may try
How would you validate a model you created to generate a predictive model of a quantitative outcome variable using multiple regression.
I would look for problems with correlation between variables. Make sure that standard errors of coefficients are low enough to se if there is instability in the model. After this I would look at metrics such as MSE and MAE to evaluate the predictive power.
60. How do data management procedures like missing data handling make selection bias worse?
If the fact that the data is missing tells you something, this information will be lost if you impute. For example individuals who know or suspect that they are HIV positive are less likely to participate in HIV surveys - Missing data handling will increase this effect as it's based on most HIV surveys being negative
12. When should you use classification over regression?
If the value you are trying to predict can only take on certain values.
16. How do you handle missing or corrupted data in a dataset?
If you do not think the data is missing is important and you have enough data, it may be fine to just remove the data with missing values. If you think you can accurately guess what the missing value would be based off of values you have, you can impute the values You can also go collect the information you are lacking
18. Is rotation necessary in PCA? If yes, Why? What will happen if you don't rotate the components?
If you mean rotation to find the next axis as the eigenvectors are orthogonal, then yes. If you mean rotation of the eigenvectors in space to simplify the model, then no, and it is not PCA the rotated principal axes would not be orthogonal anymore , and orthogonal projections on them do not make sense.
4. Explain how a ROC curve works.
Illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The value in the plot is attained by charting the True Positive Rate against the False Positive Rate
64. What do you understand by conjugate-prior with respect to Naive Bayes?
In Bayesian probability theory, if the posterior distributions p(θ | x) are in the same probability distribution family as the prior probability distribution p(θ), the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood function.
You are working on a classification problem. For validation purposes, you've randomly sampled the training data set into train and validation. You are confident that your model will work incredibly well on unseen data since your validation accuracy is high. However, you get shocked after getting poor test accuracy. What went wrong?
In case of classification problem, we should always use stratified sampling instead of random sampling. A random sampling doesn't takes into consideration the proportion of target classes. On the contrary, stratified sampling helps to maintain the distribution of target variable in the resultant distributed samples also.
38. What is the EM algorithm? Give a couple of applications
In statistics, an expectation-maximization (EM) algorithm is an iterative method to find maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables.
28. You've got a data set to work having p (no. of variable) > n (no. of observation). Why is OLS as bad option to work with? Which techniques would be best to use? Why?
In such high dimensional data sets, we can't use classical regression techniques, since their assumptions tend to fail. When p > n, we can no longer calculate a unique least square coefficient estimate, the variances become infinite, so OLS cannot be used at all. To combat this situation, we can use penalized regression methods like lasso, LARS, ridge which can shrink the coefficients to reduce variance.
You came to know that your model is suffering from low bias and high variance. Which algorithm should you use to tackle it? Why?
In such situations, we can use bagging algorithm (like random forest) to tackle high variance problem. Bagging algorithms divides a data set into subsets made with repeated randomized sampling. Then, these samples are used to generate a set of models using a single learning algorithm. Later, the model predictions are combined using voting (classification) or averaging (regression).
49. What is classifier in machine learning?
In the terminology of machine learning, classification is considered an instance of supervised learning, i.e., learning where a training set of correctly identified observations is available.
44. Is it better to have too many false positives, or too many false negatives? Explain.
It depends on the question as well as on the domain for which we are trying to solve the question.
81. Explain what regularization is and why it is useful
It is a change we make to the loss function to penalize an overly complicated model. This stops us from adding variables which do not generalize to the population (reducing variance)
9. What's the F1 score? How would you use it?
It is an evaluation metric which balances the importance of recall and precision (it is the harmonic average of recall and precision). The F1 score reaches the best score at 1 and the worst at 0. You would use this if you care about qualities of both of these metrics.
82. We know that one hot encoding increasing the dimensionality of a data set. But, label encoding does not. How?
It maps the values to the integers
74. What kind of properties are ideal for a machine learning data set?
It should be representative of the population you are selecting data from.
6. What is Bayes' Theorem? How is it useful in a machine learning context?
It tells us that if there is a conditional dependance between two events, let's call them A and B, then we can learn something about the probability of A happening if we also know that B happened. This is done using the information we have about the dependence relationship.
77. Pick an algorithm you like and walk me through the math and then the implementation of it, in pseudo-code.
K means- initialize k means in random positions. For each data point find the closest average. Group datapoints by their closest average. Move each average to the centroid of the points associated with them. Do this until no points have changed which mean they are associated in an iteration.
76. What is the difference between k-means and EM for Gaussian Mixtures?
K-means algorithm differs in the method used for calculating the Euclidean distance while calculating the distance between each of two data items; and EM uses statistical methods. The EM algorithm is often used to provide the functions more effectively.
3. How is KNN different from k-means clustering?
K-means is an unsupervised learning algorithm used for clustering problem whereas KNN is a supervised learning algorithm used for classification and regression problem. Also, although both happen to have a "K" in their name, they are different K's. The "K" in K-means refers to the number of clusters, and the output will have exactly K clusters; while the "K" in KNN refers to the number of nearest points, not the number of classes.
8. Explain the difference between L1 and L2 regularization.
L1 penalizes a model for having a feature at all if it is not very useful. Whereas L2 penalizes a model for having a feature with a large magnitude. So L1 regularization performs variable selection and yields sparse models, but L2 does not.
56. Why L1 regularizations causes parameter sparsity whereas L2 regularization does not?
L1 penalizes a model for having a parameter, while L2 penalizes a model for having a parameter with a large magnitude
2. What is the difference between supervised and unsupervised machine learning?
Labels. In supervised ML we have examples and try to use these examples to improve our predictions. In unsupervised ML we have NO examples so we try and make predictions by modeling patterns in the data.
45. Lasso regression uses the L1-norm of coefficients as a penalty term, while ridge regression uses the L2-norm. Which of these regularization methods is more likely to result in sparse solutions?
Lasso, as it removes variables completely if they are not important enough.
15. How would you evaluate a logistic regression model?
Logistic Regression aims to either classify or predict the value. You can look at goodness of fit metrics such as: Likelihood Ratio Test - how much better this model does than one with fewer predictors Or you can look at prediction scores such as: AUC True Positive Rate The important thing is our choice matches the goal we have.
65. What is the difference between Bayesian Inference and Maximum Likelihood Estimation (MLE)?
Maximum Likelihood Estimate With MLE,we seek a point value for θ which maximizes the likelihood, p(D|θ), shown in the equation(s) above. We can denote this value as θ̂ . In MLE, θ̂ is a point estimate, not a random variable. In other words, in the equation above, MLE treats the term p(θ)p(D) as a constant and does NOT allow us to inject our prior beliefs, p(θ), about the likely values for θ in the estimation calculations. Bayesian Estimate Bayesian estimation, by contrast, fully calculates (or at times approximates) the posterior distribution p(θ|D) . Bayesian inference treats θ as a random variable. In Bayesian estimation, we put in probability density functions and get out probability density functions, rather than a single point as in MLE.
71. Can you cite some examples where a false negative is more important than a false positive?
Medical tests, we want to be sure to catch everyone that may have the disease.
11. Which is more important to you- model accuracy, or model performance?
Model performance, as it is customized to fit a business use case. For example we may be willing to sacrifice model accuracy to reduce false negatives in a medical test.
52. What is multicollinearity and how you can overcome it?
Multicollinearity occurs when independent variables in a regression model are correlated if you have only moderate multicollinearity, you may not need to resolve it. Center the Independent Variables to Reduce Structural Multicollinearity Remove some of the highly correlated independent variables. Linearly combine the independent variables, such as adding them together. Perform an analysis designed for highly correlated variables, such as principal components analysis or partial least squares regression.
70. Can you cite some examples where a false positive is important than a false negative?
My favorite example is in recommendation systems, we want to be sure those we suggest are a good fit for the user
7. Why is "Naive" Bayes naive?
Naive Bayes assumes that the probability of different classes occurring do not correlate with one another. So it is Naive because this may or may not be true, but it is assumed to be true.
27. Rise in global average temperature led to decrease in number of pirates around the world. Does that mean that decrease in number of pirates caused the climate change?
No, correlation does not imply causation
36. OLS is to linear regression. Maximum likelihood is to logistic regression. Explain the statement.
OLS and Maximum likelihood are the methods used by the respective regression methods to approximate the unknown parameter (coefficient) value. In simple words, Ordinary least square(OLS) is a method used in linear regression which approximates the parameters resulting in minimum distance between actual and predicted values. Maximum Likelihood helps in choosing the the values of parameters which maximizes the likelihood that the parameters are most likely to produce observed data
42. How can you prove that one improvement you've brought to an algorithm is really an improvement over not doing anything?
One common way to achieve the above guidelines is through A/B testing, where both the versions of algorithm are kept running on similar environment for a considerably long time and real-life input data is randomly split between the two. This approach is particularly common in Web Analytics.
62. What do you understand by outliers and inliers? What would you do if you find them in your dataset?
Outlier is defined as an observation that deviates too much from other observations that it arouses suspicions that it was generated by a different mechanism from other observations. Inlier, on the other hand, is defined as an observation that is explained by underlying probability density function.
5. Define precision and recall.
Precision- (true positives) / (true positive + false positives) Intuition: If you say something is positive, Precision is the probability that it really is positive Recall - (true positives) / (true positive + false negatives) Intuition: If something is positive, Recall is the probability that you say that thing is positive
39. What methods for dimensionality reduction do you know and how do they compare with each other?
Principal component analysis (PCA) The main linear technique for dimensionality reduction Non-negative matrix factorization (NMF) NMF decomposes a non-negative matrix to the product of two non-negative ones, which has been a promising tool in fields where only non-negative signals exist.[7][8] such as astronomy Kernel PCA Principal component analysis can be employed in a nonlinear way by means of the kernel trick. Linear discriminant analysis (LDA) Linear discriminant analysis (LDA) is a generalization of Fisher's linear discriminant I would choose a method based on if I think the dependence of variables is nonlinear
19. Explain prior probability, likelihood and marginal likelihood in context of naive Bayes algorithm?
Prior - Probability of a class being chosen Likelihood - probability of predictor given class
35. When does regularization becomes necessary in machine learning?
Regularization becomes necessary when the model begins to ovefit / underfit. This technique introduces a cost term for bringing in more features with the objective function. Hence, it tries to push the coefficients for many variables to zero and hence reduce cost term. This helps to reduce model complexity so that the model can become better at predicting (generalizing).
61. What are the advantages and disadvantages of using regularization methods like Ridge Regression?
Ridge regression trades variance for bias. That is, the output from ridge regression is not unbiased. That's a bad thing. Ordinarily, you would not want biased estimators. However, when there is colinearity in the data, you may judge that it is worth it in order to lower the variance of those estimators.
53. What is the curse of dimensionality?
The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces. The common theme of these problems is that when the dimensionality increases, the volume of the space increases so fast that the available data become sparse.
75. Will the solution of OLS (ordinary least squares = linear regression) on log odds be the same as logistic regression? Which one is better and why?
The logistic regression will be better. As the errors of the OLS will garenteeingly fail the assumption of normality, the prediction of OLS will be biased.
47. What is "training set" and "test set"?
The model is initially fit on a training dataset,[3] that is a set of examples used to fit the parameters Finally, the test dataset is a dataset used to provide an unbiased evaluation of a final model fit on the training dataset
48. What is the difference between "unsupervised learning" and "supervised learning"?
The primary difference between supervised learning and unsupervised learning is the data used in either method of machine learning. It is worth noting that both methods of machine learning require data, which they will analyze to produce certain functions or data groups. However, the input data used in supervised learning is well known and is labeled. This means that the machine is only tasked with the role of determining the hidden patterns from already labeled data. However, the data used in unsupervised learning is not known nor labeled. It is the work of the machine to categorize and label the raw data before determining the hidden patterns and functions of the input data.
59. What do you understand by feature vectors?
They are vectors of values that describe an object's important characteristics.
67. How can outlier values be treated?
They can be treated in different ways, thrown out, left in, modified in some way. The important thing is there is a reason you take a certain action regarding outliers. ex. You think the data point is due to a mistyped number, you may change it to the number you think it was supposed to be.
You have been asked to evaluate a regression model based on R-squared, adjusted R-squared. What will be your criteria?
Tolerance (1 / VIF) is used as an indicator of multicollinearity. It is an indicator of percent of variance in a predictor which cannot be accounted by other predictors. Large values of tolerance is desirable. We will consider adjusted R² as opposed to R² to evaluate model fit because R² increases irrespective of improvement in prediction accuracy as we add more variables. But, adjusted R² would only increase if an additional variable improves the accuracy of model, otherwise stays same. It is difficult to commit a general threshold value for adjusted R² because it varies between data sets. For example: a gene mutation data set might result in lower adjusted R² and still provide fairly good predictions, as compared to a stock market data where lower adjusted R² implies that model is not good.
22. How is True Positive Rate and Recall related? Write the equation.
True Positive Rate = Recall. Yes, they are equal having the formula (TP/TP + FN).
Linear regression model is generally evaluated using adjusted R-squared or F-value. How would you evaluate a logistic regression model?
We can use the following methods: Since logistic regression is used to predict probabilities, we can use AUC-ROC curve along with confusion matrix to determine its performance. Also, the analogous metric of adjusted R² in logistic regression is AIC. AIC is the measure of fit which penalizes model for the number of model coefficients. Therefore, we always prefer model with minimum AIC value. Null Deviance indicates the response predicted by a model with nothing but an intercept. Lower the value, better the model. Residual deviance indicates the response predicted by a model on adding independent variables. Lower the value, better the model.
In k-means or kNN, we use euclidean distance to calculate the distance between nearest neighbors. Why not manhattan distance ?
We don't use manhattan distance because it calculates distance horizontally or vertically only. It has dimension restrictions. On the other hand, euclidean metric can be used in any space to calculate distance. Since, the data points can be present in any dimension, euclidean distance is a more viable option. Example: Think of a chess board, the movement made by a bishop or a rook is calculated by manhattan distance because of their respective vertical & horizontal movements.
41. Explain what precision and recall are. How do they relate to the ROC curve?
What percent of the positive cases did you catch? You answer: the "recall" What percent of positive predictions were correct? You answer: the "precision" ROC curve represents a relation between sensitivity (RECALL) and specificity(NOT PRECISION) and is commonly used to measure the performance of binary classifiers.
72. Can you cite some examples where both false positive and false negatives are equally important?
When you want to classify an object into one of many classes, it helps to have both false positives and false negatives.
68. How can you assess a good logistic model?
You could do a Likelihood Ratio Test with all subsets of your variables. Or you can look at a psuedo R^2, or ROC Curve
54. How do you decide whether your linear regression model fits the data?
You make sure that all of the ols or multivariate assumptions are met. Residuals have constant variance, meaned 0 and error terms are uncorrelated. If these are met, you can look at goodness of fit statistics such as R squared, rmse, mse, mae
Considering the long list of machine learning algorithm, given a data set, how do you decide which one to use?
You should say, the choice of machine learning algorithm solely depends of the type of data. If you are given a data set which is exhibits linearity, then linear regression would be the best algorithm to use. If you given to work on images, audios, then neural network would help you to build a robust model. If the data comprises of non linear interactions, then a boosting or bagging algorithm should be the choice. If the business requirement is to build a model which can be deployed, then we'll use regression or a decision tree model (easy to interpret and explain) instead of black box algorithms like SVM, GBM etc. In short, there is no one master algorithm for all situations. We must be scrupulous enough to understand which algorithm to use.
14. What evaluation approaches would you work to gauge the effectiveness of a machine learning model?
You would first split the dataset into training and test sets, or perhaps use cross-validation techniques to further segment the dataset into composite sets of training and test sets within the data. You should then implement a choice selection of performance metrics and evaluate the model based on these performance metrics.
51. What is K-means? How can you select K for K-means?
k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean The elbow method looks at the percentage of variance explained as a function of the number of clusters
