General MLE
32. What is a Recommendation System?
Anyone who has used Spotify or shopped at Amazon will recognize a recommendation system: Its an information filtering system that predicts what a user might want to hear or see based on choice patterns provided by the user.
10. Compare KNN, SVM and Naive Bayes.
11. Decision TreeDefine the terms: GINI Index, Entropy and Information Gain. How will you calculate these terms from a given dataset to select the nodes of the tree?What is Pruning in a Decision Tree? Define the terms: Bottom-Up Pruning, Top-Down Pruning, Reduced Error Pruning and Cost Complexity Pruning.What are the advantages and disadvantages of a Decision Tree? AnswerHow is Decision Tree used to solve the regression problems?12. Random ForestWhat is Random Forest? How does it reduce the over-fitting problem in decision trees? AnswerWhat are the advantages and disadvantages of Random Forest algorithm? AnswerHow to choose optimal number of trees in a Random Forest? Answer
5. What is SVD (Singular Value Decomposition)?
6. Linear Discriminant AnalysisWhat is LDA (Linear Discriminant Analysis)? How does LDA create a new axis by maximizing the distance between means and minimizing the scatter? What is the formula? What are the similarities and differences between LDA and PCA (Principal Component Analysis)?7. Multi-Dimensional ScalingWhat is Multi-Dimensional Scaling? What is the difference between "Metric" and "Non-metric" MDS? What is PCoA (Principal Coordinate Analysis)? Why should we not use Euclidean Distance in MDS to calculate the distance between variables? How is Log Fold Change used to calculate the distance between two variables in MDS? What are the similarities and differences between MDS and PCA (Principal Component Analysis)? How is it helpful in Dimensionality Reduction?8. t-SNE (t-Distributed Stochastic Neighbor Embedding)What is t-SNE (t-Distributed Stochastic Neighbor Embedding)? AnswerDefine the terms: Normal Distribution, t-Distribution, Similarity Score, PerplexityWhy is it called t-SNE instead of simple SNE? Why is t-Distribution used instead of normal distribution in lower dimension?Why should t-SNE not be used in larger datasets containing thousands of features? When should we use combination of both PCA and t-SNE?What are the advantages and disadvantages of t-SNE over PCA? Answer
6. What is the difference between Inductive and Deductive Machine Learning?
7. What are the various steps involved in a Machine Learning Process?Data Exploration and Visualization (4 Questions)
6. Explain the Confusion Matrix with Respect to Machine Learning Algorithms.
A confusion matrix (or error matrix) is a specific table that is used to measure the performance of an algorithm. It is mostly used in supervised learning; in unsupervised learning, its called the matching matrix.The confusion matrix has two parameters:ActualPredicted It also has identical sets of features in both of these dimensions.Consider a confusion matrix (binary matrix) shown below:Confusion MatrixHere,For actual values:Total Yes = 12+1 = 13Total No = 3+9 = 12 Similarly, for predicted values:Total Yes = 12+3 = 15Total No = 1+9 = 10 For a model to be accurate, the values across the diagonals should be high. The total sum of all the values in the matrix equals the total observations in the test data set. For the above matrix, total observations = 12+3+1+9 = 25Now, accuracy = sum of the values across the diagonal / total dataset0.840.840.84
11. What Are the Applications of Supervised Machine Learning in Modern Businesses?
Applications of supervised machine learning include:Email Spam DetectionHere we train the model using historical data that consists of emails categorized as spam or not spam. This labeled information is fed as input to the model.Healthcare DiagnosisBy providing images regarding a disease, a model can be trained to detect if a person is suffering from the disease or not.Sentiment AnalysisThis refers to the process of using algorithms to mine documents and determine whether theyre positive, neutral, or negative in sentiment. Fraud DetectionTraining the model to identify suspicious patterns, we can detect instances of possible fraud.
13) What is not Machine Learning?
Artificial IntelligenceRule based inference
25. What is Bias and Variance in a Machine Learning Model?
BiasBias in a machine learning model occurs when the predicted values are further from the actual values. Low bias indicates a model where the prediction values are very close to the actual ones.Underfitting: High bias can cause an algorithm to miss the relevant relations between features and target outputs. VarianceVariance refers to the amount the target model will change when trained with different training data. For a good model, the variance should be minimized. Overfitting: High variance can cause an algorithm to model the random noise in the training data rather than the intended outputs.Offer Expires In00 : HRS31 : MIN14SECMachine Learning Career GuideAn In-depth Guide To Becoming an ML EngineerDOWNLOAD NOWMachine Learning Career Guide
Q19. What is the difference between covariance and correlation?
Answer: Correlation is the standardized form of covariance.Covariances are difficult to compare. For example: if we calculate the covariances of salary ($) and age (years), well get different covariances which cant be compared because of having unequal scales. To combat such situation, we calculate correlation to get a value between -1 and 1, irrespective of their respective scale.
Q26. We know that one hot encoding increasing the dimensionality of a data set. But, label encoding doesnt. How ?
Answer: Dont get baffled at this question. Its a simple question asking the difference between the two.Using one hot encoding, the dimensionality (a.k.a features) in a data set get increased because it creates a new variable for each level present in categorical variables. For example: lets say we have a variable color. The variable has 3 levels namely Red, Blue and Green. One hot encoding color variable will generate three new variables as Color.Red, Color.Blue and Color.Green containing 0 and 1 value.In label encoding, the levels of a categorical variables gets encoded as 0 and 1, so no new variable is created. Label encoding is majorly used for binary variables.
Q18. While working on a data set, how do you select important variables? Explain your methods.
Answer: Following are the methods of variable selection you can use:1. Remove the correlated variables prior to selecting important variables2. Use linear regression and select variables based on p values3. Use Forward Selection, Backward Selection, Stepwise Selection4. Use Random Forest, Xgboost and plot variable importance chart5. Use Lasso Regression6. Measure information gain for the available set of features and select top n features accordingly.
Q37. Do you suggest that treating a categorical variable as continuous variable would result in a better predictive model?
Answer: For better predictions, categorical variable can be considered as a continuous variable only when the variable is ordinal in nature.
Q31. You are working on a classification problem. For validation purposes, youve randomly sampled the training data set into train and validation. You are confident that your model will work incredibly well on unseen data since your validation accuracy is high. However, you get shocked after getting poor test accuracy. What went wrong?
Answer: In case of classification problem, we should always use stratified sampling instead of random sampling. A random sampling doesnt takes into consideration the proportion of target classes. On the contrary, stratified sampling helps to maintain the distribution of target variable in the resultant distributed samples also.
Q24. Youve got a data set to work having p (no. of variable) > n (no. of observation). Why is OLS as bad option to work with? Which techniques would be best to use? Why?
Answer: In such high dimensional data sets, we cant use classical regression techniques, since their assumptions tend to fail. When p > n, we can no longer calculate a unique least square coefficient estimate, the variances become infinite, so OLS cannot be used at all.To combat this situation, we can use penalized regression methods like lasso, LARS, ridge which can shrink the coefficients to reduce variance. Precisely, ridge regression works best in situations where the least square estimates have higher variance.Among other methods include subset regression, forward stepwise regression.11222Q25. What is convex hull ? (Hint: Think SVM)Answer: In case of linearly separable data, convex hull represents the outer boundaries of the two group of data points. Once convex hull is created, we get maximum margin hyperplane (MMH) as a perpendicular bisector between two convex hulls. MMH is the line which attempts to create greatest separation between two groups.
Q34. Explain machine learning to me like a 5 year old.
Answer: Its simple. Its just like how babies learn to walk. Every time they fall down, they learn (unconsciously) & realize that their legs should be straight and not in a bend position. The next time they fall down, they feel pain. They cry. But, they learn not to stand like that again. In order to avoid that pain, they try harder. To succeed, they even seek support from the door or wall or anything near them, which helps them stand firm.This is how a machine works & develops intuition from its environment.Note: The interview is only trying to test if have the ability of explain complex concepts in simple terms.
Q28. You are given a data set consisting of variables having more than 30% missing values? Lets say, out of 50 variables, 8 variables have missing values higher than 30%. How will you deal with them?
Answer: We can deal with them in the following ways:1. Assign a unique category to missing values, who knows the missing values might decipher some trend2. We can remove them blatantly.3. Or, we can sensibly check their distribution with the target variable, and if found any pattern well keep those missing values and assign them a new category while removing others.29. People who bought this, also bought recommendations seen on amazon is a result of which algorithm?Answer: The basic idea for this kind of recommendation engine comes from collaborative filtering.Collaborative Filtering algorithm considers User Behavior for recommending items. They exploit behavior of other users and items in terms of transaction history, ratings, selection and purchase information. Other users behaviour and preferences over the items are used to recommend items to the new users. In this case, features of the items are not known.Know more: Recommender System
Q27. What cross validation technique would you use on time series data set? Is it k-fold or LOOCV?
Answer: Neither.In time series problem, k fold can be troublesome because there might be some pattern in year 4 or 5 which is not in year 3. Resampling the data set will separate these trends, and we might end up validation on past years, which is incorrect. Instead, we can use forward chaining strategy with 5 fold as shown below:fold 1 : training [1], test [2]fold 2 : training [1 2], test [3]fold 3 : training [1 2 3], test [4]fold 4 : training [1 2 3 4], test [5]fold 5 : training [1 2 3 4 5], test [6]where 1,2,3,4,5,6 represents year.
Q40. OLS is to linear regression. Maximum likelihood is to logistic regression. Explain the statement.
Answer: OLS and Maximum likelihood are the methods used by the respective regression methods to approximate the unknown parameter (coefficient) value. In simple words,Ordinary least square(OLS) is a method used in linear regression which approximates the parameters resulting in minimum distance between actual and predicted values. Maximum Likelihood helps in choosing the the values of parameters which maximizes the likelihood that the parameters are most likely to produce observed data.
Q39. What do you understand by Bias Variance trade off?
Answer: The error emerging from any model can be broken down into three components mathematically. Following are these component :error of a modelBias error is useful to quantify how much on an average are the predicted values different from the actual value. A high bias error means we have a under-performing model which keeps on missing important trends. Variance on the other side quantifies how are the prediction made on same observation different from each other. A high variance model will over-fit on your training population and perform badly on any observation beyond training.
Q35. I know that a linear regression model is generally evaluated using Adjusted R² or F value. How would you evaluate a logistic regression model?
Answer: We can use the following methods:1. Since logistic regression is used to predict probabilities, we can use AUC-ROC curve along with confusion matrix to determine its performance.2. Also, the analogous metric of adjusted R² in logistic regression is AIC. AIC is the measure of fit which penalizes model for the number of model coefficients. Therefore, we always prefer model with minimum AIC value.3. Null Deviance indicates the response predicted by a model with nothing but an intercept. Lower the value, better the model. Residual deviance indicates the response predicted by a model on adding independent variables. Lower the value, better the model.Know more: Logistic Regression
Q33. In k-means or kNN, we use euclidean distance to calculate the distance between nearest neighbors. Why not manhattan distance ?
Answer: We dont use manhattan distance because it calculates distance horizontally or vertically only. It has dimension restrictions. On the other hand, euclidean metric can be used in any space to calculate distance. Since, the data points can be present in any dimension, euclidean distance is a more viable option.Example: Think of a chess board, the movement made by a bishop or a rook is calculated by manhattan distance because of their respective vertical & horizontal movements.
Q21. Both being tree based algorithm, how is random forest different from Gradient boosting algorithm (GBM)?
Answer: The fundamental difference is, random forest uses bagging technique to make predictions. GBM uses boosting techniques to make predictions.In bagging technique, a data set is divided into n samples using randomized sampling. Then, using a single learning algorithm a model is build on all samples. Later, the resultant predictions are combined using voting or averaging. Bagging is done is parallel. In boosting, after the first round of predictions, the algorithm weighs misclassified predictions higher, such that they can be corrected in the succeeding round. This sequential process of giving higher weights to misclassified predictions continue until a stopping criterion is reached.Random forest improves model accuracy by reducing variance (mainly). The trees grown are uncorrelated to maximize the decrease in variance. On the other hand, GBM improves accuracy my reducing both bias and variance in a model.Know more: Tree based modeling
Q23. Youve built a random forest model with 10000 trees. You got delighted after getting training error as 0.00. But, the validation error is 34.23. What is going on? Havent you trained your model perfectly?
Answer: The model has overfitted. Training error 0.00 means the classifier has mimiced the training data patterns to an extent, that they are not available in the unseen data. Hence, when this classifier was run on unseen sample, it couldnt find those patterns and returned prediction with higher error. In random forest, it happens when we use larger number of trees than necessary. Hence, to avoid these situation, we should tune number of trees using cross validation.
Q20. Is it possible capture the correlation between continuous and categorical variable? If yes, how?
Answer: Yes, we can use ANCOVA (analysis of covariance) technique to capture association between continuous and categorical variables.
Q16. When is Ridge regression favorable over Lasso regression?
Answer: You can quote ISLRs authors Hastie, Tibshirani who asserted that, in presence of few variables with medium / large sized effect, use lasso regression. In presence of many variables with small / medium sized effect, use ridge regression.Conceptually, we can say, lasso regression (L1) does both variable selection and parameter shrinkage, whereas Ridge regression only does parameter shrinkage and end up including all the coefficients in the model. In presence of correlated variables, ridge regression might be the preferred choice. Also, ridge regression works best in situations where the least square estimates have higher variance. Therefore, it depends on our model objective.Know more: Ridge and Lasso Regression
9. What is ICA (Independent Component Analysis)?
Algorithms (27 Questions)1. Types of ML AlgorithmsWhat are the various types of Machine Learning Algorithms? AnswerName various algorithms for Supervised Learning, Unsupervised Learning and Reinforcement Learning. 2. Supervised LearningWhat are the various Supervised Learning techniques? What is the difference between Classification and Regression algorithms? Name various Classification and Regression algorithms. 3. Unsupervised LearningWhat are the various Unsupervised Learning techniques? What is the difference between Clustering and Association algorithms? Name various Clustering and Association algorithms. 4. Linear RegressionHow do we draw the line of linear regression using Least Square Method? What is the equation of line? How do we calculate slope and coefficient of a line using Least Square Method?Explain Gradient Descent. How does it optimize the Line of Linear Regression? AnswerWhat are the various types of Linear Regression? What is the difference between Simple, Multiple and Polynomial Linear Regression?What are the various metrics used to check the accuracy of the Linear Regression? AnswerWhat are the advantages and disadvantages of Linear Regression? Answer5. Logistic RegressionWhat is the equation of Logistic Regression? How will you derive this equation from Linear Regression (Equation of a Straight Line)?How do we calculate optimal Threshold value in Logistic Regression?What are the advantages and disadvantages of Logistic Regression? Answer6. What is the difference between Linear Regression and Logistic Regression? Answer7. KNNWhat is K in KNN algorithm? How to choose optimal value of K? AnswerWhy the odd value of K is preferable in KNN algorithm? AnswerWhy is KNN algorithm called Lazy Learner? AnswerWhy should we not use KNN algorithm for large datasets? AnswerWhat are the advantages and disadvantages of KNN algorithm? AnswerWhat is the difference between Euclidean Distance and Manhattan Distance? What is the formula of Euclidean distance and Manhattan distance? Answer8. SVMDefine the terms: Support Vectors and HyperplanesWhat are Kernel Functions and Tricks in SVM? What are the various types of Kernels in SVM? What is the difference between Linear, Polynomial, Gaussian and Sigmoid Kernels? How are these used for transformation of non-linear data into linear data?Can SVM be used to solve regression problems? What is SVR (Support Vector Regression)?What are the advantages and disadvantages of SVM? Answer9. Naive BayesWhat is the difference between Conditional Probability and Joint Probability? What is the formula of "Naive Bayes" theorem? How will you derive it?Why is the word Naïve used in the Naïve Bayes algorithm?What is the difference between Probability and Likelihood?How do we calculate Frequency and Likelihood tables for a given dataset in the Naïve Bayes algorithm?What are the various type of models used in "Naïve Bayes" algorithm? Explain the difference between Gaussian, Multinomial and Bernoulli models.What are the advantages and disadvantages of "Naive Bayes" algorithm? AnswerWhats the difference between Generative and Discriminative models? What is the difference between Joint Probability Distribution and Conditional Probability Distribution? Name some Generative and Discriminative models. Why is Naive Bayes Algorithm considered as Generative Model although it appears that it calculates Conditional Probability Distribution?
Q17. Rise in global average temperature led to decrease in number of pirates around the world. Does that mean that decrease in number of pirates caused the climate change?
Answer: After reading this question, you should have understood that this is a classic case of causation and correlation. No, we cant conclude that decrease in number of pirates caused the climate change because there might be other factors (lurking or confounding variables) influencing this phenomenon.Therefore, there might be a correlation between global average temperature and number of pirates, but based on this information we cant say that pirated died because of rise in global average temperature.Know more: Causation and Correlation
Q36. Considering the long list of machine learning algorithm, given a data set, how do you decide which one to use?
Answer: You should say, the choice of machine learning algorithm solely depends of the type of data. If you are given a data set which is exhibits linearity, then linear regression would be the best algorithm to use. If you given to work on images, audios, then neural network would help you to build a robust model.If the data comprises of non linear interactions, then a boosting or bagging algorithm should be the choice. If the business requirement is to build a model which can be deployed, then well use regression or a decision tree model (easy to interpret and explain) instead of black box algorithms like SVM, GBM etc.In short, there is no one master algorithm for all situations. We must be scrupulous enough to understand which algorithm to use.
Q32. You have been asked to evaluate a regression model based on R², adjusted R² and tolerance. What will be your criteria?
Answer: Tolerance (1 / VIF) is used as an indicator of multicollinearity. It is an indicator of percent of variance in a predictor which cannot be accounted by other predictors. Large values of tolerance is desirable.We will consider adjusted R² as opposed to R² to evaluate model fit because R² increases irrespective of improvement in prediction accuracy as we add more variables. But, adjusted R² would only increase if an additional variable improves the accuracy of model, otherwise stays same. It is difficult to commit a general threshold value for adjusted R² because it varies between data sets. For example: a gene mutation data set might result in lower adjusted R² and still provide fairly good predictions, as compared to a stock market data where lower adjusted R² implies that model is not good.
Q30. What do you understand by Type I vs Type II error ?
Answer: Type I error is committed when the null hypothesis is true and we reject it, also known as a False Positive. Type II error is committed when the null hypothesis is false and we accept it, also known as False Negative.In the context of confusion matrix, we can say Type I error occurs when we classify a value as positive (1) when it is actually negative (0). Type II error occurs when we classify a value as negative (0) when it is actually positive(1).
31) What are the two classification methods that SVM ( Support Vector Machine) can handle?
Combining binary classifiersModifying binary to incorporate multiclass learning
21) What is Genetic Programming?
Genetic programming is one of the two techniques used in machine learning. The model is based on the testing and selecting the best choice among a set of results.
7. What are the basic steps to implement any Machine Learning algorithm using Cross Validation (cross_val_score) in Python?
Implement KNN using Cross Validation in PythonImplement Naive Bayes using Cross Validation in PythonImplement XGBoost using Cross Validation in Python8. Feature Scaling in PythonImplement Standardization in PythonImplement Normalization in Python 9. Encoding in Python Implement LabelEncoder in PythonImplement OneHotEncoder in PythonImplement get_dummies in Python10. Imputing in PythonImplement Imputer in Python11. Binning in PythonImplement Binning in Python using Cut Function12. Dimensionality Reduction in PythonImplement PCA in Python13. What is the random_state (seed) parameter in train_test_split?
40) What is dimension reduction in Machine Learning?
In Machine Learning and statistics, dimension reduction is the process of reducing the number of random variables under considerations and can be divided into feature selection and feature extraction.
27) What is Perceptron in Machine Learning?
In Machine Learning, Perceptron is an algorithm for supervised classification of the input into one of several possible non-binary outputs.
19) What are the advantages of Naive Bayes?
In Naïve Bayes classifier will converge quicker than discriminative models like logistic regression, so you need less training data. The main advantage is that it cant learn interactions between features.
15. What is the Difference Between Inductive Machine Learning and Deductive Machine Learning?
Inductive LearningIt observes instances based on defined principles to draw a conclusionExample: Explaining to a child to keep away from the fire by showing a video where fire causes damage
22) What is Inductive Logic Programming in Machine Learning?
Inductive Logic Programming (ILP) is a subfield of machine learning which uses logical programming representing background knowledge and examples.
31. Explain the K Nearest Neighbor Algorithm.
K nearest neighbor algorithm is a classification algorithm that works in a way that a new data point is assigned to a neighboring group to which it is most similar. In K nearest neighbors, K can be an integer greater than 1. So, for every new data point, we want to classify, we compute to which neighboring group it is closest. Let us classify an object using the following example. Consider there are three clusters:FootballBasketball Tennis ballKluster 1Let the new data point to be classified is a black ball. We use KNN to classify it. Assume K = 5 (initially). Next, we find the K (five) nearest data points, as shown.Kluster 2Observe that all five selected points do not belong to the same cluster. There are three tennis balls and one each of basketball and football. When multiple classes are involved, we prefer the majority. Here the majority is with the tennis ball, so the new data point is assigned to this cluster.
16. Compare K-means and KNN Algorithms.
K-meansK-Means is unsupervisedK-Means is a clustering algorithmThe points in each cluster are similar to each other, and each cluster is different from its neighboring clusters
30. Briefly Explain Logistic Regression.
Logistic regression is a classification algorithm used to predict a binary outcome for a given set of independent variables. The output of logistic regression is either a 0 or 1 with a threshold value of generally 0.5. Any value above 0.5 is considered as 1, and any point below 0.5 is considered as 0.Logistic Regression
10. What Are the Differences Between Machine Learning and Deep Learning?
Machine Learning Enables machines to take decisions on their own, based on past dataIt needs only a small amount of data for trainingWorks well on the low-end system, so you don't need large machines Most features need to be identified in advance and manually codedThe problem is divided into two parts and solved individually and then combinedMachine Learning Course Banner
2) Mention the difference between Data Mining and Machine learning?
Machine learning relates with the study, design and development of the algorithms that give computers the capability to learn without being explicitly programmed. While, data mining can be defined as the process in which the unstructured data tries to extract knowledge or unknown interesting patterns. During this process machine, learning algorithms are used.
9) What are the three stages to build the hypotheses or model in machine learning?
Model buildingModel testingApplying the model
46) What is PAC Learning?
PAC (Probably Approximately Correct) learning is a learning framework that has been introduced to analyze learning algorithms and their statistical efficiency.
39) What is PCA, KPCA and ICA used for?
PCA (Principal Components Analysis), KPCA ( Kernel based Principal Component Analysis) and ICA ( Independent Component Analysis) are important feature extraction techniques used for dimensionality reduction.
27. Define Precision and Recall.
PrecisionPrecision is the ratio of several events you can correctly recall to the total number of events you recall (mix of correct and wrong recalls).Precision = (True Positive) / (True Positive + False Positive)RecallA recall is the ratio of a number of events you can recall the number of total events.Recall = (True Positive) / (True Positive + False Negative)
29. What is Pruning in Decision Trees, and How Is It Done?
Pruning is a technique in machine learning that reduces the size of decision trees. It reduces the complexity of the final classifier, and hence improves predictive accuracy by the reduction of overfitting. Pruning can occur in:Top-down fashion. It will traverse nodes and trim subtrees starting at the rootBottom-up fashion. It will begin at the leaf nodesThere is a popular pruning algorithm called reduced error pruning, in which:Starting at the leaves, each node is replaced with its most popular classIf the prediction accuracy is not affected, the change is keptThere is an advantage of simplicity and speed
2. What is Overfitting, and How Can You Avoid It?
Overfitting is a situation that occurs when a model learns the training set too well, taking up random fluctuations in the training data as concepts. These impact the models ability to generalize and dont apply to new data. When a model is given the training data, it shows 100 percent accuracytechnically a slight loss. But, when we use the test data, there may be an error and low efficiency. This condition is known as overfitting.There are multiple ways of avoiding overfitting, such as:Regularization. It involves a cost term for the features involved with the objective functionMaking a simple model. With lesser variables and parameters, the variance can be reduced Cross-validation methods like k-folds can also be usedIf some model parameters are likely to cause overfitting, techniques for regularization like LASSO can be used that penalize these parametersOffer Expires In00 : HRS31 : MIN13SECMachine Learning Interview GuideYour Guide to A Smooth InterviewDOWNLOAD NOWMachine Learning Interview Guide
44) What are the areas in robotics and information processing where sequential prediction problem arises?
The areas in robotics and information processing where sequential prediction problem arises areImitation LearningStructured predictionModel based reinforcement learning
26. What is the Trade-off Between Bias and Variance?
The bias-variance decomposition essentially decomposes the learning error from any algorithm by adding the bias, variance, and a bit of irreducible error due to noise in the underlying dataset. Necessarily, if you make the model more complex and add more variables, youll lose bias but gain variance. To get the optimally-reduced amount of error, youll have to trade off bias and variance. Neither high bias nor high variance is desired.High bias and low variance algorithms train models that are consistent, but inaccurate on average.High variance and low bias algorithms train models that are accurate but inconsistent.
26) What is the difference between heuristic for rule learning and heuristics for decision trees?
The difference is that the heuristics for decision trees evaluate the average quality of a number of disjointed sets while rule learners only evaluate the quality of the set of instances that is covered with the candidate rule.
12) List down various approaches for machine learning?
The different approaches in Machine Learning areConcept Vs Classification LearningSymbolic Vs Statistical LearningInductive Vs Analytical Learning
43) What are the different methods for Sequential Supervised Learning?
The different methods to solve Sequential Supervised Learning problems areSliding-window methodsRecurrent sliding windowsHidden Markow modelsMaximum entropy Markow modelsConditional random fieldsGraph transformer networks
8) What are the different Algorithm techniques in Machine Learning?
The different types of techniques in Machine Learning areSupervised LearningUnsupervised LearningSemi-supervised LearningReinforcement LearningTransductionLearning to Learn
4) Why overfitting happens?
The possibility of overfitting exists as the criteria used for training the model is not the same as the criteria used to judge the efficacy of a model.
23) What is Model Selection in Machine Learning?
The process of selecting models among different mathematical models, which are used to describe the same data set is known as Model Selection. Model selection is applied to the fields of statistics, machine learning and data mining.
13. What Are Unsupervised Machine Learning Techniques?
There are two techniques used in unsupervised learning: clustering and association.ClusteringClustering problems involve data to be divided into subsets. These subsets, also called clusters, contain data that are similar to each other. Different clusters reveal different details about the objects, unlike classification or regression.ClusteringAssociationIn an association problem, we identify patterns of associations between different variables or items.For example, an e-commerce website can suggest other items for you to buy, based on the prior purchases that you have made, spending habits, items in your wishlist, other customers purchase habits, and so on.Association
5. How Can You Choose a Classifier Based on a Training Set Data Size?
When the training set is small, a model that has a right bias and low variance seem to work better because they are less likely to overfit. For example, Naive Bayes works best when the training set is large. Models with low bias and high variance tend to perform better as they work fine with complex relationships.
25) Which method is frequently used to prevent overfitting?
When there is sufficient data Isotonic Regression is used to prevent an overfitting issue.
19. How Will You Know Which Machine Learning Algorithm to Choose for Your Classification Problem?
While there is no fixed rule to choose an algorithm for a classification problem, you can follow these guidelines:If accuracy is a concern, test different algorithms and cross-validate themIf the training dataset is small, use models that have low variance and high biasIf the training dataset is large, use models that have high variance and little bias
18. Explain How a System Can Play a Game of Chess Using Reinforcement Learning.
Reinforcement learning has an environment and an agent. The agent performs some actions to achieve a specific goal. Every time the agent performs a task that is taking it towards the goal, it is rewarded. And, every time it takes a step which goes against that goal or in reverse direction, it is penalized. Earlier, chess programs had to determine the best moves after much research on numerous factors. Building a machine designed to play such games would require many rules to be specified. With reinforced learning, we dont have to deal with this problem as the learning agent learns by playing the game. It will make a move (decision), check if its the right move (feedback), and keep the outcomes in memory for the next step it takes (learning). There is a reward for every correct decision the system takes and punishment for the wrong one.
14. What is the Difference Between Supervised and Unsupervised Machine Learning?
Supervised learning - This model learns from the labeled data and makes a future prediction as output Unsupervised learning - This model uses unlabeled input data and allows the algorithm to act on that information without guidance.
12. What is Semi-supervised Machine Learning?
Supervised learning uses data that is completely labeled, whereas unsupervised learning uses no training data.In the case of semi-supervised learning, the training data contains a small amount of labeled data and a large amount of unlabeled data.Semi-supervised Learning
41) What are support vector machines?
Support vector machines are supervised learning algorithms used for classification and regression analysis.
4. How will you visualize missing values, outliers, skewed data and correlations using plots and grids? Answer
Data Preprocessing and Wrangling (19 Questions)
7) What are the five popular algorithms of Machine Learning?
Decision TreesNeural Networks (back propagation)Probabilistic networksNearest NeighborSupport vector machines
33) Why ensemble learning is used?
Ensemble learning is used to improve the classification, prediction, function approximation etc of a model.
1) What is Machine learning?
Machine learning is a branch of computer science which deals with system programming in order to automatically learn and improve with experience. For example: Robots are programed so that they can perform the task based on data they gather from sensors. It automatically learns programs from data.
28. What is Decision Tree Classification?
A decision tree builds classification (or regression) models as a tree structure, with datasets broken up into ever-smaller subsets while developing the decision tree, literally in a tree-like way with branches and nodes. Decision trees can handle both categorical and numerical data.
48) What is sequence learning?
Sequence learning is a method of teaching and learning in a logical manner.
47) What are the different categories you can categorized the sequence learning process?
Sequence predictionSequence generationSequence recognitionSequential decision
36) What is the general principle of an ensemble method and what is bagging and boosting in ensemble method?
The general principle of an ensemble method is to combine the predictions of several models built with a given learning algorithm in order to improve robustness over a single model. Bagging is a method in ensemble for improving unstable estimation or classification schemes. While boosting method are used sequentially to reduce the bias of the combined model. Boosting and Bagging both can reduce errors by reducing the variance term.
42) What are the components of relational evaluation techniques?
The important components of relational evaluation techniques areData AcquisitionGround Truth AcquisitionCross Validation TechniqueQuery TypeScoring MetricSignificance Test
6) What is inductive machine learning?
The inductive machine learning involves the process of learning by examples, where a system, from a set of observed instances tries to induce a general rule.
3) What is Overfitting in Machine learning?
In machine learning, when a statistical model describes random error or noise instead of underlying relationship overfitting occurs. When a model is excessively complex, overfitting is normally observed, because of having too many parameters with respect to the number of training data types. The model exhibits poor performance which has been overfit.Primis Player Placeholder11MWhat is JVM (Java Virtual Machine) with Architecture JAVA Programming Tutorial
33. What is Kernel SVM?
Kernel SVM is the abbreviated version of the kernel support vector machine. Kernel methods are a class of algorithms for pattern analysis, and the most common one is the kernel SVM.
45) What is batch statistical learning?
Statistical learning techniques allow learning a function or predictor from a set of observed data that can make predictions about unseen or future data. These techniques provide guarantees on the performance of the learned predictor on the future unseen data based on a statistical assumption on the data generating process.
28) Explain the two components of Bayesian logic program?
Bayesian logic program consists of two components. The first component is a logical one ; it consists of a set of Bayesian Clauses, which captures the qualitative structure of the domain. The second component is a quantitative one, it encodes the quantitative information about the domain.
16) What is algorithm independent machine learning?
Machine learning in where mathematical foundations is independent of any particular classifier or learning algorithm is referred as algorithm independent machine learning?
4. What are the magic functions in IPython?
5. What is the purpose of writing "inline" with "%matplotlib" (%matplotlib inline)?
23. What is a Random Forest?
A random forest is a supervised machine learning algorithm that is generally used for classification problems. It operates by constructing multiple decision trees during the training phase. The random forest chooses the decision of the majority of the trees as the final decision. Random Forest
Q11: What's a Fourier transform?
Answer: A Fourier transform is a generic method to decompose generic functions into a superposition of symmetric functions. Or as this more intuitive tutorial puts it, given a smoothie, it's how we find the recipe. The Fourier transform finds the set of cycle speeds, amplitudes, and phases to match any time signal. A Fourier transform converts a signal from time to frequency domain—it's a very common way to extract features from audio signals or other time series such as sensor data.More reading: Fourier transform (Wikipedia)
Q14: What's the difference between a generative and discriminative model?
Answer: A generative model will learn categories of data while a discriminative model will simply learn the distinction between different categories of data. Discriminative models will generally outperform generative models on classification tasks.More reading: What is the difference between a Generative and Discriminative Algorithm? (Stack Overflow)
Q35: What are the data types supported by JSON?
Answer: This tests your knowledge of JSON, another popular file format that wraps with JavaScript. There are six basic JSON datatypes you can manipulate: strings, numbers, objects, arrays, booleans, and null values.More reading: JSON datatypes
Q23: What evaluation approaches would you work to gauge the effectiveness of a machine learning model?
Answer: You would first split the dataset into training and test sets, or perhaps use cross-validation techniques to further segment the dataset into composite sets of training and test sets within the data. You should then implement a choice selection of performance metrics: here is a fairly comprehensive list. You could use measures such as the F1 score, the accuracy, and the confusion matrix. What's important here is to demonstrate that you understand the nuances of how a model is measured and how to choose the right performance measures for the right situations.More reading: How to Evaluate Machine Learning Algorithms (Machine Learning Mastery)
9.1 - What are some key business metrics for (S-a-a-S startup | Retail bank | e-Commerce site)?
Bagging, or Bootstrap Aggregating, is an ensemble method in which the dataset is first divided into multiple subsets through resampling.Then, each subset is used to train a model, and the final predictions are made through voting or averaging the component models.Bagging is performed in parallel.Learn more about bagging, boosting, and stacking in machine learningBusiness Applications9. Business ApplicationsHow machine learning can help different types of businesses.
29) What are Bayesian Networks (BN) ?
Bayesian Network is used to represent the graphical model for probability relationship among a set of variables.
2.2 - When would you use GD over SDG, and vice-versa?
Both algorithms are methods for finding a set of parameters that minimize a loss function by evaluating parameters against data and then making adjustments.In standard gradient descent, you'll evaluate all training samples for each set of parameters. This is akin to taking big, slow steps toward the solution.In stochastic gradient descent, you'll evaluate only 1 training sample for the set of parameters before updating them. This is akin to taking small, quick steps toward the solution.Learn more about SGD vs. GD
5) How can you avoid overfitting ?
By using a lot of data overfitting can be avoided, overfitting happens relatively as you have a small dataset, and you try to learn from it. But if you have a small database and you are forced to come with a model based on that. In such situation, you can use a technique known as cross validation. In this method the dataset splits into two section, testing and training datasets, the testing dataset will only test the model while, in training dataset, the datapoints will come up with the model.In this technique, a model is usually given a dataset of a known data on which training (training data set) is run and a dataset of unknown data against which the model is tested. The idea of cross validation is to define a dataset to test the model in the training phase.
21. When Will You Use Classification over Regression?
Classification is used when your target is categorical, while regression is used when your target variable is continuous. Both classification and regression belong to the category of supervised machine learning algorithms. Examples of classification problems include:Predicting yes or noEstimating genderBreed of an animalType of colorExamples of regression problems include:Estimating sales and price of a productPredicting the score of a teamPredicting the amount of rainfall
15) Explain what is the function of Supervised Learning?
ClassificationsSpeech recognitionRegressionPredict time seriesAnnotate strings
9. What is Deep Learning?
Deep learning is a subset of machine learning that involves systems that think and learn like humans using artificial neural networks. The term deep comes from the fact that you can have several layers of neural networks. One of the primary differences between machine learning and deep learning is that feature engineering is done manually in machine learning. In the case of deep learning, the model consisting of neural networks will automatically determine which features to use (and which not to use).
17) What is the difference between artificial learning and machine learning?
Designing and developing algorithms according to the behaviours based on empirical data are known as Machine Learning. While artificial intelligence in addition to machine learning, it also covers other aspects like knowledge representation, natural language processing, planning, robotics etc.
34) When to use ensemble learning?
Ensemble learning is used when you build component classifiers that are more accurate and independent from each other.
7. What Is a False Positive and False Negative and How Are They Significant?
False positives are those cases which wrongly get classified as True but are False. False negatives are those cases which wrongly get classified as False but are True.In the term False Positive, the word Positive refers to the Yes row of the predicted value in the confusion matrix. The complete term indicates that the system has predicted it as a positive, but the actual value is negative. Confusion Matrix 2So, looking at the confusion matrix, we get:False-positive = 3True positive = 12Similarly, in the term False Negative, the word Negative refers to the No row of the predicted value in the confusion matrix. And the complete term indicates that the system has predicted it as negative, but the actual value is positive.So, looking at the confusion matrix, we get:False Negative = 1True Negative = 9
14) Explain what is the function of Unsupervised Learning?
Find clusters of the dataFind low-dimensional representations of the dataFind interesting directions in dataInteresting coordinates and correlationsFind novel observations/ database cleaning
3.1 - What is the Box-Cox transformation used for?
GD theoretically minimizes the error function better than SGD. However, SGD converges much faster once the dataset becomes large.That means GD is preferable for small datasets while SGD is preferable for larger ones.In practice, however, SGD is used for most applications because it minimizes the error function well enough while being much faster and more memory efficient for large datasets.Data Preprocessing3. Data PreprocessingDealing with missing data, skewed distributions, outliers, etc.
6. What are the basic steps to implement any Machine Learning algorithm in Python?
Implement KNN in PythonImplement SVC in PythonImplement Naive Bayes in PythonImplement Simple Linear Regression in PythonImplement Multiple Linear Regression in PythonImplement Decision Tree for Classification Problem in PythonImplement Decision Tree for Regression Problem in PythonImplement Random Forest for Classification Problem in PythonImplement Random Forest for Regression Problem in PythonImplement Adaboost in PythonImplement XGBoost For Classification Problem in PythonImplement XGBoost For Regression Problem in Python
11) What is Training set and Test set?
In various areas of information science like machine learning, a set of data is used to discover the potentially predictive relationship known as Training Set. Training set is an examples given to the learner, while Test set is used to test the accuracy of the hypotheses generated by the learner, and it is the set of example held back from the learner. Training set are distinct from Test set.
38) What is an Incremental Learning algorithm in ensemble?
Incremental learning method is the ability of an algorithm to learn from new data that may be available after classifier has already been generated from already available dataset.
3. Explain various plots and grids available for data exploration in seaborn and matplotlib libraries?
Joint Plot, Distribution Plot, Box Plot, Bar Plot, Regression Plot, Strip Plot, Heatmap, Violin Plot, Pair Plot and Grid, Facet Grid
Q12: What's the difference between probability and likelihood?
More reading: What is the difference between "likelihood" and "probability"? (Cross Validated)
20. How is Amazon Able to Recommend Other Things to Buy? How Does the Recommendation Engine Work?
Once a user buys something from Amazon, Amazon stores that purchase data for future reference and finds products that are most likely also to be bought, it is possible because of the Association algorithm, which can identify patterns in a given dataset. Association Algorithm
4. How Do You Handle Missing or Corrupted Data in a Dataset?
One of the easiest ways to handle missing or corrupted data is to drop those rows or columns or replace them entirely with some other value.There are two useful methods in Pandas:IsNull() and dropna() will help to find the columns/rows with missing data and drop themFillna() will replace the wrong values with a placeholder valuePanda
20) In what areas Pattern Recognition is used?
Pattern Recognition can be used inComputer VisionSpeech RecognitionData MiningStatisticsInformal RetrievalBio-Informatics
10. What do you mean by Loss Function? Name some commonly used Loss Functions. Define Mean Absolute Error, Mean Squared Error, Root Mean Squared Error, Sum of Absolute Error, Sum of Squared Error, R Square Method, Adjusted R Square Method. Answer
Python (16 Questions)1. What are the commonly used libraries in Python for Machine Learning? Explain pandas, numpy, sklearn, matplotlib, seaborn and scipy libraries.
17. What Is naive in the Naive Bayes Classifier?
The classifier is called naive because it makes assumptions that may or may not turn out to be correct. The algorithm assumes that the presence of one feature of a class is not related to the presence of any other feature (absolute independence of features), given the class variable.For instance, a fruit may be considered to be a cherry if it is red in color and round in shape, regardless of other features. This assumption may or may not be right (as an apple also matches the description).
24) What are the two methods used for the calibration in Supervised Learning?
The two methods used for predicting good probabilities in Supervised Learning arePlatt CalibrationIsotonic RegressionThese methods are designed for binary classification, and it is not trivial.
35) What are the two paradigms of ensemble methods?
The two paradigms of ensemble methods areSequential ensemble methodsParallel ensemble methods
32) What is ensemble learning?
To solve a particular computational program, multiple models such as classifiers or experts are strategically generated and combined. This process is known as ensemble learning.
3.3 - What are 3 ways of reducing dimensionality?
Winsorize (cap at threshold).Transform to reduce skew (using Box-Cox or similar).Remove outliers if you're certain they are anomalies or measurement errors.
5.1 - What are the advantages and disadvantages of decision trees?
Yes, it's definitely possible. One common beginner mistake is re-tuning a model or training new models with different parameters after seeing its performance on the test set.In this case, its the model selection process that causes the overfitting. The test set should not be tainted until you're ready to make your final selection.Learn more about overfitting in machine learningSupervised Learning5. Supervised LearningLearning from labeled data using classification and regression models.
34. What Are Some Methods of Reducing Dimensionality?
You can reduce dimensionality by combining features with feature engineering, removing collinear features, or using algorithmic dimensionality reduction.Now that you have gone through these machine learning interview questions, you must have got an idea of your strengths and weaknesses in this domain.
4.2 - If you split your data into train/test splits, is it still possible to overfit your model?
You have to find a balance, and there's no right answer for every problem.If your test set is too small, you'll have an unreliable estimation of model performance (performance statistic will have high variance). If your training set is too small, your actual model parameters will have high variance.A good rule of thumb is to use an 80/20 train/test split. Then, your train set can be further split into train/validation or into partitions for cross-validation.See an example in Python
23. What is the difference between GBM and XGBoost? Answer
24. K-Means ClusteringWhat are the various types of Clustering? How will you differentiate between Hierarchial (Agglomerative and Devisive) and Partitional (K-Means, Fuzzy C-Means) Clustering?How do you decide the value of "K" in K-Mean Clustering Algorithm? What is the Elbow method? What is WSS (Within Sum of Squares)? How do we calculate WSS? How is Elbow method used to calculate value of "K" in K-Mean Clustering Algorithm?How do we find centroids and reposition them in a cluster? How many times we need to reposition the centroids? What do you mean by convergence of clusters?
8. What are Skewed Variables and Outliers in the dataset? What are the various ways to visualize and remove these? What do you mean by log transformation of skewed variables? Answer 1, Answer 2, Answer 3, Answer 4, Answer 5
9. What are the various ways to handle missing and invalid data in a dataset? What is Imputer? Answer 1, Answer 2, Answer 3, Answer 4, Answer 510. What is the difference between Mean, Median and Mode? How are these terms used to impute missing values in numeric variables? Answer
18) What is classifier in machine learning?
A classifier in a Machine Learning is a system that inputs a vector of discrete or continuous feature values and outputs a single discrete value, the class.
Q24: How would you evaluate a logistic regression model?
Answer: A subsection of the question above. You have to demonstrate an understanding of what the typical goals of a logistic regression are (classification, prediction, etc.) and bring up a few examples and use cases.More reading: Evaluating a logistic regression (CrossValidated), Logistic Regression in Plain English
Q22. Running a binary classification tree algorithm is the easy part. Do you know how does a tree splitting takes place i.e. how does the tree decide which variable to split at the root node and succeeding nodes?
Answer: A classification trees makes decision based on Gini Index and Node Entropy. In simple words, the tree algorithm find the best possible feature which can divide the data set into purest possible children nodes.Gini index says, if we select two items from a population at random then they must be of same class and probability for this is 1 if population is pure. We can calculate Gini as following:1. Calculate Gini for sub-nodes, using formula sum of square of probability for success and failure (p^2+q^2).2. Calculate Gini for split using weighted Gini score of each node of that splitEntropy is the measure of impurity as given by (for binary class):Entropy, Decision TreeHere p and q is probability of success and failure respectively in that node. Entropy is zero when a node is homogeneous. It is maximum when a both the classes are present in a node at 50% 50%. Lower entropy is desirable.
Q38: How would you implement a recommendation system for our company's users?
Answer: A lot of machine learning interview questions of this type will involve the implementation of machine learning models to a company's problems. You'll have to research the company and its industry in-depth, especially the revenue drivers the company has, and the types of users the company takes on in the context of the industry it's in.More reading: How to Implement A Recommendation System? (Stack Overflow)
22. How Do You Design an Email Spam Filter?
Building a spam filter involves the following process:The email spam filter will be fed with thousands of emails Each of these emails already has a label: spam or not spam.The supervised machine learning algorithm will then determine which type of emails are being marked as spam based on spam words like the lottery, free offer, no money, full refund, etc.The next time an email is about to hit your inbox, the spam filter will use statistical analysis and algorithms like Decision Trees and SVM to determine how likely the email is spamIf the likelihood is high, it will label it as spam, and the email wont hit your inboxBased on the accuracy of each model, we will use the algorithm with the highest accuracy after testing all the modelsEmail
30) Why instance based learning algorithm sometimes referred as Lazy learning algorithm?
Instance based learning algorithm is also referred as Lazy learning algorithm as they delay the induction or generalization process until classification is performed.
37) What is bias-variance decomposition of classification error in ensemble method?
The expected error of a learning algorithm can be decomposed into bias and variance. A bias term measures how closely the average classifier produced by the learning algorithm matches the target function. The variance term measures how much the learning algorithms prediction fluctuates for different training sets.
50) Give a popular application of machine learning that you see on day to day basis?
The recommendation engine implemented by major ecommerce websites uses Machine Learning.
10) What is the standard approach to supervised learning?
The standard approach to supervised learning is to split the set of example into the training set and the test.
8. What Are the Three Stages of Building a Model in Machine Learning?
The three stages of building a machine learning model are:Model BuildingChoose a suitable algorithm for the model and train it according to the requirement Model TestingCheck the accuracy of the model through the test data Applying the ModelMake the required changes after testing and use the final model for real-time projectsHere, its important to remember that once in a while, the model needs to be checked to make sure its working correctly. It should be modified to make sure that it is up-to-date.
24. Considering a Long List of Machine Learning Algorithms, given a Data Set, How Do You Decide Which One to Use?
There is no master algorithm for all situations. Choosing an algorithm depends on the following questions:How much data do you have, and is it continuous or categorical?Is the problem related to classification, association, clustering, or regression?Predefined variables (labeled), unlabeled, or mix?What is the goal?Based on the above questions, the following algorithms can be used:Algorithm 1Algorithm 2
13. What is the difference between Decision Tree and Random Forest? Answer
14. Bias and VarianceWhat is the difference between Bias and Variance? Whats the trade-off between Bias and Variance?What is the general cause of Overfitting and Underfitting? What steps will you take to avoid Overfitting and Underfitting? AnswerHint: You should explain Dimensionality Reduction Techniques, Regularization, Cross-validation, Decision Tree Pruning and Ensemble Learning Techniques.15. Cross ValidationWhat is Cross Validation? What is the difference between K-Fold Cross Validation and LOOCV (Leave One Out Cross Validation)?What are Hyperparameters? How does Cross Validation help in Hyperparameter Tuning? AnswerWhat are the advantages and disadvantages of Cross Validation? Answer16. RegularizationWhat is Regularization? When should one use Regularization in Machine Learning? How is it helpful in reducing the overfitting problem? Can regularization lead to underfitting of the model?What is the difference between Lasso (L1 Regularization) and Ridge (L2 Regularization) Regression? Which one provides better results? Which one to use and when? AnswerWhat is Elastic Net Regression?17. Ensemble LearningWhat do you mean by Ensemble Learning?What are the various Ensemble Learning Methods? What is the difference between Bagging (Bootstrap Aggregating) and Boosting? AnswerWhat are the various Bagging and Boosting Algorithms? Differentiate between Random Forest, AdaBoost, Gradient Boosting Machine (GBM) and XGBoost? Answer 1, Answer 2, Answer 318. AdaBoost What do you know about AdaBoost Algorithm? What are Stumps? Why are the stumps called Weak Learners? How do we calculate order of stumps (which stump should be the first one and which should be the second and so on)? How do we calculate Error and Amount of Say of each stump? What is the mathematical formula?
14. What are the various metrics present in sklearn library to measure the accuracy of an algorithm? Describe classification_report, confusion_matrix, accuracy_score, f1_score, r2_score, score, mean_absolute_error, mean_squared_error.
15. Pandas LibraryData Exploration using Pandas Library in PythonCreating Pandas DataFrame using CSV, Excel, Dictionary, List and TupleBoolean Indexing: How to filter Pandas Data Frame?How to find missing values in each row and column using Apply function in Pandas library?How to calculate Mean and Median of numeric variables using Pandas library?Sorting datasets based on multiple columns using sort_valuesHow to view and change datatypes of variables or features in a dataset?How to print Frequency Table for all categorical variables using value_counts() function?Frequency Table: How to use pandas value_counts() function to impute missing values?How to use Pandas Lambda Functions for Data Wrangling?How to separate numeric and categorical variables in a dataset using Pandas and Numpy Libraries in Python?16. Scipy LibraryHow to find mode of a variable using Scipy library to impute missing values?Practical Implementations (5 Questions)
14. What is Feature Scaling? What is the difference between Normalization and Standardization? Answer 1, Answer 2, Answer 3, Answer 4
15. Which Machine Learning Algorithms require Feature Scaling (Standardization and Normalization) and which not? Answer
1. What is Multicollinearity? What is the difference between Covariance and Correlation? How are these terms related with each other? Answer 1, Answer 2
2. Feature Selection and Feature ExtractionWhat do you mean by Curse of Dimensionality? How to deal with it? What is Dimension Reduction in Machine Learning? Why is it required? AnswerWhat is the difference between Feature Selection and Feature Extraction? What are the various Dimensionality Reduction Techniques? Answer
1. Write a pseudo code for a given algorithm.
2. What are the parameters on which we decide which algorithm to use for a given situation?
19. What is the difference between Random Forest and AdaBoost? Answer
20. GBM (Gradient Boosting Machine)What is GBM (Gradient Boosting Machine)? What is Gradient Descent? Why is it so named? How will you calculate the Step Size and Learning Rate in Gradient Descent?When to stop descending the gradient? What is Stochastic Gradient Descent?
21. What is the difference between the AdaBoost and GBM? Answer
22. XGBoost What is XGBoost Algorithm?How is XGBoost more efficient than GBM (Gradient Boosting Machine)? AnswerWhat are the advantages of XGBoost Algorithm? Answer
25. What is the difference between KNN and K-Means Clustering algorithms?
26. Time Series AnalysisWhat are various components of Time Series Analysis? What do you mean by Trend, Seasonality, Irregularity and Cyclicity?To perform Time Series Analysis, data should be stationary? Why? How will you know that your data is stationary? What are the various tests you will perform to check whether the data is stationary or not? How will you achieve the stationarity in the data?How will you use Rolling Statistics (Rolling Mean and Standard Deviation) method and ADCF (Augmented Dickey Fuller) test to measure stationarity in the data?What are the ways to achieve stationarity in the Time Series data?What is ARIMA model? How is it used to perform Time Series Analysis?When not to use Time Series Analysis?27. Sentiment AnalysisWhat do you mean by Sentiment Analysis? How to identify Positive, Negative and Neutral sentiments? What is Polarity and Subjectivity in Sentiment Analysis?Accuracy Measurement (10 Questions)1. Name some metrics which we use to measure the accuracy of the classification and regression algorithms.Hint: Classification metrics: Confusion Matrix, Classification Report, Accuracy Score etc.Regression metrics: MAE, MSE, RMSE Answer
3. What is Factor Analysis? What is the difference between Exploratory and Confirmatory Factor Analysis? Answer
4. Principal Component AnalysisWhat is Principal Component Analysis (PCA)?How do we find Principal Components through Projections and Rotations? How will you find your first Principal Component (PC1) using SVD? What is Singular Vector or Eigenvector? What do you mean by Eigenvalue and Singular Value? How will you calculate it? What do you mean by Loading Score? How will you calculate it?"Principal Component is a linear combination of existing features." Illustrate this statement. How will you find your second Principal Component (PC2) once you have discovered your first Principal Component (PC1)? How will you calculate the variation for each Principal Component? What is Scree Plot? How is it useful? How many Principal Components can you draw for a given sample dataset? Why is PC1 more important than PC2 and so on?What are the advantages and disadvantages of PCA? Answer
4. What is Sensitivity (True Positive Rate) and Specificity (True Negative Rate)? How will you calculate it from Confusion Matrix? What is its formula?
5. What is the difference between Precision and Recall? How will you calculate it from Confusion Matrix? What is its formula?6. What do you mean by ROC (Receiver Operating Characteristic) curve and AUC (Area Under the ROC Curve)? How is this curve used to measure the performance of a classification model?
Q38. When does regularization becomes necessary in Machine Learning?
Answer: Regularization becomes necessary when the model begins to ovefit / underfit. This technique introduces a cost term for bringing in more features with the objective function. Hence, it tries to push the coefficients for many variables to zero and hence reduce cost term. This helps to reduce model complexity so that the model can become better at predicting (generalizing).
19. What do you understand by Fourier Transform? How is it used in Machine Learning?
Dimensionality Reduction (9 Questions)
49) What are two techniques of Machine Learning ?
The two techniques of Machine Learning areGenetic ProgrammingInductive Learning
1. What Are the Different Types of Machine Learning?
There are three types of machine learning:Supervised LearningIn supervised machine learning, a model makes predictions or decisions based on past or labeled data. Labeled data refers to sets of data that are given tags or labels, and thus made more meaningful.Supervised LearningUnsupervised LearningIn unsupervised learning, we don't have labeled data. A model can identify patterns, anomalies, and relationships in the input data.Unsupervised LearningReinforcement LearningUsing reinforcement learning, the model can learn based on the rewards it received for its previous action.Reinforcement LearningConsider an environment where an agent is working. The agent is given a target to achieve. Every time the agent takes some action toward the target, it is given positive feedback. And, if the action taken is going away from the goal, the agent is given negative feedback.
3. Deploy the model
Training SetThe training set is examples given to the model to analyze and learn70% of the total data is typically taken as the training datasetThis is labeled data used to train the modelConsider a case where you have labeled data for 1,000 records. One way to train the model is to expose all 1,000 records during the training process. Then you take a small set of the same data to test the model, which would give good results in this case.But, this is not an accurate way of testing. So, we set aside a portion of that data called the test set before starting the training process. The remaining data is called the training set that we use for training the model. The training set passes through the model multiple times until the accuracy is high, and errors are minimized.Train SetNow, we pass the test data to check if the model can accurately predict the values and determine if training is effective. If you get errors, you either need to change your model or retrain it with more data.Test SetRegarding the question of how to split the data into a training set and test set, there is no fixed rule, and the ratio can vary based on individual preferences.
8.1 - Why are ensemble methods superior to individual models?
AUROC is robust to class imbalance, unlike raw accuracy.For example, if you want to detect a type of cancer that's prevalent in only 1% of the population, you can build a model that achieves 99% accuracy by simply classifying everyone has cancer-free.Learn more about class imbalance in machine learningEnsemble Learning8. Ensemble LearningCombining multiple models for better performance.
5.2 - What are the advantages and disadvantages of neural networks?
Advantages: Decision trees are easy to interpret, nonparametric (which means they are robust to outliers), and there are relatively few parameters to tune.Disadvantages: Decision trees are prone to be overfit. However, this can be addressed by ensemble methods like random forests or boosted trees.Overview of modern machine learning algorithms
Q19: How would you handle an imbalanced dataset?
Answer: An imbalanced dataset is when you have, for example, a classification test and 90% of the data is in one class. That leads to problems: an accuracy of 90% can be skewed if you have no predictive power on the other category of data! Here are a few tactics to get over the hump:Collect more data to even the imbalances in the dataset.Resample the dataset to correct for imbalances.Try a different algorithm altogether on your dataset.More reading: 8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset (Machine Learning Mastery)What's important here is that you have a keen sense for what damage an unbalanced dataset can cause, and how to balance that.
Q4. You are given a data set on cancer detection. Youve build a classification model and achieved an accuracy of 96%. Why shouldnt you be happy with your model performance? What can you do about it?
Answer: If you have worked on enough data sets, you should deduce that cancer detection results in imbalanced data. In an imbalanced data set, accuracy should not be used as a measure of performance because 96% (as given) might only be predicting majority class correctly, but our class of interest is minority class (4%) which is the people who actually got diagnosed with cancer. Hence, in order to evaluate model performance, we should use Sensitivity (True Positive Rate), Specificity (True Negative Rate), F measure to determine class wise performance of the classifier. If the minority class performance is found to to be poor, we can undertake the following steps:1. We can use undersampling, oversampling or SMOTE to make the data balanced.2. We can alter the prediction threshold value by doing probability caliberation and finding a optimal threshold using AUC-ROC curve.3. We can assign weight to classes such that the minority classes gets larger weight.4. We can also use anomaly detection.Know more: Imbalanced Classification
Q28: Pick an algorithm. Write the pseudo-code for a parallel implementation.
Answer: This kind of question demonstrates your ability to think in parallelism and how you could handle concurrency in programming implementations dealing with big data. Take a look at pseudocode frameworks such as Peril-L and visualization tools such as Web Sequence Diagrams to help you demonstrate your ability to write code that reflects parallelism.More reading: Writing pseudocode for parallel programming (Stack Overflow)
Q31: Which data visualization libraries do you use? What are your thoughts on the best data visualization tools?
Answer: What's important here is to define your views on how to properly visualize data and your personal preferences when it comes to tools. Popular tools include R's ggplot, Python's seaborn and matplotlib, and tools such as Plot.ly and Tableau.More reading: 31 Free Data Visualization Tools (Springboard)
6.1 - Explain Latent Dirichlet Allocation (LDA).
If training set is small, high bias / low variance models (e.g. Naive Bayes) tend to perform better because they are less likely to be overfit.If training set is large, low bias / high variance models (e.g. Logistic Regression) tend to perform better because they can reflect more complex relationships.Unsupervised Learning6. Unsupervised LearningLearning from unlabeled data using factor and cluster analysis models.
Q45: Where do you usually source datasets?
Kaggle, Scraping, Premade. Answer: Machine learning interview questions like these try to get at the heart of your machine learning interest. Somebody who is truly passionate about machine learning will have gone off and done side projects on their own, and have a good idea of what great datasets are out there. If you're missing any, check out Quandl for economic and financial data, and Kaggle's Datasets collection for another great list.
6.2 - Explain Principle Component Analysis (PCA).
Latent Dirichlet Allocation (LDA) is a common method of topic modeling, or classifying documents by subject matter.LDA is a generative model that represents documents as a mixture of topics that each have their own probability distribution of possible words.The "Dirichlet" distribution is simply a distribution of distributions. In LDA, documents are distributions of topics that are distributions of words.Intuitive explanation of the Dirichlet distribution
7.1 - What is the ROC Curve and what is AUC (a.k.a. AUROC)?
PCA is a method for transforming features in a dataset by combining them into uncorrelated linear combinations.These new features, or principal components, sequentially maximize the variance represented (i.e. the first principal component has the most variance, the second principal component has the second most, and so on).As a result, PCA is useful for dimensionality reduction because you can set an arbitrary variance cutoff.Learn more about PCAModel Evaluation7. Model EvaluationMaking decisions based on various performance metrics.
7.2 - Why is Area Under ROC Curve (AUROC) better than raw accuracy as an out-of- sample evaluation metric?
The ROC (receiver operating characteristic) the performance plot for binary classifiers of True Positive Rate (y-axis) vs. False Positive Rate (x-axis).AUC is area under the ROC curve, and it's a common performance metric for evaluating binary classification models.It's equivalent to the expected probability that a uniformly drawn random positive is ranked before a uniformly drawn random negative.Learn more about the ROC Curve
1.3 - Explain the Bias-Variance Tradeoff.
The difficulty of searching through a solution space becomes much harder as you have more features (dimensions).Consider the analogy of looking for a penny in a line vs. a field vs. a building. The more dimensions you have, the higher volume of data you'll need.Learn more about the Curse of Dimensionality (and reducing dimensions)
Machine Learning Interview Questions: Company/Industry Specific
These machine learning interview questions deal with how to implement your general machine learning knowledge to a specific company's requirements. You'll be asked to create case studies and extend your knowledge of the company and industry you're applying for with your machine learning skills.
Machine Learning Interview Questions: Programming
These machine learning interview questions test your knowledge of programming principles you need to implement machine learning principles in practice. Machine learning interview questions tend to be technical questions that test your logic and programming skills: this section focuses more on the latter.
5.3 - How can you choose a classifier based on training set size?
Advantages: Neural networks (specifically deep NNs) have led to performance breakthroughs for unstructured datasets such as images, audio, and video. Their incredible flexibility allows them to learn patterns that no other ML algorithm can learn.Disadvantages: However, they require a large amount of training data to converge. It's also difficult to pick the right architecture, and the internal "hidden" layers are incomprehensible.
Q30: Describe a hash table.
Answer: A hash table is a data structure that produces an associative array. A key is mapped to certain values through the use of a hash function. They are often used for tasks such as database indexing.More reading: Hash table (Wikipedia)
Q47: How would you simulate the approach AlphaGo took to beat Lee Sedol at Go?
Answer: AlphaGo beating Lee Sedol, the best human player at Go, in a best-of-five series was a truly seminal event in the history of machine learning and deep learning. The Nature paper above describes how this was accomplished with "Monte-Carlo tree search with deep neural networks that have been trained by supervised learning, from human expert games, and by reinforcement learning from games of self-play."More reading: Mastering the game of Go with deep neural networks and tree search (Nature)
Q29: What are some differences between a linked list and an array?
Answer: An array is an ordered collection of objects. A linked list is a series of objects with pointers that direct how to process them sequentially. An array assumes that every element has the same size, unlike the linked list. A linked list can more easily grow organically: an array has to be pre-defined or re-defined for organic growth. Shuffling a linked list involves changing which points direct where—meanwhile, shuffling an array is more complex and takes more memory.More reading: Array versus linked list (Stack Overflow)
Q13. How is True Positive Rate and Recall related? Write the equation.
Answer: True Positive Rate = Recall. Yes, they are equal having the formula (TP/TP + FN).Know more: Evaluation Metrics
Q8. You are assigned a new project which involves helping a food delivery company save more money. The problem is, companys delivery team arent able to deliver food on time. As a result, their customers get unhappy. And, to keep them happy, they end up delivering food for free. Which machine learning algorithm can save them?
Answer: You might have started hopping through the list of ML algorithms in your mind. But, wait! Such questions are asked to test your machine learning fundamentals.This is not a machine learning problem. This is a route optimization problem. A machine learning problem consist of three things:1. There exist a pattern.2. You cannot solve it mathematically (even by writing exponential equations).3. You have data on it.Always look for these three factors to decide if machine learning is a tool to solve a particular problem.
Q32: Given two strings, A and B, of the same length n, find whether it is possible to cut both strings at a common point such that the first part of A and the second part of B form a palindrome.
Answer: You'll often get standard algorithms and data structures questions as part of your interview process as a machine learning engineer that might feel akin to a software engineering interview. In this case, this comes from Google's interview process. There are multiple ways to check for palindromes—one way of doing so if you're using a programming language such as Python is to reverse the string and check to see if it still equals the original string, for example. The thing to look out for here is the category of questions you can expect, which will be akin to software engineering questions that drill down to your knowledge of algorithms and data structures. Make sure that you're totally comfortable with the language of your choice to express that logic.More reading: Glassdoor machine learning interview questions
Q27: Do you have experience with Spark or big data tools for machine learning?
Answer: You'll want to get familiar with the meaning of big data for different companies and the different tools they'll want. Spark is the big data tool most in demand now, able to handle immense datasets with speed. Be honest if you don't have experience with the tools demanded, but also take a look at job descriptions and see what tools pop up: you'll want to invest in familiarizing yourself with them.More reading: 50 Top Open Source Tools for Big Data (Datamation)
Q5. Why is naive Bayes so naive ?
Answer: naive Bayes is so naive because it assumes that all of the features in a data set are equally important and independent. As we know, these assumption are rarely true in real world scenario.
2.1 - What is the difference between stochastic gradient descent (SGD) and gradient descent (GD)?
Predictive models have a tradeoff between bias (how well the model fits the data) and variance (how much the model changes based on changes in the inputs).Simpler models are stable (low variance) but they don't get close to the truth (high bias).More complex models are more prone to being overfit (high variance) but they are expressive enough to get close to the truth (low bias).The best model for a given problem usually lies somewhere in the middle.Learn more about the Bias-Variance TradeoffOptimization2. OptimizationAlgorithms for finding the best parameters for a model.
9.2 - How can you help our marketing team be more efficient?
Thinking about key business metrics, often shortened as KPI's (Key Performance Indicators), is an essential part of a data scientist's job. Here are a few examples, but you should practice brainstorming your own.Tip: When in doubt, start with the easier question of "how does this business make money?"S-a-a-S startup: Customer lifetime value, new accounts, account lifetime, churn rate, usage rate, social share rateRetail bank: Offline leads, online leads, new accounts (segmented by account type), risk factors, product affinitiese-Commerce: Product sales, average cart value, cart abandonment rate, email leads, conversion rate
Machine Learning Interview Questions: General Machine Learning Interest
This series of machine learning interview questions attempts to gauge your passion and interest in machine learning. The right answers will serve as a testament to your commitment to being a lifelong learner in machine learning.
Q8: Explain the difference between L1 and L2 regularization.
Answer: L2 regularization tends to spread error among all the terms, while L1 is more binary/sparse, with many variables either being assigned a 1 or 0 in weighting. L1 corresponds to setting a Laplacean prior on the terms, while L2 corresponds to a Gaussian prior.
Q2: What is the difference between supervised and unsupervised machine learning?
Answer: Supervised learning requires training labeled data. For example, in order to do classification (a supervised learning task), you'll need to first label the data you'll use to train the model to classify data into your labeled groups. Unsupervised learning, in contrast, does not require labeling data explicitly.More reading: Classic examples of supervised vs. unsupervised learning (Springboard)
Q18: What's the F1 score? How would you use it?
Answer: The F1 score is a measure of a model's performance. It is a weighted average of the precision and recall of a model, with results tending to 1 being the best, and those tending to 0 being the worst. You would use it in classification tests where true negatives don't matter much.More reading: F1 score (Wikipedia)
8.2 - Explain bagging.
They average out biases, reduce variance, and are less likely to overfit.There's a common line in machine learning which is: "ensemble and get 2%."This implies that you can build your models as usual and typically expect a small performance boost from ensembling.
Q11. After spending several hours, you are now anxious to build a high accuracy model. As a result, you build 5 GBM models, thinking a boosting algorithm would do the magic. Unfortunately, neither of models could perform better than benchmark score. Finally, you decided to combine those models. Though, ensembled models are known to return high accuracy, but you are unfortunate. Where did you miss?
Answer: As we know, ensemble learners are based on the idea of combining weak learners to create strong learners. But, these learners provide superior result when the combined models are uncorrelated. Since, we have used 5 GBM models and got no accuracy improvement, suggests that the models are correlated. The problem with correlated models is, all the models provide same information.For example: If model 1 has classified User1122 as 1, there are high chances model 2 and model 3 would have done the same, even if its actual value is 0. Therefore, ensemble learners are built on the premise of combining weak uncorrelated models to obtain better predictions.
Q6: What is Bayes' Theorem? How is it useful in a machine learning context?
Answer: Bayes' Theorem gives you the posterior probability of an event given what is known as prior knowledge.Mathematically, it's expressed as the true positive rate of a condition sample divided by the sum of the false positive rate of the population and the true positive rate of a condition. Say you had a 60% chance of actually having the flu after a flu test, but out of people who had the flu, the test will be false 50% of the time, and the overall population only has a 5% chance of having the flu. Would you actually have a 60% chance of having the flu after having a positive test?Bayes' Theorem says no. It says that you have a (.6 * 0.05) (True Positive Rate of a Condition Sample) / (.6*0.05)(True Positive Rate of a Condition Sample) + (.5*0.95) (False Positive Rate of a Population) = 0.0594 or 5.94% chance of getting a flu.Bayes' Theorem is the basis behind a branch of machine learning that most notably includes the Naive Bayes classifier. That's something important to consider when you're faced with machine learning interview questions.More reading: An Intuitive (and Short) Explanation of Bayes' Theorem (BetterExplained)
Q1: What's the trade-off between bias and variance?
Answer: Bias is error due to erroneous or overly simplistic assumptions in the learning algorithm you're using. This can lead to the model underfitting your data, making it hard for it to have high predictive accuracy and for you to generalize your knowledge from the training set to the test set.Variance is error due to too much complexity in the learning algorithm you're using. This leads to the algorithm being highly sensitive to high degrees of variation in your training data, which can lead your model to overfit the data. You'll be carrying too much noise from your training data for your model to be very useful for your test data.The bias-variance decomposition essentially decomposes the learning error from any algorithm by adding the bias, the variance and a bit of irreducible error due to noise in the underlying dataset. Essentially, if you make the model more complex and add more variables, you'll lose bias but gain some variance — in order to get the optimally reduced amount of error, you'll have to tradeoff bias and variance. You don't want either high bias or high variance in your model.More reading: Bias-Variance Tradeoff (Wikipedia)
Q10. You are given a data set. The data set contains many variables, some of which are highly correlated and you know about it. Your manager has asked you to run PCA. Would you remove correlated variables first? Why?
Answer: Chances are, you might be tempted to say No, but that would be incorrect. Discarding correlated variables have a substantial effect on PCA because, in presence of correlated variables, the variance explained by a particular component gets inflated.For example: You have 3 variables in a data set, of which 2 are correlated. If you run PCA on this data set, the first principal component would exhibit twice the variance than it would exhibit with uncorrelated variables. Also, adding correlated variables lets PCA put more importance on those variable, which is misleading.
Q20: When should you use classification over regression?
Answer: Classification produces discrete values and dataset to strict categories, while regression gives you continuous results that allow you to better distinguish differences between individual points. You would use classification over regression if you wanted your results to reflect the belongingness of data points in your dataset to certain explicit categories (ex: If you wanted to know whether a name was male or female rather than just how correlated they were with male and female names.)More reading: Regression vs Classification (Math StackExchange)
Q36: How would you build a data pipeline?
Answer: Data pipelines are the bread and butter of machine learning engineers, who take data science models and find ways to automate and scale them. Make sure you're familiar with the tools to build data pipelines (such as Apache Airflow) and the platforms where you can host models and pipelines (such as Google Cloud or AWS or Azure). Explain the steps required in a functioning data pipeline and talk through your actual experience building and scaling them in production.More reading: 10 Minutes to Building A Machine Learning Pipeline With Apache Airflow
Q13: What is deep learning, and how does it contrast with other machine learning algorithms?
Answer: Deep learning is a subset of machine learning that is concerned with neural networks: how to use backpropagation and certain principles from neuroscience to more accurately model large sets of unlabelled or semi-structured data. In that sense, deep learning represents an unsupervised learning algorithm that learns representations of data through the use of neural nets.More reading: Deep learning (Wikipedia)
Q7: Why is "Naive" Bayes naive?
Answer: Despite its practical applications, especially in text mining, Naive Bayes is considered "Naive" because it makes an assumption that is virtually impossible to see in real-life data: the conditional probability is calculated as the pure product of the individual probabilities of components. This implies the absolute independence of features — a condition probably never met in real life.As a Quora commenter put it whimsically, a Naive Bayes classifier that figured out that you liked pickles and ice cream would probably naively recommend you a pickle ice cream.
Q10: What's the difference between Type I and Type II error?
Answer: Don't think that this is a trick question! Many machine learning interview questions will be an attempt to lob basic questions at you just to make sure you're on top of your game and you've prepared all of your bases.Type I error is a false positive, while Type II error is a false negative. Briefly stated, Type I error means claiming something has happened when it hasn't, while Type II error means that you claim nothing is happening when in fact something is.A clever way to think about this is to think of Type I error as telling a man he is pregnant, while Type II error means you tell a pregnant woman she isn't carrying a baby.More reading: Type I and type II errors (Wikipedia)
Q12. How is kNN different from kmeans clustering?
Answer: Dont get mislead by k in their names. You should know that the fundamental difference between both these algorithms is, kmeans is unsupervised in nature and kNN is supervised in nature. kmeans is a clustering algorithm. kNN is a classification (or regression) algorithm.kmeans algorithm partitions a data set into clusters such that a cluster formed is homogeneous and the points in each cluster are close to each other. The algorithm tries to maintain enough separability between these clusters. Due to unsupervised nature, the clusters have no labels.kNN algorithm tries to classify an unlabeled observation based on its k (can be any number ) surrounding neighbors. It is also known as lazy learner because it involves minimal training of model. Hence, it doesnt use training data to make generalization on unseen data set.
Q21: Name an example where ensemble techniques might be useful.
Answer: Ensemble techniques use a combination of learning algorithms to optimize better predictive performance. They typically reduce overfitting in models and make the model more robust (unlikely to be influenced by small changes in the training data).You could list some examples of ensemble methods (bagging, boosting, the "bucket of models" method) and demonstrate how they could increase predictive power.More reading: Ensemble learning (Wikipedia)
Q48: What are your thoughts on GPT-3 and OpenAI's model?
Answer: GPT-3 is a new language generation model developed by OpenAI. It was marked as exciting because with very little change in architecture, and a ton more data, GPT-3 could generate what seemed to be human-like conversational pieces, up to and including novel-size works and the ability to create code from natural language. There are many perspectives on GPT-3 throughout the Internet — if it comes up in an interview setting, be prepared to address this topic (and trending topics like it) intelligently to demonstrate that you follow the latest advances in machine learning.More reading: Language Models are Few-Shot Learners
Q50: What are some of your favorite APIs to explore?
Answer: If you've worked with external data sources, it's likely you'll have a few favorite APIs that you've gone through. You can be thoughtful here about the kinds of experiments and pipelines you've run in the past, along with how you think about the APIs you've used before.More reading: Awesome APIs
Q34: How does XML and CSVs compare in terms of size?
Answer: In practice, XML is much more verbose than CSVs are and takes up a lot more space. CSVs use some separators to categorize and organize data into neat columns. XML uses tags to delineate a tree-like structure for key-value pairs. You'll often get XML back as a way to semi-structure data from APIs or HTTP responses. In practice, you'll want to ingest XML data and try to process it into a usable CSV. This sort of question tests your familiarity with data wrangling sometimes messy data formats.More reading: How Can XML Be Used?
Q15: What cross-validation technique would you use on a time series dataset?
Answer: Instead of using standard k-folds cross-validation, you have to pay attention to the fact that a time series is not randomly distributed data—it is inherently ordered by chronological order. If a pattern emerges in later time periods, for example, your model may still pick up on it even if that effect doesn't hold in earlier years!You'll want to do something like forward chaining where you'll be able to model on past data then look at forward-facing data.Fold 1 : training [1], test [2]Fold 2 : training [1 2], test [3]Fold 3 : training [1 2 3], test [4]Fold 4 : training [1 2 3 4], test [5]Fold 5 : training [1 2 3 4 5], test [6]More reading: Using k-fold cross-validation for time-series model selection (CrossValidated)
Q3: How is KNN different from k-means clustering?
Answer: K-Nearest Neighbors is a supervised classification algorithm, while k-means clustering is an unsupervised clustering algorithm. While the mechanisms may seem similar at first, what this really means is that in order for K-Nearest Neighbors to work, you need labeled data you want to classify an unlabeled point into (thus the nearest neighbor part). K-means clustering requires only a set of unlabeled points and a threshold: the algorithm will take unlabeled points and gradually learn how to cluster them into groups by computing the mean of the distance between different points.The critical difference here is that KNN needs labeled points and is thus supervised learning, while k-means doesn't—and is thus unsupervised learning.
Q41: What are the last machine learning papers you've read?
Answer: Keeping up with the latest scientific literature on machine learning is a must if you want to demonstrate an interest in a machine learning position. This overview of deep learning in Nature by the scions of deep learning themselves (from Hinton to Bengio to LeCun) can be a good reference paper and an overview of what's happening in deep learning — and the kind of paper you might want to cite.More reading: What are some of the best research papers/books for machine learning?
Q9. You came to know that your model is suffering from low bias and high variance. Which algorithm should you use to tackle it? Why?
Answer: Low bias occurs when the models predicted values are near to actual values. In other words, the model becomes flexible enough to mimic the training data distribution. While it sounds like great achievement, but not to forget, a flexible model has no generalization capabilities. It means, when this model is tested on an unseen data, it gives disappointing results.In such situations, we can use bagging algorithm (like random forest) to tackle high variance problem. Bagging algorithms divides a data set into subsets made with repeated randomized sampling. Then, these samples are used to generate a set of models using a single learning algorithm. Later, the model predictions are combined using voting (classification) or averaging (regression).Also, to combat high variance, we can:1. Use regularization technique, where higher model coefficients get penalized, hence lowering model complexity.2. Use top n features from variable importance chart. May be, with all the variable in the data set, the algorithm is having difficulty in finding the meaningful signal.
Q46: How do you think Google is training data for self-driving cars?
Answer: Machine learning interview questions like this one really test your knowledge of different machine learning methods, and your inventiveness if you don't know the answer. Google is currently using recaptcha to source labeled data on storefronts and traffic signs. They are also building on training data collected by Sebastian Thrun at GoogleX—some of which was obtained by his grad students driving buggies on desert dunes!More reading: Waymo Tech
Q6. Explain prior probability, likelihood and marginal likelihood in context of naiveBayes algorithm?
Answer: Prior probability is nothing but, the proportion of dependent (binary) variable in the data set. It is the closest guess you can make about a class, without any further information. For example: In a data set, the dependent variable is binary (1 and 0). The proportion of 1 (spam) is 70% and 0 (not spam) is 30%. Hence, we can estimate that there are 70% chances that any new email would be classified as spam.Likelihood is the probability of classifying a given observation as 1 in presence of some other variable. For example: The probability that the word FREE is used in previous spam message is likelihood. Marginal likelihood is, the probability that the word FREE is used in any message.
Q1. You are given a train data set having 1000 columns and 1 million rows. The data set is based on a classification problem. Your manager has asked you to reduce the dimension of this data so that model computation time can be reduced. Your machine has memory constraints. What would you do? (You are free to make practical assumptions.)
Answer: Processing a high dimensional data on a limited memory machine is a strenuous task, your interviewer would be fully aware of that. Following are the methods you can use to tackle such situation:1. Since we have lower RAM, we should close all other applications in our machine, including the web browser, so that most of the memory can be put to use.2. We can randomly sample the data set. This means, we can create a smaller data set, lets say, having 1000 variables and 300000 rows and do the computations.3. To reduce dimensionality, we can separate the numerical and categorical variables and remove the correlated variables. For numerical variables, well use correlation. For categorical variables, well use chi-square test.4. Also, we can use PCA and pick the components which can explain the maximum variance in the data set.5. Using online learning algorithms like Vowpal Wabbit (available in Python) is a possible option.6. Building a linear model using Stochastic Gradient Descent is also helpful.7. We can also apply our business understanding to estimate which all predictors can impact the response variable. But, this is an intuitive approach, failing to identify useful predictors might result in significant loss of information.Note: For point 4 & 5, make sure you read about online learning algorithms & Stochastic Gradient Descent. These are advanced methods.
Q16: How is a decision tree pruned?
Answer: Pruning is what happens in decision trees when branches that have weak predictive power are removed in order to reduce the complexity of the model and increase the predictive accuracy of a decision tree model. Pruning can happen bottom-up and top-down, with approaches such as reduced error pruning and cost complexity pruning.Reduced error pruning is perhaps the simplest version: replace each node. If it doesn't decrease predictive accuracy, keep it pruned. While simple, this heuristic actually comes pretty close to an approach that would optimize for maximum accuracy.More reading: Pruning (decision trees)
Q5: Define precision and recall.
Answer: Recall is also known as the true positive rate: the amount of positives your model claims compared to the actual number of positives there are throughout the data. Precision is also known as the positive predictive value, and it is a measure of the amount of accurate positives your model claims compared to the number of positives it actually claims. It can be easier to think of recall and precision in the context of a case where you've predicted that there were 10 apples and 5 oranges in a case of 10 apples. You'd have perfect recall (there are actually 10 apples, and you predicted there would be 10) but 66.7% precision because out of the 15 events you predicted, only 10 (the apples) are correct.More reading: Precision and recall (Wikipedia)
Q42: Do you have research experience in machine learning?
Answer: Related to the last point, most organizations hiring for machine learning positions will look for your formal experience in the field. Research papers, co-authored or supervised by leaders in the field, can make the difference between you being hired and not. Make sure you have a summary of your research experience and papers ready — and an explanation for your background and lack of formal research experience if you don't.
Q25: What's the "kernel trick" and how is it useful?
Answer: The Kernel trick involves kernel functions that can enable in higher-dimension spaces without explicitly calculating the coordinates of points within that dimension: instead, kernel functions compute the inner products between the images of all pairs of data in a feature space. This allows them the very useful attribute of calculating the coordinates of higher dimensions while being computationally cheaper than the explicit calculation of said coordinates. Many algorithms can be expressed in terms of inner products. Using the kernel trick enables us effectively run algorithms in a high-dimensional space with lower-dimensional data.More reading: Kernel method (Wikipedia)
Q44: How would you approach the "Netflix Prize" competition?
Answer: The Netflix Prize was a famed competition where Netflix offered $1,000,000 for a better collaborative filtering algorithm. The team that won called BellKor had a 10% improvement and used an ensemble of different methods to win. Some familiarity with the case and its solution will help demonstrate you've paid attention to machine learning for a while.More reading: Netflix Prize (Wikipedia)
Q4: Explain how a ROC curve works.
Answer: The ROC curve is a graphical representation of the contrast between true positive rates and the false positive rate at various thresholds. It's often used as a proxy for the trade-off between the sensitivity of the model (true positives) vs the fall-out or the probability it will trigger a false alarm (false positives).More reading: Receiver operating characteristic (Wikipedia)machine learning interview questions
Q22: How do you ensure you're not overfitting with a model?
Answer: This is a simple restatement of a fundamental problem in machine learning: the possibility of overfitting training data and carrying the noise of that data through to the test set, thereby providing inaccurate generalizations.There are three main methods to avoid overfitting:Keep the model simpler: reduce variance by taking into account fewer variables and parameters, thereby removing some of the noise in the training data.Use cross-validation techniques such as k-folds cross-validation.Use regularization techniques such as LASSO that penalize certain model parameters if they're likely to cause overfitting.
Q39: How can we use your machine learning skills to generate revenue?
Answer: This is a tricky question. The ideal answer would demonstrate knowledge of what drives the business and how your skills could relate. For example, if you were interviewing for music-streaming startup Spotify, you could remark that your skills at developing a better recommendation model would increase user retention, which would then increase revenue in the long run.The startup metrics Slideshare linked above will help you understand exactly what performance indicators are important for startups and tech companies as they think about revenue and growth.More reading: Startup Metrics for Startups (500 Startups)
Q40: What do you think of our current data process?
Answer: This kind of question requires you to listen carefully and impart feedback in a manner that is constructive and insightful. Your interviewer is trying to gauge if you'd be a valuable member of their team and whether you grasp the nuances of why certain things are set the way they are in the company's data process based on company or industry-specific conditions. They're trying to see if you can be an intellectual peer. Act accordingly.More reading: The Data Science Process Email Course (Springboard)
Q3. You are given a data set. The data set has missing values which spread along 1 standard deviation from the median. What percentage of data would remain unaffected? Why?
Answer: This question has enough hints for you to start thinking! Since, the data is spread across median, lets assume its a normal distribution. We know, in a normal distribution, ~68% of the data lies in 1 standard deviation from mean (or mode, median), which leaves ~32% of the data unaffected. Therefore, ~32% of the data would remain unaffected by missing values.
Q37: What do you think is the most valuable data in our business?
Answer: This question or questions like it really try to test you on two dimensions. The first is your knowledge of the business and the industry itself, as well as your understanding of the business model. The second is whether you can pick how correlated data is to business outcomes in general, and then how you apply that thinking to your context about the company. You'll want to research the business model and ask good questions to your recruiter—and start thinking about what business problems they probably want to solve most with their data.More reading: Three Recommendations For Making The Most Of Valuable Data
Q49: What models do you train for fun, and what GPU/hardware do you use?
Answer: This question tests whether you've worked on machine learning projects outside of a corporate role and whether you understand the basics of how to resource projects and allocate GPU-time efficiently. Expect questions like this to come from hiring managers that are interested in getting a greater sense behind your portfolio, and what you've done independently.More reading: Where to get free GPU cloud hours for machine learning
Q17: Which is more important to you: model accuracy or model performance?
Answer: This question tests your grasp of the nuances of machine learning model performance! Machine learning interview questions often look towards the details. There are models with higher accuracy that can perform worse in predictive power — how does that make sense?Well, it has everything to do with how model accuracy is only a subset of model performance, and at that, a sometimes misleading one. For example, if you wanted to detect fraud in a massive dataset with a sample of millions, a more accurate model would most likely predict no fraud at all if only a vast minority of cases were fraud. However, this would be useless for a predictive model—a model designed to find fraud that asserted there was no fraud at all! Questions like this help you demonstrate that you understand model accuracy isn't the be-all and end-all of model performance.More reading: Accuracy paradox (Wikipedia)
Q9: What's your favorite algorithm, and can you explain it to me in less than a minute?
Answer: This type of question tests your understanding of how to communicate complex and technical nuances with poise and the ability to summarize quickly and efficiently. Make sure you have a choice and make sure you can explain different algorithms so simply and effectively that a five-year-old could grasp the basics!
Q7. You are working on a time series data set. You manager has asked you to build a high accuracy model. You start with the decision tree algorithm, since you know it works fairly well on all kinds of data. Later, you tried a time series regression model and got higher accuracy than decision tree model. Can this happen? Why?
Answer: Time series data is known to posses linearity. On the other hand, a decision tree algorithm is known to work best to detect non linear interactions. The reason why decision tree failed to provide robust predictions because it couldnt map the linear relationship as good as a regression model did. Therefore, we learned that, a linear regression model can provide robust prediction given the data set satisfies its linearity assumptions.
Q15. After analyzing the model, your manager has informed that your regression model is suffering from multicollinearity. How would you check if hes true? Without losing any information, can you still build a better model?
Answer: To check multicollinearity, we can create a correlation matrix to identify & remove variables having correlation above 75% (deciding a threshold is subjective). In addition, we can use calculate VIF (variance inflation factor) to check the presence of multicollinearity. VIF value <= 4 suggests no multicollinearity whereas a value of >= 10 implies serious multicollinearity. Also, we can use tolerance as an indicator of multicollinearity.But, removing correlated variables might lead to loss of information. In order to retain those variables, we can use penalized regression models like ridge or lasso regression. Also, we can add some random noise in correlated variable so that the variables become different from each other. But, adding noise might affect the prediction accuracy, hence this approach should be carefully used.Know more: Regression
Q51: How do you think quantum computing will affect machine learning?
Answer: With the recent announcement of more breakthroughs in quantum computing, the question of how this new format and way of thinking through hardware serves as a useful proxy to explain classical computing and machine learning, and some of the hardware nuances that might make some algorithms much easier to do on a quantum machine. Demonstrating some knowledge in this area helps show that you're interested in machine learning at a much higher level than just implementation details.
Q14. You have built a multiple regression model. Your model R² isnt as good as you wanted. For improvement, your remove the intercept term, your model R² becomes 0.8 from 0.3. Is it possible? How?
Answer: Yes, it is possible. We need to understand the significance of intercept term in a regression model. The intercept term shows model prediction without any independent variable i.e. mean prediction. The formula of R² = 1 ?(y y´)²/?(y ymean)² where y´ is predicted value. When intercept term is present, R² value evaluates your model wrt. to the mean model. In absence of intercept term (ymean), the model can make no such evaluation, with large denominator, ?(y - y´)²/?(y)² equations value becomes smaller than actual, resulting in higher R².
Q2. Is rotation necessary in PCA? If yes, Why? What will happen if you dont rotate the components?
Answer: Yes, rotation (orthogonal) is necessary because it maximizes the difference between variance captured by the component. This makes the components easier to interpret. Not to forget, thats the motive of doing PCA where, we aim to select fewer components (than features) which can explain the maximum variance in the data set. By doing rotation, the relative location of the components doesnt change, it only changes the actual coordinates of the points.If we dont rotate the components, the effect of PCA will diminish and well have to select more number of components to explain variance in the data set.Know more: PCA
Q26: How do you handle missing or corrupted data in a dataset?
Answer: You could find missing/corrupted data in a dataset and either drop those rows or columns, or decide to replace them with another value.In Pandas, there are two very useful methods: isnull() and dropna() that will help you find columns of data with missing or corrupted data and drop those values. If you want to fill the invalid values with a placeholder value (for example, 0), you could use the fillna() method.More reading: Handling missing data (O'Reilly)
1.2 - What is the "Curse of Dimensionality?"
Parametric models are those with a finite number of parameters. To predict new data, you only need to know the parameters of the model. Examples include linear regression, logistic regression, and linear SVMs.Non-parametric models are those with an unbounded number of parameters, allowing for more flexibility. To predict new data, you need to know the parameters of the model and the state of the data that has been observed. Examples include decision trees, k-nearest neighbors, and topic models using latent dirichlet analysis.Learn more about parametric vs. non-parametric models
4.1 - How much data should you allocate for your training, validation, and test sets?
Removing collinear features.Performing PCA, ICA, or other forms of algorithmic dimensionality reduction.Combining features with feature engineering.Learn more about feature engineering best practicesSampling & Splitting4. Sampling & SplittingHow to split your datasets to tune parameters and avoid overfitting.
3.2 - What are 3 data preprocessing techniques to handle outliers?
The Box-Cox transformation is a generalized "power transformation" that transforms data to make the distribution more normal.For example, when its lambda parameter is 0, it's equivalent to the log-transformation.It's used to stabilize the variance (eliminate heteroskedasticity) and normalize the distribution.Learn more about the Box-Cox transformation
9.2 - How can you help our marketing team be more efficient?
The answer will depend on the type of company. Here are some examples.Clustering algorithms to build custom customer segments for each type of marketing campaign.Natural language processing for headlines to predict performance before running ad spend.Predict conversion probability based on a user's website behavior in order to create better re-targeting campaigns.