Data Science Interviews

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

What is a confusion matrix?

A confusion matrix, also known as an error matrix, is a summarized table used to assess the performance of a classification model. The number of correct and incorrect predictions are summarized with count values and broken down by each class.

Explain what a false positive and a false negative are. Why is it important these from each other? Provide examples when false positives are more important than false negatives and when false negatives are more important than false positives.

A false positive is an incorrect identification of the presence of a condition when it's absent. A false negative is an incorrect identification of the absence of a condition when it's actually present. An example of when false negatives are more important than false positives is when screening for cancer. It's much worse to say that someone doesn't have cancer when they do, instead of saying that someone does and later realizing that they don't. This is a subjective argument, but false positives can be worse than false negatives from a psychological point of view. For example, a false positive for winning the lottery could be a worse outcome than a false negative because people normally don't expect to win the lottery anyway.

What's the difference between an AdaBoosted tree and a Gradient Boosted tree?

AdaBoost is a boosted algorithm that is similar to Random Forests but has a couple of significant differences: 1. Rather than a forest of trees, AdaBoost typically makes a forest of stumps (a stump is a tree with only one node and two leaves). 2. Each stump's decision is not weighted equally in the final decision. Stumps with less total error (high accuracy) will have a higher say. 3. The order in which the stumps are created is important, as each subsequent stump emphasizes the importance of the samples that were incorrectly classified in the previous stump. Gradient Boost is similar to AdaBoost in the sense that it builds multiple trees where each tree is built off of the previous tree. Unlike AdaBoost which builds stumps, Gradient Boost builds trees with usually 8 to 32 leaves. More importantly, Gradient differs from AdaBoost in the way that the decisions trees are built. Gradient boost starts with an initial prediction, usually the average. Then, a decision tree is built based on the residuals of the samples. A new prediction is made by taking the initial prediction + a learning rate times the outcome of the residual tree, and the process is repeated.

What is an inlier?

An inlier is a data observation that lies within the rest of the dataset and is unusual or an error. Since it lies in the dataset, it is typically harder to identify than an outlier and requires external data to identify them.

Do you think 50 small decision trees are better than a large one? Why?

Another way of asking this question is "Is a random forest a better model than a decision tree?" And the answer is yes because a random forest is an ensemble method that takes many weak decision trees to make a strong learner. Random forests are more accurate, more robust, and less prone to overfitting.

Briefly explain how a basic neural network works

At its core, a Neural Network is essentially a network of mathematical equations. It takes one or more input variables, and by going through a network of equations, results in one or more output variables. In a neural network, there's an input layer, one or more hidden layers, and an output layer. The input layer consists of one or more feature variables (or input variables or independent variables) denoted as x1, x2, ..., xn. The hidden layer consists of one or more hidden nodes or hidden units. A node is simply one of the circles in the diagram above. Similarly, the output variable consists of one or more output units. Each node in a neural network is composed of two functions, a linear function and an activation function. This is where things can get a little confusing, but for now, think of the linear function as some line of best fit. Also, think of the activation function like a light switch, which results in a number between 1 or 0.

What is the difference between bagging and boosting?

Bagging, also known as bootstrap aggregating, is the process in which multiple models of the same learning algorithm are trained with bootstrapped samples of the original dataset. Then, like the random forest example above, a vote is taken on all of the models' outputs Boosting is a variation of bagging where each individual model is built sequentially, iterating over the previous one. Specifically, any data points that are falsely classified by the previous model is emphasized in the following model. This is done to improve the overall accuracy of the model. Here's a diagram to make more sense of the process: Once the first model is built, the falsely classified/predicted points are taken in addition to the second bootstrapped sample to train the second model. Then, the ensemble model (models 1 and 2) are used against the test dataset and the process continues.

What is the difference between online and batch learning?

Batch learning, also known as offline learning, is when you learn over groups of patterns. This is the type of learning that most people are familiar with, where you source a dataset and build a model on the whole dataset at once. Online learning, on the other hand, is an approach that ingests data one observation at a time. Online learning is data-efficient because the data is no longer required once it is consumed, which technically means that you don't have to store your data.

What are ridge and lasso regression and what are the differences between them?

Both L1 and L2 regularization are methods used to reduce the overfitting of training data. Least Squares minimizes the sum of the squared residuals, which can result in low bias but high variance. L2 Regularization, also called ridge regression, minimizes the sum of the squared residuals plus lambda times the slope squared. This additional term is called the Ridge Regression Penalty. This increases the bias of the model, making the fit worse on the training data, but also decreases the variance. If you take the ridge regression penalty and replace it with the absolute value of the slope, then you get Lasso regression or L1 regularization. L2 is less robust but has a stable solution and always one solution. L1 is more robust but has an unstable solution and can possibly have multiple solutions.

What is collinearity? What is multicollinearity? How do you deal with it?

Collinearity is a linear association between two predictors. Multicollinearity is a situation where two or more predictors are highly linearly related. This can be problematic because it undermines the statistical significance of an independent variable. While it may not necessarily have a large impact on the model's accuracy, it affects the variance of the prediction and reduces the quality of the interpretation of the independent variables. You could use the Variance Inflation Factors (VIF) to determine if there is any multicollinearity between independent variables — a standard benchmark is that if the VIF is greater than 5 then multicollinearity exists.

What is cross-validation?

Cross-validation is essentially a technique used to assess how well a model performs on a new independent dataset. The simplest example of cross-validation is when you split your data into three groups: training data, validation data, and testing data, where you use the training data to build the model, the validation data to tune the hyperparameters, and the testing data to evaluate your final model.

How can you avoid overfitting your model?

Cross-validation: Cross-validation is a technique used to assess how well a model performs on a new independent dataset. The simplest example of cross-validation is when you split your data into two groups: training data and testing data, where you use the training data to build the model and the testing data to test the model. Regularization: Overfitting occurs when models have higher degree polynomials. Thus, regularization reduces overfitting by penalizing higher degree polynomials. Reduce the number of features: You can also reduce overfitting by simply reducing the number of input features. You can do this by manually removing features, or you can use a technique, called Principal Component Analysis, which projects higher dimensional data (eg. 3 dimensions) to a smaller space (eg. 2 dimensions). Ensemble Learning Techniques: Ensemble techniques take many weak learners and converts them into a strong learner through bagging and boosting. Through bagging and boosting, these techniques tend to overfit less than their alternative counterparts.

What is the difference between a validation set and a test set?

Generally, the validation set is used to tune the hyperparameters of your model, while the testing set is used to evaluate your final mode

What are some of the steps for data wrangling and data cleaning before applying machine learning algorithms?

Data profiling: Almost everyone starts off by getting an understanding of their dataset. More specifically, you can look at the shape of the dataset with .shape and a description of your numerical variables with .describe(). Data visualizations: Sometimes, it's useful to visualize your data with histograms, boxplots, and scatterplots to better understand the relationships between variables and also to identify potential outliers. Syntax error: This includes making sure there's no white space, making sure letter casing is consistent, and checking for typos. You can check for typos by using .unique() or by using bar graphs. Standardization or normalization: Depending on the dataset your working with and the machine learning method you decide to use, it may be useful to standardize or normalize your data so that different scales of different variables don't negatively impact the performance of your model. Handling null values: There are a number of ways to handle null values including deleting rows with null values altogether, replacing null values with the mean/median/mode, replacing null values with a new category (eg. unknown), predicting the values, or using machine learning models that can deal with null values. Other things include: removing irrelevant data, removing duplicates, and type conversion.

What is ensemble learning?

Ensemble learning is a method where multiple learning algorithms are used in conjunction. The purpose of doing so is that it allows you to achieve higher predictive performance than if you were to use an individual algorithm by itself. An example of this is random forests.

How should you deal with unbalanced binary classification?

First, you want to reconsider the metrics that you'd use to evaluate your model. The accuracy of your model might not be the best metric to look at because and I'll use an example to explain why. Let's say 99 bank withdrawals were not fraudulent and 1 withdrawal was. If your model simply classified every instance as "not fraudulent", it would have an accuracy of 99%! Therefore, you may want to consider using metrics like precision and recall Another method to improve unbalanced binary classification is by increasing the cost of misclassifying the minority class. By increasing the penalty of such, the model should classify the minority class more accurately. Lastly, you can improve the balance of classes by oversampling the minority class or by undersampling the majority class. You can read more about it here.

What happens if the learning rate is set too high or too low?

If the learning rate is too low, your model will train very slowly as minimal updates are made to the weights through each iteration. Thus, it would take many updates before reaching the minimum point. If the learning rate is set too high, this causes undesirable divergent behavior to the loss function due to drastic updates in weights, and it may fail to converge.

How are collaborative filtering and content-based filtering similar? different?

In content-based filtering, you use the properties of the objects to find similar products. For example, using content-based filtering, a movie recommender may recommend movies of the same genre or movies directed by the same director. In collaborative filtering, your behavior is compared to other users and users with similar behavior dictate what is recommended to you. To give a very simple example, if you bought a tv and another user bought a tv as well as a recliner, you would be recommended the recliner as well.

What is principal component analysis? Explain the sort of problems you would use PCA for.

In its simplest sense, PCA involves project higher dimensional data (eg. 3 dimensions) to a smaller space (eg. 2 dimensions). This results in a lower dimension of data, (2 dimensions instead of 3 dimensions) while keeping all original variables in the model. PCA is commonly used for compression purposes, to reduce required memory and to speed up the algorithm, as well as for visualization purposes, making it easier to summarize data.

How does K-Nearest Neighbor work

K-Nearest Neighbors is a classification technique where a new sample is classified by looking at the nearest classified points, hence 'K-nearest'. In the example above, if k=1 then the unclassified point would be classified as a blue point. If the value of k is too low, it can be subject to outliers. However, if it's too high, it may overlook classes with only a few samples

Why is mean square error a bad measure of model performance? What would you suggest instead?

Mean Squared Error (MSE) gives a relatively high weight to large errors — therefore, MSE tends to put too much emphasis on large deviations. A more robust alternative is MAE (mean absolute deviation).

Is mean imputation of missing data acceptable practice? Why or why not?

Mean imputation is the practice of replacing null values in a data set with the mean of the data. Mean imputation is generally bad practice because it doesn't take into account feature correlation. For example, imagine we have a table showing age and fitness score and imagine that an eighty-year-old has a missing fitness score. If we took the average fitness score from an age range of 15 to 80, then the eighty-year-old will appear to have a much higher fitness score than he actually should. Second, mean imputation reduces the variance of the data and increases bias in our data. This leads to a less accurate model and a narrower confidence interval due to a smaller variance.

Why is Naive Bayes "naive"?

Naive Bayes is naive because it holds a strong assumption in that the features are assumed to be uncorrelated with one another, which typically is never the case.

What is pruning in decision trees?

Pruning is a technique in machine learning and search algorithms that reduces the size of decision trees by removing sections or branches of the tree that provide little to no power for classifying instances.

What are random forests? Why is Naive Bayes better?

Random forests are an ensemble learning technique that builds off of decision trees. Random forests involve creating multiple decision trees using bootstrapped datasets of the original data and randomly selecting a subset of variables at each step of the decision tree. The model then selects the mode of all of the predictions of each decision tree. By relying on a "majority wins" model, it reduces the risk of error from an individual tree For example, if we created one decision tree, the third one, it would predict 0. But if we relied on the mode of all 4 decision trees, the predicted value would be 1. This is the power of random forests. Random forests offer several other benefits including strong performance, can model non-linear boundaries, no cross-validation needed, and gives feature importance. Naive Bayes is better in the sense that it is easy to train and understand the process and results. A random forest can seem like a black box. Therefore, a Naive Bayes algorithm may be better in terms of implementation and understanding. However, in terms of performance, a random forest is typically stronger because it is an ensemble technique.

What is the difference between precision and recall?

Recall attempts to answer "What proportion of actual positives was identified correctly?" Precision attempts to answer "What proportion of positive identifications was actually correct?"

What are recurrent neural networks?

Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are commonly used to recognize the pattern of sequences in data, including time-series data, stock market data, etc...

What is Supervised vs Unsupervised learning?

Supervised learning involves learning on a labeled dataset where the target variable is known. Unsupervised learning is used to draw inferences and find patterns from input data without references to labeled outcomes — there's no target variable

When would you use random forests Vs SVM and why?

There are a couple of reasons why a random forest is a better choice of an algorithm than a support vector machine: Random forests allow you to determine the feature importance. SVM's can't do this. Random forests are much quicker and simpler to build than an SVM For multi-class classification problems, SVMs require a one-vs-rest method, which is less scalable and more memory intensive.

How can outliers be treated?

There are a couple of ways: 1. Remove outliers if they're a garbage value. 2. You can try a different model. For example, a non-linear model might treat an outlier differently than a linear model. 3. You can normalize the data to narrow the range. 4. You can use algorithms that account for outliers, such as random forests.

Explain what the bootstrap sampling method is and give an example of when it's used

Technically speaking, the bootstrap sampling method is a resampling method that uses random sampling with replacement. It's an essential part of the random forest algorithm, as well as other ensemble learning algorithms

Why is Rectified Linear Unit a good activation function?

The Rectified Linear Unit, also known as the ReLU function, is known to be a better activation function than the sigmoid function and the tanh function because it performs gradient descent faster. Notice in the image to the left that when x (or z) is very large, the slope is very small, which slows gradient descent significantly. This, however, is not the case for the ReLU function.

What is the bias-variance tradeoff?

The bias of an estimator is the difference between the expected value and true value. A model with a high bias tends to be oversimplified and results in underfitting. Variance represents the model's sensitivity to the data and the noise. A model with high variance results in overfitting. Therefore, the bias-variance tradeoff is a property of machine learning models in which lower variance results in higher bias and vice versa. Generally, an optimal balance of the two can be found in which error is minimized.

What are the support vectors in SVM?

The support vectors are the data points that touch the boundaries of the maximum margin

How are weights initialized in a Network?

The weights of a neural network MUST be initialized randomly because this is an expectation of stochastic gradient descent. If you initialized all weights to the same value (i.e. zero or one), then each hidden unit will get exactly the same signal. For example, if all weights are initialized to 0, all hidden units will get zero signal.

What are the drawbacks of a linear model?

There are a couple of drawbacks of a linear model: A linear model holds some strong assumptions that may not be true in application. It assumes a linear relationship, multivariate normality, no or little multicollinearity, no auto-correlation, and homoscedasticity. A linear model can't be used for discrete or binary outcomes. You can't vary the model flexibility of a linear model.

Give several ways to deal with missing values

There are a number of ways to handle null values including the following: You can omit rows with null values altogether You can replace null values with measures of central tendency (mean, median, mode) or replace it with a new category (eg. 'None') You can predict the null values based on other variables. For example, if a row has a null value for weight, but it has a value for height, you can replace the null value with the average weight for that given height. Lastly, you can leave the null values if you are using a machine learning model that automatically deals with null values.

What are the assumptions required for linear regression? What if some of these assumptions are violated?

There are four assumptions associated with a linear regression model: 1. Linearity: The relationship between X and the mean of Y is linear. 2. Homoscedasticity: The variance of the residual is the same for any value of X. 3. Independence: Observations are independent of each other. 4. Normality: For any fixed value of X, Y is normally distributed. Extreme violations of these assumptions will make the results redundant. Small violations of these assumptions will result in a greater bias or variance of the estimate.

Assume you need to generate a predictive model using multiple regression. Explain how you intend to validate this model.

There are two main ways that you can do this: A) Adjusted R-squared. R Squared is a measurement that tells you to what extent the proportion of variance in the dependent variable is explained by the variance in the independent variables. In simpler terms, while the coefficients estimate trends, R-squared represents the scatter around the line of best fit. However, every additional independent variable added to a model always increases the R-squared value — therefore, a model with several independent variables may seem to be a better fit even if it isn't. This is where adjusted R² comes in. The adjusted R² compensates for each additional independent variable and only increases if each given variable improves the model above what is possible by probability. This is important since we are creating a multiple regression model. B) Cross-Validation A method common to most people is cross-validation, splitting the data into three sets: training, validating, and testing data.

What are the feature selection methods used to select the right variables?

There are two types of methods for feature selection: filter methods and wrapper methods. Filter methods include the following: Linear discrimination analysis ANOVA Chi-Square Wrapper methods include the following: Forward Selection: We test one feature at a time and keep adding them until we get a good fit Backward Selection: We test all the features and start removing them to see what works better

How does XGBoost handle the bias-variance tradeoff?

XGBoost is an ensemble Machine Learning algorithm that leverages the gradient boosting algorithm. In essence, XGBoost is like a bagging and boosting technique on steroids. Therefore, you can say that XGBoost handles bias and variance similar to that of any boosting technique. Boosting is an ensemble meta-algorithm that reduces both bias and variance by takes a weighted average of many weak models. By focusing on weak predictions and iterating through models, the error (thus the bias) is reduced. Similarly, because it takes a weighted average of many weak models, the final model has a lower variance than each of the weaker models themselves.

How can you select k for k means?

You can use the elbow method, which is a popular method used to determine the optimal value of k. Essentially, what you do is plot the squared error for each value of k on a graph (value of k on the x-axis and squared error on the y-axis). Once the graph is made, the point where the distortion declines the most is the elbow point.

How can you identify outliers?

Z-score/standard deviations: if we know that 99.7% of data in a data set lie within three standard deviations, then we can calculate the size of one standard deviation, multiply it by 3, and identify the data points that are outside of this range. Likewise, we can calculate the z-score of a given point, and if it's equal to +/- 3, then it's an outlier. Note: that there are a few contingencies that need to be considered when using this method; the data must be normally distributed, this is not applicable for small data sets, and the presence of too many outliers can throw off z-score. Interquartile Range (IQR): IQR, the concept used to build boxplots, can also be used to identify outliers. The IQR is equal to the difference between the 3rd quartile and the 1st quartile. You can then identify if a point is an outlier if it is less than Q1-1.5*IRQ or greater than Q3 + 1.5*IQR. This comes to approximately 2.698 standard deviations. Other methods include DBScan clustering, Isolation Forests, and Robust Random Cut Forests.


Ensembles d'études connexes

Principles of Management Test 1 Study Guide

View Set

Chapter 16: Depressive Disorders

View Set

HESI Dosage Calculations Practice Exam

View Set

Network Topology Advantages vs. Disadvantages

View Set