TECHNICAL - STATISTICS & MACHINE LEARNING
What are the bounds of correlation between any two variables?
-1 to 1
What is the difference between a decision tree and a random forest?
A random forest is simply a collection of decision trees whose results are aggregated into one final result.
What is machine learning?
A subfield of computer science and artificial intelligence (AI) that focuses on the design of systems that can learn from and make decisions and predictions based on data. Machine learning enables computers to act and make data-driven decisions rather than being explicitly programmed to carry out a certain task. Machine Learning programs are also designed to learn and improve over time when exposed to new data. Three types of machine learning: supervised learning, unsupervised learning, and reinforcement learning.
Explain convolutional neural networks (CNNs) to me as if I do not know data science.
Convolutional neural networks are generally used when we are dealing with image data. Drastically cuts down the number of parameters needed to learn as they only consider interactions among "close pixels".
Would you use dimensionality reduction with a random forest or gradient boosted model? Why or why not?
Curse of dimensionality: after a certain point, the performance of the model will decrease (increased overfitting) with the increasing number of elements. reducing the dimensions reduces training time but it depends entirely on the dataset, for this to lead to better solutions.
Why might multiple random weak trees be better than one long tree? (Note: This is the difference between boosting and a deep decision tree.)
Deep tree: pros: easy interpretation, less effort for data preparation during pre-processing. -cons: overfitting, high variance error (result will change based on changes to the training set.) Multiple weak trees: -pros: weak learners come together to form a strong learner. -cons: since final prediction is based on the mean predictions from subset trees, it won't give precise values for the regression model.
What is better, false positives or false negatives? Why?
Depends on the subject: Tumor detection we don't want false negatives Spam detection we don't want false positives
Given a gradient f(x) = x + 8, what is the antiderivative f(x)?
F(x)= ((1/2) x^2 ) + 8x + C
Suppose you're trying to cluster your observations into groups. However, you have too many features in your data. How would you decide which ones to use in the clustering model?
Feature Extraction: Principal Component Analysis (PCA) algorithm used to compress a dataset onto a lower-dimensional feature subspace with the goal of maintaining most of the relevant information. Feature Selection: -drop least important variables from model ~ find with heatmap
What hyperparameters are available in LSTMs?
For each model layer: ● Number of hidden layers: More hidden layers = More overall complexity of relationship learned ● Number of nodes in each hidden layer: More hidden nodes = More relationships learned ● Choice of activation functions: sigmoid, tanch, ReLU Regularization Techniques for model layers: ● Penalty parameters at each hidden layer ● Dropout ● Early stopping Model Compile: ● Loss function: regression= MSE, classification= binary crossentropy ● Optimizer: adam Model Fit: -Number of Epochs: -Batch Size: controls how often to update the weights of the network.
What is gradient boosting?
Gradient Boosting trains many models in a gradual, additive and sequential manner. It identifies shortcomings by using gradients in the loss function. The loss function is a measure indicating how good the model's coefficients are at fitting the underlying data.
What clustering algorithms have you used before? Can you think of a reason why we'd use a clustering algorithm for a specific business problem?
K-Means Clustering: (crime analysis geocode) Density-Based Spatial Clustering of Applications with Noise (DBSCAN): (smile face, in class wine by properties) Marketing and Sales: Clustering algorithms are able to group together people with similar traits and likelihood to purchase. Once you have the groups, you can run tests on each group with different marketing copy that will help you better target your messaging to them in the future. Recommendation of products to customers
What are the assumptions of a linear regression model?
Linear relationship: the relationship between the independent and dependent variables to be linear. Multivariate normality: check with histogram to see distribution. No or little multicollinearity: independent variables are too highly correlated with each other. No auto-correlation: Autocorrelation occurs when the residuals are not independent from each other. For instance, this typically occurs in stock prices, where the price is not independent from the previous price. Homoscedasticity: The scatter plot is good way to check whether the data are homoscedastic (meaning the residuals are equal across the regression line).
In a linear model, how would you interpret a QQ plot?
Quantile-Quantile (Q-Q) plot Helps in a scenario of linear regression when we have training and test data set received separately and then we can confirm using Q-Q plot that both the data sets are from populations with same distributions A q-q plot is a plot of the quantiles of the first data set against the quantiles of the second data set. You can interpret a similar distribution Y-values < X values X-values < Y Values Different distributions https://medium.com/@premal.matalia/q-q-plot-in-linear-regression-explained-ab040567d86f
If you were a machine learning model, what would you be and why?
Random Forest: - I'm a high achiever. - I'm popular with a variety of different people. - I'm flexible and can adapt to almost any situation. -Sometimes my friends complain that they do not fully understand me -I need time to re-charge as I'm highly complex by nature.
What is stacking?
Stacking is an ensemble machine learning algorithm that learns how to best combine the predictions from multiple well-performing machine learning models.
After testing several models, how do you decide which is the best?
blank
How do you calculate the L1 distance between two vectors?
blank
How do you calculate the maximum model lift if the prevalence of the model is set at 20%? ([Helpful link here](https://en.wikipedia.org/wiki/Lift_(data_mining)).)
blank
What's the difference between statistics, machine learning and deep learning. What is AI?
blank
When processing text, what do you do about frequently occurring words, like "a," "and," "the," etc.?
blank
When you pulled data from an API, what format was it in?
blank
technical person?
blank
How is the ROC curve created?
by plotting the true positive rate (TPR) against the false positive rate (FPR).
How would you need to preprocess your target variable if it were a categorical variable assuming you want to run a regression model?
convert to numerical values by dummying categorical variables
How would you convey the correlation between two variables to a nontechnical audience?
correlations shows the strength of how two variables relate. If correlation as high as one variable changes, so does the other. As example the more you exercise the more calories you will burn. This means that exercising and calories burned have a positive correlation. This is considered positive because as when one increases so does the other
What is a p value
https://towardsdatascience.com/p-values-explained-by-data-scientist-f40a746cfc8
How do you investigate whether there are any relationships between your features and the target?
primarily use heatmap
Why are missing values a concern in modeling?
risk losing important data.
What is the correlation between X and X^2?
they are positively correlated but are not linear as the correlation will not be preserved across nonlinear transformation
What is the correlation between X and X?
they are the same. I suppose you can say that is the greatest correlation you can ever imagine haha
How do you correct for variance in a regression problem?
Overfitting: Training error much lower than test error, high variance/low bias - Get more data - perform regularization: scale data - reduce features • Penalty parameters at each hidden layer • Dropout • Early stopping
Given a confusion matrix, calculate precision.
Precision is a metric that quantifies the number of correct positive predictions made. Precision = TruePositives / (TruePositives + FalsePositives)
Explain PCA to me.
Principal Component Analysis is a dimensionality-reduction algorithm. that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set.
How is stochastic gradient descent different than gradient descent?
gradient descent computes the gradient using the whole dataset Stochastic gradient descent (SGD) computes the gradient using a single sample.
How do you select the right model for a problem? (Note: My wording here is verbatim. I asked for clarification about "what kind of problem" and they said "Any problem." This forced me to describe a few different approaches.)
https://towardsdatascience.com/how-to-choose-the-right-machine-learning-algorithm-for-your-application-1e36c32400b9 Based on our problem, we identify which model to use. (Is our problem classification or regression? Do we want an interpretable model?) 1. What type of data do I have: You can either categorize your input or output data. If your input data is labeled, then use a supervised learning algorithm(regression, random forest, decision tree, naive bayes); if not, it's probably an unsupervised learning problem(KNN). On the other hand, if your output data is numeric, then use regression, but if it's a set of groups, then it's a clustering problem. 2. What's the size & characteristics of the data: -Small training datasets, algorithms with high bais/ low variance classifiers will work better than low bias/ high variance classifiers. So, for small training data, Naïve Bayes, logistic regression, will perform better than kNN. - Really large datasets or those with many features, neural networks or boosted trees might be an excellent choice. - Is your data linear? Then maybe a linear model will fit it best, such as regressions — linear and logistic — or SVM (support vector machine). -Is data is more complex, then you need an algorithm like random forest. 3. Define Problem, Do I need to clearly interpret findings? What metrics am I optimizing? Knowing how important it is to easily to explain the model to stakeholders vs capturing some really interesting trends while giving up explainability is key to picking a model. For instance, self driving cars need blazing fast prediction times, whereas fraud detection systems need to quickly update their models to stay up to date with the latest phishing attacks. For other cases like medical diagnosis, we care about the accuracy (or area under the ROC curve) much more than the training times.
What are the common pitfalls when building a predictive model? How do you avoid them?
- not enough data: get more data - bias in data: investigating before discarding features, Having clear rules and procedures in place for the experiment, ignoring statistical relationships in prejudices - Overfitting: reduce the number of features, paying most attention to the features that have the greatest effect on your target variable. - Under-fitting: add more complexity with feature engineering - Forgetting about outliers or mishandeling null values
What is the difference between kmeans and k-nearest neighbors?
K-means: creates cluster information: is a clustering algorithm that tries to partition a set of points into K sets (clusters) such that the points in each cluster tend to be near each other. It is unsupervised because the points have no external classification. K-nearest neighbors is a classification (or regression) algorithm that in order to determine the classification of a point, combines the classification of the K nearest points. It is supervised because you are trying to classify a point based on the known classification of other points.
Why would you choose to use NLTK as opposed to other natural language packages?
blank
What is the difference between unsupervised and supervised learning?
Supervised machine learning: We know what our output values should be. Our goal is to learn a best approximates the relationship between input and output observable in the data. -classification and regression models -Algorithms: logistic regression, naive bayes, support vector machines, artificial neural networks, and random forests Unsupervised machine learning: Does not have labeled outputs, so its goal is to infer the natural structure present within a set of data points. -Algorithms: clustering and association -k-means clustering & dimension reduction(eliminate redundant features).
What is boosting?
*(repeat)** * blank
What are some models you could use to solve a classification problem?
- Logistic Regression: binary output - KNN: multiclass classification, no assumptions of data, samples of the data is less than 50,000 - Random forest/Decision Tree classifier: multi-class object detection - Naive Bayes: if text data - SVM(Support Vector Machines): speech recognition
Let's say you build a model that performs poorly. How would you go about improving it?
- treat missing and outlier values - proper feature selection - selecting correct machine learning algorithm - tune algorithm: Gridsearch to find best parameters For example: In random forest, we have various parameters like max_features, number_trees, random_state Overfitting: Training error much lower than test error, high variance/low bias - Get more data - perform regularization: scale data - reduce features • Penalty parameters at each hidden layer • Dropout • Early stopping Underfitting: •Training error close to test error, high bias/low variance - add complexity to model: more features -feature engineering: feature creation interaction terms/polynomial features - ensemble methods: bagging and boosting
Why would you perform PCA even if you don't have a lot of features?
-If model is overfit -if features have high multicollinearity
In what cases would you NOT use PCA?
-Since PCA distorts the interpretability of our features, we should not use PCA if our goal is to interpret the output of our model. -If we have relatively few features as inputs, PCA is unlikely to have a large positive impact on our model.
What is the geometric interpretation of an eigenvalue?
-points in a direction in which it is stretched by the transformation and the eigenvalue is the factor by which it is stretched. -If the eigenvalue is negative, the direction is reversed. - in a multidimensional vector space, the eigenvector is not rotated.
How can you assess whether your model is overfitting?
1. split data into test and train 2. test training data set 2. test test data set If the model's Rsquared score performs better on the training set than on the test set, it means that the model is likely overfitting.
What is the definition of gradient?
A gradient simply measures the change in all weights with regard to the change in error. You can also think of a gradient as the slope of a function. The higher the gradient, the steeper the slope and the faster a model can learn. But if the slope is zero, the model stops learning.
Explain neural networks to our CEO.
A neutal networks makes predictions by learning the different interactions in our data. For example: There are a lot of features to the weather-humidity, temperature, cloud cover, past events, etc - A neural network can help us predict what the weather will be 2 days from now based on how those features have interacted in the past. If i enter the weather data from the past 5 years, the nural network will learn in a way. It will recognize that 2 days after it's 95 degrees, sunny, low humidity more often then not it will still be 95 degrees, sunny, with low humidity. However if it's 35 degrees, high humidity, and cloudy in 2 days it will be more likely to be 30 degrees, high humidity, and snowing. So it will weigh different combinations of the features to determine accurate predictions.
What is bagging?
Bootstrap aggregating Decision trees are powerful machine learning models. However, decision trees have some limitations. In particular, trees that are grown very deep tend to learn highly irregular patterns (a.k.a. they overfit their training sets). Bagging (bootstrap aggregating) mitigates this problem by implementing similar different trees learners to different sub-samples of the training set. and then takes a mean of all the predictions.
What are the pros and cons of bagging, boosting, and stacking?
Bagging - The objective is to increase generalization power. -Split the training data into random subsets (sampling with replacement) -The output of the ensemble model is the average of the underlying methods. -All underlying models have the same voting power. - Example: Random Forest is Bagging applied to Decision Trees. Pro: - Bagging method helps when we face variance or overfitting in the model. It provides an environment to deal with variance by using N learners of same size on same algorithm. - During the sampling of train data, there are many observations which overlaps. So, the combination of these learners helps in overcoming the high variance. - Bagging uses Bootstrap sampling method Con: -is not helpful in case of bias or underfitting in the data. -can experience lots of bias -ignores the value with the highest and the lowest result which may have a wide difference and provides an average result. Boosting - The objective is to increase the accuracy. - Train a model on the whole training data. - Then train a model on the errors (residuals) of the previous model. - Repeat until convergence. - Examples: AdaBoost and XGBoost are variants of boosting Pro: -takes care of the weightage of the higher accuracy sample and lower accuracy sample and then gives the combined results. -Net error is evaluated in each learning steps. It works good with interactions. -Boosting technique helps when we are dealing with bias or underfitting in the data set. -Multiple boosting techniques are available. For example: AdaBoost, LPBoost, XGBoost, GradientBoost, BrownBoost Con: - Boosting technique often ignores overfitting or variance issues in the data set. - It increases the complexity of the classification. - Time and computation can be a bit expensive. Stacking: - The objective is to increase model accuracy and generalization power. - Train each underlying model on the whole training data. - Train another model (e.g. Logistic Regression) to learn how to best combine the outputs of each underlying model. Pro: -Stacking can yield improvements in model performance. -Stacking reduces variance and creates a more robust model by combining the predictions of multiple models. Con: - Stacked models can take significantly longer to train than simpler models and require more memory. - Generating predictions using stacked models will usually be slower and more computationally expensive.
How would you evaluate a classification model? What metrics other than accuracy might you use, and why?
Confusion matrix: recall/sensitivity: correctly classified positives to actual positive cases. Measures how good our model is at correctly predicting positive classes. It's more important to correctly identify a tumor as a false negative could risk a person's life. precision/positive predicted value: how accurate the positive predictors are. True positives relative to predicted positives. Measures how good our model is when the prediction is positive. A low precision would mean that we're giving a lot of customers headaches, in that we're classifying more fraudulent transactions than what's actually fraudulent. F1 score: Harmonic mean of precision and recall. 1 indicates perfect precision and recall, therefore the higher the F1 score, the better the model Other things to note: specificity/true negative rate: correctly classified negatives to actual negatives cases. Opposite of sensitivity.
You have a dataset of very few training examples. You have a much much bigger set of outside data for which you're trying to make predictions. How do you determine whether your small training set is large enough to build a model and reliably deploy it to a much larger outside set?
blank
What types of regression models do you know and what are the differences among them?
Logistic regression: when dependant variable is binary in nature. Classification Linear regression: the best fit line through all data points to predict the numerical values of testing set. Easy to understand. You clearly see what the biggest drivers of the model are Polynomial regression: curve of best fit to improve R2 score of linear regression Ridge: modification of linear regression, penalizes the model for the sum of squared value of the weights Lasso: modification of linear regression, where the model is penalized for the sum of absolute values of the weights Elastic Net: is a hybrid of Lasso and Ridge, where both the absolute value penalization and squared penalization are included * useful when there are multiple features which are correlated.
What is your favorite use case for a machine learning model?
Naive Bayes Classifier Algorithm: Analyzing sentiment based on product reviews. K-Means Clustering Algorithm: characterizing people dependent on various interests Support Vector Machine (SVM) Learning Algorithm: how likely someone is to click on an online ad. How many patients a hospital will need to serve in a specific time period Linear Regression: Optimizing product-level price points Logistic Regression: Weather forecasting applications for predicting rainfall and weather conditions Decision Trees: Understanding consumer behaviour on your website
What do you know about topic modeling and how can you apply it to business problems?
Natural Language processing: describes the field of getting computers to understand language how we as humans do. Natural language processing has many, many applications including: -voice-to-text services for people who are hard of hearing. -automated chatbots for organizations. -translation services. -sentiment analysis: classify text as either having positive or negative sentiment -spam filter
What are the advantages of using a random forest over a logistic regression model?
Random Forest: Pros: -Robust to outliers. -Works well with non-linear data. -Lower risk of overfitting. -High-Quality models Cons:- biased with categorical variables. -Not easy to understand predictions Logistic Regression Pros: -easier to interpret, -good for simple data sets Cons: -tendency to overfit, -can't solve non-linear problems, -not high performing
How do recurrent neural networks (RNNs) work?
Recurrent neural networks are a special type of neural network in which the output of a layer is fed back to the input layer multiple times in order to learn from the past data. Basically, the neural network is trying to learn data that follows a sequence.
What is the bias
Selection bias occurs when sample obtained is not representative of the population intended to be analysed. Bias is error introduced in your model due to over simplification of machine learning algorithm." It can lead to under fitting. When you train your model at that time model makes simplified assumptions to make the target function easier to understand. Low bias machine learning algorithms — Decision Trees, k-NN and SVM High bias machine learning algorithms — Linear Regression, Logistic Regression
When bootstrapping, how would you estimate the error between your estimate and the true population value as a function of your sample size? (This relies on the Central Limit Theorem.)
The standard error of an estimator is it's standard deviation. It tells us how far your sample estimate deviates from the actual parameter. estimated standard error, which use the sample standard deviation S as a estimated standard deviation of the population: we have sample with 30, and sample mean is 228.06, and the sample standard deviation is 166.97, so our estimated standard error for our sample mean is 166.97/ √30 = 30.48.
What are time series data?
Time series data is data that is collected at different points in time. Because data points in time series are collected at adjacent time periods there is potential for correlation between observations.
Let's say you're building a neural network model to classify images. You image dataset is very small (say only 5 images). You need more data to build a relatively reliable model but there's no place for you to get more data from. What would you do?
blank
variance tradeoff?
Variance is error introduced in your model due to complex machine learning algorithm, your model learns noise also from the training data set and performs bad on test data set." It can lead high sensitivity and over fitting. Normally, as you increase the complexity of your model, you will see a reduction in error due to lower bias in the model. However, this only happens till a particular point. As you continue to make your model more complex, you end up over-fitting your model and hence your model will start suffering from high variance. The goal of any supervised machine learning algorithm is to have low bias and low variance to achieve good prediction performance. Increasing the bias will decrease the variance. Increasing the variance will decrease the bias.
Under the hood, how is a linear regression model created? Assume there is only one X variable.
blank
What are the drawbacks of data augmentation?
blank
What is the relationship among CNNs, RNNs and LSTMs?
blank
What type of data are neural networks best for?
blank
What types of neural networks are you familiar with?
blank
What use cases are good applications of neural networks?
blank
Why are CNNs good for image data?
blank
Aside from the middle observation in a dataset, how would you describe the median? (Answer: It is the quantity that minimizes the mean absolute error.)
blank
What is bootstrapping?
is random resampling with replacement. We bootstrap when fitting bagged decision trees so that we can fit multiple decision trees on slightly different sets of data. Bagged decision trees tend to outperform single decision trees. -Bootstrapping can also be used to conduct hypothesis tests and generate confidence intervals directly from resampled data.
What is a standard deviation?
is the average amount of variability in your data set. It tells you, on average, how far each score lies from the mean.
What is the algebraic interpretation of an eigenvalue?
its multiplicity as a root of the characteristic polynomial, that is, the largest integer k such that (λ − λi)k divides evenly that polynomial.
When are RNNs used?
natural language processing and time series tasks: -Text-to-speech recognition -Predicting the next word in a sentence -Time series prediction
Given summary statistics, which variable should we drop from our regression model so we don't have multicollinearity?
transform some of the highly correlated variables to make them less correlated but still maintain their feature. this will drop VIF score. Ex: change year built to age of home Principal Component Analysis(PCA) is commonly used to reduce the dimension of data by decomposing data into the number of independent factors. but this makes interpreting the coefficients difficult. Thus, we should try our best to reduce the correlation by selecting the right variables and transform them if needed.
Given a function, how do you find its maxima or minima?
where the slope is zero to find this you can find the derivative
Can you run linear regression on both numerical and categorical features?
yes but you have to dummy categorical variables
In a level of technical detail that's most comfortable to you, describe to me how DBSCAN works and in what cases would you choose it over Kmeans.
DBSCAN: Works on density of points. The idea is that if a particular point belongs to a cluster, it should be near to lots of other points in that cluster. This allows us to: ● detect areas of high density and low density ● detect some cluster patterns that k-Means might not be able to detect. ● Works well for noisy datasets Choose DBSCAN over Kmeans is dependent on data and objective: Kmeans: ● I want to cluster crime data by location to form pseudo neighborhoods DBSCAN: ●I want to cluster hot spots of crime K means: tries to create same sized cluster no matter how the data is scattered.
What is the difference between parameters and hyperparameters?
Model Parameters are something that a model learns on its own. For example, 1) Weights or Coefficients of independent variables in Linear regression model. 2) Weights or Coefficients of independent variables SVM. 3) Split points in Decision Tree. Model hyper-parameters are used to optimize the model performance. For example, 1)Kernel and slack in SVM. 2)Value of K in KNN. 3)Depth of tree in Decision trees. If you have to specify a model parameter manually then it is probably a model hyperparameter.
How does backpropagation work in neural nets?
blank
How would you detect anomalies in a dataset? How would you deal with them?
-z-score: any data point that is more than 3 times the standard deviation, then those points are very likely to be anomalous or outliers. -boxplot: Any data points that show above or below the whiskers, can be considered outliers or anomalous -scatterplot/ DBscan: data points that do not belong to any cluster -perform "isolation forest" to see the point of which an outlier is defined -Investigate outliers -If outlier is bad data/wrong calculation: we can drop or correct. Always correct if possible. Take notes to log change. -Outliers may be due to random variation or may indicate something scientifically interesting.
If I were to ask you to use clustering on some of the features in your Ames housing dataset which I see you used for a project, how would you preprocess the data before applying the clustering?
1. Check and correct for outliers and missing data as k-Means is very sensitive to outliers 2. Transform and scale data to help to make distributions symmetrical and everything on the same scale 3. I would reduce features to top numeric predictors with no collinearity (k-means only works with numerical variables) or plan to use K-prototype clustering 4. scatter plot of those variables to get an idea of the k value to use
What are the assumptions of a simple linear regression model?
1. Linear relationship: There exists a linear relationship between the Y/features and X/target. Determine with scatter plot 2. Independence: Observations are independent of each other. Determine with seaborn heatmap 3. Homoscedasticity: The variance of residual is the same for any value of X/target. Determine with a scatter plot of residual values with predictions 5. Normality: For any fixed value of X/target, Y/features is normally distributed. Determine with histogram of residuals
How do decision trees work?
A decision tree is a graphical representation of all the possible solutions to a decision based on certain conditions. Example: Let's say the question is if you should hire someone and our first root node is experience. is the applicant is really experienced you will say yes if the applicant isn't you will say no. But experiance isn't all that you're looking for. the next node on the tree will be if the person will be a proper fi into your company culture. If the person with experiance will fit in the company culture you will say yes if they dont you will say no. But experiance and company culture fit isnt all that you're looking for the next node will be being a team player, then having a growth mindset, then their salary requirements. with each node there are 2 branches until you reach the end or the leaves. This is how a decision tree works
What is the difference between R squared and adjusted R squared?
Both are evaluation metrics for regression problems R-Squared: defines the degree to which the variance in the dependent variable (or target) can be explained by the independent variable (features). The higher the R squared, the more variation is explained by your input variables and hence better is your model. Example — say the R-squared value for a particular model comes out to be 0.7. This means that 70% of the variation in the dependent variable is explained by the independent variables. One drawback of r-squared is that it assumes every variable helps in explaining the variation in the target, which might not always be true. R-Squared Adjusted: measures the variation in the dependent variable (or target), explained by only the features which are helpful in making predictions. Unlike R-squared, the Adjusted R-squared would penalize you for adding features which are not useful for predicting the target.
If you're comparing decision trees and logistic regression, what are the pros and cons of each?
Decision Trees: Pros: -automatically take into account interactions between variables. -Less prepping of data as things don't need to be normalized/scaled/missing values are ok -easy to understand Cons: -likely to overfit, -unstable as a small change in data can cause a large change in decision -not powerful enough for complex data Logistic Regression Pros: -easier to interpret, -good for simple data sets Cons: -tendency to overfit, -can't solve non-linear problems, -not high performing
Explain linear regression from a point of view that our CEO could understand.
I won $100 on Monday, $200 on Tuesday, $300 on Wednesday, how much would I win on Thursday? If you answered $400 you just did linear regression! You are Makeing a Prediction' based on 'some information'. One significant feature to linear regression is that the prediction is a straight line. This straight line is developed to get as close as possible to all of the past data points. For example, say were looking at a graph. The bottom line that goes side to side is your $100, $200, $300. The line on the side going up and down is your Monday, Tues, and Wed. Now let's plot $100 on Monday, $200 on Tuesday, $300 on Wednesday. If we were to draw a straight line through these points we will be able to predict that on Thursday we win $400
When would you use PCA?
If I want to: -reduce the number of variables, but aren't able to identify variables to completely remove from consideration? -ensure your variables are independent of one another? -Are you comfortable making your independent variables less interpretable?
How would you typically handle missing data?
Investigate to decide if deleting or imputing null values is the best course of action. If the data is missing at complete random and it is a small portion of the total dataset I would delete. However missing values are rarely completely at random so doing this without thorough investigation will result in a biased dataset. In cases where there are a small number of missing observations, we can impute null values with the mean or median for numerical features or mode for categorical features. However, when there are many missing variables, mean or median results can result in a loss of variation in the data. Depending on the data we can also use logistical regression, or linear regression, or random sample imputation to impute.
What are precision, recall, F1 score, and AUC ROC? When would you use each of them?
Precision/recall F1 is the same criteria: Use when: - you care more about the positive class. - when data is imbalanced Precision: This tells when you predict something positive, how many times they were actually positive. Recall: This tells out of actual positive data, how many times you predicted correctly. F1 Score: weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account. ROC/AUC is the same criteria: Use when: -you ultimately care about ranking predictions and not necessarily about outputting well -you care equally about positive and negative classes. ROC: visualizes a probability curve. It helps analyze how the efficiency of Binary Classification changes with the values of the Probability threshold. Helps us choose the right probability based on the required/acceptable False Positives. AUC: represents the degree or measure of separability. It tells how much the model is capable of distinguishing between classes. Higher the AUC, the better the model is at predicting 0s as 0s and 1s as 1s. By analogy, the Higher the AUC, the better the model is at distinguishing between patients with the disease and no disease. The area under the curve (AUC) can be used as a summary of the model skill and can be used to compare two models.
Why might a random forest better than a decision tree? (Note: What they were looking for is to talk about hwo random forests work.)
Random Forest Pro: is an ensemble of decision trees thus it ensures that the individual errors of trees are minimized and overall variance and error is reduced. -Robust to outliers. -Works well with non-linear data. -Lower risk of overfitting. -High-Quality models Con: Black Box: It is tough to know what is happening/Not easy to understand predictions - biased with categorical variables. Decision Trees: Pros: -automatically take into account interactions between variables. -Less prepping of data as things don't need to be normalized/scaled/missing values are ok -easy to understand Cons: -likely to overfit, -unstable as a small change in data can cause a large change in decision -not powerful enough for complex data
What is the difference between a regression problem and a classification problem? Provide an example of each.
Regression: Continuous output variable = a real-value, such as an integer or floating point value. These are often quantities, such as amounts and sizes. Common algorithms: linear regression, Support Vector Regression, and regression trees Regression can be evaluated using root mean squared error, where classification cant. Ex: predicting home prices Classification: Discrete output variables= labels or categories. Common algorithms: logistic regression, Naïve Bayes, decision trees, and K Nearest Neighbors Classification can be evaluated by using accuracy where regression cannot. Ex: differentiate between apples and oranges
Aside from the sum of the observations divided by the number of observations, how would you describe the mean? (Answer: It is the quantity that minimizes the sum of squared errors.)
blank
How is standard deviation used to construct a confidence interval?
blank
Suppose you have an X variable that impacts Y differently depending on another* X variable. (For example: the dosage of a medication impacts your overall recovery time differently if you are a man or woman.) How do you account for this in a linear regression?
Typically, when a regression equation includes an interaction term, the first question you ask is: Does the interaction term contribute in a meaningful way to the explanatory power of the equation? You can answer that question by: Assessing the statistical significance of the interaction term. Comparing the coefficient of determination with and without the interaction term. If the interaction term is statistically significant, the interaction term is probably important. And if the coefficient of determination is also much bigger with the interaction term, it is definitely important. If neither of these outcomes are observed, the interaction term can be removed from the regression equation. In this case we would not want to remove the interaction term
What is the difference between precision and recall?
Unlike precision that only comments on the correct positive predictions out of all positive predictions, recall provides an indication of missed positive predictions. Recall is a metric that quantifies the number of correct positive predictions made out of all positive predictions that could have been made. It helps us pick the optimal threshold that will allow the model to most efficiently distinguish between 2 classes Recall = TruePositives / (TruePositives + FalseNegatives)
How does logistic regression work?
Uses a link function to effectively "bend" our line of best fit so that it is a curve of best fit that matches the range or set of values in which we're interested. This helps us predicting the probability of success!
What is regularization? Why would you use it?
When your model is trying too hard to capture the noise in your training dataset, it's overfitting, it's too complex and wont make accurate predictions. The proper fit model is in the Goldilocks zone. It is not too simple nor too complicated. Regularization techniques are used to discourages learning a more complex or flexible model, so as to avoid the risk of overfitting. It helps us find the goldylock zone, without substantial increase in its bias or making it too generalized. L1(Lasso): reduce the dataset to only the most important features that would impact the "target variable". Does well if there are a few variables that are much better predictors of the target value than the other predictors. L2(Ridge).: doesnt completely prune features like lasso but it allows more predictive features to have a larger coefficient so that can predict more of the signal. Its a good tool for handling multicollinearity when you must keep all your predictors. It addresses the collinearity by shrinking the magnitude of the predictors — but never eliminates them. works well if there are many predictors of about the same magnitude. This means all predictors have similar power to predict the target value.
Is there a model that you tend to use frequently in your projects? Please explain how that model works.
blank
Why might accuracy not be the best error metric for classification?
accuracy is a metric that summarizes the performance of a classification model as the number of correct predictions divided by the total number of predictions. accuracy fails on classification problems with a skewed class distribution because of the intuitions developed by practitioners on datasets with an equal class distribution. it also doesn't account for real world as our definition of "good" will be dependent on the problem we are solving for. a confusion matrix will help to show the true cost/benefit
If a client who is not a math/data person were to ask you what are MCMC methods, how would you describe it to them?
made up of 2 terms Monte Carlo and Markov Chains Monte Carlo: a simple simulation. Ex: having a computer simulate a card game over and over to figure out the probability of wins. Markov Chains: approach would be to draw a large number of random samples from a normal distribution, and calculate the sample mean of those.
Given three columns containing some missing data (one categorical and two quantitative columns), how would you impute values for a fourth column, where this fourth column can take on only positive integer values?
mean/median: only with numerical values. only works on columns level and doesnt take into consider interactions with other features Most Frequent/mode: works with categorical. can introduce bias. doesnt factor in correlations between features k-NN: more accurate than above but is sensitive to outliers Deep Learning: single column, can handle categorical with feature encoding
How would we know if we have any multicollinearity in our model?
multicollinearity leads to overfitting where the model may do great on known training set but will fail at unknown testing set. it happens when variables are not independant of target-X To detect: corrilation maztrix/heatmap first but this wont show if collinearity exists between 3 or more variables. variance inflation factor- VIF can determine if two independent variables are collinear with each other. if the two features have a VIF of 1, then they are not collinear of each other (ie there are no correlation between these two features). However, as the numbers increases, the higher they are correlated with each other. If VIF returns a number greater than 5, then those two features should be reduced to one using PCA.
What are some disadvantages to time series analysis methods?
the entire process is affected by time. Some more quickly, others more slowly, so we need to have an idea of how our model will act with the changes that will happen. Every time series prediction tries to understand human interactions. If the modelled system changes any logic, a trained model will have a performance decay over time. Some models will decay fast, others slowly. To measure how is your model performing over time, try to monitor its performance and retrain it when the performance drops below a threshold.