Stats & Trees

¡Supera tus tareas y exámenes ahora con Quizwiz!

What is a confidence interval and what does it represents?

- A confidence interval with a C% confidence level gives an interval that C% of the time contains the true population parameter. - Meaning if we did this over and over for example 100 times, we would expect the confidence interval to contain the true population parameter C times (out of 100).

What is linear regression ?

- Algorithm that uses a linear combination of features to predict a continuous target - Weights are adjusted based on the error of the linear combination with the actual target value

How do we detect multicollinearity? Under what circumstances can we overlook multicollinearity ?

- Compute the Variance Inflation Factor (VIF). Loop through each feature and run a regression model using each feature as the target. The VIF of the particular feature 𝑖 is 1/(1−𝑅^2𝑖). VIF > 10 in general is regarded as a sign of collinearity - We can overlook multicollinearity if we are only concerned with predictions

Briefly describe the principles behind (Lasso and Ridge) regularization? What are the differences between Lasso and Ridge regularization?

- Cost function minimizes squared error + regularization parameter * sum of abs(coefficents) for L1 or sum of squared coefficents for L2. - Lasso regularization tends drive coefficents to 0 where as ridge regularization tends to drive coeffients towards 0. - Lasso is better for truly sparse models. - Lasso / Ridge regularization can be thought of as a kind of Bayesian estimate of the beta coefficients. Lasso (L1) assumes the coefficients are distributed in a Laplacian distribution. Ridge (L2) assumes the coefficients are distributed in a Gaussian distribution

What are the steps one can take to correct for the violated assumptions ?

- Heteroscedasticity can be reduced by taking log or other transformation of the features / target variables - Non-normality of residuals can be reduced by taking log or other transformation of the features / target variables - Dependence can be taken into account by including lag terms or more complex time series models - Non-linearity can be taken into account by including polynomials, interactions, or some other transformation

What does it mean in multiple regression when a point has high levarage ? What constitutes an influential point ? Why are they important ?

- High levarage means some feature(s) of a data point far deviates from the mean value of the feature(s) across data point - Influence = Leverage x Residual - High leverage and high residual data points are influential, i.e. affects the model ( 𝛽 s) a lot when absent / present - Important because you do not want a few data points controlling your model, and if so, you should know why

Given you have trained and tested a model with cross validation and you deploy the model and it performs significantly more poorly than in cross validation. Provide some reasonable explanations.

- Likely you are overfitting the model. - Assumptions about the data have changed, e.g. the same feature no longer affect the outcome in the same way - Target leak: Some feature(s) in the trained model is not attainable in the same way when the model is deployed For example, we might have sorted the data by outcome as assigned IDs to it when we trained the model But the IDs are not assigned in the same way when we are trying to predict the outcome at deployment - Would look at some measure of feature importance / plots and re-examine sensibility of the trained model

What is the difference between logistic regression and linear regression ?

- Linear regression is used when the target is a continuous variable and logistic regression is using for a classification problem. - Logistic regression employs the logit function to model a discrete non-continuous and binary target - Logistic regression models the probability of obtaining a positive class (y=1)

Describe the 2 ways to decide the parameters for the given distribution that best fit the dataset?

- Maximum Likelihood estimation. Compute the likelihood of observing all the data points given certain distribution and its parameters (product of the PMF / PDF of all the data points). Select the parameters that yield the highest likelihood. - Method of Moments - Estimate the parameters from the sample statistics (mean, variance...) by deriving equations that relate population moments (expected values of powers of random variable under consideration) to parameters of interest.

Coefficient of features (in logistic regression) is the opposite sign of what you expect. What might be the problem ?

- Multicollinearity - Assumption validity - Outliers - Confounding variables

What is multicollinearity, why is a problem ?

- Multicollinearity is when two or more features are linearly correlated with each other - The coefficients of the regression model become unstable and do not reflect the true coefficients. Removing or adding a data point might change coefficients significantly

How do you use logistic regression to tackle a classification problem with multiple labels?

- Multiple one-vs-rest logistic regression model (Softmax Regression) - Multinomial Logistic Regression: Modified version of logistic regression that predicts a multinomial probability (i.e. more than two classes) for each input example.

Can you compare the beta coefficients in multiple linear regression? If not, what will allow us to ?

- Normalizing (substract by mean and divided by standard deviation column-wise) the features and target - Since variables are normalized, we do not need to fit a bias (𝛽0) term

What are some causes for over-fitting ? How to avoid them ?

- Overly complex model, too many feature - Too many polynomial / transformation of features - Inaccurate predictions due to over-learning from noise and edge cases in the data - Regularization - Feature selection - Use K-fold cross validation to decide complexity of model

What are some causes for under-fitting ? How to avoid them ? What disadvantages do they bring ?

- Overly simplistic model, not enough feature - Not enough polynomial / transformation of features - Can't interprete coefficients - Inaccurate predictions due to under-learning from the data - Add more features and complex (non-linear) model

How to determine how many features to select for your model ? What difference does it make if you are fitting on on a large (does not fit into memory) dataset

- Product knowledge - Cross validation - Forwards / Backwards elimination (Not feasible with big dataset, computationally intensive) - Regularization - Adjusted 𝑅2 with K-fold cross validation (F1 score with K-fold if logistic regression) - Random forest feature importance

You have built a logistic regression model to predict fraud, how do you decide the threshold for deciding fraud or not fraud?

- Profit curve - Associate cost to false positives and benefit to true positives - Compute net earnings at each of the threshold with either ROC / Precision-Recall curve - Pick the threshold with the highest net earning

What are the methods to test the assumptions ?

- QQ plot to check for normality - Graphing residuals to check for patterns (heteroskasticity and nonlinearity)

What are the assumptions of a linear regression model ?

- Relationship between X and y is linear - Errors are independent (Lack of autocorrelation) - Errors are normally distributed - Errors are homoskedastic - No multicollinearity

What are some ways to regularize a logistic regression model?

- Ridge regularization L2 - Lasso regularization L1 - Or elastic net regularization - Principal component analysis - Better feature selection!

Interpretation of logistic regression model coefficients

- Say 𝛽1=−0.0621 An increase in X1 by 1 unit, decreases the odds ratio by a factor of 0.94. - Say 𝛽1=0.0621 An increase in X1 by 1 unit, increases the odds ratio by a factor of 1.06. - Say 𝛽1=0.0621 and feature 1 is 𝑙𝑜𝑔2 transformed An increase in X1 by a factor of 2, increases the odds ratio by 1.0621.

Using the cost function, describe two algorithms to fit a linear regression model

- Solve analytically by taking the partial derivative of the cost function with respect to B, set = 0 and solve for B. - Use gradient descent.

Why is regularization necessary?

- To adjust for overfitting. (When there is high variance in a model). - Make the model capture the underlying signal but not the noise.

How to spot an outlier in multiple linear regression ?

- Univariate scatter plots / Boxplots - Normalized residual plot: More than 2 standard deviations from mean considered as outlier

What is the cost function for logistic regression?

- Use MLE - Negative log likelihood -sum(y*log(pi)+(1-y)log(1-pi))

How do you use the proximity matrix to impute missing values?

- Use proximity matrix - First impute missing values with median of the feature - Fit the trees and compute proximity matrix - Impute each missing value by taking a weighted average (by proximity) of the feature that is missing - Fit the tree again, compute proximity matrix and repeat the process

What is the interpretation of 𝛽? What does 𝛽0 (the bias) mean ?

A 1 unit increase in the feature results in a predicted increase in y of B units (also else held constant). A value of 0 for all features results in a predicted value of y of B0.

What is random forest better/worse than logistic regression at?

A random forest is a nonparametric model. It fits features that have a non-linear relationship with the target better. In a random forest the feature importance does not tell you the direction or strength of a feature on prediction. The coefficients in a logistic regression if the assumptions are much more informative. They give both a direction and a measure of strength/impact on the odds. Pros of Logistics Regression(LR) over RF: - High interpretability compared to RF as it is a linear model. - Lesser computation than RF No hyperparamer tunning needed except for the threshold of class labels which is usually chosen as 0.5. In RF the sampling proportion of samples and features, depth of the trees, and an appropriate measure like Information Gain or Gini Index has to be chosen for splitting the independent variables. - Less chance of over fitting where RF is prone to overfitting Cons of logistic regression over RF - Can't learn non linear decision boundaries and has high bias as compared to RF which has low bias and is flexible enough to learn highly nonlinear decision boundaries. The variance in RF is also reduced due to bootstrapping and voting. - Most of the limitations of linear regression are applied to LR, that are -heteroskedasticity, serial correlation, and non-normality of error terms. All of them contribute to the standard errors of the estimated parameters. - Can't handle missing values unlike RF which is immune to it as its underlyings are decision trees. - In LR, features need to be scaled and normalized unlike RF which is unaffected by it. - Appropriate features must be selected before fitting the model unlike RF which select features in its decision tress. Or as an alternative the Logistic Rgression model should be regularized with lasso to select the features.

How do you calculate the 95% confidence interval of a sample mean?

As long as n > 30, from the central limit theorem we know the sample mean will be distributed approximately normally. Calculate the sample mean. If you know the populations standard deviation you sample mean +/- Z * (population standard deviation)/sqrt(n) where Z = norm.ppf(0.025) If you don't know the population standard deviation you can estimate it using the sample standard deviation.

Explain what precision, recall and specificity are.

Recall = TP/(TP + FN) = # true positives/ # actual positives. Important for with cancer detection. Recall alone is not a good metric since predicted everything as positive will yield recall of 1. Performance of the actual negatives has no impact on recall. Precision = TP/(TP + FP) = # true positives/# things predicted as positive. Important with spam detection. If threshold of predicting positive is high, then precision is high. Specificity = TN/(TN+FP) = # true negatives/ # actual negatives Only concerned with performance of the actual negative class. Performance on actual positives has no impact on specificity. Specificity alone is not a good metric since predicting everything as negative will yield specificity of 1.

Why bootstrap data and sample features for each tree?

Reduce variance Reduce correlation between trees If trees are correlated, effectively fewer trees built Ensemble effect of trees is to lower variance Sampling of features allows less predictive features to be considered, without the most predictive features dominating

How to find the decision boundary in the feature space for a trained logistic regression model ? (Work through an example when regressed with 1 feature)

Say p = 0.5 for the boundary between positive and negative classes, and we have 𝜃 since the model is trained 0.5=1/(1+𝑒^−𝜃𝑇𝑋) Rearrange the equation above to get X on the left hand side and plot the line

Describe what statistical power is. Illustrate the concept with an example.

Statistical power, or the power of a hypothesis test is the probability that the test correctly rejects the null hypothesis. The probability of accepting the alternative hypothesis if it is true. Things that will affect power... - Sample size (n). Other things being equal, the greater the sample size, the greater the power of the test. - Larger effect size. (If alternative hypothesis very close to null hypothesis it will be harder to detect.) - Lower variance will increase power. - Significance level (α). The lower the significance level, the lower the power of the test. Power = 1 - pr(type 2 error) = pr(X > x | alternative is true) = pr(reject null hypo | alternative is true) We flip a coin 100 times. Null hypothesis is fair coin mu = 50 sum approx normal (50, 0.5*0.5*100) Alternative hypotesis is mu = 60 ``` Suppose we do a one tailed hypothesis test 5% significance level. Critical value is 58.22. h0 = norm(loc=50, scale=(100*0.5*0.5)**0.5) h0.ppf(0.95) power = Pr (X > 58.22 | alternative hypothesis is true) h1 = norm(loc=60, scale=(100*0.6*0.4)**0.5) 1 - h1.cdf(58.22) 0.64

How to avoid overfitting (low bias and high variance)?

You can prune the tree. For example you can preprune the tree by restricting - the max_depth of the tree - max_leaf_nodes: the number of leaf nodes - min_impurity_decrease - node will be split if split induces a decrease of the impurity >= this value Post-pruning - Reduce depth of tree after fitting. Nodes and subtrees are replaced with leaves to improve complexity

How do you test a model to ensure it is robust?

You can use cross-validation to see how your model performs on data that is not used in the fitting process. - Split the data into a train-test split. Put your hold-out set away. - Split the training data using Kfold to create k training sets and k validation sets. Train your data using the training data and then evaluate it using the metric of your choice on the validation set. Average the validation set performance across the k folds. You can use this process to compare models, tune hyperparameters, and try to access how the model will perform on unseen data. - When you have final model that you have tuned refit your model using all of the training data. - You can do a final evaluation on the hold-out set.

How would you prove if a coin is unfair ?

You could flip a coin a large number of time (n). Count the number of heads (k) Your null hypothesis would be p=0.5 (It is a fair coin). Your alternative hypothesis would be p \neq 0.5 You could find the p-value. That is assuming the null hypothesis is true what is the probability you observe something as exteme or more extreme than what you observed. Using the binomial distribution you could calculate your p-value. p-value = 2 * (1 - binom.cdf(k-1, n, 0.5)) Depending on the significance level of your choosing you could reject or fail to reject the null hypothesis that it is a fair coin. If p-value < significance level then you would reject the null hypothesis. (Or you could approximate using the normal distribution since with large n the normal distribution is a good approximator for the binomial distribution.)

Given you have a significance level of 0.05 and you have ran 20 independent comparison (t-test) without correcting your significance level, what is your effective rate of false positive ?

pr(false positive) = 1 - pr(all true positives) = 1 - (0.95)^20 = 0.64!

How does decision tree work? (For classification)

Each feature and feature value are evaluated for a split. The feature/value that give the greatest information gain is chosen. Information gain can be evaluated using different criteria such as shannon entropy or the gini index. Information gain = entropy of parent node - weighted entropy of split. When the split maximizing information gain is chosen and made this process is repeated at each of the child nodes. This is repeated until the leaf nodes are pure or there are no other features to split on.

What is entropy?

Entropy is the expected (or average) level of surprise or uncertainty in variables possible outcomes. Lower levels of entropy indicate lower levels of uncertainty. Shannon entropy is - sum pi * log2(pi) (summed over variables possible outcomes).

How is feature importance computed?

Calculated as the total decrease in node impurity (weighted by the probability of reaching that node (which is approximated by the proportion of samples reaching that node) averaged over all trees of the ensemble. The higher the value the more important the feature. Warning: impurity-based feature importances can be misleading for high cardinality features (many unique values)

Explain the difference between mutually exclusive and independence.

Two events are independent if P(X and Y) = Pr(X) * P(Y) The outcome of one event doesn't affect the other. Independence means there is no relationship between the occurrences of the events Two events are mutually exclusive if P(X and Y) = 0 Mutually exclusive means the occurrence of one event precludes the others.

What are the differences between running linear regression on a large (does not fit into memory) vs a small dataset?

Optimization: - Cannot do optimization on the whole dataset - Stochastic gradient descent / Mini-batch - Compute gradient and update parameters 𝛽 s on one / a few data points until the convergence of cost Interpretation: - p-value of 𝛽 are not meaningful any more. Will be statistically significant given large enough data - Normality assumption is usually met (due to large dataset) - Usually cannot examine residual plot for diagnostics (unless sample data points)

What is the F-score?

The harmonic mean of precision and recall.

Describe the central limit theorem. As a data scientist, state a few scenarios where you cannot use the CLT.

If X1, X2, X3, ... Xn are independent, identically distributed random variables are drawn from a distribution with a mean, mu, and finite variance, sigmas^2, when the sample size is sufficiently large the mean X_bar is approximately normally distributed with mean mu and std of sigman/sqrt(n). Sampling is not random (not iid) Underlying distribution does not have finite mean / variance The statistic is not the sum/mean (median / variance ...)

What is out-of-bag (OOB) error?

It is a measure for estimating the prediction error. When bootstrapping approximately 1/3 (1/e) of the data isn't use to fit the tree. This data can be used to evaluate the tree. The out-of-bag (OOB) error is the average error for each (x_i, y_i) calculated using predictions from the trees that do not contain (x_i, y_i) in their respective bootstrap sample. - Find all models (or trees, in the case of a random forest) that are not trained by the OOB instance. - Take the majority vote of these models' result for the OOB instance, compared to the true value of the OOB instance. - Compile the OOB error for all instances in the OOB dataset. Out-of-bag error and cross-validation (CV) are different methods of measuring the error estimate of a machine learning model. Over many iterations, the two methods should produce a very similar error estimate. That is, once the OOB error stabilizes, it will converge to the cross-validation (specifically leave-one-out cross-validation) error. The advantage of the OOB method is that it requires less computation and allows one to test the model as it is being trained.

Why is post-pruning better than max depth for avoiding over-fitting?

Max depth (or stopping early) is too short-sighted since a seemingly worthless split early on might be followed by a very good split later on Post-pruning gives the tree depth that provides the best prediction Post-pruning merges pair of nodes from the lowest level of tree and computes test error However max depth is less computationally intensive

Compute the variance of the following population: 10, 12, 34, 24.

Mean = 1/4 (10+12+34+24) Variance = 1/4[(10-20)^2 + (12-20)^2 + (34-20)^2 + (24-20)^2] = 94

What is the proximity matrix and how is it computed?

Proximity matrix is a N * N matrix describing how close are each of the data point with each other For each tree, run the the data point down the tree and record the terminal node If 2 nodes ended up in the same terminal node, add one to their proximity

What are the differences between a precision-recall curve and a ROC curve?

ROC curve is FPR vs TPR. For each threshold plots (FPR, TPR). ROC curve represents a more balanced perspective of the prediction performance on both positive and negative classes. A precision-recall places more emphasis on predicting the positive class correctly and having low false positive rate at the same time.

What is the difference of using the mean vs the median as a centrality measure ?

The mean is more sensitive to outliers. If the distribution is skewed the median will likely be a better measure of centrality.

What is the interpretation of the p-value of 𝛽 ? What is the null and alternative hypothesis?

The null hypothesis is B=0. The alternative hypothesis is B <> 0. The p-value gives the likelihood of getting a B value as extreme or more extreme than what you observed given the null hypothesis that B = 0. If the p-value is less than the significance level then you can reject the null hypothesis that B = 0.

Explain what a p-value is ?

The probability of observing something at least as extreme as what you observed **given the null hypothesis is true**.

Explain what the standard error of mean is.

The standard error is the standard deviation of the sampling distribution of the mean If a statistically independent sample of n observations are taken from a statistical population with a standard deviation of \sigma , then the mean value calculated from the sample will have an associated standard error on the mean of sigma/sqrt(n). The standard error of the mean is a measure of the dispersion of sample means around the population mean. That is the mean is less variable than X. This is because as the sample size increases, sample means cluster more closely around the population mean. The smaller the sd of X or the larger the sample size, the smaller the standard error is...

Why is random forest perform better than decision tree?

There are two reasons. - Using bootstrap aggregation of trees you can average the results to reduce variation. - By restricting the feature options at each split randomly you can decorelate the trees in your random forest. Low bias because trees grown to be large. Aggregating decorelated trees reduces variation.

What is the cost function for linear regression ?

minimize MSE


Conjuntos de estudio relacionados

ACE Exam Study: Program Progression and Modification

View Set

ETA Fiber Certification Technician (FOT) WOAC

View Set

New York State Property and Casualty Licensing Exam

View Set

Cognitive Psychology 2341 Chapter 5: Memory Structures

View Set

Old People Questions (Chp 32 Foundations)

View Set