Capital One Final Round

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Explain how to handle missing values

- delete rows with missing data (pro is robust model but con is data loss. don't use if there are a lot of nulls) - for numerical values: impute the values by filling in the mean or median (pro is easy implementation and prevention of data loss, con is data leakage), - for categorical values: impute with the mode or a new category for "unknown" when instead of mode when number of nulls is large (pro is easy implementation and prevention of data loss, con is data leakage or more categories affecting model performance because of sparse data) - temporal values: can fill in the most recent value (last observation carried forward) or interpolation (value between two values) over a time period like the most recent value before and after

How do you detect multicollinearity?

- high standard error of the regression coefficient. Also, large change in regression coefficient with different samples of data - pairwise correlation matrix (bivariate relationships) between all pairs of features can show that two columns are related. Same as doing a scatter plot. Only shows bivariate relationships. - VIF (variation inflation factor). VIF is 1/(Rj^2) where Rj^2 is the auxiliary regression for variable j. When the VIF is greater than 5 or 10, then the variable is associated with multicollinearity

If you have 70 red marbles, and the ratio of green to red marbles is 2 to 7, how many green marbles are there?

20

How to handle different levels of categories between train and test sets?

After one hot encoding, drop columns from the test set that do not appear in the train set and add columns to the test set with all 0 values that appear in train

What's the difference between a random forest and a gradient boosted tree?

Both are tree based ensemble algorithms. Random forest is a bagging algorithm and gradient boosted trees are boosting algorithms. Random Forest works by taking random samples of the data and random samples of the features, fitting a decision tree using the subset, and repeating this step with many random samples/trees. It then averages or takes the majority vote for predicted values. Gradient Boosted trees work by fitting sequential decision trees to an initial prediction that predict error of the prediction. The initial prediction and the subsequent predicted errors from sequential decision trees are added to reach the final prediction. (sequential decision trees are trained to improve the previous decision tree and each decision tree is trained independently).

How would you build a model to predict credit card fraud?

Data preprocessing, feature engineering, anomaly detection (if unlabeled) and classification with a lot of emphasis on resampling techniques (if labeled)

Are false positives or false negatives more important?

It depends on the business situation. For example, when detecting credit card fraud, a false positive would be declaring a transaction fraudulent when it was truly benign and a false negative would be declaring a transaction benign when it is truly fraudulent. In this case, a false negative is more important since the cost of not detecting fraud is higher than the cost of incorrectly flagging a transaction. Perhaps for a medical classification, a false positive for having a disease may be more important than a false negative if the cost of incorrectly treating the patient is more than the cost of not treating the patient if it is a dangerous treatment.

How would you explain the multinomial distribution and write python code on a whiteboard to represent this distribution

It is a generalization of the binomial distribution (which is the distribution of success/failure) where there are finite number of mutually exclusive class outcomes, k. Probability of each of the k outcomes sums to 1. P(X1=x1, ... Xk=xk) = (n!/x1!*...xk!)*(p1*...pk) This translates to the probability that there are [a number] of instances of class 1, [a number] of instances of class2, ... is equal to the product of the probability of drawing that number of instances of each class and the number of ways to rearrange the instances. def prob_class_counts(num_class_a=0, num_class_b=0, num_class_c=0): #sum all to get n #n/math.factorial(each) #have constants for the true probabilities and multiply The above is the probability mass function

What does regularization do?

It penalized extra features in a model so that you only use the features that have the greatest predictive power. L1 regularization will set coefficients to 0 (LASSO Reg) and l2 regularization sets them close to 0 (Ridge regression)

If you're attempting to predict a customer's gender, and you only have 100 data points, what problems could arise?

Overfitting

Explain the bias-variance tradeoff

The Bias Variance tradeoff relates to model performance. A natural way to evaluate models is the mean squared error, which is equal to the expected value of the predicted value minus the actual value. This equation can be rewritten to be bias squared plus variance. Since a good model is one that minimizes mean squared error (one that predicts values close to the actual values), a good model minimizes both bias and variance. However, bias and variance can be at odds with each other because by optimizing one, you are deoptimizing the other. Optimizing the bias creates an overfit model, one that has memorized the training dataset and can closely predict the test value. An overfit model however is likely to have high variance since with different training data it will produce a different value. Likewise optimizing variance can produce an underfit model which is a simple model that produces consistent predicted values but it not sophisticated enough to predict values well. This model has low variance because a consistent simple model will not change much with different data, but it has high bias since it is not sophisticated enough to predict values close to actuals. An ideal model considers this tradeoff between bias and variance so that the model can achieve better bias or variance without compromising the other (the model is neither overfit nor underfit).

How will you set the threshold for credit card fraud detection model?

Use a ROC curve and choose the p in the elbow (p is the threshold). A ROC is the true positive rate (y) vs the false positive rate (x) at different values of p, the threshold for classification

Given a die, would it be more likely to get a single 6 in six rolls, at least two 6s in twelve rolls, or at least one-hundred 6s in six-hundred rolls?

binomial distribution. work it out.

what is R^2

coefficient of determination: The percent of variation in y that is accounted for by regression on x. It is a measure of how close the data fit to the regression line. E.g is R^2=.88 for gpa=m*study_time + b means 88% of the variation of GPA is accounted for by its regression on study time. Equation is 1 - sum squares residuals/total sum of squares. (residual sum squares is sum squared value of predicted minus expected. total sum squares is sum squared different of y - y bar) NOT to be confused with the correlation r, which is the strength of a linear relationship between two quantitative variables.

Describe the modeling process from the beginning

data cleaning, feature engineering, eda, split into train and test, model, evaluate

How would you derive new features from features that already exist?

depends on business situation. Can create categorical and numerical columns from time columns by doing month, day of week, days between dates, etc. Can combines columns into a single column

How do you join two data sets?

foreign key is the set of columns consisting of the primary key of the another table that links the two datasets. You join on the foreign key. - left join -> add columns from second dataset to first - right join -> add columns from first dataset to second - outerjoin -> - cross join ->

What would the distribution of daily commutes in New York City look like?

multimodal distribution

Suppose you were given two years of transaction history. What features would you use to predict credit risk?

transaction amounts, frequency of transactions, remaining balances, defaults, transaction types


Kaugnay na mga set ng pag-aaral

Pearson Revel // Chapter 4: Social Interaction (PART 2: BUILDING BLOCKS OF SOCIETIES)

View Set

Life Insurance Exam #2 Chs. 9, 11-19, 22

View Set

Psychiatric-Mental Health Practice questions

View Set