Machine Learning & Mathematics
44. How do we multiply matrices?
-A∈Rn×mA∈Rn×m and B∈Rm×pB∈Rm×p -Each entry: ABij=∑mk=1AikBkj
26. What do you think about the idea of injecting noise in your data set to test the sensitivity of your models?
-Effect would be similar to regularization: avoid overfitting -Used to increase robustness
11. Explain what a false positive and a false negative are. Why is it important these from each other? Provide examples when false positives are more important than false negatives, false negatives are more important than false positives and when these two types of errors are equally important
-False positive Improperly reporting the presence of a condition when it's not in reality. Example: HIV positive test when the patient is actually HIV negative -False negative Improperly reporting the absence of a condition when in reality it's the case. Example: not detecting a disease when the patient has this disease. When false positives are more important than false negatives: - In a non-contagious disease, where treatment delay doesn't have any long-term consequences but the treatment itself is grueling - HIV test: psychological impact When false negatives are more important than false positives: - If early treatment is important for good outcomes - In quality control: a defective item passes through the cracks! - Software testing: a test to catch a virus has failed
39. Which kernels do you know? How to choose a kernel?
-Gaussian kernel -Linear kernel -Polynomial kernel -Laplace kernel -Esoteric kernels: string kernels, chi-square kernels -If number of features is large (relative to number of observations): SVM with linear kernel ; e.g. text classification with lots of words, small training example -If number of features is small, number of observations is intermediate: Gaussian kernel -If number of features is small, number of observations is small: linear kernel
20. What is better: good data or good models? And how do you define "good"? Is there a universal good model? Are there any models that are definitely not so good?
-Good data is definitely more important than good models -If quality of the data wasn't of importance, organizations wouldn't spend so much time cleaning and preprocessing it! -Even for scientific purpose: good data (reflected by the design of experiments) is very important How do you define good? - good data: data relevant regarding the project/task to be handled - good model: model relevant regarding the project/task - good model: a model that generalizes on external data sets Is there a universal good model? - No, otherwise there wouldn't be the overfitting problem! - Algorithm can be universal but not the model - Model built on a specific data set in a specific organization could be ineffective in other data set of the same organization - Models have to be updated on a somewhat regular basis Are there any models that are definitely not so good? - "all models are wrong but some are useful" George E.P. Box - It depends on what you want: predictive models or explanatory power - If both are bad: bad model
15. When would you use random forests Vs SVM and why?
-In a case of a multi-class classification problem: SVM will require one-against-all method (memory intensive) -If one needs to know the variable importance (random forests can perform it as well) -If one needs to get a model fast (SVM is long to tune, need to choose the appropriate kernel and its parameters, for instance sigma and epsilon) -In a semi-supervised learning context (random forest and dissimilarity measure): SVM can work only in a supervised learning mode
29. Do we always need the intercept term in a regression model?
-It guarantees that the residuals have a zero mean -It guarantees the least squares slopes estimates are unbiased -the regression line floats up and down, by adjusting the constant, to a point where the mean of the residuals is zero
37. What is the maximal margin classifier? How this margin can be achieved?
-When the data can be perfectly separated using a hyperplane, there actually exists an infinite number of these hyperplanes -Intuition: a hyperplane can usually be shifted a tiny bit up, or down, or rotated, without coming into contact with any of the observations -Large margin classifier: choosing the hyperplance that is farthest from the training observations -This margin can be achievec using support vectors
40. Is it beneficial to perform dimensionality reduction before fitting an SVM? Why or why not?
-When the number of features is large comparing to the number of observations (e.g. document-term matrix) -SVM will perform better in this reduced space
23. Do you think 50 small decision trees are better than a large one? Why?
-Yes! -More robust model (ensemble of weak learners that come and make a strong learner) -Better to improve a model by taking many small steps than fewer large steps -If one tree is erroneous, it can be auto-corrected by the following -Less prone to overfitting
14. What are feature vectors?
-n-dimensional vector of numerical features that represent some object -term occurrences frequencies, pixels of an image etc. -Feature space: vector space associated with these vectors
9. Explain what resampling methods are and why they are useful
-repeatedly drawing samples from a training set and refitting a model of interest on each sample in order to obtain additional information about the fitted model -example: repeatedly draw different samples from training data, fit a linear regression to each new sample, and then examine the extent to which the resulting fit differ -most common are: cross-validation and the bootstrap -cross-validation: random sampling with no replacement -bootstrap: random sampling with replacement -cross-validation: evaluating model performance, model selection (select the appropriate level of flexibility) -bootstrap: mostly used to quantify the uncertainty associated with a given estimator or statistical learning method
30. What are the assumptions required for linear regression? What if some of these assumptions are violated?
1. The data used in fitting the model is representative of the population 2. The true underlying relation between xx and yy is linear 3. Variance of the residuals is constant (homoscedastic, not heteroscedastic) 4. The residuals are independent 5. The residuals are normally distributed -Predict yy from xx: 1) + 2) -Estimate the standard error of predictors: 1) + 2) + 3) -Get an unbiased estimation of yy from xx: 1) + 2) + 3) + 4) -Make probability statements, hypothesis testing involving slope and correlation, confidence intervals: 1) + 2) + 3) + 4) + 5) Note: - Common mythology: linear regression doesn't assume anything about the distributions of xx and yy - It only makes assumptions about the distribution of the residuals - And this is only needed for statistical tests to be valid - Regression can be applied to many purposes, even if the errors are not normally distributed
50. What the Newton's method is?
No answer given - colin
13. What does NLP stand for?
"Natural language processing"! -Interaction with human (natural) and computers languages -Involves natural language understanding Major tasks: - Machine translation - Question answering: "what's the capital of Canada?" - Sentiment analysis: extract subjective information from a set of documents, identify trends or public opinions in the social media - Information retrieval
28. How would you define and measure the predictive power of a metric?
-Predictive power of a metric: the accuracy of a metric's success at predicting the empirical -They are all domain specific -Example: in field like manufacturing, failure rates of tools are easily observable. A metric can be trained and the success can be easily measured as the deviation over time from the observed -In information security: if the metric says that an attack is coming and one should do X. Did the recommendation stop the attack or the attack never happened?
42. What is curse of dimensionality? How does it affect distance and similarity measures?
-Refers to various phenomena that arise when analyzing and organizing data in high dimensional spaces -Common theme: when number of dimensions increases, the volume of the space increases so fast that the available data becomes sparse -Issue with any method that requires statistical significance: the amount of data needed to support the result grows exponentially with the dimensionality -Issue when algorithms don't scale well on high dimensions typically when O(nkn)O(nkn) -Everything becomes far and difficult to organize Illustrative example: compare the proportion of an inscribed hypersphere with radius rr and dimension d to that of a hypercube with edges of length 2r2r - Volume of such a sphere is Vsphere=2rdπd/2dΓ(d/2)Vsphere=2rdπd/2dΓ(d/2) - The volume of the cube is: Vcube=2rdVcube=2rd As d increases (space dimension), the volume of hypersphere becomes insignificant relative to the volume of the hypercube: limd→∞VsphereVcube=πd/2d2d−1Γ(d/2)=0 limd→∞VsphereVcube=πd/2d2d−1Γ(d/2)=0 - Nearly all of the dimensional space is far away from the center - It consists almost entirely of the corners of the hypercube, with no middle!
18. How do you test whether a new credit risk scoring model works?
-Test on a holdout set -Kolmogorov-Smirnov test Kolmogorov-Smirnov test: - Non-parametric test - Compare a sample with a reference probability distribution or compare two samples - Quantifies a distance between the empirical distribution function of the sample and the cumulative distribution function of the reference distribution - Or between the empirical distribution functions of two samples - Null hypothesis (two-samples test): samples are drawn from the same distribution - Can be modified as a goodness of fit test - In our case: cumulative percentages of good, cumulative percentages of bad
24. Why is mean square error a bad measure of model performance? What would you suggest instead?
-see question 3 about metrics in regression -It puts too much emphasis on large deviations (squared) -Alternative: mean absolute deviation
43. What is Ax=bAx=b? How to solve it?
-A matrix equation/a system of linear equations -calculate the inverse of AA (if non singular) -can be done using Gaussian elimination
5. Explain what a local optimum is and why it is important in a specific context, such as K-means clustering. What are specific ways of determining if you have a local optimum problem? What can be done to avoid local optima?
-A solution that is optimal in within a neighboring set of candidate solutions -In contrast with global optimum: the optimal solution among all others -K-means clustering context: It's proven that the objective cost function will always decrease until a local optimum is reached. Results will depend on the initial random cluster assignment -Determining if you have a local optimum problem: Tendency of premature convergence Different initialization induces different optima -Avoid local optima in a K-means context: repeat K-means and take the solution that has the lowest cost
22. What are the drawbacks of linear model? Are you familiar with alternatives (Lasso, ridge regression)?
-Assumption of linearity of the errors -Can't be used for count outcomes, binary outcomes -Can't vary model flexibility: overfitting problems -Alternatives: see question 4 about regularization
17. How do you know if one algorithm is better than other?
-In terms of performance on a given data set? -In terms of performance on several data sets? -In terms of efficiency? In terms of performance on several data sets: - "Does learning algorithm A have a higher chance of producing a better predictor than learning algorithm B in the given context?" - "Bayesian Comparison of Machine Learning Algorithms on Single and Multiple Datasets", A. Lacoste and F. Laviolette - "Statistical Comparisons of Classifiers over Multiple Data Sets", Janez Demsar In terms of performance on a given data set: - One wants to choose between two learning algorithms - Need to compare their performances and assess the statistical significance One approach (Not preferred in the literature): - Multiple k-fold cross validation: run CV multiple times and take the mean and sd - You have: algorithm A (mean and sd) and algorithm B (mean and sd) - Is the difference meaningful? (Paired t-test) Sign-test (classification context): -Simply counts the number of times A has a better metrics than B and assumes this comes from a binomial distribution. Then we can obtain a p-value of the HoHo test: A and B are equal in terms of performance. Wilcoxon signed rank test (classification context): -Like the sign-test, but the wins (A is better than B) are weighted and assumed coming from a symmetric distribution around a common median. Then, we obtain a p-value of the HoHo test. Other (without hypothesis testing): - AUC - F-Score - See question 3
8. What is latent semantic indexing? What is it used for? What are the specific limitations of the method?
-Indexing and retrieval method that uses singular value decomposition to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text -Based on the principle that words that are used in the same contexts tend to have similar meanings -"Latent": semantic associations between words is present not explicitly but only latently -For example: two synonyms may never occur in the same passage but should nonetheless have highly associated representations Used for: -Learning correct word meanings -Subject matter comprehension -Information retrieval -Sentiment analysis (social network analysis) Here's a great tutorial on it.
21. Why is naive Bayes so bad? How would you improve a spam detection algorithm that uses naive Bayes?
-Naïve: the features are assumed independent/uncorrelated -Assumption not feasible in many cases -Improvement: decorrelate features (covariance matrix into identity matrix)
12. What is the difference between supervised learning and unsupervised learning? Give concrete examples
-Supervised learning: inferring a function from labeled training data -Supervised learning: predictor measurements associated with a response measurement; we wish to fit a model that relates both for better understanding the relation between them (inference) or with the aim to accurately predicting the response for future observations (prediction) -Supervised learning: support vector machines, neural networks, linear regression, logistic regression, extreme gradient boosting -Supervised learning examples: predict the price of a house based on the are, size.; churn prediction; predict the relevance of search engine results. -Unsupervised learning: inferring a function to describe hidden structure of unlabeled data -Unsupervised learning: we lack a response variable that can supervise our analysis -Unsupervised learning: clustering, principal component analysis, singular value decomposition; identify group of customers -Unsupervised learning examples: find customer segments; image segmentation; classify US senators by their voting.
2. Is it better to design robust or accurate algorithms?
-The ultimate goal is to design systems with good generalization capacity, that is, systems that correctly identify patterns in data instances not seen before -The generalization performance of a learning system strongly depends on the complexity of the model assumed -If the model is too simple, the system can only capture the actual data regularities in a rough manner. In this case, the system has poor generalization properties and is said to suffer from underfitting -By contrast, when the model is too complex, the system can identify accidental patterns in the training data that need not be present in the test set. These spurious patterns can be the result of random fluctuations or of measurement errors during the data collection process. In this case, the generalization capacity of the learning system is also poor. The learning system is said to be affected by overfitting -Spurious patterns, which are only present by accident in the data, tend to have complex forms. This is the idea behind the principle of Occam's razor for avoiding overfitting: simpler models are preferred if more complex models do not significantly improve the quality of the description for the observations -Quick response: Occam's Razor. It depends on the learning task. Choose the right balance -Ensemble learning can help balancing bias/variance (several weak learners together = strong learner)
51. Imagine you have N pieces of rope in a bucket. You reach in and grab one end-piece, then reach in and grab another end-piece, and tie those two together. What is the expected value of the number of loops in the bucket?
-There are nn entirely unattached pieces of rope in a bucket -A loop: any number of rope attached in a closed chain -Suppose the expected number of loops for n−1n−1 pieces of rope is denoted Ln−1Ln−1 -Consider the bucket of nn pieces of rope; there are 2n2n rope ends Pick an end of rope. Of the remaining 2n−12n−1 ends of rope, only one end creates a loop (the other end of the same piece of rope). There are then n−1n−1 untied pieces of rope. The rest of the time, two separates pieces of rope are tied together and there are effectively n−1n−1 untied pieces of rope. The recurrence is therefore: -Ln=12n−1+Ln−1Ln=12n−1+Ln−1 Clearly, L1=1L1=1 so: -Ln=∑nk=112k−1=H2n−Hn2Ln=∑k=1n12k−1=H2n−Hn2 -Where HkHk is the kthkth harmonic number Since Hk≐γ+lnkHk≐γ+lnk for large-ish k, where gamma=0.57722. is the Euler-Mascheroni constant, we have: -Ln≐ln(2n)−ln(n)2=ln2n−−√
3. How to define/select metrics?
-Type of task: regression? Classification? -Business goal? -What is the distribution of the target variable? -What metric do we optimize for? -Regression: RMSE (root mean squared error), MAE (mean absolute error), WMAE(weighted mean absolute error), RMSLE (root mean squared logarithmic error)... -Classification: recall, AUC, accuracy, misclassification error, Cohen's Kappa... Common metrics in regression: -Mean Squared Error Vs Mean Absolute Error RMSE gives a relatively high weight to large errors. The RMSE is most useful when large errors are particularly undesirable. The MAE is a linear score: all the individual differences are weighted equally in the average. MAE is more robust to outliers than MSE. RMSE=1n∑ni=1(yi−y^i)2−−−−−−−−−−−−−−√RMSE=1n∑i=1n(yi−y^i)2 MAE=1n∑ni=1|yi−y^i|MAE=1n∑i=1n|yi−y^i| -Root Mean Squared Logarithmic Error RMSLE penalizes an under-predicted estimate greater than an over-predicted estimate (opposite to RMSE) RMSLE=1n∑ni=1(log(pi+1)−log(ai+1))2−−−−−−−−−−−−−−−−−−−−−−−−−−−√RMSLE=1n∑i=1n(log(pi+1)−log(ai+1))2 Where pipi is the ith prediction, aiai the ith actual response, log(b)log(b) the natural logarithm of bb. -Weighted Mean Absolute Error The weighted average of absolute errors. MAE and RMSE consider that each prediction provides equally precise information about the error variation, i.e. the standard variation of the error term is constant over all the predictions. Examples: recommender systems (differences between past and recent products) WMAE=1∑wi∑ni=1wi|yi−y^i|WMAE=1∑wi∑i=1nwi|yi−y^i| Common metrics in classification: -Recall / Sensitivity / True positive rate: High when FN low. Sensitive to unbalanced classes. Sensitivity=TPTP+FNSensitivity=TPTP+FN -Precision / Positive Predictive Value High when FP low. Sensitive to unbalanced classes. Precision=TPTP+FPPrecision=TPTP+FP -Specificity / True Negative Rate High when FP low. Sensitive to unbalanced classes. Specificity=TNTN+FPSpecificity=TNTN+FP -Accuracy High when FP and FN are low. Sensitive to unbalanced classes (see "Accuracy paradox") Accuracy=TP+TNTN+TP+FP+FNAccuracy=TP+TNTN+TP+FP+FN -ROC / AUC ROC is a graphical plot that illustrates the performance of a binary classifier (SensitivitySensitivity Vs 1−Specificity1−Specificity or SensitivitySensitivity Vs SpecificitySpecificity). They are not sensitive to unbalanced classes. AUC is the area under the ROC curve. Perfect classifier: AUC=1, fall on (0,1); 100% sensitivity (no FN) and 100% specificity (no FP) -Logarithmic loss Punishes infinitely the deviation from the true value! It's better to be somewhat wrong than emphatically wrong! logloss=−1N∑ni=1(yilog(pi)+(1−yi)log(1−pi))logloss=−1N∑i=1n(yilog(pi)+(1−yi)log(1−pi)) -Misclassification Rate Misclassification=1n∑iI(yi≠y^i)Misclassification=1n∑iI(yi≠y^i) -F1-Score Used when the target variable is unbalanced. F1Score=2Precision×RecallPrecision+Recall
4. Explain what regularization is and why it is useful. What are the benefits and drawbacks of specific methods, such as ridge regression and lasso?
-Used to prevent overfitting: improve the generalization of a model -Decreases complexity of a model -Introducing a regularization term to a general loss function: adding a term to the minimization problem -Impose Occam's Razor in the solution Ridge regression: -We use an L2L2 penalty when fitting the model using least squares -We add to the minimization problem an expression (shrinkage penalty) of the form λ×∑coefficientsλ×∑coefficients -λ: tuning parameter; controls the bias-variance tradeoff; accessed with cross-validation -A bit faster than the lasso β^ridge=argminβ{∑ni=1(yi−β0−∑pj=1xijβj)2+λ∑pj=1β2j}β^ridge=argminβ{∑i=1n(yi−β0−∑j=1pxijβj)2+λ∑j=1pβj2} The Lasso: -We use an L1L1 penalty when fitting the model using least squares -Can force regression coefficients to be exactly: feature selection method by itself β^lasso=argminβ{∑ni=1(yi−β0−∑pj=1xijβj)2+λ∑pj=1||βj||}
16. How do you take millions of users with 100's transactions each, amongst 10k's of products and group the users together in meaningful segments?
1. Some exploratory data analysis (get a first insight) -Transactions by date -Count of customers Vs number of items bought -Total items Vs total basket per customer -Total items Vs total basket per area 2. Create new features (per customer): Counts: -Total baskets (unique days) -Total items -Total spent -Unique product id Distributions: -Items per basket -Spent per basket -Product id per basket -Duration between visits -Product preferences: proportion of items per product cat per basket 3. Too many features, dimension-reduction? PCA? 4. Clustering: -PCA 5. Interpreting model fit -View the clustering by principal component axis pairs PC1 Vs PC2, PC2 Vs PC1. -Interpret each principal component regarding the linear combination it's obtained from; example: PC1=spendy axis (proportion of baskets containing spendy items, raw counts of items and visits)
33. What is a decision tree?
1. Take the entire data set as input 2. Search for a split that maximizes the "separation" of the classes. A split is any test that divides the data in two (e.g. if variable2>10) 3. Apply the split to the input data (divide step) 4. Re-apply steps 1 to 2 to the divided data 5. Stop when you meet some stopping criteria 6. (Optional) Clean up the tree when you went too far doing splits (called pruning) Finding a split: methods vary, from greedy search (e.g. C4.5) to randomly selecting attributes and split points (random forests) Purity measure: information gain, Gini coefficient, Chi Squared values Stopping criteria: methods vary from minimum size, particular confidence in prediction, purity criteria threshold Pruning: reduced error pruning, out of bag error pruning (ensemble methods)
19. What is: collaborative filtering, n-grams, cosine distance?
Collaborative filtering: - Technique used by some recommender systems - Filtering for information or patterns using techniques involving collaboration of multiple agents: viewpoints, data sources. 1. A user expresses his/her preferences by rating items (movies, CDs.) 2. The system matches this user's ratings against other users' and finds people with most similar tastes 3. With similar users, the system recommends items that the similar users have rated highly but not yet being rated by this user n-grams: - Contiguous sequence of n items from a given sequence of text or speech - "Andrew is a talented data scientist" - Bi-gram: "Andrew is", "is a", "a talented". - Tri-grams: "Andrew is a", "is a talented", "a talented data". - An n-gram model models sequences using statistical properties of n-grams; see: Shannon Game - More concisely, n-gram model: P(Xi|Xi−(n−1)...Xi−1)P(Xi|Xi−(n−1)...Xi−1): Markov model - N-gram model: each word depends only on the n−1n−1 last words Issues: - when facing infrequent n-grams - solution: smooth the probability distributions by assigning non-zero probabilities to unseen words or n-grams - Methods: Good-Turing, Backoff, Kneser-Kney smoothing Cosine distance: - How similar are two documents? - Perfect similarity/agreement: 1 - No agreement : 0 (orthogonality) - Measures the orientation, not magnitude Given two vectors A and B representing word frequencies: cosine-similarity(A,B)=⟨A,B⟩||A||⋅||B||
31. What is collinearity and what to do with it? How to remove multicollinearity?
Collinearity/Multicollinearity: - In multiple regression: when two or more variables are highly correlated - They provide redundant information - In case of perfect multicollinearity: β=(XTX)−1XTyβ=(XTX)−1XTy doesn't exist, the design matrix isn't invertible - It doesn't affect the model as a whole, doesn't bias results - The standard errors of the regression coefficients of the affected variables tend to be large - The test of hypothesis that the coefficient is equal to zero may lead to a failure to reject a false null hypothesis of no effect of the explanatory (Type II error) - Leads to overfitting Remove multicollinearity: - Drop some of affected variables - Principal component regression: gives uncorrelated predictors - Combine the affected variables - Ridge regression - Partial least square regression Detection of multicollinearity: - Large changes in the individual coefficients when a predictor variable is added or deleted - Insignificant regression coefficients for the affected predictors but a rejection of the joint hypothesis that those coefficients are all zero (F-test) - VIF: the ratio of variances of the coefficient when fitting the full model divided by the variance of the coefficient when fitted on its own - rule of thumb: VIF>5VIF>5 indicates multicollinearity - Correlation matrix, but correlation is a bivariate relationship whereas multicollinearity is multivariate
25. How can you prove that one improvement you've brought to an algorithm is really an improvement over not doing anything? Are you familiar with A/B testing?
Example with linear regression: - F-statistic (ANOVA) F=RSS1−RSS2p2−p1RSS2n−p2F=RSS1−RSS2p2−p1RSS2n−p2 p1p1: number of parameters of model 1 p2p2: number of parameters of model 2 nn: number of observations Under the null hypothesis that model 2 doesn't provide a significantly better fit than model 1, FF will have an FF distribution with (p2−p1,n−p2)(p2−p1,n−p2) degrees of freedom. The null hypothesis is rejected if the FF calculated from the data is greater than the critical value of the FF distribution for some desired significance level. Others: AIC/BIC (regression), cross-validation: assessing test error on a test/validation set
34. What impurity measures do you know?
Gini -Gini=1−∑jp2jGini=1−∑jpj2 Information Gain/Deviance -InformationGain=∑jpjlog2pjInformationGain=∑jpjlog2pj -Better than Gini when pjpj are very small: multiplying very small numbers leads to rounding errors, we can instead take logs.
1. What is cross-validation? How to do it right?
It's a model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set. Mainly used in settings where the goal is prediction and one wants to estimate how accurately a model will perform in practice. The goal of cross-validation is to define a data set to test the model in the training phase (i.e. validation data set) in order to limit problems like overfitting, and get an insight on how the model will generalize to an independent data set. Examples: leave-one-out cross validation, K-fold cross validation How to do it right? -the training and validation data sets have to be drawn from the same population -predicting stock prices: trained for a certain 5-year period, it's unrealistic to treat the subsequent 5-year a draw from the same population -common mistake: for instance the step of choosing the kernel parameters of a SVM should be cross-validated as well Bias-variance trade-off for k-fold cross validation: Leave-one-out cross-validation: gives approximately unbiased estimates of the test error since each training set contains almost the entire data set (n−1n−1 observations). But: we average the outputs of n fitted models, each of which is trained on an almost identical set of observations hence the outputs are highly correlated. Since the variance of a mean of quantities increases when correlation of these quantities increase, the test error estimate from a LOOCV has higher variance than the one obtained with k-fold cross validation Typically, we choose k=5k=5 or k=10k=10, as these values have been shown empirically to yield test error estimates that suffer neither from excessively high bias nor high variance.
38. What is a kernel? Explain the kernel trick
No answer given - colin
41. What is an Artificial Neural Network? What is back propagation?
No answer given - colin
45. What is singular value decomposition? What is an eigenvalue? And what is an eigenvector?
No answer given - colin
46. What's the relationship between PCA and SVD?
No answer given - colin
47. Can you derive the ordinary least square regression formula?
No answer given - colin
48. What is the difference between a convex function and non-convex?
No answer given - colin
49. What is gradient descent method? Will gradient descent methods always converge to the same point?
No answer given - colin
32. How to check if the regression model fits the data well?
R squared/Adjusted R squared: - R2=RSStot−RSSresRSStot=RSSregRSStot=1−RSSresRSStotR2=RSStot−RSSresRSStot=RSSregRSStot=1−RSSresRSStot - Describes the percentage of the total variation described by the model - R2R2 always increases when adding new variables: adjusted R2R2 incorporates the model's degrees of freedom F test: - Evaluate the hypothesis HoHo: all regression coefficients are equal to zero Vs H1H1: at least one doesn't - Indicates that R2R2 is reliable RMSE: - Absolute measure of fit (whereas R2R2 is a relative measure of fit)
35. What is random forest? Why is it good?
Random forest? (Intuition): - Underlying principle: several weak learners combined provide a strong learner - Builds several decision trees on bootstrapped training samples of data - On each tree, each time a split is considered, a random sample of mm predictors is chosen as split candidates, out of all pp predictors - Rule of thumb: at each split m=p-√m=p - Predictions: at the majority rule Why is it good? - Very good performance (decorrelates the features) - Can model non-linear class boundaries - Generalization error for free: no cross-validation needed, gives an unbiased estimate of the generalization error as the trees is built - Generates variable importance
7. Explain what precision and recall are. How do they relate to the ROC curve?
See question 3. "How to define/select metrics? Do you know compound metrics?". When using Precision/Recall curves. -Type of task: regression? Classification? -Business goal? -What is the distribution of the target variable? -What metric do we optimize for? -Regression: RMSE (root mean squared error), MAE (mean absolute error), WMAE(weighted mean absolute error), RMSLE (root mean squared logarithmic error)... -Classification: recall, AUC, accuracy, misclassification error, Cohen's Kappa... Common metrics in regression: -Mean Squared Error Vs Mean Absolute Error RMSE gives a relatively high weight to large errors. The RMSE is most useful when large errors are particularly undesirable. The MAE is a linear score: all the individual differences are weighted equally in the average. MAE is more robust to outliers than MSE. RMSE=1n∑ni=1(yi−y^i)2−−−−−−−−−−−−−−√RMSE=1n∑i=1n(yi−y^i)2 MAE=1n∑ni=1|yi−y^i|MAE=1n∑i=1n|yi−y^i| -Root Mean Squared Logarithmic Error RMSLE penalizes an under-predicted estimate greater than an over-predicted estimate (opposite to RMSE) RMSLE=1n∑ni=1(log(pi+1)−log(ai+1))2−−−−−−−−−−−−−−−−−−−−−−−−−−−√RMSLE=1n∑i=1n(log(pi+1)−log(ai+1))2 Where pipi is the ith prediction, aiai the ith actual response, log(b)log(b) the natural logarithm of bb. -Weighted Mean Absolute Error The weighted average of absolute errors. MAE and RMSE consider that each prediction provides equally precise information about the error variation, i.e. the standard variation of the error term is constant over all the predictions. Examples: recommender systems (differences between past and recent products) WMAE=1∑wi∑ni=1wi|yi−y^i|WMAE=1∑wi∑i=1nwi|yi−y^i| Common metrics in classification: -Recall / Sensitivity / True positive rate: High when FN low. Sensitive to unbalanced classes. Sensitivity=TPTP+FNSensitivity=TPTP+FN -Precision / Positive Predictive Value High when FP low. Sensitive to unbalanced classes. Precision=TPTP+FPPrecision=TPTP+FP -Specificity / True Negative Rate High when FP low. Sensitive to unbalanced classes. Specificity=TNTN+FPSpecificity=TNTN+FP -Accuracy High when FP and FN are low. Sensitive to unbalanced classes (see "Accuracy paradox") Accuracy=TP+TNTN+TP+FP+FNAccuracy=TP+TNTN+TP+FP+FN -ROC / AUC ROC is a graphical plot that illustrates the performance of a binary classifier (SensitivitySensitivity Vs 1−Specificity1−Specificity or SensitivitySensitivity Vs SpecificitySpecificity). They are not sensitive to unbalanced classes. AUC is the area under the ROC curve. Perfect classifier: AUC=1, fall on (0,1); 100% sensitivity (no FN) and 100% specificity (no FP) -Logarithmic loss Punishes infinitely the deviation from the true value! It's better to be somewhat wrong than emphatically wrong! logloss=−1N∑ni=1(yilog(pi)+(1−yi)log(1−pi))logloss=−1N∑i=1n(yilog(pi)+(1−yi)log(1−pi)) -Misclassification Rate Misclassification=1n∑iI(yi≠y^i)Misclassification=1n∑iI(yi≠y^i) -F1-Score Used when the target variable is unbalanced. F1Score=2Precision×RecallPrecision+Recall
10. What is principal component analysis? Explain the sort of problems you would use PCA for. Also explain its limitations as a method
Statistical method that uses an orthogonal transformation to convert a set of observations of correlated variables into a set of values of linearly uncorrelated variables called principal components. Reduce the data from nn to kk dimensions: find the kk vectors onto which to project the data so as to minimize the projection error. Algorithm: 1) Preprocessing (standardization): PCA is sensitive to the relative scaling of the original variable 2) Compute covariance matrix ΣΣ 3) Compute eigenvectors of ΣΣ 4) Choose kk principal components so as to retain xx% of the variance (typically x=99x=99) Applications: 1) Compression - Reduce disk/memory needed to store data - Speed up learning algorithm. Warning: mapping should be defined only on training set and then applied to test set 2. Visualization: 2 or 3 principal components, so as to summarize data Limitations: - PCA is not scale invariant - The directions with largest variance are assumed to be of most interest - Only considers orthogonal transformations (rotations) of the original variables - PCA is only based on the mean vector and covariance matrix. Some distributions (multivariate normal) are characterized by this but some are not - If the variables are correlated, PCA can achieve dimension reduction. If not, PCA just orders them according to their variances
6. Assume you need to generate a predictive model using multiple regression. Explain how you intend to validate this model
Validation using R2R2: - % of variance retained by the model - Issue: R2R2 is always increased when adding variables - R2=RSStot−RSSresRSStot=RSSregRSStot=1−RSSresRSStotR2=RSStot−RSSresRSStot=RSSregRSStot=1−RSSresRSStot Analysis of residuals: - Heteroskedasticity (relation between the variance of the model errors and the size of an independent variable's observations) - Scatter plots residuals Vs predictors - Normality of errors - Etc. : diagnostic plots Out-of-sample evaluation: with cross-validation
27. Do you know / used data reduction techniques other than PCA? What do you think of step-wise regression? What kind of step-wise techniques are you familiar with?
data reduction techniques other than PCA?: -Partial least squares: like PCR (principal component regression) but chooses the principal components in a supervised way. Gives higher weights to variables that are most strongly related to the response step-wise regression? - the choice of predictive variables are carried out using a systematic procedure - Usually, it takes the form of a sequence of F-tests, t-tests, adjusted R-squared, AIC, BIC - at any given step, the model is fit using unconstrained least squares - can get stuck in local optima - Better: Lasso step-wise techniques: - Forward-selection: begin with no variables, adding them when they improve a chosen model comparison criterion - Backward-selection: begin with all the variables, removing them when it improves a chosen model comparison criterion Better than reduced data: -Example 1: If all the components have a high variance: which components to discard with a guarantee that there will be no significant loss of the information? -Example 2 (classification): - One has 2 classes; the within class variance is very high as compared to between class variance - PCA might discard the very information that separates the two classes Better than a sample: - When number of variables is high relative to the number of observations
36. How do we train a logistic regression model? How do we interpret its coefficients?
log(odds)=log(P(y=1|x)P(y=0|x))=log(odds)=log(P(y=1|x)P(y=0|x))= is a linear function of the input features Minimization objective/Cost function: -J(β)=−1m∑mi=1yilog(hβ(xi))+(1−yi)log(1−hβ(xi))J(β)=−1m∑i=1myilog(hβ(xi))+(1−yi)log(1−hβ(xi)) -Where: hβ(x)=g(βTx)hβ(x)=g(βTx) and g(z)=11+e−zg(z)=11+e−z (sigmoid function) -Intuition: - if yi=0yi=0, J(β)=log(1−hβ(x)i)J(β)=log(1−hβ(x)i), will converge to ∞∞ as hβ(x)ihβ(x)i becomes far from 0 - Converse: when yi=1yi=1, J(β)=log(hβ(x)i)J(β)=log(hβ(x)i), will converge to ∞∞ as hβ(x)ihβ(x)i becomes far from 1 Interpretation of the coefficients: the increase of logoddslogodds for the increase of one unit of a predictor, given all the other predictors are fixed.