BA 476 midterm 1
Why does lasso lead to sparse models?
Suppose you have two predictors with estimates 𝛽1=1,𝛽2=10. Someone forces you to decrease the total size of coefficients by 1. Assume MSE is affected the same way regardless of which coefficient you decrease So all else being equal, the ridge penalty encourages shrinking larger coefficients before smaller ones, so small coeffs stick around.
Linear Regression; Measuring accuracy of a model
* outliers have a large impact on residuals. - One measure of success is average squared residuals (MSE-> average prediction error) - we square in order to avoid negative values and show the true error of the entire model, especially when we are summing the values of the residuals. * MSE = Variance + Bias^2 + Irreducible Error
Why does lasso lead to sparse models? (2)
*extra info: 'elastic net" combines ridge and lasso because lasso can be unstable with highly correlated variables. So "elastic net" is the best of both: gets stabilkity from ridge and sparsity from lasso.
Overfitting and Underfitting General
- (Process 2) Random model: 𝑀𝑆𝐸𝑡𝑟𝑎𝑖𝑛=𝑀𝑆𝐸𝑡𝑒𝑠𝑡, large. •(Process 1)Linear regression model: 𝑀𝑆𝐸𝑡𝑟𝑎𝑖𝑛<𝑀𝑆𝐸𝑡𝑒𝑠𝑡 Underfitting: both 𝑀𝑆𝐸𝑡𝑟𝑎𝑖𝑛, 𝑀𝑆𝐸𝑡𝑒𝑠𝑡 are large (didn't learn anything) Overfitting: 𝑀𝑆𝐸𝑡𝑟𝑎𝑖𝑛 is small and 𝑀𝑆𝐸𝑡𝑒𝑠𝑡 is large(learnt noise). Generalizable models: 𝑀𝑆𝐸𝑡𝑟𝑎𝑖𝑛≈𝑀𝑆𝐸𝑡𝑒𝑠𝑡 Ultimate goal: low 𝑀𝑆𝐸𝑡𝑒𝑠𝑡
Navigating the Bias-Variance Trade off -> Can we do anything other than manual variable selection?
- So far we have been fitting least squares which chooses coefficients to minimize the loss function on the training set. But what if we change the loss function? Instead of using MSE we can think of using ridge regressions or lasso (newer)
More graph examples of bias variance tradeoff
- look over lecture 3 on more graph examples
LASSO (Least Absolute Shrinkage and Selection Operator)
- The Lasso is a recent advance is statistical learning. •Select model parameters to minimize. - The penalty term now includes the absolute values of the coefficients instead of their squares. (L1-norm instead of L2). •Like Ridge, Lasso shrinks coefficients towards zero •Unlike Ridge, the Lasso sets some coefficients to zero exactly! •Yields sparse, parsimonious models and does variable selection.
When can you use ML?
- There is some pattern - the pattern is hard to pin down mathematically. - We have data on it. - It is a program that improves with experience.
Train-Test Paradigm
- We care about out of sample (generalization) accuracy •Keep some data separate, treat as unknown •Compute MSE on test/holdout data (average residuals)From the perspective of the algorithm future data ≈ test data Avoids overfitting on the training data •Potential problems with using a test set: training on less data, must be representative
Linear Regression (in 1-D) Single predictor
Goal: choose a 'line of best fit' More formally: estimate parameters (B_0, B_1) based on (X,Y). - Suppose we have parameter estimates 𝛽0,𝛽1, then given any 𝑋′: Y_hat = 𝛽0 + 𝛽1X' when we train a model we come up with predictions for the BETAS. Residual: r_i = y_i - Y_hat_i: how much are we off on instance 𝑖. if residual is negative we overpredicted. if positive: underpredicted.
How do we do Linear Regression?
Goal: we want the 'best' line. -Set up an objective function that measures how close our predictions are to the observed points -Best line minimizes objective function (could be loss between true label and predictor, residuals). - the relationship bw predictors and outcome is linear.
Regularization
Key idea in modern analytics •Automatically constrain model to subset of variables. •To do so we minimize a different objective function than MSE •Why would minimizing something other than MSE (on the training data) work? We will still evaluate our predictions using MSE on the test data, after all. •Essentially, we are trading off some bias for some variance. Hopefully the trade-off is good: we sacrifice a bit in terms of bias to gain a lot in terms of variance. *as the alpha increases, the penalty gets larger, the size of the coefficients goes down.
Tuning hyperparameters
Many machine learning models have tunable parameters for which we need to select a value. •How should we choose 𝛼 in Lasso/Ridge? (but it's a general method) -> Choosing the right 𝛼 is the difference between glory and disaster. -> Recall: higher 𝛼= smaller coefficients 𝛽, higher bias + lower variance
Choosing subsets of predictors
Objective: balance model complexity and prediction accuracy How about this:For every possible combination of predictors, fit a linear regression Compute test MSE Pick the best model Would this work? Is it practical?((𝑝, 𝑘))tochoose 𝑘 predictors from 𝑝(∼155m for 𝑝=30,𝑘=15) * we should be getting unbiased estimates so this would be good. BUT computational power to do all of this is way too much! As a result there are some heuristics that we can do to not go through so much power. like step wise regression.
Handling Outliers
Outliers Standard deviations for normally distributed data (99.7% in 3𝜎) Random cut/isolation forests -is a point easily isolated? Clustering techniques -instances far from cluster centers/in theirown clusters etc. Sklearnhas some built in estimators for anomaly detection (fit, predict) Why is it missing? •Drop the predictor/instance -not ideal if all instances affected •Univariate imputation, for example mean/median/most frequent -MICE - Mark missing values as missing (add_indicator)
Objective: Prediction vs Inference
Prediction: Given new x_k, predict y_k. Estimate Y_hat = f(X). Y_hat is our estimate. - We dont care about f's looks. Treat it like a black box (ie: can we accurately predict a heart attack/loan default) - Goal is Accuracy Inference: Find the relationships between X,Y - Estimate Y_hat = f(X) - What is the realtionshjip bw Y and the various X_i (ie: what determines credit card default/hear attack) - Goal is Interpretability
Overfitting and Underfitting Process 1
Process 1: 1. Take a dataset and split it randomly in train and test. 2. Select coefficients for a linear model using linear regression. 3. Compute 𝑀𝑆𝐸𝑡𝑟𝑎𝑖𝑛 and 𝑀𝑆𝐸𝑡𝑒𝑠𝑡 for this model. *Remember: Linear regression selects coefficients by minimizing loss (MSE) in the training set. MSEtrain is usually small. - On average MSE_train < MSE_test
Overfitting and Underfitting Process 2
Process 2 1. Take a dataset and split it randomly in train and test 2. Select random coefficients for a linear model. 3. Using the random model, compute 𝑀𝑆𝐸𝑡𝑟𝑎𝑖𝑛and 𝑀𝑆𝐸𝑡𝑒𝑠𝑡 Linear models have the form 𝑦=𝛽0+𝛽1𝑥1+𝛽2𝑥2+𝛽3𝑥3+𝛽4𝑥4+𝜖, where we set the betas randomly. - MSE_train usually large - On average MSE_train = MSE_test
Bias-Variance Trade-off for Cross validation
(Here we are referring to the bias/variance of our estimate of accuracy) - The holdout method has the highest bias. Why?-> Consider what happens during 2-fold CV - LOOCV has very high variance: 1. Bias is low -we drop just one observation for each model 2. Variance is high: compare 2 models, do they use similar data or yield similar estimates? Likely very similar models, with very different perf. estimates. 3. Recall that the variance of correlated variables increases with correlation *k-fold CV lets us find a good trade off empirically.
Regularization : Ridge Regression
* need to STANDARDIZE data before doing penalty! - Ridge regression minimizes the objective. - MSE is the usual sum of squared residuals . •The second term penalizes large coefficients -known as shrinkage •𝛼is a tuning parameter supplied by the user(or chosen by CV) •𝛼trades off between and MSE and penalizing complex models if alpha = 0, then ridge becomes a linear regression -> becomes flexible-> more likely to overfit if alpha is very large, then theres significant pressure on penalty -> variace goes down-> bias goes up
Reinforcement Learning
- Environment with states 𝑠 - Set of available actions 𝑎 - Probability of going to new state 𝑠′, given action - Reward for going from 𝑠 to 𝑠′
Drawbacks of the holdout method (80% train 20% test)
- Error estimates can be highly variable. •Only a subset of data used for training -may do worse training on fewer observations and overestimate out of sample error
Increasing the flexibility of linear regression
- Feature transformations increase flexibility - Linear regression = PARAMETRIC MODEL
Nested Cross-Validation
- Intuitively, we will use cross-validation for both hyper-parameter tuning and performance estimation - Do CV in an inner-outer loop structure called nested CV - Inner loop: select the best hyper-parametersWhat is 'best'? Best accuracy OR a heuristic: most parsimonious model with accuracy within 1 standard deviation of the best model - In the outer loop we will estimate the performance of a model using the best hyper-parameters which we discovered in the inner loop - We do this for every type of model (Lasso, etc.), and get estimates of the performance of a tuned version of the model. Nested cross validation does not tell us what the best hyper-parameters are! Nested cross validation tells us what generalization error to expect if we use cross validation to tune the hyper-parameters of any specific algorithm. (Avg. error of outer loop) Once we select the best performing algorithm using nested CV, how do we tune its hyper-parameters? We run CV on the entire dataset (as usual)
LASSO OR RIDGE?
- It depends on your dataset •Lasso assumes many predictors truly have ZERO effect •If that's the case for your data, Lasso will be better than ridge
How do we select the right k in cross validation?
- Largely an empirical question -depends on the application - Larger k(more folds) give a more unbiased estimate of test MSE - As k increases so does the variance of our estimate of test MSE. - Practical concern: larger k means greater computational cost (since we have totrain the model 𝑘+1times) Rule of thumb: 5-or 10-fold CV tends to work well
Unsupervised Learning- Clustering
- No outcomes Y - What do we do now?
Supervised Learning- Regression Problem
- Predict a continuous value/number. - Note, a regression problem ≠ regression (OLS)
K-Fold Cross Validation - model evaluation
- Repeat the holdout procedure several times so that every instance/fold gets a chance to be in the validation set. •Record (and average) validation set performance across iterations + This uses all our data for training − Might be pessimistic for small k (why?) Final (deployment) model: train on all data
Other common Regression metrics
- Root mean squared error (RMSE) used to make it more interpretable by using original units by not squaring. - Mean absolute error (MAE) addresses issue of getting rid of the sign of the residual and not squaring, we only care of sum of the abs vals. - Mean absolute percenetage error (MAPE) gives larger weight if the residuals are identical
Supervised Learning- Classification
- Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations - New data is classified based on the training set. And we predict a category/label
Tradeoffs between Accuracy and Interpretability
Tradeoff bw model flexibility and interpretability.
Navigating Bias and Variance
What tools do we have to our disposal? Flexibility - Number of predictors in the model. - Degree of polynomial -number of non-zero coefficients. Other reasons to encourage models with few predictors/nonzeros -Simple models are easy to interpret, may generalize better. -What if the number of predictors is large? (𝑝>𝑛) - As model increases flexibility, variance increases. 1. Variable selection 2. forward/backward selection 3. Ridge Regression: 𝑀𝑆𝐸+𝜆⋅∑𝛽^2 4. Lasso: 𝑀𝑆𝐸+𝜆⋅∑|𝛽| 5. Elastic net: 𝑀𝑆𝐸+𝜆(𝛼⋅∑|𝛽|+(1−𝛼)∑𝛽^2)
Navigating bias and variance-> What tools do we have to our disposal?
What tools do we have to our disposal? -Polynomial features give us more model flexibility. As a result, manual variable prediction. - Flexibility: number of predictors in our model, degree of polynomial - Other reasons to encourage models with few predictors/nonzeros: 1. simple models are easy to interpret and may generalize better. 2. What if the # of predictors is large? (p>n).
Suppose we do forward selection and repeatedly add the predictor that most increase training accuracy (decreases error). We stop when training accuracy decreases (error increases) the first time. How many of our 𝑝predictors will we have added when we stop?
p(all of them)
Handling Missing values: Imputation
-Univariate imputation, for example mean/median/most frequent. •Multivariate imputation-> 𝑘−NN - Multivariate (iterative)imputation with chained equations (MICE) MICE algorithm sketch: 1. Univariate imp. to fill missing values (mean/med.) 2. Choose 1 predictor as target, reset its missing values 3. Train regression model with target as y 4. Use predicted values to fill in missing values 5. Repeat 2-4 for each 𝑋𝑖 missing values = 1 cycle Cycle until iteration limit/convergence
Supervised Learning Workflow
1. Define the problem to be solved. 2. Collect labelled training data(clean data) 3. Choose ML algorithm, fit to data(feature engineering) 4. Evaluate model according to chosen metric 5. Use model to predict out of sample
Is Ridge Regression Perfect?
1. Ridge models include all variables even though some of them end up with very small coefficients 2. Doesn't do variable selection 3. Ridge is not parsimonious (we want models with a small number of predictors, not small coefficients for many predictors)
Backward Selection
1. Start with a model that only contains ALL variables. 2.Fit 𝑝 linear regressions, removing one predictor 𝑋𝑝 each time,and compute 𝑀𝑆𝐸(𝑝). 3.Choose the model with smallest 𝑀𝑆𝐸(𝑝) 4.Repeat
Forward Selection
1. Start with a model that only contains an intercept. 2.Fit p linear regressions, one for each predictor 𝑋𝑝, and compute𝑀𝑆𝐸(𝑝). 3.Add the predictor 𝑋𝑝with the smallest 𝑀𝑆𝐸(𝑝)to the model. 4.Repeat
Cross Validation
A way to estimate model performance on new data. Three main reasons why we evaluate model performance: 1. Estimate how well we will do in data we have not seen yet (generalization performance) 2. Tweak the learning algorithm to improve its performance (e.g., by tuning 𝛼) 3. Compare different algorithms to select the best one
Bias var tradeoff example
As flexibility increases, variance goes up, and bias goes down Therefore, we are facing a trade-off between bias and variance. *MSE: Variance of error is irreducible. BUT the rest of the equation is reducible.
Bias-Variance Tradeoff
Bias: the difference between our average estimate and the truth. eg.average estimate of the መ𝛽's (over many training sets) and true 𝛽's Variance: the variance of our set of estimates (the 𝛽's) -are they similar or do they vary a lot between datasets? - Reducing MSE on your training data does not necessarily reduce MSE on data you have not trained on (test set). •𝑀𝑆𝐸𝑇𝑒𝑠𝑡 could be going up as 𝑀𝑆𝐸𝑇𝑟𝑎𝑖𝑛 is going down! •This is a fundamental trade-off in machine learning, called the bias-variance trade-off - increasing bias decreases variances and vice versa.
Linear Regression
Simple approach to supervised learning •Assumes linear relationship between predictors, outcome •Assumption is often false, but can work well even when false.
What is a good accuracy?
Suppose I tell you I have been studying market data for 2010-2015 and have a great model for predicting stock market movements. How can I convince you it is 'good'? - We should care about accuracy on future, unknown data, i.e. we want our models to `generalize.' - 𝑀𝑆𝐸𝑇𝑟𝑎𝑖𝑛 may be overly optimistic, since we evaluate it on the same data we trained on.
k-fold cross validation - model selection
Three main reasons why we evaluate model performance: Estimate how well we will do in data we have not seen yet (generalization performance) Tweak the learning algorithm to improve its performance (e.g., by tuning𝛼) Compare different algorithms to select the best one
A Basic and WRONG attempt at Cross validation vs an improved attempt
WRONG 1. Set aside hold out/test set 2. Fit models with different values of𝛼on train, evaluate on test 3. Select 𝛼∗,the value of𝛼that minimizes MSE on the test set 4. Use 𝑀𝑆𝐸𝛼∗as your performance estimate Why shouldn't you do this? 2. Define "training" as "fix entire model: coefficients and parameters" We know we should not look at the test set while choosing coefficients, the same is true for parameters. Improved 1. Split data: set aside 𝑋𝑡𝑒𝑠𝑡;split rest in 2: train 𝑋𝑡𝑟𝑎𝑖𝑛+ validation 𝑋𝑣𝑎𝑙 2. Fit different models (different 𝛼's) on training, evaluate on validation 3. Select parameter value that does best on validationset, call it 𝛼∗. 4. Train a new model with 𝛼∗on entire training set (𝑋𝑡𝑟𝑎𝑖𝑛∪𝑋𝑣𝑎𝑙) 5. Evaluate on test set to get performance estimate 6. Train on 𝑋with 𝛼∗to get production model This is theoretically sound and something we can do Practical disadvantage: we have to"throw away" even more data to create both a test and a validation set. Can we do better?
What is supervised learning?
supervised learning uses labeled input and output data Given labelled data with outcome variables 𝑌=(𝑦1,...,𝑦𝑛): - 𝑛 is the number of observations in the dataset - Labels (𝑦's) can be sales, prices, categories (spam vs not spam). Predictors 𝑋𝑗 are features associated with observation 𝑦𝑖: -𝑥𝑖1 can be money spent on TV ads, 𝑥𝑖2 on radio etc. - 𝑋 can capture the features of a house (size, neighborhood, #beds) - 𝑋 can be the words that appear in an email .
Parametric or Non-Parametric?
the closer the parametric form is to the true form, the better the parametric method will do. - In higher dimensions(many predictors, few observations per predictor), parametric methods tend to do better.
K-NN -> Non parametric
𝑘−NN is perhaps the simplest non-parametric estimator: - Assume we have some notion of distance between instances - For new instance 𝑥0, let 𝑁𝑘(𝑥0) be the 𝑘 instances closest to 𝑥0 -Regression: Predict average value 1/𝑘∑𝑦𝑖_∈𝑁0 * 𝑦𝑖 -Classification: Predict the class most common in 𝑁𝑘(𝑥0)
