BA 476 midterm 1

Ace your homework & exams now with Quizwiz!

Why does lasso lead to sparse models?

Suppose you have two predictors with estimates 𝛽1=1,𝛽2=10. Someone forces you to decrease the total size of coefficients by 1. Assume MSE is affected the same way regardless of which coefficient you decrease So all else being equal, the ridge penalty encourages shrinking larger coefficients before smaller ones, so small coeffs stick around.

Linear Regression; Measuring accuracy of a model

* outliers have a large impact on residuals. - One measure of success is average squared residuals (MSE-> average prediction error) - we square in order to avoid negative values and show the true error of the entire model, especially when we are summing the values of the residuals. * MSE = Variance + Bias^2 + Irreducible Error

Why does lasso lead to sparse models? (2)

*extra info: 'elastic net" combines ridge and lasso because lasso can be unstable with highly correlated variables. So "elastic net" is the best of both: gets stabilkity from ridge and sparsity from lasso.

Overfitting and Underfitting General

- (Process 2) Random model: 𝑀𝑆𝐸𝑡𝑟𝑎𝑖𝑛=𝑀𝑆𝐸𝑡𝑒𝑠𝑡, large. •(Process 1)Linear regression model: 𝑀𝑆𝐸𝑡𝑟𝑎𝑖𝑛<𝑀𝑆𝐸𝑡𝑒𝑠𝑡 Underfitting: both 𝑀𝑆𝐸𝑡𝑟𝑎𝑖𝑛, 𝑀𝑆𝐸𝑡𝑒𝑠𝑡 are large (didn't learn anything) Overfitting: 𝑀𝑆𝐸𝑡𝑟𝑎𝑖𝑛 is small and 𝑀𝑆𝐸𝑡𝑒𝑠𝑡 is large(learnt noise). Generalizable models: 𝑀𝑆𝐸𝑡𝑟𝑎𝑖𝑛≈𝑀𝑆𝐸𝑡𝑒𝑠𝑡 Ultimate goal: low 𝑀𝑆𝐸𝑡𝑒𝑠𝑡

Navigating the Bias-Variance Trade off -> Can we do anything other than manual variable selection?

- So far we have been fitting least squares which chooses coefficients to minimize the loss function on the training set. But what if we change the loss function? Instead of using MSE we can think of using ridge regressions or lasso (newer)

More graph examples of bias variance tradeoff

- look over lecture 3 on more graph examples

LASSO (Least Absolute Shrinkage and Selection Operator)

- The Lasso is a recent advance is statistical learning. •Select model parameters to minimize. - The penalty term now includes the absolute values of the coefficients instead of their squares. (L1-norm instead of L2). •Like Ridge, Lasso shrinks coefficients towards zero •Unlike Ridge, the Lasso sets some coefficients to zero exactly! •Yields sparse, parsimonious models and does variable selection.

When can you use ML?

- There is some pattern - the pattern is hard to pin down mathematically. - We have data on it. - It is a program that improves with experience.

Train-Test Paradigm

- We care about out of sample (generalization) accuracy •Keep some data separate, treat as unknown •Compute MSE on test/holdout data (average residuals)From the perspective of the algorithm future data ≈ test data Avoids overfitting on the training data •Potential problems with using a test set: training on less data, must be representative

Linear Regression (in 1-D) Single predictor

Goal: choose a 'line of best fit' More formally: estimate parameters (B_0, B_1) based on (X,Y). - Suppose we have parameter estimates 𝛽0,𝛽1, then given any 𝑋′: Y_hat = 𝛽0 + 𝛽1X' when we train a model we come up with predictions for the BETAS. Residual: r_i = y_i - Y_hat_i: how much are we off on instance 𝑖. if residual is negative we overpredicted. if positive: underpredicted.

How do we do Linear Regression?

Goal: we want the 'best' line. -Set up an objective function that measures how close our predictions are to the observed points -Best line minimizes objective function (could be loss between true label and predictor, residuals). - the relationship bw predictors and outcome is linear.

Regularization

Key idea in modern analytics •Automatically constrain model to subset of variables. •To do so we minimize a different objective function than MSE •Why would minimizing something other than MSE (on the training data) work? We will still evaluate our predictions using MSE on the test data, after all. •Essentially, we are trading off some bias for some variance. Hopefully the trade-off is good: we sacrifice a bit in terms of bias to gain a lot in terms of variance. *as the alpha increases, the penalty gets larger, the size of the coefficients goes down.

Tuning hyperparameters

Many machine learning models have tunable parameters for which we need to select a value. •How should we choose 𝛼 in Lasso/Ridge? (but it's a general method) -> Choosing the right 𝛼 is the difference between glory and disaster. -> Recall: higher 𝛼= smaller coefficients 𝛽, higher bias + lower variance

Choosing subsets of predictors

Objective: balance model complexity and prediction accuracy How about this:For every possible combination of predictors, fit a linear regression Compute test MSE Pick the best model Would this work? Is it practical?((𝑝, 𝑘))tochoose 𝑘 predictors from 𝑝(∼155m for 𝑝=30,𝑘=15) * we should be getting unbiased estimates so this would be good. BUT computational power to do all of this is way too much! As a result there are some heuristics that we can do to not go through so much power. like step wise regression.

Handling Outliers

Outliers Standard deviations for normally distributed data (99.7% in 3𝜎) Random cut/isolation forests -is a point easily isolated? Clustering techniques -instances far from cluster centers/in theirown clusters etc. Sklearnhas some built in estimators for anomaly detection (fit, predict) Why is it missing? •Drop the predictor/instance -not ideal if all instances affected •Univariate imputation, for example mean/median/most frequent -MICE - Mark missing values as missing (add_indicator)

Objective: Prediction vs Inference

Prediction: Given new x_k, predict y_k. Estimate Y_hat = f(X). Y_hat is our estimate. - We dont care about f's looks. Treat it like a black box (ie: can we accurately predict a heart attack/loan default) - Goal is Accuracy Inference: Find the relationships between X,Y - Estimate Y_hat = f(X) - What is the realtionshjip bw Y and the various X_i (ie: what determines credit card default/hear attack) - Goal is Interpretability

Overfitting and Underfitting Process 1

Process 1: 1. Take a dataset and split it randomly in train and test. 2. Select coefficients for a linear model using linear regression. 3. Compute 𝑀𝑆𝐸𝑡𝑟𝑎𝑖𝑛 and 𝑀𝑆𝐸𝑡𝑒𝑠𝑡 for this model. *Remember: Linear regression selects coefficients by minimizing loss (MSE) in the training set. MSEtrain is usually small. - On average MSE_train < MSE_test

Overfitting and Underfitting Process 2

Process 2 1. Take a dataset and split it randomly in train and test 2. Select random coefficients for a linear model. 3. Using the random model, compute 𝑀𝑆𝐸𝑡𝑟𝑎𝑖𝑛and 𝑀𝑆𝐸𝑡𝑒𝑠𝑡 Linear models have the form 𝑦=𝛽0+𝛽1𝑥1+𝛽2𝑥2+𝛽3𝑥3+𝛽4𝑥4+𝜖, where we set the betas randomly. - MSE_train usually large - On average MSE_train = MSE_test

Bias-Variance Trade-off for Cross validation

(Here we are referring to the bias/variance of our estimate of accuracy) - The holdout method has the highest bias. Why?-> Consider what happens during 2-fold CV - LOOCV has very high variance: 1. Bias is low -we drop just one observation for each model 2. Variance is high: compare 2 models, do they use similar data or yield similar estimates? Likely very similar models, with very different perf. estimates. 3. Recall that the variance of correlated variables increases with correlation *k-fold CV lets us find a good trade off empirically.

Regularization : Ridge Regression

* need to STANDARDIZE data before doing penalty! - Ridge regression minimizes the objective. - MSE is the usual sum of squared residuals . •The second term penalizes large coefficients -known as shrinkage •𝛼is a tuning parameter supplied by the user(or chosen by CV) •𝛼trades off between and MSE and penalizing complex models if alpha = 0, then ridge becomes a linear regression -> becomes flexible-> more likely to overfit if alpha is very large, then theres significant pressure on penalty -> variace goes down-> bias goes up

Reinforcement Learning

- Environment with states 𝑠 - Set of available actions 𝑎 - Probability of going to new state 𝑠′, given action - Reward for going from 𝑠 to 𝑠′

Drawbacks of the holdout method (80% train 20% test)

- Error estimates can be highly variable. •Only a subset of data used for training -may do worse training on fewer observations and overestimate out of sample error

Increasing the flexibility of linear regression

- Feature transformations increase flexibility - Linear regression = PARAMETRIC MODEL

Nested Cross-Validation

- Intuitively, we will use cross-validation for both hyper-parameter tuning and performance estimation - Do CV in an inner-outer loop structure called nested CV - Inner loop: select the best hyper-parametersWhat is 'best'? Best accuracy OR a heuristic: most parsimonious model with accuracy within 1 standard deviation of the best model - In the outer loop we will estimate the performance of a model using the best hyper-parameters which we discovered in the inner loop - We do this for every type of model (Lasso, etc.), and get estimates of the performance of a tuned version of the model. Nested cross validation does not tell us what the best hyper-parameters are! Nested cross validation tells us what generalization error to expect if we use cross validation to tune the hyper-parameters of any specific algorithm. (Avg. error of outer loop) Once we select the best performing algorithm using nested CV, how do we tune its hyper-parameters? We run CV on the entire dataset (as usual)

LASSO OR RIDGE?

- It depends on your dataset •Lasso assumes many predictors truly have ZERO effect •If that's the case for your data, Lasso will be better than ridge

How do we select the right k in cross validation?

- Largely an empirical question -depends on the application - Larger k(more folds) give a more unbiased estimate of test MSE - As k increases so does the variance of our estimate of test MSE. - Practical concern: larger k means greater computational cost (since we have totrain the model 𝑘+1times) Rule of thumb: 5-or 10-fold CV tends to work well

Unsupervised Learning- Clustering

- No outcomes Y - What do we do now?

Supervised Learning- Regression Problem

- Predict a continuous value/number. - Note, a regression problem ≠ regression (OLS)

K-Fold Cross Validation - model evaluation

- Repeat the holdout procedure several times so that every instance/fold gets a chance to be in the validation set. •Record (and average) validation set performance across iterations + This uses all our data for training − Might be pessimistic for small k (why?) Final (deployment) model: train on all data

Other common Regression metrics

- Root mean squared error (RMSE) used to make it more interpretable by using original units by not squaring. - Mean absolute error (MAE) addresses issue of getting rid of the sign of the residual and not squaring, we only care of sum of the abs vals. - Mean absolute percenetage error (MAPE) gives larger weight if the residuals are identical

Supervised Learning- Classification

- Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations - New data is classified based on the training set. And we predict a category/label

Tradeoffs between Accuracy and Interpretability

Tradeoff bw model flexibility and interpretability.

Navigating Bias and Variance

What tools do we have to our disposal? Flexibility - Number of predictors in the model. - Degree of polynomial -number of non-zero coefficients. Other reasons to encourage models with few predictors/nonzeros -Simple models are easy to interpret, may generalize better. -What if the number of predictors is large? (𝑝>𝑛) - As model increases flexibility, variance increases. 1. Variable selection 2. forward/backward selection 3. Ridge Regression: 𝑀𝑆𝐸+𝜆⋅∑𝛽^2 4. Lasso: 𝑀𝑆𝐸+𝜆⋅∑|𝛽| 5. Elastic net: 𝑀𝑆𝐸+𝜆(𝛼⋅∑|𝛽|+(1−𝛼)∑𝛽^2)

Navigating bias and variance-> What tools do we have to our disposal?

What tools do we have to our disposal? -Polynomial features give us more model flexibility. As a result, manual variable prediction. - Flexibility: number of predictors in our model, degree of polynomial - Other reasons to encourage models with few predictors/nonzeros: 1. simple models are easy to interpret and may generalize better. 2. What if the # of predictors is large? (p>n).

Suppose we do forward selection and repeatedly add the predictor that most increase training accuracy (decreases error). We stop when training accuracy decreases (error increases) the first time. How many of our 𝑝predictors will we have added when we stop?

p(all of them)

Handling Missing values: Imputation

-Univariate imputation, for example mean/median/most frequent. •Multivariate imputation-> 𝑘−NN - Multivariate (iterative)imputation with chained equations (MICE) MICE algorithm sketch: 1. Univariate imp. to fill missing values (mean/med.) 2. Choose 1 predictor as target, reset its missing values 3. Train regression model with target as y 4. Use predicted values to fill in missing values 5. Repeat 2-4 for each 𝑋𝑖 missing values = 1 cycle Cycle until iteration limit/convergence

Supervised Learning Workflow

1. Define the problem to be solved. 2. Collect labelled training data(clean data) 3. Choose ML algorithm, fit to data(feature engineering) 4. Evaluate model according to chosen metric 5. Use model to predict out of sample

Is Ridge Regression Perfect?

1. Ridge models include all variables even though some of them end up with very small coefficients 2. Doesn't do variable selection 3. Ridge is not parsimonious (we want models with a small number of predictors, not small coefficients for many predictors)

Backward Selection

1. Start with a model that only contains ALL variables. 2.Fit 𝑝 linear regressions, removing one predictor 𝑋𝑝 each time,and compute 𝑀𝑆𝐸(𝑝). 3.Choose the model with smallest 𝑀𝑆𝐸(𝑝) 4.Repeat

Forward Selection

1. Start with a model that only contains an intercept. 2.Fit p linear regressions, one for each predictor 𝑋𝑝, and compute𝑀𝑆𝐸(𝑝). 3.Add the predictor 𝑋𝑝with the smallest 𝑀𝑆𝐸(𝑝)to the model. 4.Repeat

Cross Validation

A way to estimate model performance on new data. Three main reasons why we evaluate model performance: 1. Estimate how well we will do in data we have not seen yet (generalization performance) 2. Tweak the learning algorithm to improve its performance (e.g., by tuning 𝛼) 3. Compare different algorithms to select the best one

Bias var tradeoff example

As flexibility increases, variance goes up, and bias goes down Therefore, we are facing a trade-off between bias and variance. *MSE: Variance of error is irreducible. BUT the rest of the equation is reducible.

Bias-Variance Tradeoff

Bias: the difference between our average estimate and the truth. eg.average estimate of the መ𝛽's (over many training sets) and true 𝛽's Variance: the variance of our set of estimates (the 𝛽's) -are they similar or do they vary a lot between datasets? - Reducing MSE on your training data does not necessarily reduce MSE on data you have not trained on (test set). •𝑀𝑆𝐸𝑇𝑒𝑠𝑡 could be going up as 𝑀𝑆𝐸𝑇𝑟𝑎𝑖𝑛 is going down! •This is a fundamental trade-off in machine learning, called the bias-variance trade-off - increasing bias decreases variances and vice versa.

Linear Regression

Simple approach to supervised learning •Assumes linear relationship between predictors, outcome •Assumption is often false, but can work well even when false.

What is a good accuracy?

Suppose I tell you I have been studying market data for 2010-2015 and have a great model for predicting stock market movements. How can I convince you it is 'good'? - We should care about accuracy on future, unknown data, i.e. we want our models to `generalize.' - 𝑀𝑆𝐸𝑇𝑟𝑎𝑖𝑛 may be overly optimistic, since we evaluate it on the same data we trained on.

k-fold cross validation - model selection

Three main reasons why we evaluate model performance: Estimate how well we will do in data we have not seen yet (generalization performance) Tweak the learning algorithm to improve its performance (e.g., by tuning𝛼) Compare different algorithms to select the best one

A Basic and WRONG attempt at Cross validation vs an improved attempt

WRONG 1. Set aside hold out/test set 2. Fit models with different values of𝛼on train, evaluate on test 3. Select 𝛼∗,the value of𝛼that minimizes MSE on the test set 4. Use 𝑀𝑆𝐸𝛼∗as your performance estimate Why shouldn't you do this? 2. Define "training" as "fix entire model: coefficients and parameters" We know we should not look at the test set while choosing coefficients, the same is true for parameters. Improved 1. Split data: set aside 𝑋𝑡𝑒𝑠𝑡;split rest in 2: train 𝑋𝑡𝑟𝑎𝑖𝑛+ validation 𝑋𝑣𝑎𝑙 2. Fit different models (different 𝛼's) on training, evaluate on validation 3. Select parameter value that does best on validationset, call it 𝛼∗. 4. Train a new model with 𝛼∗on entire training set (𝑋𝑡𝑟𝑎𝑖𝑛∪𝑋𝑣𝑎𝑙) 5. Evaluate on test set to get performance estimate 6. Train on 𝑋with 𝛼∗to get production model This is theoretically sound and something we can do Practical disadvantage: we have to"throw away" even more data to create both a test and a validation set. Can we do better?

What is supervised learning?

supervised learning uses labeled input and output data Given labelled data with outcome variables 𝑌=(𝑦1,...,𝑦𝑛): - 𝑛 is the number of observations in the dataset - Labels (𝑦's) can be sales, prices, categories (spam vs not spam). Predictors 𝑋𝑗 are features associated with observation 𝑦𝑖: -𝑥𝑖1 can be money spent on TV ads, 𝑥𝑖2 on radio etc. - 𝑋 can capture the features of a house (size, neighborhood, #beds) - 𝑋 can be the words that appear in an email .

Parametric or Non-Parametric?

the closer the parametric form is to the true form, the better the parametric method will do. - In higher dimensions(many predictors, few observations per predictor), parametric methods tend to do better.

K-NN -> Non parametric

𝑘−NN is perhaps the simplest non-parametric estimator: - Assume we have some notion of distance between instances - For new instance 𝑥0, let 𝑁𝑘(𝑥0) be the 𝑘 instances closest to 𝑥0 -Regression: Predict average value 1/𝑘∑𝑦𝑖_∈𝑁0 * 𝑦𝑖 -Classification: Predict the class most common in 𝑁𝑘(𝑥0)


Related study sets

Biology 《6》 《4》 Animal like protists

View Set

Week 12 Physics Review (last one KC 6)

View Set

Memorizing Chords (Major with all extensions, 9,11,13)

View Set

Chapter 23 Plant Evolution and Diversity

View Set

Chapter 7: Photosynthesis: using light to make food

View Set

Phlebotomy worksheet chapter 3 & 4

View Set