Ch 6 Linear Model Selection and Regularization

¡Supera tus tareas y exámenes ahora con Quizwiz!

Backward Stepwise Selection

Begin with a full model and then iteratively remove the least useful predictor one at a time. (n > p)

What kind of model does the Lasso Yield

A sparse model

Dimension Reduction Steps

1. Obtain the transformed predictors Z (Z = sum(phi*X)). 2 The model is fit using the Zm predictors.

Considerations with Higher Dimensions

1. Regularization is crucial. 2. Tuning parameter optimization is crucial. 3. Test Error increases as p increases, unless p is correlated with y. Multicollinearity becomes extreme

Dimensionally Reduced Coefficients

A special case of OLS where the B coefficients are constrained. This has the potential to bias some coefficients, but when p is large relative to n, selecting a value of M << p can significantly reduce the variance.

Comparing Models

A good way to compare models is to plot the model against their R2 (R2,MSE)

Deviance

A measurement that plays the role of RSS for a broader class of models. (-2 * Maximum Log Likelihood)

Cp

A statistic that adds a penalty of 2dsigma^2 to the training RSS. simgma^2 being the estimated variance of the residuals. The penalty increases as the number of predictors increases. Takes on a small value for low test error

Partial Least Squares

A supervised alternative to PCR which first identifies the new set of features Z1 -> Zm and then fits a linear model via least squares, but uses Y as a supervisor to identify new features that not only approximate the old features, but also are related to response.

PCA

A technique for reducing the dimension of a nxp data matrix X. Where the first principle component vector defines the line that is as close as possible to the data.

bias - variance and regularization

AS we increase regularization strength, we increase the bias and decrease the variance

Forward Stepwise Selection

Begin with a model containing no predictors. then add predictors to the model one at a tie untill all of the predictors are in the model. At each step the variable that gives the greatest additional improvement to the fit is added to the model. (n can be < p)

Regularizing Coeffiencts

By constraining coefficients, we can substantially reduce the variance at the cost of negligible increase in bias.

Ridge Regression

Coefficients are estimated by minimizing a Beta coefficient with a tuning parameter. Ridge attempts to minimize the RSS, like OLS, but induces a regularization penalty that coerces coefficients towards 0. The regularization parameter serves to control the relative impact of of the two terms (regularization and RSS). as the regularization term approaches infinity, the beta coefficients approach 0. Not applied to the intercept. We seetek a set of coefficients estimates such that the RSS is as small as possible, subject to the requirement that the regurlaized beta not exceed a budget

PCR

Construct the first M principle components Z1 -> ZM and then using these components as predictors in the linear model fit with least squares.

AIC

Defined for a large class of models fit by maximum likelihood. With gaussian errors, maximum likelihood and least squares are the same. Adds a penalty to adding additional features, similar to Cp

Features and Ridge Regression

Each feature matrix has to be centered to have a mean of zero before ridge regression is performed.

One Standard Error Rule

First calculate the standard error of the estimated test MSE for each model size, and then select the smallest model for which the estimated test error is within one standard error of the lowest point on the curve.

Subset Selection

Identifying a subset of the p predictors that we believe to be realted to the response variable. We then fit a model using least squares on this set of variables

Prediction Accuracy of OLS with Small N

If n is not much larger than p, then there can be a lot of variaility in the least squares fit, resulting in overfitting and poor predictions. if n < p, then there is no longer a unique least squares coefficient. the variance is infinite.

Shrinkage / Regularization

Involved fitting a model involving all p predictors. However the estimate coefficients are shrunken towards 0.

Adjusted R2

Large values indicate a model with a small test error.

Lasso and Feature Selection

Lasso preforms feature selection

Lasso v Ridge

Lasso tends to preform better with a small number of predictors have substantial coefficients and the remaining coefficients are relatively small. The ridge will preform better when the response is a function of many predictors, all with coefficients with equal size.

L2 Regularization/norm

Measures the distance of B from zero for ||B||2. as the regularization paramater increases, the l2 norm will always decrease

OLS Betas and Scale

OLS Beta coefficients are scale invariant

Dimension Reduction

Projecting p predctors into a M-dimensional space where M < p. This is achieved by computing M different liner combinations, or projects of the variables. Then these M projects are used as predictors to fit a linear regression model by least squares

Prediction Accuracy OLS with Large N

Provided that the true relationship between the response and the predictors is approximately linear, the lest square estimates will if low bias, if n >>p - that is if n, the number of observations, is much larger than p, the number of variables. The OLS estimates tend to also have low variance and will perform well on test observations

RSS and Features

RSS always decreases as the number of variables in the model increases.

Model Interperability

Removing variables, or setting their coefficients to 0, we can obtain a model that is more easily intepreted

Bayesian Viewpoint for Ridge and Lasso

Ridge and lasso follow directly from assuming the usual linear model with normal error

Ridge Regression and Scale

Ridge regression coefficient estimates can change substantially when multiplying a given predictor by a constant and as such each feature must be standardized so they are on the same scale

Disadvantage of Ridge Regression

Ridge regression uses all features in the model and the regularization term will tend to shrink the coefficients towards 0, but never to 0, unless the parameter is set to infinity, which can mess with interoperability.

Why Ridge Improves OLS

Ridge regression's advantages over OLS is rooted in bias-variance. as the regularization term increases, the flexibility of the model decreases leading to a decreased variance, but an increased bias. Ridge regression can perform well with n close to p.

Test MSE

Sum(y_hat - y_bar)^2/n

BIC

Tends to take on a small value for a low test error. A bayesian perspective of AIC. Applies a heavier penalty for models with more features.

How does PCR determine the linear combinations of X1 -> Xp

The PCR approach involves identifying the linear combinations/directions that best represent the predictors X1 -> Xp in an unsupervised way since Y is not used to help determine the principle components.

First Principal Component Direction

The direction of the data that along which the observations vary the ,ost

Curse of Dimensionality

The fact that test error increases given more features.

Key Idea of PCR

The key idea is that often a small number of principal components suffice to explain most of the variability in the data as well as the relationship with the response. We assume that the directions in which X1 -> Xp show the most variation are the directions that are associated with Y.

Training v Test Error

Training error tends to underestimate the test error.This is due to the fact that the training error will decrease as more variables are included into the model and R2 increases as more variables are included into the model

l1 Norm

Use the absolute value instead of the squared term for computing distance from 0 of the vector

Advantages of using Validation > Estimating the test MSE

Using the validation or cross-validation approach, we can directly compute the test error rate. There are also fewer assumptions.

Hybrid Approaches

Variables are added to the model sequentially, however, after adding each new variable, the method may also remove any variables that no longer provide improvement to model fit

When to use Ridge over OLS

When the OLS estimates have high variance.

Lasso and OLS

When the ols estimates have high variance, lasso is good at reducing the variance at the cost of a small increase in bias

When does PCR preform well

When there is only a need for a few principal components.

Sparse Models

models that involve only a subset of the vairables

The Lasso

regularization model that shrinks coefficients towards 0 and can even set them to 0. We are trying to find the set of coefficient estimates that lead to the smallest RSS, subject to the constrain that there is a budget for how large the l1 regularized beta coefficnet can be.


Conjuntos de estudio relacionados

Physical Science Test Chapter 14

View Set

psychopathology and diagnosis final

View Set

Financial Institutions and Markets

View Set

Keystone Questions For Biology Test

View Set

Ch. 32 Physical Activity & Mobility

View Set