Ch 6 Linear Model Selection and Regularization
Backward Stepwise Selection
Begin with a full model and then iteratively remove the least useful predictor one at a time. (n > p)
What kind of model does the Lasso Yield
A sparse model
Dimension Reduction Steps
1. Obtain the transformed predictors Z (Z = sum(phi*X)). 2 The model is fit using the Zm predictors.
Considerations with Higher Dimensions
1. Regularization is crucial. 2. Tuning parameter optimization is crucial. 3. Test Error increases as p increases, unless p is correlated with y. Multicollinearity becomes extreme
Dimensionally Reduced Coefficients
A special case of OLS where the B coefficients are constrained. This has the potential to bias some coefficients, but when p is large relative to n, selecting a value of M << p can significantly reduce the variance.
Comparing Models
A good way to compare models is to plot the model against their R2 (R2,MSE)
Deviance
A measurement that plays the role of RSS for a broader class of models. (-2 * Maximum Log Likelihood)
Cp
A statistic that adds a penalty of 2dsigma^2 to the training RSS. simgma^2 being the estimated variance of the residuals. The penalty increases as the number of predictors increases. Takes on a small value for low test error
Partial Least Squares
A supervised alternative to PCR which first identifies the new set of features Z1 -> Zm and then fits a linear model via least squares, but uses Y as a supervisor to identify new features that not only approximate the old features, but also are related to response.
PCA
A technique for reducing the dimension of a nxp data matrix X. Where the first principle component vector defines the line that is as close as possible to the data.
bias - variance and regularization
AS we increase regularization strength, we increase the bias and decrease the variance
Forward Stepwise Selection
Begin with a model containing no predictors. then add predictors to the model one at a tie untill all of the predictors are in the model. At each step the variable that gives the greatest additional improvement to the fit is added to the model. (n can be < p)
Regularizing Coeffiencts
By constraining coefficients, we can substantially reduce the variance at the cost of negligible increase in bias.
Ridge Regression
Coefficients are estimated by minimizing a Beta coefficient with a tuning parameter. Ridge attempts to minimize the RSS, like OLS, but induces a regularization penalty that coerces coefficients towards 0. The regularization parameter serves to control the relative impact of of the two terms (regularization and RSS). as the regularization term approaches infinity, the beta coefficients approach 0. Not applied to the intercept. We seetek a set of coefficients estimates such that the RSS is as small as possible, subject to the requirement that the regurlaized beta not exceed a budget
PCR
Construct the first M principle components Z1 -> ZM and then using these components as predictors in the linear model fit with least squares.
AIC
Defined for a large class of models fit by maximum likelihood. With gaussian errors, maximum likelihood and least squares are the same. Adds a penalty to adding additional features, similar to Cp
Features and Ridge Regression
Each feature matrix has to be centered to have a mean of zero before ridge regression is performed.
One Standard Error Rule
First calculate the standard error of the estimated test MSE for each model size, and then select the smallest model for which the estimated test error is within one standard error of the lowest point on the curve.
Subset Selection
Identifying a subset of the p predictors that we believe to be realted to the response variable. We then fit a model using least squares on this set of variables
Prediction Accuracy of OLS with Small N
If n is not much larger than p, then there can be a lot of variaility in the least squares fit, resulting in overfitting and poor predictions. if n < p, then there is no longer a unique least squares coefficient. the variance is infinite.
Shrinkage / Regularization
Involved fitting a model involving all p predictors. However the estimate coefficients are shrunken towards 0.
Adjusted R2
Large values indicate a model with a small test error.
Lasso and Feature Selection
Lasso preforms feature selection
Lasso v Ridge
Lasso tends to preform better with a small number of predictors have substantial coefficients and the remaining coefficients are relatively small. The ridge will preform better when the response is a function of many predictors, all with coefficients with equal size.
L2 Regularization/norm
Measures the distance of B from zero for ||B||2. as the regularization paramater increases, the l2 norm will always decrease
OLS Betas and Scale
OLS Beta coefficients are scale invariant
Dimension Reduction
Projecting p predctors into a M-dimensional space where M < p. This is achieved by computing M different liner combinations, or projects of the variables. Then these M projects are used as predictors to fit a linear regression model by least squares
Prediction Accuracy OLS with Large N
Provided that the true relationship between the response and the predictors is approximately linear, the lest square estimates will if low bias, if n >>p - that is if n, the number of observations, is much larger than p, the number of variables. The OLS estimates tend to also have low variance and will perform well on test observations
RSS and Features
RSS always decreases as the number of variables in the model increases.
Model Interperability
Removing variables, or setting their coefficients to 0, we can obtain a model that is more easily intepreted
Bayesian Viewpoint for Ridge and Lasso
Ridge and lasso follow directly from assuming the usual linear model with normal error
Ridge Regression and Scale
Ridge regression coefficient estimates can change substantially when multiplying a given predictor by a constant and as such each feature must be standardized so they are on the same scale
Disadvantage of Ridge Regression
Ridge regression uses all features in the model and the regularization term will tend to shrink the coefficients towards 0, but never to 0, unless the parameter is set to infinity, which can mess with interoperability.
Why Ridge Improves OLS
Ridge regression's advantages over OLS is rooted in bias-variance. as the regularization term increases, the flexibility of the model decreases leading to a decreased variance, but an increased bias. Ridge regression can perform well with n close to p.
Test MSE
Sum(y_hat - y_bar)^2/n
BIC
Tends to take on a small value for a low test error. A bayesian perspective of AIC. Applies a heavier penalty for models with more features.
How does PCR determine the linear combinations of X1 -> Xp
The PCR approach involves identifying the linear combinations/directions that best represent the predictors X1 -> Xp in an unsupervised way since Y is not used to help determine the principle components.
First Principal Component Direction
The direction of the data that along which the observations vary the ,ost
Curse of Dimensionality
The fact that test error increases given more features.
Key Idea of PCR
The key idea is that often a small number of principal components suffice to explain most of the variability in the data as well as the relationship with the response. We assume that the directions in which X1 -> Xp show the most variation are the directions that are associated with Y.
Training v Test Error
Training error tends to underestimate the test error.This is due to the fact that the training error will decrease as more variables are included into the model and R2 increases as more variables are included into the model
l1 Norm
Use the absolute value instead of the squared term for computing distance from 0 of the vector
Advantages of using Validation > Estimating the test MSE
Using the validation or cross-validation approach, we can directly compute the test error rate. There are also fewer assumptions.
Hybrid Approaches
Variables are added to the model sequentially, however, after adding each new variable, the method may also remove any variables that no longer provide improvement to model fit
When to use Ridge over OLS
When the OLS estimates have high variance.
Lasso and OLS
When the ols estimates have high variance, lasso is good at reducing the variance at the cost of a small increase in bias
When does PCR preform well
When there is only a need for a few principal components.
Sparse Models
models that involve only a subset of the vairables
The Lasso
regularization model that shrinks coefficients towards 0 and can even set them to 0. We are trying to find the set of coefficient estimates that lead to the smallest RSS, subject to the constrain that there is a budget for how large the l1 regularized beta coefficnet can be.