CS5565 Chapter 6: Linear Model Selection and Regularization

Ace your homework & exams now with Quizwiz!

How many models does backward and forward stepwise selection search through?

(1+p)*((p+1)/2)

What is AIC, mathematically?

(1/n*sigma^2) (RSS+2dsigma^2) and is proportional to Cp.

Besides computational limitations, what's another problem withe best subset selection?

A large search space (p) can lead to overfitting and high variance of the coefficient estimates.

What is the key idea behind PCR?

A small number of principal components suffice to explain most of the variability in the data, as well as the relationship with the response. In other words, we assume the directions in which X1,...,Xp show the most variation are the directions that are associated with Y.

What is Adjusted R^2, mathematically?

Adjusted R^2 = 1 - (RSS/(n-d-1)) / (TSS/(n-1)), a large R^2 value indicates a model with small test error.

What is ridge regression's advantage over least squares?

As lambda increases, the flexibility of the ridge regression fit decreases, lead to decreased variance, but increased bias.

What is BIC, mathematically?

BIC = 1/n (RSS + log(n)*dsigma^2), we choose smallest BIC

What is backward stepwise selection?

Begins with all p predictors, and then iteratively removes the least useful predictor.

How does forward stepwise selection work?

Begins with no predictors, and then adds predictors to the model, one-at-a-time, until all of the predictors are int eh model. In particular, at each step the variable that gives the greatest additional improvement to the fit is added to the model.

How is PLS similar to PCR?

Both first identify a new set of features Z1,...,Zm that are linear combinations of the original features and then fits a linear model via least squares using these M new features.

What field is PLS popular?

Chemometrics, where many variables arise from digitized spectrometry signals.

How would you select the tuning parameter for ridge regression or lasso?

Choose a grid of lambda values, and compute the cross-validation error for each value of lambda. Then select the tuning parameter value for which the cross-validation error is the smallest.

What does PCR do?

Constructing the first M principal components and then using these components as the predictors in a linear regression model that is fit using least squares.

When looking for the best model what are we looking for in Cp?

Cp statistic tends to take on a small value for models with a low test error, so when determining which of a set of models is best, we choose the model with the lowest Cp.

What are the four ways of adjusting the training error for the model size available?

Cp,Akaike information criterion(AIC), Bayesian information criterion (BIC), and adjusted R^2.

In PCR and PLS how are the number of principal components chosen?

Cross-validation.

What is the first principal component?

Direction of the data along which the observations have the most variance.

Why is PCR not a feature detection tool?

Each of the M principal components used in the regression is a linear combination of all p of the original features.

What are the steps of the one-standard-error rule?

First, calculate the standard error of the estimated test MSE for each model size. Second, select the smallest model for which the estimated test error is within one standard error of the lowest point on the curve.

What is Cp, mathematically?

For a fitted least squares model containing d predictors, the Cp estimate of test MSE is: Cp = 1/n*(RSS+2dsigma^2) where sigma^2 is an estimate of the variance of the error e associated with each response measurement.

What is the rationale behind the one-standard-error rule?

If a set of models appear to be more or less equally good, then we might as well choose the simplest mode -- that is, the model with the smallest number of predictors.

When will PCR perform best?

In cases when the first few principal components are sufficient to capture most of the variation in the predictors, as well as the relationship with the response.

What is a major advantage of LASSO over ridge regression?

It produces simpler and more interpretable models that involve only a subset of the predictors.

What is an advantage to using validation and cross-validation?

It provides a direct estimate of the test error and makes fewer assumptions about the true underlying model. Also, it can be used in a wider range of model selection tasks, even in cases where it is hard to pinpoint the model degrees of freedom (e.g. the number of predictors in the model) or hard to estimate the error variance simga^2.

What is the amount of shrinkage in ridge regression controlled by?

Lambda, the tuning parameter that multiplies the ridge penalty.

What does LASSO stand for?

Least Absolute Selection and Shrinkage Operator

What does a low RSS or a high R^2 indicate?

Low training error

What does ridge regression do?

Minimizes the usual regression criterion plus a penalty term on the sqaured l2 norm of the coefficient vector. As such, it shrinks the coefficients towards zero. This introduces some bias, but can greatly reduce the variance, resulting in a better mean-squared error.

What does a large lambda mean for ridge regression?

More shrinkage, and so we get different coefficient estimates for different values of lambda.

Why is the overall benefit of PLS relative to PCR a wash?

PLS can reduce bias, but also has the potential to increase variance.

How is PLS unlike PCR?

PLS identifies new features in a supervised way -- that is, it makes use of the response Y in order to indentify new features that not only approximate the old features well, but also are related to the response.

What is the supervised alternative to PCR?

Partial Least Squares (PLS)

What does PCA stand for?

Principal Components Analysis

What does PCR stand for?

Principal components regression

Why should flexible least squares not be used when p (number of features) > n (number of observations)?

Regardless whether or not there truly is a relationship between the features and the response, least squares will yield a set of coefficient estimates that result in a perfect fit to the data, such that the residuals are zero. Put simply, a simple least squares regression line is too flexible and overfits the data.

How does LASSO differ from ridge regression?

Ridge regression uses ||B||^2 = l2 penalty, whilst LASSO uses l1 = ||B||, not squared. This minor difference causes the estimate of some coefficients to be exactly zero.

Is PCR more related to ridge regression or LASSO?

Ridge regression. One can even think of ridge regression as a continuous version of PCR.

What is a drawback of PCR, in terms of supervision?

The PCR approach involves identifying linear combinations, or directions, that best represent the predictors X1,..,Xp. There is no guarantee that the directions that best explain the predictors will also be the best directions to use for predicting the response.

What is another interpretation for first principal component besides variance?

The first principal component vector defines the line that is as close as possible to the data.

How is best subset selection computationally limited?

The number of possible models increases at 2^p, which is a very rapid growth and computationally unfeasible for p > 40.

What does backward stepwise selection require?

The number of samples n is larger than the number of variable.

How is deviance related to the fit of a model?

The smaller the deviance, the better the fit.

What is the purpose of PCA?

To reduce the number of dimensions or variables within large data sets.

In ridge regression, what happens when lambda = infinity?

We get Beta(ridge) = 0.

In ridge regression, what happens when lambda = 0?

We get the linear regression estimate.

When would LASSO outperform ridge regression?

When a relatively small number of predictors have substantial coefficients, and the remaining predictors that have coefficients that are very small or that equal zero.

When is forward stepwise selection viable?

When p is very large ( n<p)

Where does ridge regression work best?

When the least squares estimates have high variance.

When would ridge regression outperform LASSO?

When the response is a function of many predictors, all with coefficients of roughly equal size.

When does test error decrease?

With number of observations or samples.

How does PCR mitigate overfitting?

by estimating M << p coefficients

What happens to lambda as the bias increases in ridge regression?

lambda (amount of shrinkage) increases.

Which least square models be used for high dimensional data?

less flexible models, such as forward stepwise selection, ridge regression, LASSO, PCR

Why must you be careful when selecting a model with a low RSS or high R^2?

low RSS and high R^2 indicate low training error, but you're concerned with a low TEST error. A low training error in no way indicates a low test error.

What is the tuning parameter for PCR and PLS?

the number M of partial least squares directions.

What happens to the variance as lambda (amount of shrinkage) increases for ridge regression?

variance decreases.

When does test error increase?

with number of dimensions, with variance of training error, and with coefficient magnitudes (magnitudes correlate with potential for overfitting/flexibility).


Related study sets

DSM 32: The Internal Environment of Animals: Organization and Regulation

View Set

Ch. 4- Court Organization and Structure

View Set

Ch 2: Life/Health Insurance Underwriting

View Set

(Algebra Sem. 2) Unit 3 Assignment - 20. Test

View Set

Acts chapter 28 Custom Questions

View Set

Unit Circle - Radians/Sine/Cosine/Tangent

View Set

gsu pols 1101 exam 2, gsu pols 1101 exam 2, gsu pols 1101 exam 2, gsu pols 1101 exam 2

View Set

Iggy:Chapter 52 - Concepts of Care for Patients With Inflammatory Intestinal Disorders

View Set

Correlation and regression analysis

View Set