ISLR ch 6
Give algorithm for forward stepwise selection.
1. Let M0 denote the null model, which contains no predictors. 2. Fork=0,...,p−1: (a) Consider all p − k models that augment the predictors in Mk with one additional predictor. (b) Choose the best among these p − k models, and call it Mk+1. Here best is defined as having smallest RSS or highest R2. 3. Select a single best model from among M0, . . . , Mp using cross- validated prediction error, Cp (AIC), BIC, or adjusted R2.
Give the best subset algorithm. What is tricky about step 2 in this process
1. Let M0 denote the null model, which contains no predictors. This model simply predicts the sample mean for each observation. 2. Fork=1,2,...p: (a) Fit all kp models that contain exactly k predictors. (b) Pick the best among these kp models, and call it Mk. Here best is defined as having the smallest RSS, or equivalently largest R2. 3. Select a single best model from among M0, . . . , Mp using cross- validated prediction error, Cp (AIC), BIC, or adjusted R2.
Give algorithm for Backward Step-Wise
1. Let Mp denote the full model, which contains all p predictors. 2. Fork=p,p−1,...,1: (a) Consider all k models that contain all but one of the predictors in Mk, for a total of k − 1 predictors. (b) Choose the best among these k models, and call it Mk−1. Here best is defined as having smallest RSS or highest R2. 3. Select a single best model from among M0, . . . , Mp using cross- validated prediction error, Cp (AIC), BIC, or adjusted R2.
In order to select the best model with respect to test error, we need to estimate this test error. There are two common approaches. What are they?
1. We can indirectly estimate test error by making an adjustment to the training error to account for the bias due to overfitting. 2. We can directly estimate the test error, using either a validation set approach or a cross-validation approach, as discussed in Chapter 5.
How many models to choose from in best subset?
2^p
How doe lasso perform variable selection?
= |βj|. As with ridge regression, the lasso shrinks the coefficient estimates towards zero. However, in the case of the lasso, the l1 penalty has the effect of forcing some of the coefficient estimates to be exactly equal to zero when the tuning parameter λ is sufficiently large. Hence, much like best subset se- lection, the lasso performs variable selection.
If you were to use best subset selection for logistic regression what changes might you make?
Although we have presented best subset selection here for least squares regression, the same ideas apply to other types of models, such as logistic regression. In the case of logistic regression, instead of ordering models by RSS in Step 2 of Algorithm 6.1, we instead use the deviance, a measure that plays the role of RSS for a broader class of models
What is an alternative to choosing best model using AIC, Cp, BIC , adjusted R2
As an alternative to the approaches just discussed, we can directly esti- mate the test error using the validation set and cross-validation methods We can compute the validation set error or the cross-validation error for each model under consideration, and then select the model for which the resulting estimated test error is smallest.
What's the best subset selection option when n< p . why?
Backward selection requires that the number of samples n is larger than the number of variables p (so that the full model can be fit). In contrast, forward stepwise can be used even when n < p, and so is the only viable subset method when p is very large.
How do you choose the tuning parameter lambda?
Cross-validation provides a sim- ple way to tackle this problem. We choose a grid of λ values, and compute the cross-validation error for each value of λ. We then select the tuning parameter value for which the cross-validation error is smallest. Finally, the model is re-fit using all of the available observations and the selected value of the tuning parameter.
Explain how Cp works
Essentially, the Cp statistic adds a penalty of 2dσˆ2 to the training RSS in order to adjust for the fact that the training error tends to underestimate the test error. Clearly, the penalty increases as the number of predictors in the model increases; this is intended to adjust for the corresponding decrease in training RSS. if σˆ2 is an unbiased estimate of σ2 in (6.2), then Cp is an unbiased estimate of test MSE. so when determining which of a set of models is best, we choose the model with the lowest Cp value.
Illustrate using graph how ridge coeff change as lambda increases or decreases
For example, the black solid line represents the ridge regression estimate for the income coefficient, as λ is varied. At the extreme left-hand side of the plot, λ is essentially zero, and so the corresponding ridge coefficient estimates are the same as the usual least squares esti- mates. But as λ increases, the ridge coefficient estimates shrink towards zero. When λ is extremely large, then all of the ridge coefficient estimates are basically zero; this corresponds to the null model that contains no predictors.
Describe forward selection
Forward stepwise selection begins with a model containing no predictors, and then adds predictors to the model, one-at-a-time, until all of the predictors are in the model. In particular, at each step the variable that gives the greatest additional improvement to the fit is added to the model.
Give advantage and disadvantage of forward subset selection
Forward stepwise selection's computational advantage over best subset selection is clear. Another advantage: Forward stepwise selection can be applied even in the high-dimensional setting where n < p; however, in this case, it is possible to construct sub- models M0, . . . , Mn−1 only, since each submodel is fit using least squares, which will not yield a unique solution if p ≥ n. Though forward stepwise tends to do well in practice, it is not guaranteed to find the best possible model out of all 2p mod- els containing subsets of the p predictors. For instance, suppose that in a given data set with p = 3 predictors, the best possible one-variable model contains X1, and the best possible two-variable model instead contains X2 and X3. Then forward stepwise selection will fail to select the best possible two-variable model, because M1 will contain X1, so M2 must also contain X1 together with one additional variable.
What is the role of shrinkage penalty?
However, the second term, λ Sumj (βj)^2, called a shrinkage penalty, is small when β1, . . . , βp are close to zero, and so it has the effect of shrinking the estimates of βj towards zero.
What method is used to select the number of components in PCR?
In PCR, the number of principal components, M, is typically chosen by cross-validation.
Describe backward step-wise selection
It begins with the full least squares model containing all p predictors, and then iteratively removes the least useful predictor, one-at-a-time.
Explain how model interpretability can be improved for OLS
It is often the case that some or many of the variables used in a multiple regression model are in fact not associ- ated with the response. Including such irrelevant variables leads to unnecessary complexity in the resulting model. By removing these variables—that is, by setting the corresponding coefficient estimates to zero—we can obtain a model that is more easily interpreted.
What is The principal components regression
Let Z1, Z2, . . . , ZM represent M < p linear combinations of our original p predictors. (PCR) approach involves constructing thefirstMprincipalcomponents,Z1,...,ZM,andthenusingthesecompo- nents as the predictors in a linear regression model that is fit using least squares. The key idea is that often a small number of prin- cipal components suffice to explain most of the variability in the data, as well as the relationship with the response.
How does BIC differ from Cp
Like Cp, the BIC will tend to take on a small value for a model with a low test error, and so generally we select the model that has the lowest BIC value. Notice that BIC replaces the 2dσˆ2 used by Cp with a log(n)dσˆ2 term, where n is the number of observations. Since log n > 2 for any n > 7, the BIC statistic generally places a heavier penalty on models with many variables, and hence results in the selection of smaller models than Cp.
Give advantages of backward stepwise
Like forward stepwise selection, the backward selection approach searches through only 1+p(p+1)/2 models, and so can be applied in settings where p is too large to apply best subset selection.2
If you have p- predictors then how many distinct principal components can you have?
P principal components
What is a drawback of PCR
PCR approach that we just described involves identifying linear combi- nations, or directions, that best represent the predictors X1, . . . , Xp. These directions are identified in an unsupervised way, since the response Y is not used to help determine the principal component directions. That is, the response does not supervise the identification of the principal components. Consequently, PCR suffers from a drawback: there is no guarantee that the directions that best explain the predictors will also be the best directions to use for predicting the response.
What is PCA?
Principal components analysis (PCA) is a popular approach for deriving a low-dimensional set of features from a large set of variables.
Explain how to improve prediction accuracy OLS
Provided that the true relationship between the response and the predictors is approximately linear, the least squares estimates will have low bias. If n ≫ p—that is, if n, the number of observations, is much larger than p, the number of variables—then the least squares estimates tend to also have low variance, and hence will perform well on test observations. However, if n is not much larger than p, then there can be a lot of variability in the least squares fit, resulting in overfitting and consequently poor predictions on future observations not used in model training. And if p > n, then there is no longer a unique least squares coefficient estimate: the variance is infinite so the method cannot be used at all. By constraining or shrinking the estimated coefficients, we can often substantially reduce the variance at the cost of a negligible increase in bias.
Name two shrinkage methods
Ridge and lasso
What is a disadvantage of ridge regression?
Ridge regression does have one obvious disadvantage. Unlike best subset, forward stepwise, and backward stepwise selection, which will generally select models that involve just a subset of the variables, ridge regression will include all p predictors in the final model. The penalty λ βj2 in (6.5) will shrink all of the coefficients towards zero, but it will not set any of them exactly to zero (unless λ = ∞). This may not be a problem for prediction accuracy, but it can create a challenge in model interpretation in settings in which the number of variables p is quite large.
What does shrinkage method do?
Shrinkage methods fit a model containing all p predictors using a technique that constrains or regularizes the coefficient estimates, or equivalently, that shrinks the coefficient estimates towards zero. shrinking the coefficient estimates can significantly reduce their variance.
What are signal and noise variables?
Signal variables are predictors that are related to the response, while noise variables represent the unre- lated predictors
three important classes of methods that are alternatives to OLS
Subset selection, shrinkage, Dimension reduction.
How many models does forward subset selection fit?
Sum from k=0 to p-1 (p-k) = 1 + p(p + 1)/2 models
What is AIC?
The AIC criterion is defined for a large class of models fit by maximum likelihood.
Plot MSE vs no. of components along with variance and bias. Explain
The curves are plotted as a function of M, the number of principal components used as predic- tors in the regression model. As more principal components are used in the regression model, the bias decreases, but the variance increases. This results in a typical U-shape for the mean squared error.
What is deviance?
The deviance is negative two times the maximized log-likelihood; the smaller the deviance, the better the fit.
What is The first principal component direction of the data? Give two definitions
The first principal component direction of the data is that along which the observations vary the most. the first principal compo- nent vector defines the line that is as close as possible to the data.
Explain the intuition behind adjusted R 2
The intuition behind the adjusted R2 is that once all of the correct variables have been included in the model, adding additional noise variables will lead to only a very small decrease in RSS. Since adding noise variables leads to an increase in d, such variables will lead to an increase in RSS/n−d−1 and consequently a decrease in the adjusted R . Therefore, in theory, the model with the largest adjusted R2 will have only correct variables and no noise variables. Unlike the R2 statistic, the adjusted R2 statistic pays a price for the inclusion of unnecessary variables in the model.
How is the lasso penalty different from ridge.
The only difference is that the βj2 term in the ridge regression penalty (6.5) has been replaced by |βj | in the lasso penalty (6.7).
The second principal component Z2
The second principal component Z2 is a linear combination of the variables that is un- correlated with Z1, and has largest variance subject to this constraint. It turns out that the zero correlation condition of Z1 with Z2 is equivalent to the condition that the direction must be perpendicular, or orthogonal, to the first principal component direction.
Describe Shrinkage methods
This approach involves fitting a model involving all p pre- dictors. However, the estimated coefficients are shrunken towards zero relative to the least squares estimates. This shrinkage (also known as regularization) has the effect of reducing variance. Depending on what type of shrinkage is performed, some of the coefficients may be esti- mated to be exactly zero. Hence, shrinkage methods can also perform variable selection.
Describe subset selection
This approach involves identifying a subset of the p predictors that we believe to be related to the response. We then fit a model using least squares on the reduced set of variables.
Describe how dimension reduction works
This approach involves projecting the p predic- tors into a M-dimensional subspace, where M < p. This is achieved by computing M different linear combinations, or projections, of the variables. Then these M projections are used as predictors to fit a linear regression model by least squares.
What is an advantage of using these alternative methods?
This procedure has an advantage relative to AIC, BIC, Cp, and adjusted R2, in that it provides a direct estimate of the test error, and makes fewer assumptions about the true underlying model. It can also be used in a wider range of model selection tasks, even in cases where it is hard to pinpoint the model degrees of freedom (e.g. the number of predictors in the model) or hard to estimate the error variance σ2.
Describe best subset selection
To perform best subset selection, we fit a separate least squares regression foreachpossiblecombinationoftheppredictors.Thatis,wefitallpmodels that contain exactly one predictor, all p choose 2 = p(p−1)/2 models that contain exactly two predictors, and so forth. We then look at all of the resulting models, with the goal of identifying the one that is best.
AIC, BIC, and Cp in the case of a linear model fit using least squares; however, these quantities can also be defined for more general types of models. True or False
True
for least squares models, Cp and AIC are proportional to each other. True or false
True
For adjusted R 2 is small value or big value better?
Unlike Cp, AIC, and BIC, for which a small value indicates a model with a low test error, a large value of adjusted R2 indicates a model with a small test error.
What is the one-standard-error rule in model selection
We first calculate the standard error of the estimated test MSE for each model size, and then select the smallest model for which the estimated test error is within one standard error of the lowest point on the curve. The rationale here is that if a set of models appear to be more or less equally good, then we might as well choose the simplest model—that is, the model with the smallest number of predictors.
Is PCA a feature selection method? Why or why not?
We note that even though PCR provides a simple way to perform regression using M < p predictors, it is not a feature selection method. This is because each of the M principal components used in the regression is a linear combination of all p of the original features.
approaches that can be used to select among a set of models with different numbers of variables.
We now consider four such approaches: Cp, Akaike information criterion (AIC), Bayesian information criterion(BIC),andadjustedR2.
the shrinkage penalty is applied to β1,...,βp, but not to the intercept β0. Why?
We want to shrink the estimated association of each variable with the response; however, we do not want to shrink the intercept, which is simply a measure of the mean value of the response when xi1 = xi2 = . . . = xip = 0.
Why do you standardize each predictor for PCR?
When performing PCR, we generally recommend standardizing each predictor, using (6.6), prior to generating the principal components. This standardization ensures that all variables are on the same scale. In the absence of standardization, the high-variance variables will tend to play a larger role in the principal components obtained, and the scale on which the variables are measured will ultimately have an effect on the final PCR model. However, if the variables are all measured in the same units (say, kilograms, or inches), then one might choose not to standardize them.
Illustration of Lasso coefficients as a function of lambda.
When λ = 0, then the lasso simply gives the least squares fit, and when λ becomes sufficiently large, the lasso gives the null model in which all coefficient estimates equal zero. However, in between these two extremes, the ridge regression and lasso models are quite different from each other.
What are dimension reduction methods?
a class of approaches that transform the predictors and then fit a least squares model using the transformed variables.
Why might we want to use another fitting procedure instead of least squares?
alternative fitting procedures can yield better pre- diction accuracy and model interpretability.
When would PCA perform poorly? And when would it perform well?
data were generated in such a way that many princi- pal components are required in order to adequately model the response. In contrast, PCR will tend to do well in cases when the first few principal components are sufficient to capture most of the variation in the predictors as well as the relationship with the response
Explain how the hybrid subset selection works?
hybrid versions of forward and backward stepwise selection are available, in which variables are added to the model sequentially, in analogy to forward selection. However, after adding each new variable, the method may also remove any variables that no longer provide an improvement in the model fit. Such an approach attempts to more closely mimic best sub- set selection while retaining the computational advantages of forward and backward stepwise selection.
What is a disadvantage of best subset selection?
it suffers from computational limitations. The number of possible models that must be considered grows rapidly as p increases. Consequently, best subset selection becomes computationally infeasible for values of p greater than around 40, even with extremely fast modern computers. There are compu- tational shortcuts—so called branch-and-bound techniques—for eliminat- ing some choices, but these have their limitations as p gets large. They also only work for least squares linear regression.
Compare lasso and ridge
lasso has a major advantage over ridge regression, in that it produces simpler and more interpretable models that involve only a subset of the predictors. neither ridge regression nor the lasso will universally dominate the other. In general, one might expect the lasso to perform better in a setting where a relatively small number of predictors have substantial coefficients, and the remaining predictors have coefficients that are very small or that equal zero. Ridge regression will perform better when the response is a function of many predictors, all with coefficients of roughly equal size. As with ridge regression, when the least squares estimates have exces- sively high variance, the lasso solution can yield a reduction in variance at the expense of a small increase in bias, and consequently can gener- ate more accurate predictions.
Give a disadvantage of backward stepwise selection
like forward stepwise selection, backward stepwise selection is not guaranteed to yield the best model containing a subset of the p predictors.
sparse models
models that involve only a subset of the variables.
Explain Partial least squares. How is it different from PCR
partial least squares (PLS), a supervised alternative to PCR. PCR.LikePCR,PLS is a dimension reduction method,which first identifies a new set of featuresZ1,...,ZM that are linear combinations of the original features, , and then fits a linear model via least squares using these M new features. But unlike PCR, PLS identifies these new features in a supervised way—that is, it makes use of the response Y in order to identify new features that not only approximate the old features well, but also that are related to the response.
Name 2 dimension reduction methods
principal components and partial least squares.
What is feature selection or variable selection—
that is, for excluding irrelevant variables from a multiple regression model.
Why RSS and R2 are not suitable for selecting the best model among a collection of models with different numbers of predictors.
the model containing all of the predictors will always have the smallest RSS and the largest R2, since these quantities are related to the training error. Instead, we wish to choose a model with a low test error. As is evident here, and as we show in Chapter 2, the training error can be a poor estimate of the test error.
What can you infer from a small lambda value that has been chosen as the optimal lambda?
the value is relatively small, indicating that the optimal fit only involves a small amount of shrinkage relative to the least squares solution.
What is the tuning parameter in ridge regression?
λ ≥ 0 is a tuning parameter. The tuning parameter λ serves to control the relative impact of these two terms on the regression coefficient esti- mates. When λ = 0, the penalty term has no effect, and ridge regression will produce the least squares estimates. However, as λ → ∞, the impact of the shrinkage penalty grows, and the ridge regression coefficient estimates will approach zero.
The first principal component formula
φ11 × (pop − pop-bar) + φ21 × (ad − ad-bar)