Quiz 4, Stats 1361
Difference in penalty term for BIC vs AIC/Cp
Here, the penalty term is log(n)dσˆ 2 instead of the 2dσˆ 2 with Mallow's Cp
Constructing up to p principal components
Idea is that these should be ordered in terms of information: Z1 explains the most about Y, Z2 explains the next most about Y given the information already contained in Z1 etc. Keep in mind that while we can create p principal components, we generally only want to use the first few (in practice, think 20-30 predictor variables down to 2 or 3 principal components) # of principal components decided by cross validation
Lasso Regression equation
In this case we are also minimzing B
Forward (Stepwise) Selection
computationally efficient alternative to best subset selection begins with null model then adds predictors one by one until all are in the model; at each step the variable that gives the greatest additional improvement to the fit is added to the model Continue this process to get the best of each size model M0, M1, ..., Mp. Choose the best among these using Cp, AIC, BIC, CV, or Adjusted R
Because we account for __________ adjusted R^2 only increases with _________ when those additional __________make the model substantially _________
d (# of predictors) d predictors accurate
Smaller sample sizes may actually be better if _________________ than that of larger datasets that may be available
if the data quality is better
AIC/BIC/Cp/Adjusted R^2 give us a way to
"objectively" choose a good model while balancing accuracy and complexity
In high dimensional settings we must have _______________ ________________ And what does this mean?
"perfect" multicollinearity any/every predictor can be written as a linear combination of the others. In linear algebra terms, this means we can't have linearly independent columns
Mallows Cp equation
(1/n) * (RSS + 2dσhat^2) d: number of parameters
Akaike Information Criterion: AIC equation
(1/nσhat^2)* (RSS + 2dσhat^2)
(Bayesian Information Criterion) BIC equatio
1 ----- * (RSS + log(n)dσhat^2) n
Non adjusted, simple R^2 formula
1 - (RSS/TSS)
Best subset selection process
1. Construct null model M0 with no predictors 2. For i 1, ..., p, construct all pCi possible models containing i predictors. Denote the best among these by Mi - to choose the best can use RSS or R^2 since models are the same size 3. Given the best of each size model M0, M1, ..., Mp, choose the best among these using Cp, AIC, BIC, CV, or Adjusted R^2 Basically make all these models and then choose the best of the best
What can "big data" mean
1. Large n or 2. Large p or 3. Large n and large p or 4. Large p relative to n
Two types of transformation to use in dimension reduction
1. Principal Component Analysis (sometimes called Principal Component Regression) 2. Partial Least Squares
K fold cross validation steps
1. Split data into k groups 2. Construct the model k times, each time holding out one group to use as the test set 3. In doing so, obtain k estimates of the test error. 4. Average over these k estimates to get a final estimate of the generalizability error
Relationship of AIC and Cp for models fit via least squares
AIC is proportional to Mallow's Cp for models fit via least squares
In ridge regression we choose to minimize
B^
PCA why is the line that would result in the largest variance in the projections the best?
Basically you're trying to simplify the data while maintaining the main underlying patterns So you want the transformations that will contain the largest amount of variance in the data
Issue with using Cp, AIC, BIC, or Adjusted R^2 when p >n
Because these are perfect fits - we'd estimate σˆ = 0. Similar issues with adjusted R^2 .
Backward Stepwise Selection
Begins with full model and then it iteratively removes the least useful predictor one by one Construct all possible models with p − 1 predictors by removing them one at a time. Keep the best and call this Mp−1. Beginning with Mp−1, construct all possible models with p − 2 predictors by removing them one at a time. Keep the best and call this Mp−2 Continue this process to get the best of each size model M0, M1, ..., Mp. Choose the best among these using Cp, AIC, BIC, CV, or Adjusted R
How to choose between models selected by Best SS, FSS, and BSS
Can use the measures we defined previously (Cp, AIC, BIC, CV, Adjusted R^2 ) to compare results from different approaches
Principal Component Analysis (PCA)
Each transformation (each new predictor Zi) is called a principal component The first principal component is the direction in which the data varies the mostf
Practical implications of "perfect multicollinearity"
Even if you can find a good "small" set of predictors, there are almost certainly many such good, small sets Hence while finding such a set is useful for predicting, you can not really ever claim to have found the "best" set
What are the good properties of OLS And why would we give these up for ridge regression?
Fitting via OLS produced estimates with good properties, namely ... Unbiasedness: the OLS estimates won't systematically under/over-estimate the "true" value of the β's Bias-variance tradeoff! Maybe we can give up a little bias for a big reduction in variance.
Best subset selection (Best SS) general intuiton
Given a total of p possible predictors, build every possible model using every possible subset of the p predictor variables.
High dimensional data best practice
In practice, the best option is often to find a good set that are easy to measure (in the real world) and/or that make the most sense in a particular context
Dimension reduction big idea
Instead of building a linear model of the form Y = B0 + B1X1 +..... BpXp + error We build a linear model of the form Y = θ0 + θ1Z1 +... θmZm + error where the Z's are linear combinations of the original predictors generally we want m to be small.
PCA and PLS end goal
Instead of fitting a bigger linear model on a lot of predictors, transform the predictors and fit a small model on the transformations
A linear model need only be __________________ to be linear
Linear in the parameters B we could have X2= X^2 (polynomial term) or X3= X1*X2 (interaction term) Just need it to be a linear combination of them
If we have two models M1 and M2 with similar (training) accuracy but M1 uses fewer predictor variables (less complexity), then _________ would be preferred
M1
Is more data always better?
NO
Would Best SS, FSS, and BSS produce the same models?
No - not guaranteed to and usually don't
Do AIC/BIC/Cp/Adjusted R^2 always choose the same model?
No! But they typically pick a similar one
Does FSS guarantee we find the best model
No! If we have a data set with 3 predictors, and let's say the best model for p = 1 is the model with just X1 then the best model for p =2 following stepwise will have to include X1, so if the true optimal model for p =2 was X2 and X3, forward stepwise will fail to identify this.
Does backwards stepwise always choose the best model
No! Same rationale as forward stepwise not always choosing the best
OLS when we have more predictors than we have observations: (p>n)
OLS solution is not well-defined (no unique solution) OLS will have infinite variance OLS won't work at all
Why do we expect dimension reduction to work
Often times, different predictor variables contain similar information about the response:
high dimensional data
P is large relative to n p ≈ n, p >n , or p>>n
As p increases and becomes larger than n
RSS → 0 and R^2 → 1
Downsides of PCA
Regression is on Zi , not the original Xi , so depending on the model you fit, you can lose some interpretability We're inherently assuming that the direction of max variation in the data is also the most informative about Y
______________________ provide a solution when p > n where OLS does not, but those solutions will ___________ as if we had started with the "right" (smaller) set of predictors to begin with
Regularization methods never be as good
Difference between lasso and ridge regression
Ridge regression: predictors are still left in the model Lasso sets many of the coefficients equal to 0 (i.e. removing predictor variables from the model) Lasso fits a sparse model - only involves a subset of the predictors
PLS can be thought of as a _______________ to PCA
Supervised alternative PLS information about Y to make transformation PLS is not always better though - it can often reduce bias (more flexible), but comes with added variance
Difference between PCA and PLS line
The PLS line has a smaller slope (with X1 on the x-axis and X2 on the y-axis) Meaning there is Less change in X2 per unit change in X1, essentially we're capturing "more" of X1 X1 must have been correlated with the response
What is penalized in ridge regression
The magnitude of the coefficients
Ridge regression How can/should we choose the value of λ
The same way we talked about choosing any tuning parameter - Cross-Validation
What controls sparcity of lasso model
The tuning parameter λ Larger λ ⇒ more shrinkage ⇒ sparser model (i.e. more coefficients set to 0; fewer predictors in the model)
What effect does ridge regression have? How would the coefficient values chosen here compare to βˆOLS
These coefficients would often be smaller. The penalty term has the effect of shrinking the magnitude of the coefficients
Partial least squares (PLS)
Very similar idea to PCA, but instead of the direction (projection) based only on what maximizes the variance, we choose the new predictors Zi based on a projection that both summarizes the original predictors and relates to the response
disadvantage of best subset selection
We will build a ton of models so it's computationally expensive
OLS vs Dimension Reduction
When dimension reduction predictors are chosen well, this kind of model can do substantially better than OLS Ideally, better fit with less predictors
When do we expect Lasso to outperform OLS
When the OLS estimates have high variance
When do we expect ridge regression with (λ>0) to do well?
When the OLS estimates have high variance
When do we expect Lasso to outperform ridge regression
When there are a lot of predictors, not all of which are relevant
When is the penalty for BIC greater than AIC
Whenever n > 7, log(n) >2 log(n)dσˆ 2 > 2dσˆ 2
Does best subset selection guarantee we will find the best model?
Yes because we're building every possible model
For a fixed number of predictors p, is having a larger sample size n always better
Yes, of course, but assuming the quality of the data is the same
Two predictor case form of PLS
Z1 = c1(X1 − X¯ 1) + c2(X2 − X¯ 2) those are x bar PLS gices more weight to the variable that marginally (by itself) is more strongly correlated with Y ex: . if X1 is more correlated with Y, magnitude of c1 will be larger)
The first principal component
Z1 c1(X1 − X¯ 1) + c2(X2 − X¯ 2) those are x bars
The second principal component
Z2 is the next direction in which the data varies the most, but we insist that Z2 not be correlated with Z1 Means that Z2 must be orthogonal to Z1
BIC assigns a _______ penatly that Cp and AIC hence BIC prefers models with _______ parameters
a harsher penalty BIC will "prefer" models with fewer parameter
k-fold Cross-validation gives us __________ way of estimating generalizability error of a particular model
a robust much better than using only training error or even a single train/test split
Ultimate goal for AIC/BIC/Cp
adjust the training error in order to account for model complexity Each measure can be seen as an estimate of the test error Smaller values would be preferred
Think of Mallows Cp as
an adjustment to the training error that is meant to estimate the test error
Best SS, FSS, and BSS provide a means for finding "________" which we can evaluate with __________
best candidate models AIC BIC Cp Adj R^2
Because for PCA the Zi still depend on the original predictors, we consider this kind of process _________ instead of ____________
dimension reduction variable selection
Cross validation gives estimates of test error but _____________
doesn't penalize more complex methods
RSS
equal to y sum of (yi - equation)^2 With ordinary least squares (OLS), we choose the coefficient estimates βˆOLS as the values of β that minimize the R
• In practice, we're often interested in finding a model with a good trade-of between these three___________
fit, complexity, and interpetrability
In the case of linear models, models with more terms are ___________ and will have _________ RSS
flexible lower
K fold cross validation is to estimate the
generalizability error of a particular model
In shrinkage and regularization we use penalty terms in _________
in fitting the models rather than to evaluate models already fit
First principal component
is the direction in which the data varies the most This means that if we were to project our data onto a line, we would want the line that would result in the largest variance in the projections
Ridge regression For large values of λ, the penalty term becomes ___________
larger ⇒ the coefficient estimates shrink by a large amount (i.e. βˆ Ridge → 0)
We think of RSS as a _____________ and to fit the model, we choose the coefficient estimates βˆ as the values of β that_____________ We call this approach
loss function minimize this loss function OLS(Ordinary least squares)
AIC is technically only defined for models fit via __________
maximum likelihood
When we talked about model selection with cross validation, we were only interested in _____
model fit ignored practical concerns such as (i) model complexity and (ii) understanding what kinds of scientific questions we might want/need to be able to answe
In high dimensional settings _______________ is always the case
multicollinearity The idea that some combination of variables (almost entirely) explained some other variable
For p > n we can think of this as _____________________
no bias (perfect fit) but infinite variance Same holds for larger values of n and p
In dimension reduction we're no longer using the ________
original predictors directly Instead, we're using a new, smaller set of transformed predictors
Mallows Cp adds a ______ to ______ that _______________________
penalty RSS that increases as the number of parameters d increases to account for the fact RSS decreases with d
an L2 penalty
penalty term on the magnitude of the coefficients The "2" in L2 refers to fact that we're squaring the coefficient values in the penalty term
The penalty term has the effect of shrinking the magnitude of the coefficients this is called
regularization In the specific context of linear regression, this typically means adding some kind of penalty to the least squares approach
Lasso and Ridge regression both _________________
shrink the magnitude of the coefficients but in different ways
High dimenstional data p>n In general, as _______ grows, so does __________ This is called the curse of ________________
the dimension (p) test error dimensionality
Ridge regression when λ = 0 what happens to the penalty
the penalty goes away so that βˆ Ridge = βˆOLS
What happens with ridge regression if we change the scale of a predictor
the resulting change in βˆ c,Ridge depends on c, lambda, and the scaling of other predictors
for mallows cp if (if σˆ 2 is unbiased for σ 2 ,
then Mallow's Cp is an unbiased estimate of the test error
Lambda in ridge regression is the
tuning parameter analogous to the k in k-nearest neighbors) that controls how much weight we give the penalty:
When we do lasso and ridge our B^'s are no longer _____________ but the hope is what we __________________________
unbiased add in bias we will more than make up for in a drop in variance
the training error will _____________ the generalizability (test) error
underestimate
Ridge regression, λ and bias variance tradeoff
when λ = 0, we have little (zero) bias, but possibly high variance For large λ, we get a more rigid (less flexible) model; βˆ Ridge close to 0; ⇒ higher bias, lower variance This is because remember higher lambda the coefficient estimates shrink by a large amount (i.e. βˆ Ridge → 0)
k-fold Cross-validation is very useful in determining________
which models are outperforming others Can be used to compare across different kinds of models, like LDA vs QDA or choosing the "best" from a class of models, like choosing the value value of a tuning paramenter such as the k in kNN