Quiz 4, Stats 1361

Ace your homework & exams now with Quizwiz!

Difference in penalty term for BIC vs AIC/Cp

Here, the penalty term is log(n)dσˆ 2 instead of the 2dσˆ 2 with Mallow's Cp

Constructing up to p principal components

Idea is that these should be ordered in terms of information: Z1 explains the most about Y, Z2 explains the next most about Y given the information already contained in Z1 etc. Keep in mind that while we can create p principal components, we generally only want to use the first few (in practice, think 20-30 predictor variables down to 2 or 3 principal components) # of principal components decided by cross validation

Lasso Regression equation

In this case we are also minimzing B

Forward (Stepwise) Selection

computationally efficient alternative to best subset selection begins with null model then adds predictors one by one until all are in the model; at each step the variable that gives the greatest additional improvement to the fit is added to the model Continue this process to get the best of each size model M0, M1, ..., Mp. Choose the best among these using Cp, AIC, BIC, CV, or Adjusted R

Because we account for __________ adjusted R^2 only increases with _________ when those additional __________make the model substantially _________

d (# of predictors) d predictors accurate

Smaller sample sizes may actually be better if _________________ than that of larger datasets that may be available

if the data quality is better

AIC/BIC/Cp/Adjusted R^2 give us a way to

"objectively" choose a good model while balancing accuracy and complexity

In high dimensional settings we must have _______________ ________________ And what does this mean?

"perfect" multicollinearity any/every predictor can be written as a linear combination of the others. In linear algebra terms, this means we can't have linearly independent columns

Mallows Cp equation

(1/n) * (RSS + 2dσhat^2) d: number of parameters

Akaike Information Criterion: AIC equation

(1/nσhat^2)* (RSS + 2dσhat^2)

(Bayesian Information Criterion) BIC equatio

1 ----- * (RSS + log(n)dσhat^2) n

Non adjusted, simple R^2 formula

1 - (RSS/TSS)

Best subset selection process

1. Construct null model M0 with no predictors 2. For i 1, ..., p, construct all pCi possible models containing i predictors. Denote the best among these by Mi - to choose the best can use RSS or R^2 since models are the same size 3. Given the best of each size model M0, M1, ..., Mp, choose the best among these using Cp, AIC, BIC, CV, or Adjusted R^2 Basically make all these models and then choose the best of the best

What can "big data" mean

1. Large n or 2. Large p or 3. Large n and large p or 4. Large p relative to n

Two types of transformation to use in dimension reduction

1. Principal Component Analysis (sometimes called Principal Component Regression) 2. Partial Least Squares

K fold cross validation steps

1. Split data into k groups 2. Construct the model k times, each time holding out one group to use as the test set 3. In doing so, obtain k estimates of the test error. 4. Average over these k estimates to get a final estimate of the generalizability error

Relationship of AIC and Cp for models fit via least squares

AIC is proportional to Mallow's Cp for models fit via least squares

In ridge regression we choose to minimize

B^

PCA why is the line that would result in the largest variance in the projections the best?

Basically you're trying to simplify the data while maintaining the main underlying patterns So you want the transformations that will contain the largest amount of variance in the data

Issue with using Cp, AIC, BIC, or Adjusted R^2 when p >n

Because these are perfect fits - we'd estimate σˆ = 0. Similar issues with adjusted R^2 .

Backward Stepwise Selection

Begins with full model and then it iteratively removes the least useful predictor one by one Construct all possible models with p − 1 predictors by removing them one at a time. Keep the best and call this Mp−1. Beginning with Mp−1, construct all possible models with p − 2 predictors by removing them one at a time. Keep the best and call this Mp−2 Continue this process to get the best of each size model M0, M1, ..., Mp. Choose the best among these using Cp, AIC, BIC, CV, or Adjusted R

How to choose between models selected by Best SS, FSS, and BSS

Can use the measures we defined previously (Cp, AIC, BIC, CV, Adjusted R^2 ) to compare results from different approaches

Principal Component Analysis (PCA)

Each transformation (each new predictor Zi) is called a principal component The first principal component is the direction in which the data varies the mostf

Practical implications of "perfect multicollinearity"

Even if you can find a good "small" set of predictors, there are almost certainly many such good, small sets Hence while finding such a set is useful for predicting, you can not really ever claim to have found the "best" set

What are the good properties of OLS And why would we give these up for ridge regression?

Fitting via OLS produced estimates with good properties, namely ... Unbiasedness: the OLS estimates won't systematically under/over-estimate the "true" value of the β's Bias-variance tradeoff! Maybe we can give up a little bias for a big reduction in variance.

Best subset selection (Best SS) general intuiton

Given a total of p possible predictors, build every possible model using every possible subset of the p predictor variables.

High dimensional data best practice

In practice, the best option is often to find a good set that are easy to measure (in the real world) and/or that make the most sense in a particular context

Dimension reduction big idea

Instead of building a linear model of the form Y = B0 + B1X1 +..... BpXp + error We build a linear model of the form Y = θ0 + θ1Z1 +... θmZm + error where the Z's are linear combinations of the original predictors generally we want m to be small.

PCA and PLS end goal

Instead of fitting a bigger linear model on a lot of predictors, transform the predictors and fit a small model on the transformations

A linear model need only be __________________ to be linear

Linear in the parameters B we could have X2= X^2 (polynomial term) or X3= X1*X2 (interaction term) Just need it to be a linear combination of them

If we have two models M1 and M2 with similar (training) accuracy but M1 uses fewer predictor variables (less complexity), then _________ would be preferred

M1

Is more data always better?

NO

Would Best SS, FSS, and BSS produce the same models?

No - not guaranteed to and usually don't

Do AIC/BIC/Cp/Adjusted R^2 always choose the same model?

No! But they typically pick a similar one

Does FSS guarantee we find the best model

No! If we have a data set with 3 predictors, and let's say the best model for p = 1 is the model with just X1 then the best model for p =2 following stepwise will have to include X1, so if the true optimal model for p =2 was X2 and X3, forward stepwise will fail to identify this.

Does backwards stepwise always choose the best model

No! Same rationale as forward stepwise not always choosing the best

OLS when we have more predictors than we have observations: (p>n)

OLS solution is not well-defined (no unique solution) OLS will have infinite variance OLS won't work at all

Why do we expect dimension reduction to work

Often times, different predictor variables contain similar information about the response:

high dimensional data

P is large relative to n p ≈ n, p >n , or p>>n

As p increases and becomes larger than n

RSS → 0 and R^2 → 1

Downsides of PCA

Regression is on Zi , not the original Xi , so depending on the model you fit, you can lose some interpretability We're inherently assuming that the direction of max variation in the data is also the most informative about Y

______________________ provide a solution when p > n where OLS does not, but those solutions will ___________ as if we had started with the "right" (smaller) set of predictors to begin with

Regularization methods never be as good

Difference between lasso and ridge regression

Ridge regression: predictors are still left in the model Lasso sets many of the coefficients equal to 0 (i.e. removing predictor variables from the model) Lasso fits a sparse model - only involves a subset of the predictors

PLS can be thought of as a _______________ to PCA

Supervised alternative PLS information about Y to make transformation PLS is not always better though - it can often reduce bias (more flexible), but comes with added variance

Difference between PCA and PLS line

The PLS line has a smaller slope (with X1 on the x-axis and X2 on the y-axis) Meaning there is Less change in X2 per unit change in X1, essentially we're capturing "more" of X1 X1 must have been correlated with the response

What is penalized in ridge regression

The magnitude of the coefficients

Ridge regression How can/should we choose the value of λ

The same way we talked about choosing any tuning parameter - Cross-Validation

What controls sparcity of lasso model

The tuning parameter λ Larger λ ⇒ more shrinkage ⇒ sparser model (i.e. more coefficients set to 0; fewer predictors in the model)

What effect does ridge regression have? How would the coefficient values chosen here compare to βˆOLS

These coefficients would often be smaller. The penalty term has the effect of shrinking the magnitude of the coefficients

Partial least squares (PLS)

Very similar idea to PCA, but instead of the direction (projection) based only on what maximizes the variance, we choose the new predictors Zi based on a projection that both summarizes the original predictors and relates to the response

disadvantage of best subset selection

We will build a ton of models so it's computationally expensive

OLS vs Dimension Reduction

When dimension reduction predictors are chosen well, this kind of model can do substantially better than OLS Ideally, better fit with less predictors

When do we expect Lasso to outperform OLS

When the OLS estimates have high variance

When do we expect ridge regression with (λ>0) to do well?

When the OLS estimates have high variance

When do we expect Lasso to outperform ridge regression

When there are a lot of predictors, not all of which are relevant

When is the penalty for BIC greater than AIC

Whenever n > 7, log(n) >2 log(n)dσˆ 2 > 2dσˆ 2

Does best subset selection guarantee we will find the best model?

Yes because we're building every possible model

For a fixed number of predictors p, is having a larger sample size n always better

Yes, of course, but assuming the quality of the data is the same

Two predictor case form of PLS

Z1 = c1(X1 − X¯ 1) + c2(X2 − X¯ 2) those are x bar PLS gices more weight to the variable that marginally (by itself) is more strongly correlated with Y ex: . if X1 is more correlated with Y, magnitude of c1 will be larger)

The first principal component

Z1 c1(X1 − X¯ 1) + c2(X2 − X¯ 2) those are x bars

The second principal component

Z2 is the next direction in which the data varies the most, but we insist that Z2 not be correlated with Z1 Means that Z2 must be orthogonal to Z1

BIC assigns a _______ penatly that Cp and AIC hence BIC prefers models with _______ parameters

a harsher penalty BIC will "prefer" models with fewer parameter

k-fold Cross-validation gives us __________ way of estimating generalizability error of a particular model

a robust much better than using only training error or even a single train/test split

Ultimate goal for AIC/BIC/Cp

adjust the training error in order to account for model complexity Each measure can be seen as an estimate of the test error Smaller values would be preferred

Think of Mallows Cp as

an adjustment to the training error that is meant to estimate the test error

Best SS, FSS, and BSS provide a means for finding "________" which we can evaluate with __________

best candidate models AIC BIC Cp Adj R^2

Because for PCA the Zi still depend on the original predictors, we consider this kind of process _________ instead of ____________

dimension reduction variable selection

Cross validation gives estimates of test error but _____________

doesn't penalize more complex methods

RSS

equal to y sum of (yi - equation)^2 With ordinary least squares (OLS), we choose the coefficient estimates βˆOLS as the values of β that minimize the R

• In practice, we're often interested in finding a model with a good trade-of between these three___________

fit, complexity, and interpetrability

In the case of linear models, models with more terms are ___________ and will have _________ RSS

flexible lower

K fold cross validation is to estimate the

generalizability error of a particular model

In shrinkage and regularization we use penalty terms in _________

in fitting the models rather than to evaluate models already fit

First principal component

is the direction in which the data varies the most This means that if we were to project our data onto a line, we would want the line that would result in the largest variance in the projections

Ridge regression For large values of λ, the penalty term becomes ___________

larger ⇒ the coefficient estimates shrink by a large amount (i.e. βˆ Ridge → 0)

We think of RSS as a _____________ and to fit the model, we choose the coefficient estimates βˆ as the values of β that_____________ We call this approach

loss function minimize this loss function OLS(Ordinary least squares)

AIC is technically only defined for models fit via __________

maximum likelihood

When we talked about model selection with cross validation, we were only interested in _____

model fit ignored practical concerns such as (i) model complexity and (ii) understanding what kinds of scientific questions we might want/need to be able to answe

In high dimensional settings _______________ is always the case

multicollinearity The idea that some combination of variables (almost entirely) explained some other variable

For p > n we can think of this as _____________________

no bias (perfect fit) but infinite variance Same holds for larger values of n and p

In dimension reduction we're no longer using the ________

original predictors directly Instead, we're using a new, smaller set of transformed predictors

Mallows Cp adds a ______ to ______ that _______________________

penalty RSS that increases as the number of parameters d increases to account for the fact RSS decreases with d

an L2 penalty

penalty term on the magnitude of the coefficients The "2" in L2 refers to fact that we're squaring the coefficient values in the penalty term

The penalty term has the effect of shrinking the magnitude of the coefficients this is called

regularization In the specific context of linear regression, this typically means adding some kind of penalty to the least squares approach

Lasso and Ridge regression both _________________

shrink the magnitude of the coefficients but in different ways

High dimenstional data p>n In general, as _______ grows, so does __________ This is called the curse of ________________

the dimension (p) test error dimensionality

Ridge regression when λ = 0 what happens to the penalty

the penalty goes away so that βˆ Ridge = βˆOLS

What happens with ridge regression if we change the scale of a predictor

the resulting change in βˆ c,Ridge depends on c, lambda, and the scaling of other predictors

for mallows cp if (if σˆ 2 is unbiased for σ 2 ,

then Mallow's Cp is an unbiased estimate of the test error

Lambda in ridge regression is the

tuning parameter analogous to the k in k-nearest neighbors) that controls how much weight we give the penalty:

When we do lasso and ridge our B^'s are no longer _____________ but the hope is what we __________________________

unbiased add in bias we will more than make up for in a drop in variance

the training error will _____________ the generalizability (test) error

underestimate

Ridge regression, λ and bias variance tradeoff

when λ = 0, we have little (zero) bias, but possibly high variance For large λ, we get a more rigid (less flexible) model; βˆ Ridge close to 0; ⇒ higher bias, lower variance This is because remember higher lambda the coefficient estimates shrink by a large amount (i.e. βˆ Ridge → 0)

k-fold Cross-validation is very useful in determining________

which models are outperforming others Can be used to compare across different kinds of models, like LDA vs QDA or choosing the "best" from a class of models, like choosing the value value of a tuning paramenter such as the k in kNN


Related study sets

Economics - Mankiw Ch18 Production Factors Market

View Set

Python повторение материалов 1-6

View Set

8th Grade Science - Earth and Space Science

View Set

AH1 MOD 6 CARDIO Shock Iggy NCLEX, Iggy Chp 35 - Care of Patients with Cardiac Problems, CH 33, 35 Cardiac, Nursing Management: Coronary Artery Disease and Acute Coronary Syndrome, Nursing Assessment: Cardiovascular System

View Set

Chapter 10: Interest Groups - Vocabulary + Quiz

View Set