ML_Model Selection and Regularization

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Methods for Feature Subset Selection

- Exhaustive (best subset) - Forward - Backward -Criteria

What to do if p > n

- Shrinkage - Feature Subset Selection

Lasso Regression Disadvantages

However, it can be sensitive to correlated predictors, as it may select one predictor while ignoring others that are highly correlated with it

Ridge Regression Bias and VAriance

- Reduces Flexibility -> high bias, underfitting, decrease variance -lambda controls the flexibility

What happens when you take a more flexible model?

- better performance on training - worse performance on test - better performance in the limit of many training objects

Disadvantages of Best Subset Selection

- computationally intensive - when p is large possibility for not generalizing well

Optimal Model Selection

- estimate test error indirectly with AIC, BIC and Cp, Adjusted R2 - estimate test error directly with test set or CV

Methods Shrinkage

- ridge regression - the lasso

Ridge Regression

-as the tuning parameter (lambda) λ→∞ approaches infinity, the coefficients tend to approach zero. Select Lambda through CV. If lambda is zero then least squares -The ridge regression coefficients can be calculated by determining the coefficients β^R1, β^R2, ..., β^Rp that minimize: ∑(yi−β0−∑pj=1βjxij)2+λ∑pj=1β2j. - ridge regression coefficients are not scale equivariant. Meaning it is heavily affected by the scale due to the penalizing factor weight and the magnitude of the predictors... - Because of -involves all of the predictors

Adjusted R2

A measure of goodness-of-fit of a regression that is adjusted for degrees of freedom and hence does not automatically increase when another independent variable is added to a regression. Choose Highest Always lower than standard R2

Shrinkage

Adjust coefficients so that some features are used to a lesser extent or not at all... Regularizations -> Reducing Model Flexibility, Less Overfitting

All three methods may find a solution with the lowest test RSS, but the best subset method is most likely to do so.

Best Subset, The best subset considers all models, where forward and backward selection may miss some. The best subset selection thus will always have the lowest training RSS

Bayesian Information Criterion (BIC)

Model selection technique that trades off model fit and model complexity. When comparing models, the model with lower BIC is preferred.

Standard Least Squares is Scale Equivariant

Multiplying X by constant c means B_j scales by factor of 1/c X_j * B_j will remain the same Not sensitive to changes in scale

Largest Risk in Modelling

Overfitting, making model too specific to the training data so that it will not generalize

One Standard Error Rule

Select smallest model for which the estimated test error is within one standard error of lowest point on the curve

Disadvantage of Ridge Regression

Still maintains high dimensionality with all of the predictors still being used in the final model

Akaike information criterion (AIC)

The AIC is a measure of the trade-off between the goodness of fit of a model and the complexity of the model. Lower the better

Best Subset Selection

The best subset selection procedure involves fitting all possible models that can be formed from the predictors and choosing the model with the best fit based on a predefined criterion, such as the adjusted R-squared, AIC, or BIC. can provide a reliable way to select the best set of predictors, especially when the number of predictors is not too large.

Feature subset selection

Use only a subset of attributes (variables) in the dataset select k out of p features

Mallow's Cp

assess the quality of a linear regression model. Mallow's Cp is used to compare different linear regression models with different numbers of predictor variables. The measure takes into account both the goodness of fit of the model and the number of predictor variables used. used in a variety of linear regression models, including multiple linear regression, polynomial regression, and stepwise regression. It can also be used in models that include both continuous and categorical predictor variables. smaller value of Mallow's Cp indicates a better model fit, with the ideal value being close to p + 1. Larger than this indicates overfitting, and lower is underfitting of data.

For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer. (a) The lasso, relative to least squares, is: i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias. iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

i) False, Lasso is more inflexible, it is less flexible due to the lambda hyper parameter, so increase in bias and decrease in variance ii.) False Less flexible increase bias and decrease in variance iii) True, Less Flexible and better predictions because less variance and more bias

We perform best subset, forward stepwise, and backward stepwise selection on a single data set. For each approach, we obtain p + 1 models, containing 0, 1, 2, . . . ,p predictors. Explain your answers: True or False: i. The predictors in the k-variable model identified by forward stepwise are a subset of the predictors in the (k+1)-variable model identified by forward stepwise selection. ii. The predictors in the k-variable model identified by backward stepwise are a subset of the predictors in the (k + 1)- variable model identified by backward stepwise selection. iii. The predictors in the k-variable model identified by backward stepwise are a subset of the predictors in the (k + 1)- variable model identified by forward stepwise selection. iv. The predictors in the k-variable model identified by forward stepwise are a subset of the predictors in the (k+1)-variable model identified by backward stepwise selection. v. The predictors in the k-variable model identified by best subset are a subset of the predictors in the (k + 1)-variable model identified by best subset selection.

i.) True the k-variable model is obtained by starting with the null model and iteratively adding one predictor at a time until k predictors are included. ii) True Each predictor removed is chosen to minimize the decrease in model fit. Since the (k+1)-variable model is obtained by adding a predictor back to the k-variable model, the k-variable model's predictors must be a subset of the (k+1)-variable model's predictors. iii) False, the selections are independent of eachother iv) False, selections are independent of eachother v.)False, the best subset of k+1 does not necessarily contain best subset of k variables.

For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer. c) Repeat (a) for non-linear methods relative to least squares. i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias. iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

ii) True, More flexible and improved prediction accuracy and there is increased variance and decreased bias

For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer. (b) ridge regression relative to least squares. i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias. iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

iii)True, Same as for Lasso ...Ridge Regression also will have higher bias and less variance and better predictions....

Consequences of Shrinkage Methods

increased bias reduced variance

Lasso Regression

lasso regression adds a penalty term to the least squares objective function, but instead of the squared magnitude of the coefficients, it uses the sum of the absolute values of the coefficients. It effectively performs feature selection by reducing some of the predictor coefficients to zero. Lambda controls strength of the penalty like in ridge regression lasso regression tends to produce more interpretable models with a smaller number of nonzero coefficients than ridge regression

Shrinkage Methods

reduce magnitude of predictor coefficients toward zero or set them to zero

Backward Subset Selection

start with all predictors iteratively remove predictors until a stop criterion is reached requires that n > p for fitting full model

Forward Subset Selection

start with no predictors and iteratively add predictors until a stop criterion is reached n doesn't need to be greater than p

ML_Model Selection and Regularization

Ensembles d'études connexes

Module 6

Chapter5

ch 48 sectn 2

MKTG380 Final Study Guide

Dealing with Conflict Ch.6

Intro to Management Ch 13

Integruotos kolis full

AQR Unit 4

Chapter 4

PFL Final

Chapter 32 - The Building of Global Empires

chapter 1 - scope of microbiology and infection control

Chapter 6: Bones and Skeletal Tissue (PRACTICE QUESTIONS)

Rec and Leisure Midterm

study set test 13

COMM1311 Ch12

FINAL (The Business Of Music)

Investment Vehicles Round 2

AP bio ch. 49

patho check your understanding week 3