3.3 Stepwise Selection and Regularization
2 Methods to Improve Models
1) Stepwise Selection > Alternative to test RMSE for selecting the best model > Keep good predictors and throw out bad predictors > AIC & BIC 2) Regularization > Rige Regression > LASSO Regression > Elastic Net
Backward Selection
> Null model: model with NO predictors, only the intercept > Biggest model: Model with ALL possible predictors MAIN IDEA: Remove 1 bad variable at a time Step 1) Start with the biggest model Step 2) Consider all models with 1 variable removed from the biggest model. Select the model that has the lowest AIC/BIC Step 3) Consider all models with 1 additional variable removed from the model selected in step 2. Select the model that has the lowest AIC/BIC Step 4) Repeat the procedure until the model from the step directly above has a lower AIC/BIC compared the newer, smaller model
Ridge Regression
> Restricts model coefficients such that the total sum of the squared coefficients is less than some threshold... intercept coefficient is not included in this sum > For a model with 2 variables/coefficients, the ridge restriction will limit the possible combinations of the 2 coefficients to a circle centered around the origin. The first band/contour that intersects with the edge of the circle will result in the coefficients that minimize the SSE
Ridge vs. LASSO Regression: Similarities
1) Both are helpful in high dimensions (i.e. lots of predictors in dataset) 2) Both use cross validation to select the best value for lambda (shrinkage parameter)
Forward vs. Backward Selection: Similarities
1) Both perform variable selection > only keep variables that are valuable predictors > help to find balance between model flexibility and interpretability 2) Both are GREEDY algorithms > Only consider the immediate best variable (forward) or immediate worst variable (backward) > Does not guarantee the global/overall best > Solution to this: Best Subset Selection... the downside is this is much more time & computationally demanding
Ridge Regression: Factoids
1) For 0 < lambda < infiniti, ridge regression will basically never remove predictors (i.e. NO VARIABLE SELECTION) 2) Ridge regression is useful with high dimensional data > Only some predictors will have meaningful values... although predictors may have non-0 coefficients, their coefficient values may be so small that the predictor will not have any significant effect on the predictions
Stepwise Selection: Procedure
1) Forward Selection 2) Backward Selection 3) Best Subset Selection
4 Possible Combinations of Stepwise Selection
1) Forward Selection with BIC 2) Forward Selection with AIC 3) Backward Selection with BIC 4) Backward Selection with AIC > Sorted in order of ascending flexibility > 2 most justifiable combinations: If model aim is interpretability: #1 favors the least flexible/most interpretable model If model aim is accurate predictions: #4 favors the most flexible/least interpretable model > Other combinations are still justifiable If Forward Selection with BIC results in too few predictors, maybe instead we consider Forward with AIC
Forward vs. Backward Selection: Differences
1) High Dimensionality > Forward: GOOD because we start with the null model and increase 1 variable at time until the procedure is stopped. Very unlikely that forward procedure will ever get to models with a large number of predictors > Backward: BAD because we start with the biggest model, this means the starting point might be a questionable model 2) Complementing Predictors > REFRESH: A complementing predictor is a predictor that by itself may not be very valuable, but when paired with another predictor, it may add valuable info > Forward: BAD because the greedy algorithm may never detect the complementing predictor because we start with the null model > Backward: GOOD because we are starting with the biggest model so all complementing predictors will be included at the starting point
Ridge vs. LASSO Regression: Differences
1) Variable selection > Ridge does NOT perform variable selection > LASSO coefficient values CAN equal 0 and there this regularization method CAN perform variable selection
Ridge Regression & Standardized Predictors
> ALL regularization models expect that we use standardized predictors > When predictors have different scales, so to do their respective coefficients. Thus, it isn't fair to shrink all coefficients evenly (coefficients for predictors with a smaller scale will have greater restrictions than those with a larger scale) Example: Target is house value. Predictors are 1) square feet with range from 0 - 10,000 and 2) # of bedrooms (0-5). The coefficient for square feet will likely be really small because an increase of 1 square foot to a house does not make a big difference in the price. The coefficient for # of bedrooms is probably very large because 1 additional bedroom increases the value of a house by a lot. As such, if we were to perform ridge regression on the non-standardized predictors, the # of bedrooms coefficient would be unfairly penalized a lot more than the smaller coefficient for square feet
Backward Selection & the Hierarchical Principle
> Backward selection follows the hierarchical principle. We first must drop the interaction term before we the individual terms/predictors become candidates for dropping
Forward Selection & Categorical Variables
> Forward selection can help identify what levels of a factor/categorical variable are actually useful > Use binarization to manually create a dummy variable for each level of the factor (0 or 1) > Forward selection procedure will then include useful/predictive levels in the final model
Forward Selection & the Hierarchical Principle
> Forward selection follows the hierarchical principle. Only consider models with interactions after both variables with the interaction have been included in previous steps
Elastic Net Regression
> Hybrid between Ridge and LASSO regression > More general regularized regression > alpha hyperparameter is used to determine the weight to be put on the Ridge penalty vs. the weight to be put on the LASSO penalty > alpha = 0 results in RIDGE > alpha = 1 results in LASSO > often we use cross validation to tune lambda and manually tune alpha
Ridge Regression & Lambda
> Lambda: hyperparameter that controls the shrinkage amount, inversely related with flexibility > Higher lambda results in LOWER FLEXIBILITY. There is a greater penalty. Lambda = infinity results in the null model because the SSE = 0. > Lower lambda results in HIGHER FLEXIBILITY. There is a smaller penalty. Lambda = 0 results in NO shrinkage > We use cross validation to find the best value for lambda
LASSO Regression
> Least Absolute Selection and Shrinkage Operator > Closely related to Ridge regression > Ridge Regression: restrict total sum of squared coefficients to be below some treshold > LASSO Regression: restrict total sum of absolute value of coefficients to be below some threshold. Remember, intercept coefficient is not included in this sum > For a model with 2 predictors, the shape of the restricted area is a diamond centered around the origin. The first band/contour that touches the edge of the diamond represents the coefficients that will minimize the SSE
Forward Selection
> Null model: model with NO predictors, only the intercept > Biggest model: Model with ALL possible predictors MAIN IDEA: Add 1 good variable at a time Step 1) Consider all models (including null model) that have only 1 variable. Compare the AIC/BIC of each of these models. Select the model with the lowest AIC/BIC Step 2) Consider all models that have 1 explanatory variable in addition to the single variable model that was selected in step 1. Compare the AIC/BIC of each of these models. Select the model with the lowest AIC/BIC Step 3) If the model with 2 variables selected in step 2 has an AIC/BIC that is LOWER than the model selected in step 1, forward selection continues. If the model selected in step 2 has a HIGHER AIC/BIC compared to the model selected in step 1, procedure ends. Model in step 1 is the final model
AIC Formula
AIC = SSE* + 2p > SSE* is very similar to SSE > p: # of predictors > SSE* decreases as flexibility increases > p increases as flexibility increases > If SSE* decreases by MORE THAN 2 as we add a new predictor, we say the new predictor is worth it because the AIC nets lower > High AIC can be due to: 1) p is large (high flexibility) 2) SSE* is large (low flexibility) > We want to find a flexibility that strikes a BALANCE to minimize AIC
Stepwise Selection Criterion: AIC & BIC
AIC: Akaike Information Criterion BIC: Bayesian Information Criterion AIC & BIC curves are U shaped as flexibility increases Pick model with the lowest AIC/BIC
Regularization Models
AKA Shrinkage Methods 1) Ridge Regression 2) LASSO Regression 3) Elastic Nets
BIC Formula
BIC = SSE* + (ln n) (p) > n: # of observations in dataset > ln n is greater than 2 for all values of n that are greater than 7 > Therefore, BIC almost always has a HIGHER PENALTY for adding variables and thus, BIC favors smaller models compared to AIC
Comparing Ridge Regression and LASSO Regression
CAN NOT say that one is more/less flexible than the other. Lambda is inversely related to flexibility but this assumes a FIXED VALUE for LAMBDA
1 Standard Error Rule
If CV Error is within 1 standard deviation of the minimum CV error, we consider this "good enough" This allows us to chose a less flexible/more interpretable model if that better suits the modeling context
LASSO Regression & the Hierarchical Principle
Ignored for LASSO
LASSO Regression & Lambda
Just like Ridge regression, lambda represents the shrinkage parameter
Motivation for Regularization
OLS finds the regression coefficients that minimize the SSE. As # of predictors increases, bias decreases but variance increases. Q: Is it possible to have a model with a LARGE # of predictors but REDUCE the VARIANCE by INCREASING BIAS slightly? A: Yes, solution is to restrict the coefficients and shrink them towards 0
Benefits of Cross Validation
REMEMBER: It is common to partition data into 3 sets... 1) training set 2) validation set 3) test set... validation set is used to chose between models Cross validation allows us to achieve the same goal of the validation set without actually having to split the dataset at all > Cross validation needs ONLY THE TRAINING SET
k-Fold Cross Validation: Algorithm
Step 1) Randomly divide the data into k folds Step 2) For the first iteration, remove fold # 1 to be used as the "test" set. Train the model on the remaining k-1 folds Step 3) Calculate the "test" RMSE using fold # 1 as the test data Step 4) For the second iteration, remove fold # 1 to be used as the "test" set. Train the model on the other k-1 folds Step 5) Calculate the "test" RMSE using fold # 2 as the test data Step 6) Repeat these steps until each fold has been excluded Step 7) Average the k individual RMSE values, this results in the CV ERROR > In the context of ridge regression, run the entire algorithm for a lot of different lambda values. Chose the lambda value with the lowest CV error