Resampling Methods
What are resampling methods?
Tools that involves repeatedly drawing samples from a training set and refitting a model of interest on each sample in order to obtain more information about the fitted model
Application phase
apply your freshly-developed model to the real-world data and get the results
Training phase
train your model, by pairing the input with expected output
Two resampling methods
•Cross Validation •Bootstrapping
LOOCV vs. the Validation Set Approach
•LOOCV has less bias •We repeatedly fit the statistical learning method using training data that contains n-1 obs., i.e. almost all the data set is used •LOOCV produces a less variable MSE •The validation approach produces different MSE when applied repeatedly due to randomness in the splitting process, while performing LOOCV multiple times will always yield the same results, because we split based on 1 obs. each time •LOOCV is computationally intensive (disadvantage) •We fit the each model n times!
k-fold Cross Validation
•LOOCV is computationally intensive, so we can run k-fold Cross Validation instead •With k-fold Cross Validation, we divide the data set into K different parts (e.g. K = 5, or K = 10, etc.) •We then remove the first part, fit the model on the remaining K-1 parts, and see how good the predictions are on the left out part (i.e. compute the MSE on the first part) •We then repeat this K different times taking out a different part each time •By averaging the K different MSE's we get an estimated validation (test) error rate for new observations
Putting aside that LOOCV is more computationally intensive than k-fold CV... Which is better LOOCV or K-fold CV?
•LOOCV is less bias than k-fold CV (when k < n) •But, LOOCV has higher variance than k-fold CV (when k < n) •Thus, there is a trade-off between what to use Conclusion: •We tend to use k-fold CV with (K = 5 and K = 10) •These are the magical K's J •It has been empirically shown that they yield test error rate estimates that suffer neither from excessively high bias, nor from very high variance
Auto Data: LOOCV vs. K-fold CV
•Left: LOOCV error curve •Right: 10-fold CV was run many times, and the figure shows the slightly different CV error rates •LOOCV is a special case of k-fold, where k = n •They are both stable, but LOOCV is more computationally intensive! (look at powerpoint)
•Suppose that we want to predict mpg from horsepower •Two models: •mpg ~ horsepower •mpg ~ horsepower + horspower2 •Which model gives a better fit?
•Randomly split Auto data set into training (196 obs.) and validation data (196 obs.) •Fit both models using the training data set •Then, evaluate both models using the validation data set •The model with the lowest validation (testing) MSE is the winner!
advantages of the validation set approach
•Simple •Easy to implement
Typical Approach: The Validation Set Approach
•Suppose that we would like to find a set of variables that give the lowest test (not training) error rate •If we have a large data set, we can achieve this goal by randomly splitting the data into training and validation(testing) parts •We would then use the training part to build each possible model (i.e. the different combinations of variables) and choose the model that gave the lowest error rate when applied to the validation data
Bootstrap
•The bootstrap is a flexible and powerful statistical tool that can be used to quantify the uncertainty associated with a given estimator or statistical learning method. •For example, it can provide an estimate of the standard error of a coefficient, or a confidence interval for that coefficient.
disadvantage of the validation set approach
•The validation MSE can be highly variable •Only a subset of observations are used to fit the model (training data). Statistical methods tend to perform worse when trained on fewer observations (look at starred image on page 10)
Leave-One-Out Cross Validation (LOOCV) (look at powerpoint slide 11)
•This method is similar to the Validation Set Approach, but it tries to address the latter's disadvantages •For each suggested model, do: •Split the data set of size n into •Training data set (blue) size: n -1 •Validation data set (beige) size: 1 •Fit the model using the training data •Validate model using the validation data, and compute the corresponding MSE •Repeat this process n times
Validation/Test phase
•estimate how well your model has been trained •estimate model properties (mean error for numeric predictors, classification errors for classifiers, recall and precision for IR-models etc.)
The validation/test phase is often split into two parts:
•first you look at your models and select the best performing approach using the validation data (=validation) •Then you estimate the accuracy of the selected approach (=test)