M7 - Cross Validation
If the k-th original observation, (xk, yk), is not chosen by any of the n bootstrap samples, then we say that (xk, yk) is "out-of-bag (OOB)"; in other words, (xk, yk) is not in the chosen collection of bootstrap samples. If bootstrap samples are pulled independently, then what is the probability that (xk, yk) is OOB?
(1 - (1/n))^n The probability that a bootstrap observation is not (xk, yk) is equal to 1-1/n. Hence, the probability that (xk, yk) is not in the bootstrap sample must be (1−1/n)^n, since n samples are drawn independently to build the bootstrap sample.
Advantages of K-fold CV over standard CV:
- Provides a less volatile estimate of the test MSE - Uses more observations in training - Provides a less overestimated test MSE
Drawbacks of the validation set approach:
- The MSEVS estimate varies highly with how the dataset is split - Only half the observations are used in training (maybe not enough) - The MSE VS tends to overestimate the test MSE
Suppose we have a dataset with 10 observations. What is the probability that the first bootstrap observation is the second observation (x2,y2) in our original dataset? (Answer as a real number with one decimal place.)
0.1 Since the bootstrap samples are formed by sampling with replacement, each observation is equally likely to be picked. Sampling with replacement implies it does not matter which observation index we pick.
What is the probability that the k-th original observation (xk, yk) is one of the points in the collection of bootstrap samples?
1 - (1 - 1/n)^n Take the complement of (1−1/n)^n, the probability that an observation is OOB, to get the probability that an observation is in the bootstrap sample.
Main Steps for K-fold CV for Classification
1) Randomly partition the available data samples into K (almost) equal-sized parts: D1,D2,...,DK . Denote the sizes by N1,N2,...,NK 2) For k = 1 to K, do the following: a. Train your classifier using the data D\Dk (remove the k-th part) b. Estimate the misclassification error rate for the left-out part Dk using the formula. Instead of MSE to measure error like in linear regression, use misclassification error rate 3) The K-fold cross-validation misclassification error estimate is:Keep repeating selecting different starting subsets (D1, D2, D3 to Dk) and then average the ERRk to get the ERRKF
Main Steps: Given a dataset D = {(xi , yi )}Ni=1, take the following steps:
1) Randomly partition the available data samples into two subsets: A training subset DTS and a validation subset DVS. 2) Train the model using the data on the training subset: 3) Use the fitted model to predict outputs in the validation subset and compute the MSE :
To estimate the testing error, we can use resampling methods:
1) Validation Set Approach 2) K-Fold Cross-Validation 3) The Bootstrap (also used to estimate the standard errors of our parameters)
Now suppose we have a dataset with n observations, indexed {1,...,k,...,n}. The probability that the first bootstrap observation is not the k-th original observation, (xk, yk), is ____.
1-1/n The probability that the first bootstrap observation is the k-th original observation is 1/n, so the complement is 1-1/n.
K-fold Cross-validation (KCV) main steps:
1. Randomly partition the available data samples into K (almost) equal-sized parts: data samples: D1,D2,...,DK . 2. Denote the sizes by N1,N2,...,NK For k = 1 to K, do the following: a. Train the model using the data D\Dk (remove the k-th part):Get ^θ by training on one of the subsets (D2 for example): b. Predict the test MSE for the left-out part Dk using the formulaGet a MSE value by using the theta from D2 on all the other parts of the data set D 3. The K-fold cross-validation estimate is given byKeep repeating selecting different starting subsets (D1, D2, D3 to Dk) and then average the MSE to get the MSE KCV
When n=1000, what is the probability that the k-th original observation is in the collection of bootstrap samples?
63% 1 - (1 - 1/n)^n = 1 - (1 - 1/1000)^1000 = 63%
When n=10, what is the probability that the n-th original observation is in the collection of bootstrap samples?
65% 1 - (1 - 1/n)^n = 1 - (1 - 1/10)^10 = 65%
What is D^B
D^B is a Bootstrap sample of the data and you will create D^B1, D^B2, ..., DBM Then you will measure the average SD of all the D^B
The correct procedure in cross validation is first training an algorithm on the complete dataset, then partitioning a portion of that dataset to test the algorithm (true or false).
False - If we are running cross validation, we always want to split a dataset into a training set and a testing set before training the model.
The purpose of splitting a dataset into a training subset and a validation subset is to lower the run-time of the training process (true or false).
False - The purpose of splitting a dataset is so that the model does not train on testing data, because we want to understand how our model performs on data it has not seen before.
Consider a dataset that is evenly partitioned into two subsets: a training subset and a validation subset. If we build an algorithm that correctly predicts all outputs using the training subset, we are guaranteed to have it accurately predict all outputs using the validation subset (true or false).
False - The training set may not be indicative of the testing set.
Main Idea: The Bootstrap
Given a dataset D = {(xi , yi )}Ni=1, the bootstrap can generate M artificial datasets,
Workflow: The Bootstrap
Given a dataset with N samples, bootstrap involves randomly drawing N samples (with replacement) from the original dataset to create a new "bootstrap" sample. This process is repeated many times (usually thousands of times) to create a distribution of statistics (e.g., mean, standard error) that can be used for inference.
Advantages: The Bootstrap
It provides a way to estimate the uncertainty of a statistic without making strong assumptions about the underlying population distribution. Bootstrap is particularly useful when the dataset is small or when traditional parametric methods are not applicable.
The Bootstrap process
Loop to create data sets DBM: 1) For m = 1 to M, do the following: a) Sample N random observations from D with replacement b) Include these N observations in an artificial dataset DBm, called a bootstrap sample c) Train your model f(x; θ) using the dataset DBm, i.e., Compute the estimation using DBM: 2) Estimate the standard deviation of the model coefficients using the formulawhere is the empirical mean
Main Idea of Resampling methods
Main Idea: Resampling methods estimate the test error by holding a subset of the training observations and use them later on for testing
Advantages: K-fold Cross-Validation
Provides a more robust estimate of model performance, reduces the risk of overfitting, and utilizes the entire dataset for both training and validation.
Advantages: Validation Set:
Simplicity, quick to implement, and provides a single performance estimate for the model.
What does this code do: X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2,random_state = 1)
Split the dataset into 80% training set and 20% test set.
What does this code do: 1 kf = KFold(n_splits=11, random_state=42, shuffle=True) 2 for train_index, test_index in kf.split(X): 3 print("TRAIN:", train_index, "TEST:", test_index) 4 print(len(train_index)) 5 print(len(test_index))
The code will run cross validation a total of 11 times (loops). In each loop, the code will separate the dataset into 460 observations of training data (train_index) and 46 observations of testing data (test_index). Because we are running 11-fold cross validation, the algorithm splits the dataset into 10 units of training data and 1 unit of testing data. The 10 units of training data equate to 506*10/11=460 observations. When training a model, each of the 13 features will be included, but this does not affect the fact that our data has 460 rows of data to train on, since all the features must be trained together.
Workflow: K-fold Cross-Validation
The dataset is divided into K equal-sized folds or subsets. The model is trained K times, each time using K-1 folds for training and one fold for validation. This process is repeated K times (folds) so that every data point is used for validation once.
Workflow: Validation Set
The dataset is typically split into two parts, a training set (used for model training) and a validation set (used for model evaluation). The model is trained on the training set, and its performance is evaluated on the validation set.
Does the validation set approach tend to overestimate the test mean square error (true or false)
True
Cross validation prevents knowledge about the test set from leaking into the model (true or false).
True - Leakage occurs when data is inadvertently transmitted. In the case of cross validation, we want to prevent leakage from data used to test the model into data used to train the model. This is the essence of cross validation—to separate training and testing.
What is Difference between Validation Set, K-fold Cross-Validation, The Bootstrap used for?
Used in machine learning and statistics for model evaluation and selection.
Validation Set (VS) Approach
VS is used to estimate the test MSE of a parametric model f (x; θ)
Difference between Validation Set, K-fold Cross-Validation, The Bootstrap
Validation set is a simple method for assessing model performance on a separate dataset. K-fold cross-validation is a technique for robustly estimating model performance by partitioning the dataset into multiple subsets. The Bootstrap is a resampling technique for estimating the distribution of statistics and making inferences about a population.
What is the probability that a particular point in the original dataset is in the collection of bootstrap samples when the number of points in the dataset tends to infinity? Hint: e^x = lim_n→∞(1 + x/n)^n
We need to compute the limit lim_n→∞ 1 − (1 − 1/n)^n = 1 − lim_n→∞ (1 + −1/n)^n = 1 − e^{−1} ≈ 0.63% Notice that this value is approximately 63%, which was the answer to the previous question.
Explain K-fold CV: Numerical Example
We use the N-fold (also called LOOCV) and the 10-fold CV approach to estimate the test MSE of polynomial models of various degrees
If we wish to implement Leave-One-Out-Cross-Validation (LOOCV), the value of the parameter n_splits in the function KFold would be______?
n - LOOCV is essentially n-fold CV. n-1 observations are used to train the model, while 1 observation is used to test the model. This split is looped until all 506 observations had been used as test data, creating an extensive cross validation test.