Bias-Variance Tradeoff
early stopping
A method for regularization that involves ending model training before training loss finishes decreasing. In early stopping, you end model training when the loss on a validation data set starts to increase, that is, when generalization performance worsens. so even if optimization hasn't converged on gradient descent or whatever, stop once validation error starts to increase!!!!!!!!!
more complex models have lower
BIAS! (generally, if you don't know the underlying distribution) because there's more freedom for it to capture underlying model
derivation of bias-variance decomposition
ED Ex,y~P (hD(x)-y)^2 (USING trick with z = hbar(x)) = ED Ex,y~P (hD(x) - hbar(x))^2 + ED Ex,y~P (hbar(x) - y)^2 + 2 ED Ex,y~P [(hD(x) - hbar(x)) (hbar(x) - y)] --> GOES TO ZERO = ED Ex,y~P (hD(x) - hbar(x))^2 + ED Ex,y~P (hbar(x) - y)^2 = VARIANCE + ED Ex,y~P (hbar(x) - y)^2 (USING trick with z = ybar(x)) = VARIANCE + ED Ex,y~P (hbar(x)-ybar(x))^2 + ED Ex,y~P (ybar(x)-y)^2 + 2ED Ex,y~P (hbar(x)-ybar(x))(ybar(x)-y)--> GOES TO ZERO = VARIANCE + BIAS^2 + ED Ex,y~P (ybar(x)-y)^2 (NOISE)
noise formal value
ED Ex,y~P (ybar(x)-y)^2 => Ex,y~P (ybar(x)-y)^2
why does 2 ED Ex,y~P [(hD(x) - hbar(x)) (hbar(x) - y)] = 0
ED over [(hD(x) - hbar(x)) (hbar(x) - y)] only applies to hD, all else are constants and ED(hD(x)) = hbar(x) hence 0
wbar for fixed design LR is
E[w^] = Ee[w^]
expected test error given A (the algorithm)
E_D E_x,y~P (hD(x) - y)^2 hD = A(D) now we don't know the dataset so we take expectation over all
bias-variance decomposition
E_D Ex,y~P (hD(x)-y)^2 (expected test error) = bias^2 + variance + noise
hbar of a bagged classifier
E_Di[1/m SUM hDi] = E_Di~D[hDi] = E_Di~P[hDi] ~ means drawn from
fixed design ridge linear regression generalization error
E_e SUM over i 1...n of (w^T xi - w*T xi)^2 (recall that in lecture notes, generalization error was wrt one x,y pair, but we can also consider it over ALL x,y pairs, hence the SUM!) w^ is our classifier w* is Bayes optimal expectation wrt noise e because that's the only randomness
decomposing fixed design ridge LR generalization error
E_e SUM over i 1...n of (w^T xi - w*T xi)^2 = SUM over i 1...n of E_e (w^T xi - w*T xi)^2 = SUM over i 1...n of BIAS + VARIANCE + NOISE! hence the overall bias term contains a sum...
expected test error given hD
E_x,y~P (hD(x) - y)^2 because we're given A(D) so we know the dataset (there is only one)
Ybar(x)
E_y|x (Y) = integral over y of Y * P(y|x) also E(Y|X)...also E_Y (y(x)) SO BASICALLY BAYES OPTIMAL PREDICTOR!
expectation of w^ from fixed design ridge LR
Ee[w^] = (XXT + lambdaI)^-1 X [XT w* + Ee [e]] e is gaussian errors with mean 0 so Ee[w^] = w* - lambda (XXT + lambda)^-1 lambda w*
model capacity (x-axis) vs test and train error (y-axis)
MORE CAPACITY = MORE COMPLEX train error decreases test error forms parabola the left side is R1 = underfitting (both test and train errors are large) R2 = right side
complex models under or overfit?
Overfit
techniques for finding the best lambda/hyperparameters
PARAMETER SEARCH THAT MINS VALIDATION ERROR..............grid/random search (with k-fold cross validation or some validation set) telescopic search (with k-fold cross validation or some validation set) bayes optimization k-fold cross validation for each lambda in some list...
bias term of fixed design ridge LR equation and behavior as lambda -> inf or 0
SUM ((wbar - w*)^T xi) ^2 [this is bc hbar is wbar^T xi, since h is the classifier, and ybar is w*^Txi since y = w*^Txi + e - the SUM comes from the fact that we're considering all x,y not just one test point] = lambda^2 (W*)^T (XXT + lambda I)^-1 XXT (XXT+lambdaI)^-1 w* use eigendecomposition on XXT final result is w* T U E UT w* where U has sigma i / (1 + sigmai/lambda)^2 for lambda -> inf, diagonal entries go to sigma i, so bias is SUM (xiT w*)^2 for lambda -> 0, diagonal entries go to 0, so lambda goes to 0
variance of fixed design LR equation and behavior as lambda -> inf or 0
SUM E(w^Txi - wbar^T xi)^2 [where wbar^T is hbar and w^Txi is hD and the SUM comes from summing over all x,y] = SUM sigma i ^2 / (sigma i + lambda)^2 as lambda -> inf, var -> 0 because w^ -> 0 so not much randomness as lambda -> 0, sigma i ^2 / (sigma i + lambda)^2 -> 1 so var -> d
do bias, noise and variance describe the training or test set?
TRAINING SET!!! so we are able to adjust and fine tune the model BEFORE release into production! makes sense since bias is related to simple models = underfitting = high training error vs. complex models = overfitting = low training error although noise is technically part of both the training set and testing set since it is data-intrinsic
does changing the feature representation affect noise
YES ex. if all features are removed, the error is only noise and is very large recall that noise contains ybar(x), aka what bayes optimal would do - if you change x's representation, you change xbar making feature representation more accurate will reduce noise (noise can occur due to errors in features)
h is WHAT
a CLASSIFIER!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! so it's never going to be just w!!!!!!!!!!! it'll be like wT x or something
telescopic search
a way to find the best lambda 1st find best order of magnitude for lambda 2nd do fine-grained search around best lambda found so far ex. 0.01, 0.1, 1, 10, 100 if 10 is best do 5, 10, 15, 20...95 for each lambda, run k-fold cross validation! or just use a validation set
grid search
a way to find the best lambda and when you have multiple parameters fix set of values for each parameter and try out each combo (aka trying out grid) this can be bad bc number of settings needed grows exponentially w # params, also model may be insensitive to one param so waste time fiddling with it for each lambda/set of parameters, run k-fold cross validation! or just use a validation set
random search
a way to find the best lambda and when you have multiple parameters select hyperparameters by randomly generating within pre-defined intervals (instead of from a grid) this explores MORE values for each individual hyperparameter!!!!!!!!!! AND SEARCH IS NOT EXPONENTIAL IN # OF HYPERPARAMETERS
regime 1 high variance remedies
add more training data reduce model complexity bagging OVERFITTING!!!!!!!!
how does changing lambda in regularizer affect training/validation error
as lambda increases, model gets simpler so training error increases over time = underfitting (right half of x^2 curve shape) validation error forms parabola but also increases with greater lambda see https://www.cs.cornell.edu/courses/cs4780/2015fa/web/lecturenotes/lecturenote13.html
weighted average of a function (aka h bar)
average of all the functions hD at every point hbar(x)=E[hD(x)] weighted average at that point
why does 2ED Ex,y~P (hbar(x)-ybar(x))(ybar(x)-y) go to 0
because x,y~P = P(X) P(Y|X) x~P y~P(y|x) so Exy~P = Ex E_y|x can remove ED because nothing depends on it so 2ED Ex,y~P (hbar(x)-ybar(x))(ybar(x)-y) = Ex E_y|x (hbar(x)-ybar(x))(ybar(x)-y) = Ex (hbar(x)-ybar(x)) * (ybar(x)-E_y|x y) E_y|x y is ybar so hence zero!
why is ybar needed
because especially in regression, even if you have the same x, you don't always have the same y (label) hence there's an expectation!
why do the equations for bias/noise/variance use test points if bias/noise/variance are defined for the training set?
because it is the theoretical measure of test error if the model is pushed to production so it uses expected value and test points it's a metric to evaluate how good a model is BEFORE release so it can be tuned in case of high bias or variance
why are models that overfit low bias?
because overfitting means high accuracy and LOW ERROR(bias) on the training set!
as number of data points increases to inf, what happens to bias/variance
bias is constant bc bias is caused by modeling assumptions, which don't change (THIS IS UNRELATED TO THE # of training instances vs training/testing error graph - BIAS IN THAT GRAPH IS NOT INCREASING, it's just higher than variance) also bias is expectation that is NOT wrt D! hence all the datasets are already considered in hbar, which includes inf data points, changing the current D won't matter bc we already considered all of them variance decreases bc more data = more likely to find underlying pattern (LESS NOISE, harder to memorize, so less overfitting, forced to fit the actual trends)
increasing lambda does what wrt bias/variance
bias is monotonically increasing with lambda variance is monotonically decreasing with lambda
bias may refer to
bias squared ASSUME BIAS SQUARED!!!!!!!!!!!!!!!!!!!!!!!!!
as model complexity increases, what happens to bias
decreases
does boosting increase variance
depends on the type of weak learner relative to the amnt of training data but generally increases bc complex model and more powerful too
advantage and disadvantage of k-fold cross validation with k->n
error decreases bc u have more training data but validation procedures become slower
what is h bar represent
expected behavior of the ERM solution h_D since ERM is random so hbar = E_D [h_D] hbar (x) = E_D [h_D(x)] for all x
the bias represents a component of error that is intentionally added by the data scientist
false you're trying to find the right modeling assumptions!!! you aren't intentionally trying to get it wrong
fixed design ridge linear regression
features x1...xn are fixed (no randomness) yi ~ w*T xi + ei , ei ~N(0,1) only randomness of dataset xi,yi is from noise ei means larger lambda in regularizer = simpler model = larger bias, smaller variance when lambda -> inf, w -> 0
what happens for fixed design ridge LR when we have 5000 datasets, and for a given lambda, solve ridge LR for each dataset to get w^1...5000 then we estimate mean of these and plot them and compare to w*???
for lambda = 0, mean is exactly the optimal w*, and plotted w^ are fairly spread out greater lambda moves mean away from w*, circle of dots shrinks in size
gaussian vs logistic regression bias/variance
gaussian is high bias bc assumptions not usually right low variance bc of many 1D probability distribution estimations... logistic regression = low bias in comparison bc makes no possibly incorrect assumptions ab data & high variance bc more dependent on entire dataset's variation
tuning lambda allows you to control what
generalization error
generalization error for linear regression w squared loss
given dataset D, hypothesis class H, squared loss (h(x) -y)^2 and hD = ERM soln... generalization = E_D E_x,y~P (h_D(x) - y )^2
formal definition of variance
hbar = ED (hD) *average ERM solution ybar(x) = E[y|x] *Bayes' optimal difference between hbar and hD aka ED E_x (hD(x) - hbar(x))^2 fluctuation of our random model around its mean!!!
formal definition of bias^2
hbar = ED (hD) *average ERM solution ybar(x) = E[y|x] *Bayes' optimal difference between hbar and the best ybar(x) aka Ex [(hbar(x) - ybar(x))^2] THIS IS THE SQUARE OF BIAS!!! SO difference between ideal case and what you're actually doing DIFFERENCE BW OUR MEAN & the BEST
what is hbar(x) when h(x) = w0
hbar(x) = Ey [y] true mean of the y's in the dataset
uniform random sampling of labels bias/variance
high bias (because of high inaccuracy) high variance (because labels on datasets will change a lot since it's random)
underfitting impact on bias/variance
high bias (narrow range of models to choose from, ex. linear only) - aka bias towards TOO SIMPLE MODELS low variance (models on diff datasets from the same distribution look relatively similar)
regime 2 (high bias) meaning and symptoms
high bias is causing underperformance test error and training are close both are ABOVE DESIRED tolerance e SYMPTOM: training error is higher than e! (test error is also above e) SO BOTH TRAIN AND TEST ERRORS ARE HIGH BECAUSE OF UNDERFITTING!!!
how do validation/training error compare for underfitting
high validation error, high training error
how do validation/training error compare for overfitting
high validation error, low training error
regime 1 (high variance) meaning & symptoms
high variance is causing poor performance training error is below error threshold e test error is much higher AKA SMALL TRAIN BUT LARGE TEST 1. training error is lower than test error 2. training error is lower than e 3. test error is above e BECAUSE OF OVERFITTING TO NOISE!
overfitting impact on validation and training error
higher and higher validation error but lower and lower training error as we memorize the dataset
intuitive meaning of variance
how much ur classifier changes if u train on diff training set aka how overfit is ur classifier to a specific training set? how far from average classifier? RECALL BULLSEYE - IT IS SPREAD!!!
why does test error typically have a parabola shape if x-axis is lambda
if x-axis is lambda, then for a while your model will do well (simpler models = less overfitting), until it starts underfitting
graphical illustration of bias/variance
imagine a target low bias/low var = points clustered at bullseye low bias/high var = points scattered around bullseye high bias/low var = points clustered away from bullseye high bias/high variance = points scattered far from bullseye
adding more data does what to training error
increases
as model complexity increases , what happens to variance
increases
noise
independent of models - it is data-intrinsic! measures ambiguity due to data distribution & feature representation (ASPECT OF DATA)...DIFF BW WHAT BAYES OPTIMAL DOES AND ACTUAL ANSWER unavoidable - cannot ever become zero!
inf lambda means what for bias/variance
inf lambda = greater bias little variance
intuitive meaning of bias
inherent error of classifier ON THE TRAINING SET even with inf training data due to model being biased to specific kind of soln (ex. biased to linear classifiers) SO CAN THINK OF IT LIKE (1) TRAINING ACCURACY AS WELL!!! it's CAUSED by modeling ASSUMPTIONS aka training accuracy issue/error caused by (2) how much our expected model differs from the underlying model RECALL BULLSEYE!!!!! IT IS NEARNESS TO BULLSEYE!!!!!! (3) ALSO kind of like expressivity/capacity/flexibility/complexity (higher complexity/non-linearity = more freedom to fit underlying model = less bias)
logistic regression on linearly separable and non-linearly separable data variance/bias
linearly separable = low bias (bc linear solution suits linearly separable data), low var non-linearly separable = high bias (linear solution is not accurate for non-linear data), low var ALWAYS LOW VARIANCE (like most linear models)
overfitting
model is too complex, fits noise too perfectly can't generalize well on unseen test examples
can we practically calculate the lambda that minimizes bias + variance + noise
no because you most likely need w* the optimal classifier so use techniques like cross validation, etc.
overfitting impact on bias and variance
no strong bias (hypothesis class is often UP TO certain order, ex. 5th order poly) high variance (models on diff datasets from the same distribution look diff)
what is often bias/variance of linear models (linear SVM, logistic regression, etc.)
often high bias (most data not linearly separable) & low variance (simple models) linear classifier w linear decision boundary, these boundaries don't bend themselves for outliers (separating line is based on shape of majority of data and not really on outliers, this generalizes well = low variance)
is high variance associated with overfitting or underfitting
overfitting with different datasets, ur model changes wildly because you are focused on NOISE!
how to detect regime 1/regime 2 (issues with high var/bias) in general?
plot the test and training error with regards to the # of training points/instances training error will be concave, curves upwards to the right test error is convex, decreasing to right left side is overfitting right side is underfitting (high bias) - BUT THIS DOES NOT MEAN THAT BIAS INCREASES WITH TRAINING DATA
Bayes optimal regressor in linear regression
predict E[y|x] aka INTEGRAL over y of y * P(y|x) (because for every feature vector x, like housing details, there is a distribution of label y's, ex. selling price)
how to select best model from data
select right order of polys for regression select right lambda for ridge regularization select right penalty for slack variables in soft SVM (aka C)
bias/variance for kNN with small k vs. large k
small k: low bias (close points are more accurate), high variance (no averaging effect) large k: high bias (using points that aren't actually close), low variance (averaging effect)
what does bias-variance decomposition decompose
squared error E_D E_xy (hD(x)-y)^2 if not squared the decomposition would not work bc u couldn't complete the square
how do testing and training error change as the # training instances increases (size of training set increases) for fixed model complexity
test error curves downwards and asymptotically approaches acceptable test error e training error curves upwards because it is harder to overfit to data when you keep adding training points!!!! at some point ur model complexity is no longer enough to match that of the data points but u only have fixed model complexity
what is hD
the ERM solution but also given a machine learning algorithm, hD = A(D) (aka calling the classifier or whatever)
what is variance a property of
the algorithm itself
purpose of k fold cross validation - do you still need a test set
to estimate the validation error, especially when you have little data this validation error allows you to tune your model and change hyperparameters you STILL NEED A TEST SET!!! to get a true idea of generalization error!! but this is after tuning never use the test set for tuning!!!!
underfitting
too simple model cannot capture trend in data
underfitting impact on validation and training error
too simple of a model can't even account for provided data so training and test error are high
relationship between model complexity (x-axis, aka decreasing lambda) and error (y-axis)
total error is a parabola, there is an optimum variance curves upwards bias^2 curves downwards where they intersect is roughly where the optimum minimal error is so there is some lambda that minimizes!
why adding more training data does not always help reduce your testing error below a desired threshold epsilon > 0
training error is lower bound on testing error adding more data increases training error if ur training error is already too high (> epsilon adding more data won't bring testing bc it is bounded by training error.....BUT A BETTER EXPLANATION IS THAT if you ALREADY HAVE high bias, adding more data cannot change that bias, so your test error won't reduce bc ur modeling assumptions haven't changed - ur test error plateaus
overfitting means the learned classifier is too specific to the training data
true
underfitting means that the learned classifier is not expressive enough and produces high training/validation error
true
If we have a validation set, we do not need to do k-fold cross validation for hyperparameter tuning
true both are forms of validation and you only need one to tune parameters
the variance measures the randomness in the trained classifier/regressor caused by randomness in the training set
true data is random so model fit to the data will also vary with different data, is your model the same or not? high variance is probably not good because ur too sensitive to the data
the noise measures the error that the bayes-optimal classifier/regressor would have
true labels can be noisy so there is inherent randomness in data even if you knew the actual distribution you would still not be 100% accurate you can't control the noise
is high bias associated with underfitting or overfitting
underfitting your model is biased towards too narrow of a selection aka too simple, restricted model is highly biased towards assumptions (ex of linearity)
how to select right lambda for ridge LR
use K-fold cross-validation for every lambda in a set choose the lambda that had the smallest avg validation error
regime 2 high bias remedies
use more complex model (kernalize, use non-linear) add features boosting UNDERFITTING
does using more/less features help overfitting/underfitting
using less features helps overfitting using more features helps underfitting
lambda (x-axis) vs test/validation error and train error (y-axis)
validation error forms parabola above training error training error curves upwards this means left hand side is overfitting right hand side is underfitting
what happens when linear predictor is h(x) = w0
w0 models the mean of y in the data (aka b) [because we're minimizing square loss and we have to predict a constant so the constant that minimizes is square loss] this is a good estimate of the true mean of y's hence low variance!
fixed design ridge LR solution in terms of e
w^ = (XXT + lambdaI)^-1 X (XTw* + e) because Y = XTw* + e because yi = w*T xi + ei HENCE w^ is a random quantity due to e!!!!!!
fixed design ridge LR objective in matrix/vector form
w^ = argmin over w of ||XT w - Y||^2 + lambda ||w||^2 where Y = XTw* + e because yi = w*T xi + ei
how does cross validation relate to generalization error of ridge LR
we perform ridge LR on all but the ith fold the ith fold is used for validation error, which is SUM over x,y in Di of (w^i T x - y)^2 / |Di| aka using Di to test w^i this validation error is ~Ex,y~P (w^iTx - y)^2 (TEST ERROR!) also outputting average validation error over all K folds, this is approximately ED[Ex,y~P (w^_D T x - y)^2 (AVERAGING OVER DATASETS!!!!) aka generalization error of ridge LR with certain lambda (AKA BLACK PARABOLA FOR TOTAL ERROR IN lambda vs. error graph)
is hbar deterministic
yes the expectation removes all randomness
does adding more useful features change variance
yes variance will increase because you have more degrees of freedom, you can fit more
can k-fold cross validation be used to select c for SVM
yes, same process as lambda
zero lambda means what for bias/variance
zero lambda = zero bias high variance