Bias-Variance Tradeoff

Ace your homework & exams now with Quizwiz!

early stopping

A method for regularization that involves ending model training before training loss finishes decreasing. In early stopping, you end model training when the loss on a validation data set starts to increase, that is, when generalization performance worsens. so even if optimization hasn't converged on gradient descent or whatever, stop once validation error starts to increase!!!!!!!!!

more complex models have lower

BIAS! (generally, if you don't know the underlying distribution) because there's more freedom for it to capture underlying model

derivation of bias-variance decomposition

ED Ex,y~P (hD(x)-y)^2 (USING trick with z = hbar(x)) = ED Ex,y~P (hD(x) - hbar(x))^2 + ED Ex,y~P (hbar(x) - y)^2 + 2 ED Ex,y~P [(hD(x) - hbar(x)) (hbar(x) - y)] --> GOES TO ZERO = ED Ex,y~P (hD(x) - hbar(x))^2 + ED Ex,y~P (hbar(x) - y)^2 = VARIANCE + ED Ex,y~P (hbar(x) - y)^2 (USING trick with z = ybar(x)) = VARIANCE + ED Ex,y~P (hbar(x)-ybar(x))^2 + ED Ex,y~P (ybar(x)-y)^2 + 2ED Ex,y~P (hbar(x)-ybar(x))(ybar(x)-y)--> GOES TO ZERO = VARIANCE + BIAS^2 + ED Ex,y~P (ybar(x)-y)^2 (NOISE)

noise formal value

ED Ex,y~P (ybar(x)-y)^2 => Ex,y~P (ybar(x)-y)^2

why does 2 ED Ex,y~P [(hD(x) - hbar(x)) (hbar(x) - y)] = 0

ED over [(hD(x) - hbar(x)) (hbar(x) - y)] only applies to hD, all else are constants and ED(hD(x)) = hbar(x) hence 0

wbar for fixed design LR is

E[w^] = Ee[w^]

expected test error given A (the algorithm)

E_D E_x,y~P (hD(x) - y)^2 hD = A(D) now we don't know the dataset so we take expectation over all

bias-variance decomposition

E_D Ex,y~P (hD(x)-y)^2 (expected test error) = bias^2 + variance + noise

hbar of a bagged classifier

E_Di[1/m SUM hDi] = E_Di~D[hDi] = E_Di~P[hDi] ~ means drawn from

fixed design ridge linear regression generalization error

E_e SUM over i 1...n of (w^T xi - w*T xi)^2 (recall that in lecture notes, generalization error was wrt one x,y pair, but we can also consider it over ALL x,y pairs, hence the SUM!) w^ is our classifier w* is Bayes optimal expectation wrt noise e because that's the only randomness

decomposing fixed design ridge LR generalization error

E_e SUM over i 1...n of (w^T xi - w*T xi)^2 = SUM over i 1...n of E_e (w^T xi - w*T xi)^2 = SUM over i 1...n of BIAS + VARIANCE + NOISE! hence the overall bias term contains a sum...

expected test error given hD

E_x,y~P (hD(x) - y)^2 because we're given A(D) so we know the dataset (there is only one)

Ybar(x)

E_y|x (Y) = integral over y of Y * P(y|x) also E(Y|X)...also E_Y (y(x)) SO BASICALLY BAYES OPTIMAL PREDICTOR!

expectation of w^ from fixed design ridge LR

Ee[w^] = (XXT + lambdaI)^-1 X [XT w* + Ee [e]] e is gaussian errors with mean 0 so Ee[w^] = w* - lambda (XXT + lambda)^-1 lambda w*

model capacity (x-axis) vs test and train error (y-axis)

MORE CAPACITY = MORE COMPLEX train error decreases test error forms parabola the left side is R1 = underfitting (both test and train errors are large) R2 = right side

complex models under or overfit?

Overfit

techniques for finding the best lambda/hyperparameters

PARAMETER SEARCH THAT MINS VALIDATION ERROR..............grid/random search (with k-fold cross validation or some validation set) telescopic search (with k-fold cross validation or some validation set) bayes optimization k-fold cross validation for each lambda in some list...

bias term of fixed design ridge LR equation and behavior as lambda -> inf or 0

SUM ((wbar - w*)^T xi) ^2 [this is bc hbar is wbar^T xi, since h is the classifier, and ybar is w*^Txi since y = w*^Txi + e - the SUM comes from the fact that we're considering all x,y not just one test point] = lambda^2 (W*)^T (XXT + lambda I)^-1 XXT (XXT+lambdaI)^-1 w* use eigendecomposition on XXT final result is w* T U E UT w* where U has sigma i / (1 + sigmai/lambda)^2 for lambda -> inf, diagonal entries go to sigma i, so bias is SUM (xiT w*)^2 for lambda -> 0, diagonal entries go to 0, so lambda goes to 0

variance of fixed design LR equation and behavior as lambda -> inf or 0

SUM E(w^Txi - wbar^T xi)^2 [where wbar^T is hbar and w^Txi is hD and the SUM comes from summing over all x,y] = SUM sigma i ^2 / (sigma i + lambda)^2 as lambda -> inf, var -> 0 because w^ -> 0 so not much randomness as lambda -> 0, sigma i ^2 / (sigma i + lambda)^2 -> 1 so var -> d

do bias, noise and variance describe the training or test set?

TRAINING SET!!! so we are able to adjust and fine tune the model BEFORE release into production! makes sense since bias is related to simple models = underfitting = high training error vs. complex models = overfitting = low training error although noise is technically part of both the training set and testing set since it is data-intrinsic

does changing the feature representation affect noise

YES ex. if all features are removed, the error is only noise and is very large recall that noise contains ybar(x), aka what bayes optimal would do - if you change x's representation, you change xbar making feature representation more accurate will reduce noise (noise can occur due to errors in features)

h is WHAT

a CLASSIFIER!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! so it's never going to be just w!!!!!!!!!!! it'll be like wT x or something

telescopic search

a way to find the best lambda 1st find best order of magnitude for lambda 2nd do fine-grained search around best lambda found so far ex. 0.01, 0.1, 1, 10, 100 if 10 is best do 5, 10, 15, 20...95 for each lambda, run k-fold cross validation! or just use a validation set

grid search

a way to find the best lambda and when you have multiple parameters fix set of values for each parameter and try out each combo (aka trying out grid) this can be bad bc number of settings needed grows exponentially w # params, also model may be insensitive to one param so waste time fiddling with it for each lambda/set of parameters, run k-fold cross validation! or just use a validation set

random search

a way to find the best lambda and when you have multiple parameters select hyperparameters by randomly generating within pre-defined intervals (instead of from a grid) this explores MORE values for each individual hyperparameter!!!!!!!!!! AND SEARCH IS NOT EXPONENTIAL IN # OF HYPERPARAMETERS

regime 1 high variance remedies

add more training data reduce model complexity bagging OVERFITTING!!!!!!!!

how does changing lambda in regularizer affect training/validation error

as lambda increases, model gets simpler so training error increases over time = underfitting (right half of x^2 curve shape) validation error forms parabola but also increases with greater lambda see https://www.cs.cornell.edu/courses/cs4780/2015fa/web/lecturenotes/lecturenote13.html

weighted average of a function (aka h bar)

average of all the functions hD at every point hbar(x)=E[hD(x)] weighted average at that point

why does 2ED Ex,y~P (hbar(x)-ybar(x))(ybar(x)-y) go to 0

why is ybar needed

because especially in regression, even if you have the same x, you don't always have the same y (label) hence there's an expectation!

why do the equations for bias/noise/variance use test points if bias/noise/variance are defined for the training set?

because it is the theoretical measure of test error if the model is pushed to production so it uses expected value and test points it's a metric to evaluate how good a model is BEFORE release so it can be tuned in case of high bias or variance

why are models that overfit low bias?

because overfitting means high accuracy and LOW ERROR(bias) on the training set!

as number of data points increases to inf, what happens to bias/variance

bias is constant bc bias is caused by modeling assumptions, which don't change (THIS IS UNRELATED TO THE # of training instances vs training/testing error graph - BIAS IN THAT GRAPH IS NOT INCREASING, it's just higher than variance) also bias is expectation that is NOT wrt D! hence all the datasets are already considered in hbar, which includes inf data points, changing the current D won't matter bc we already considered all of them variance decreases bc more data = more likely to find underlying pattern (LESS NOISE, harder to memorize, so less overfitting, forced to fit the actual trends)

increasing lambda does what wrt bias/variance

bias is monotonically increasing with lambda variance is monotonically decreasing with lambda

bias may refer to

bias squared ASSUME BIAS SQUARED!!!!!!!!!!!!!!!!!!!!!!!!!

as model complexity increases, what happens to bias

decreases

does boosting increase variance

depends on the type of weak learner relative to the amnt of training data but generally increases bc complex model and more powerful too

advantage and disadvantage of k-fold cross validation with k->n

error decreases bc u have more training data but validation procedures become slower

what is h bar represent

expected behavior of the ERM solution h_D since ERM is random so hbar = E_D [h_D] hbar (x) = E_D [h_D(x)] for all x

the bias represents a component of error that is intentionally added by the data scientist

false you're trying to find the right modeling assumptions!!! you aren't intentionally trying to get it wrong

fixed design ridge linear regression

features x1...xn are fixed (no randomness) yi ~ w*T xi + ei , ei ~N(0,1) only randomness of dataset xi,yi is from noise ei means larger lambda in regularizer = simpler model = larger bias, smaller variance when lambda -> inf, w -> 0

what happens for fixed design ridge LR when we have 5000 datasets, and for a given lambda, solve ridge LR for each dataset to get w^1...5000 then we estimate mean of these and plot them and compare to w*???

for lambda = 0, mean is exactly the optimal w*, and plotted w^ are fairly spread out greater lambda moves mean away from w*, circle of dots shrinks in size

gaussian vs logistic regression bias/variance

gaussian is high bias bc assumptions not usually right low variance bc of many 1D probability distribution estimations... logistic regression = low bias in comparison bc makes no possibly incorrect assumptions ab data & high variance bc more dependent on entire dataset's variation

tuning lambda allows you to control what

generalization error

generalization error for linear regression w squared loss

given dataset D, hypothesis class H, squared loss (h(x) -y)^2 and hD = ERM soln... generalization = E_D E_x,y~P (h_D(x) - y )^2

formal definition of variance

hbar = ED (hD) *average ERM solution ybar(x) = E[y|x] *Bayes' optimal difference between hbar and hD aka ED E_x (hD(x) - hbar(x))^2 fluctuation of our random model around its mean!!!

formal definition of bias^2

hbar = ED (hD) *average ERM solution ybar(x) = E[y|x] *Bayes' optimal difference between hbar and the best ybar(x) aka Ex [(hbar(x) - ybar(x))^2] THIS IS THE SQUARE OF BIAS!!! SO difference between ideal case and what you're actually doing DIFFERENCE BW OUR MEAN & the BEST

what is hbar(x) when h(x) = w0

hbar(x) = Ey [y] true mean of the y's in the dataset

uniform random sampling of labels bias/variance

high bias (because of high inaccuracy) high variance (because labels on datasets will change a lot since it's random)

underfitting impact on bias/variance

high bias (narrow range of models to choose from, ex. linear only) - aka bias towards TOO SIMPLE MODELS low variance (models on diff datasets from the same distribution look relatively similar)

regime 2 (high bias) meaning and symptoms

high bias is causing underperformance test error and training are close both are ABOVE DESIRED tolerance e SYMPTOM: training error is higher than e! (test error is also above e) SO BOTH TRAIN AND TEST ERRORS ARE HIGH BECAUSE OF UNDERFITTING!!!

how do validation/training error compare for underfitting

high validation error, high training error

how do validation/training error compare for overfitting

high validation error, low training error

regime 1 (high variance) meaning & symptoms

high variance is causing poor performance training error is below error threshold e test error is much higher AKA SMALL TRAIN BUT LARGE TEST 1. training error is lower than test error 2. training error is lower than e 3. test error is above e BECAUSE OF OVERFITTING TO NOISE!

overfitting impact on validation and training error

higher and higher validation error but lower and lower training error as we memorize the dataset

intuitive meaning of variance

how much ur classifier changes if u train on diff training set aka how overfit is ur classifier to a specific training set? how far from average classifier? RECALL BULLSEYE - IT IS SPREAD!!!

why does test error typically have a parabola shape if x-axis is lambda

if x-axis is lambda, then for a while your model will do well (simpler models = less overfitting), until it starts underfitting

graphical illustration of bias/variance

imagine a target low bias/low var = points clustered at bullseye low bias/high var = points scattered around bullseye high bias/low var = points clustered away from bullseye high bias/high variance = points scattered far from bullseye

adding more data does what to training error

increases

as model complexity increases , what happens to variance

increases

noise

independent of models - it is data-intrinsic! measures ambiguity due to data distribution & feature representation (ASPECT OF DATA)...DIFF BW WHAT BAYES OPTIMAL DOES AND ACTUAL ANSWER unavoidable - cannot ever become zero!

inf lambda means what for bias/variance

inf lambda = greater bias little variance

intuitive meaning of bias

inherent error of classifier ON THE TRAINING SET even with inf training data due to model being biased to specific kind of soln (ex. biased to linear classifiers) SO CAN THINK OF IT LIKE (1) TRAINING ACCURACY AS WELL!!! it's CAUSED by modeling ASSUMPTIONS aka training accuracy issue/error caused by (2) how much our expected model differs from the underlying model RECALL BULLSEYE!!!!! IT IS NEARNESS TO BULLSEYE!!!!!! (3) ALSO kind of like expressivity/capacity/flexibility/complexity (higher complexity/non-linearity = more freedom to fit underlying model = less bias)

logistic regression on linearly separable and non-linearly separable data variance/bias

linearly separable = low bias (bc linear solution suits linearly separable data), low var non-linearly separable = high bias (linear solution is not accurate for non-linear data), low var ALWAYS LOW VARIANCE (like most linear models)

overfitting

model is too complex, fits noise too perfectly can't generalize well on unseen test examples

can we practically calculate the lambda that minimizes bias + variance + noise

no because you most likely need w* the optimal classifier so use techniques like cross validation, etc.

overfitting impact on bias and variance

no strong bias (hypothesis class is often UP TO certain order, ex. 5th order poly) high variance (models on diff datasets from the same distribution look diff)

what is often bias/variance of linear models (linear SVM, logistic regression, etc.)

often high bias (most data not linearly separable) & low variance (simple models) linear classifier w linear decision boundary, these boundaries don't bend themselves for outliers (separating line is based on shape of majority of data and not really on outliers, this generalizes well = low variance)

is high variance associated with overfitting or underfitting

overfitting with different datasets, ur model changes wildly because you are focused on NOISE!

how to detect regime 1/regime 2 (issues with high var/bias) in general?

plot the test and training error with regards to the # of training points/instances training error will be concave, curves upwards to the right test error is convex, decreasing to right left side is overfitting right side is underfitting (high bias) - BUT THIS DOES NOT MEAN THAT BIAS INCREASES WITH TRAINING DATA

Bayes optimal regressor in linear regression

predict E[y|x] aka INTEGRAL over y of y * P(y|x) (because for every feature vector x, like housing details, there is a distribution of label y's, ex. selling price)

how to select best model from data

select right order of polys for regression select right lambda for ridge regularization select right penalty for slack variables in soft SVM (aka C)

bias/variance for kNN with small k vs. large k

small k: low bias (close points are more accurate), high variance (no averaging effect) large k: high bias (using points that aren't actually close), low variance (averaging effect)

what does bias-variance decomposition decompose

squared error E_D E_xy (hD(x)-y)^2 if not squared the decomposition would not work bc u couldn't complete the square

how do testing and training error change as the # training instances increases (size of training set increases) for fixed model complexity

test error curves downwards and asymptotically approaches acceptable test error e training error curves upwards because it is harder to overfit to data when you keep adding training points!!!! at some point ur model complexity is no longer enough to match that of the data points but u only have fixed model complexity

what is hD

the ERM solution but also given a machine learning algorithm, hD = A(D) (aka calling the classifier or whatever)

what is variance a property of

the algorithm itself

purpose of k fold cross validation - do you still need a test set

to estimate the validation error, especially when you have little data this validation error allows you to tune your model and change hyperparameters you STILL NEED A TEST SET!!! to get a true idea of generalization error!! but this is after tuning never use the test set for tuning!!!!

underfitting

too simple model cannot capture trend in data

underfitting impact on validation and training error

too simple of a model can't even account for provided data so training and test error are high

relationship between model complexity (x-axis, aka decreasing lambda) and error (y-axis)

total error is a parabola, there is an optimum variance curves upwards bias^2 curves downwards where they intersect is roughly where the optimum minimal error is so there is some lambda that minimizes!

why adding more training data does not always help reduce your testing error below a desired threshold epsilon > 0

training error is lower bound on testing error adding more data increases training error if ur training error is already too high (> epsilon adding more data won't bring testing bc it is bounded by training error.....BUT A BETTER EXPLANATION IS THAT if you ALREADY HAVE high bias, adding more data cannot change that bias, so your test error won't reduce bc ur modeling assumptions haven't changed - ur test error plateaus

overfitting means the learned classifier is too specific to the training data

true

underfitting means that the learned classifier is not expressive enough and produces high training/validation error

true

If we have a validation set, we do not need to do k-fold cross validation for hyperparameter tuning

true both are forms of validation and you only need one to tune parameters

the variance measures the randomness in the trained classifier/regressor caused by randomness in the training set

true data is random so model fit to the data will also vary with different data, is your model the same or not? high variance is probably not good because ur too sensitive to the data

the noise measures the error that the bayes-optimal classifier/regressor would have

true labels can be noisy so there is inherent randomness in data even if you knew the actual distribution you would still not be 100% accurate you can't control the noise

is high bias associated with underfitting or overfitting

underfitting your model is biased towards too narrow of a selection aka too simple, restricted model is highly biased towards assumptions (ex of linearity)

how to select right lambda for ridge LR

use K-fold cross-validation for every lambda in a set choose the lambda that had the smallest avg validation error

regime 2 high bias remedies

use more complex model (kernalize, use non-linear) add features boosting UNDERFITTING

does using more/less features help overfitting/underfitting

using less features helps overfitting using more features helps underfitting

lambda (x-axis) vs test/validation error and train error (y-axis)

validation error forms parabola above training error training error curves upwards this means left hand side is overfitting right hand side is underfitting

what happens when linear predictor is h(x) = w0

w0 models the mean of y in the data (aka b) [because we're minimizing square loss and we have to predict a constant so the constant that minimizes is square loss] this is a good estimate of the true mean of y's hence low variance!

fixed design ridge LR solution in terms of e

w^ = (XXT + lambdaI)^-1 X (XTw* + e) because Y = XTw* + e because yi = w*T xi + ei HENCE w^ is a random quantity due to e!!!!!!

fixed design ridge LR objective in matrix/vector form

w^ = argmin over w of ||XT w - Y||^2 + lambda ||w||^2 where Y = XTw* + e because yi = w*T xi + ei

how does cross validation relate to generalization error of ridge LR

we perform ridge LR on all but the ith fold the ith fold is used for validation error, which is SUM over x,y in Di of (w^i T x - y)^2 / |Di| aka using Di to test w^i this validation error is ~Ex,y~P (w^iTx - y)^2 (TEST ERROR!) also outputting average validation error over all K folds, this is approximately ED[Ex,y~P (w^_D T x - y)^2 (AVERAGING OVER DATASETS!!!!) aka generalization error of ridge LR with certain lambda (AKA BLACK PARABOLA FOR TOTAL ERROR IN lambda vs. error graph)

is hbar deterministic

yes the expectation removes all randomness

does adding more useful features change variance

yes variance will increase because you have more degrees of freedom, you can fit more

Bias-Variance Tradeoff

Related study sets

Accounting Chapter 10

Finance Lesson 1

Management Ch 7 Individual and Group Decision Making

Frog and Toad are Friends

ISM4402 CH5,6,7,8 T/F

Cardio for Nclex

Texas History (Chapter 1-10)

5153 Module 2

Psychoanalytic Therapy Quiz

Chapter 19 hist 112

Med Surg Exam 5

BC - Chapter 1 - Building Construction

Fund EXAM 2 - nutrition & NGT (p)

circulatory system

Hematology

Active and Passive Voice

Chapter 10: Racial and Ethnic Inequality

Chapter 2

Lesson 5

Chapter 10 Learnsmart's