KD/ball/decision trees/parametric/bagging/boosting

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

weak learner

classifier whose accuracy is slightly above 50% one which optimizes its own data {pi, xi, yi} and solves argmin SUM pi * 1[h(xi) is not yi]

what kind of decision boundaries/coloring will you get for RF w/ 2 trees

3 colorings either trees agree positive, agree negative, or disagree

Adaboost setup

classification, yi in 1, -1 weak learners h in H are binary, h(xi) = -1, 1 loss function is exponential L(H) = 1/n SUM e^-yi H(xi)

goal of boosting

learn an ensemble

after n samples, what is probability that z1 is never sampled when bootstrapping?

(1-1/n)^n approaches 1/e as n -> inf

boosting what are the values of yi h(xi)

-1 or 1!!!! so we can consider all i where h(xi) = yi or where it does not

boosting dl(y^i, yi) /dy^i

-yi e^-H(xi)yi

SUM of wi from lecture notes for boosting

1

SUM wi is what in boosting

1 hence SUM wi over h(xi) = yi + SUM wi over h(xi) noteq yi = 1

bagging (bootstrap aggregation) algorithm

1. construct P^(xi,yi) = 1/n 2. treat P^ as ground truth, draw k datasets D1...Dk from P^ (using bootstrapping - so Di is a bootstrapped sample) DRAWING FROM P^ is basically drawing from P...but not independently...so identically distributed but not independently 3. train classifier for each h^k = ID3(Dk) 4. averaging/aggregation: hbar = SUM h^i/k (THIS STEP REDUCES VARIANCE!)

advantages of bagging

1. easy to implement 2. reduces variance 3. prediction is average of many classifiers, so u get a mean score and variance (uncertainty of prediction) 4. provides unbiased estimate of test error by giving us out-of-bag error (can do this without reducing training set!)

goal of decision trees

1. find a compact tree (maximally compact) 2. with pure leaves (single label) [WITH PURE LEAVES, COMMON LABEL BECOMES A GREAT IDEA FOR AN ESTIMATE! bc with pure leaves, you know that test points with features like the ones that made it to the leaf tend to all have similar labels!]

why is random forest popular

1. has only 2 params, m and k (m being # classifiers) insensitive to both of these, BOTH ARE EASY TO SET 2. decision trees don't need much preprocessing or feature scaling (things can be different units, can be categorical or real-valued can scale all to be within 0,1] and still would not change tree)

kd tree algorithm

1. identify which side the test point lies in 2. find the nearest neighbor on the same side 3. compute the distance between test point and the dividing wall, if this distance > d(xt, xNN on same side), we are done and get 2x speedup if not the case, compute distance to all points recursively apply to subdomains

regularizations in decision trees (reduces model complexity for high variance situations...)

1. minimum # of examples per leaf (no split if # examples < threshold) 2. max depth (no split if hits depth limit) because more depth = more splits (bc # leaves exponential wrt depth) so more likely to split on outliers 3. max # nodes (stop tree if it hits max # nodes) 2 and 3 are similar if tree is balanced PRUNING THE TREE

variants of RF

1. split each training set into two disjoint partitions, DA and DB build tree on DA and estimate leaf labels on DB stop splitting if a leaf has only a single point in DB in it this means each tree and RF classifier becomes consistent 2. don't grow each tree to full depth, instead prune based on the leave out samples (can improve bias-variance tradeoff)

total training loss of ensemble Ht

1/n SUM l(Ht(xi), yi)

adaboost on full CART trees without depth limit

CART tree has zero training error, hence infinite step size alpha! and division by zero

binary classification on weighted data

D = {pi, xi, yi} where SUM pi = 1 pi >= 0 pi = |wi| / SUM |wj| BASICALLY pi is NORMALIZED WEIGHTS!!!!!!!!!!!!!!!!

normalizer Z in boosting is? how does this help us interpret wi?

EQUIVALENT TO THE LOSS FUNCTION NO MATTER HOW MANY UPDATES YOU MAKE!!!!!!!!! So at any point in the algorithm, Z is the loss function of the entire ensemble aka SUM e^-yi H(xi) big H is the entire ensemble SUM a hi each weight is normalized and divided by this constant Z makes it so the sum of the weights is always 1 THIS MEANS wi is the contribution to the total loss (kinda like total exponential loss for this point and the prediction by the ensemble)

for hbar = SUM h^i/k (so bagging) what happens when k->inf where h^i results from bootstrapped dataset Di

Each h^i comes from a bootstrapped dataset Di, and when k -> inf, h bar does not approach the true average classifier because the bootstrapped datasets Di are not truly independent. (can't apply WLLN) So variance can't ever be zero, but variance does decrease in practice.

In bagging, each classifier in the ensemble is trained on a data set that is independently and identically distributed

FALSE!!!!!!! The data is NOT independently sampled because independent sampling means having one sample tells you NOTHING about the other sample this is not the case! Seeing D1 tells you a good idea of what is in D2, likely significant overlap

squared loss impurity for regression trees

L(S) = 1/|S| SUM (y - y^)^2 where y^ = 1/|S| SUM y (average label) can think of it like reducing variance similar labels = small variance impurity in this case measures variance

gini impurity for a leaf formula

G(S) = SUM over k of pk (1-pk)

GBRT pseudocode

H = 0 for t... find ti = yi - H(xi) for all i h = argmin (h(xi) - ti)^2 H = H + ah [assumes squared loss]

entropy over a tree

H(S) = |SL|/|S| H(SL) + |SR|/|S| H(SR)

adaboost pseudocode

H0 0 init wi = 1/n for t... ht+1 = argmin e where e = SUM over i:h(xi) is not yi of wi if e < 0.5 then compute a Ht+1 = Ht + aht+1 update wi OTHERWISE: return Ht (because no improvement on loss)

Under what conditions on your training set will a CART tree (with unlimited depth) obtain 0% training error

If there are no two training inputs with identical features but different labels

Generic boosting (aka Anyboost) algorithm

Input l, α, {(xi,yi)} H0 = 0 for t:0...T-1: ri = dl(Ht(x1),y1...Ht(xn),yn)/dH(xi) ht+1 = argmin SUM ri h(xi) if SUM ri ht+1(xi) < 0 then (THIS MEANS L(Ht + at+1 ht+1) IS LESS THAN L(Ht)!!!!): Ht+1 = Ht + at+1 ht+1 else return Ht (NO IMPROVEMENT!)

proof/intuition that bootstrapping (drawing w replacement from set D) is like drawing from P

Let Q be distribution picking sample from D uniformly Q(X=x) = P(X=x) we can imagine that we draw D from P then Q picks sample from D this is the same as Q picking a slot in D then filling that slot with something from P slot that we picked doesn't matter, so Q(X=x) = P(X=x) BUT THE DATA IS NOT INDEPENDENT! THE DATASETS Di ARE NOT INDEPENDENT

What is pk in Gini impurity

Let Sk subset S where Sk = (x,y) in S with label k S = S1 U S2... pk = |Sk| / |S| = frac of inputs in S with label k

in CART does the relative scaling between features matter

NO we're just splitting between features

in boosting to minimize L(y^) can we just do gradient descent on y^ directly? if not, what to do instead? Let y^ be [Ht(x1)...Ht(xn)] Let L(y^) = SUM l(y^i, yi) [total training loss of ensemble Ht]

NO! y^ that minimizes L needs to be from some ensemble H (so y^ is some manifold, but −∇L(y ̂ ) may move us off this manifold) instead, find h st [h(x1)...h(xn)] is close to −∇L(y ̂ )

does decision tree and KNN return same prediction

No! One easy way to see this that the decision boundaries of a decision tree are always axis-aligned, whereas the decision boundaries of 1NN, the Voronoi diagram of the points, are not necessarily.

time-complexity of KNN adding more data point to train on

O(d)

time-complexity of KNN classifying one test point

O(dn)

time-complexity of KNN training

O(dn)

after how many iterations is adboost training error 0%

O(log(n))

cost of finding best split in regression tree

O(n log n)

why can P^(zi) = 1/n be considered a good approximation of P in bootstrapping?

P^ can approx P's mean and variance E_z~P^ [z] = SUM zi/n -> E_z~P[z] E_z~P^ [z^2] = SUM zi^2/n -> E_z~P[z^2] all by weak law of large numbers

Bootstrapping

RANDOM SAMPLING W REPLACEMENT TECHNIQUE D = zi ~P approx P with Phat(zi) = 1/n treat P^ as if it were P sample uniformly randomly from P^ n times w replacement (THIS IS NOT INDEPENDENT)

regression tree algorithm

RT(S): if |S| <= k, set leaf value to average label (or linear regressor) where k is a hyperparameter meaning you won't split nodes with # pts <= k ELSE: for all dims and all values, find split minimizing |SL|/|S| var(SL) + |SR|/|S| var(SR) call RT(SL), RT(SR)

epsilon in adaboost

SUM wi over all i where h_t+1(xi) is not yi (where wi is normalized like in notes) aka SUM over all i where h_t+1(xi) is not yi OF pi (from lecture) aka weak learner's loss (loss of ht+1) total weight of examples where ht+1 made a mistake.......AKA WEIGHTED CLASSIFICATION ERROR

How to make KD tree and traverse one? traversal complexity?

Split data recursively in half on exactly one feature. Rotate through features. When rotating through features, a good heuristic is to pick the feature with maximum variance. DRAW ACTUAL TREE!!! And when you need to classify a point, traverse down to leaf!!! Then compute nearest neighbor in that box (BASICALLY COMPUTING K NEAREST NEIGHBORS!!!!!!!!!!) Compare with previous node to see if it is nearest neighbor Otherwise compute more distances in search of nearest neighbor.......O(logn) traversal time

Given a distribution P you can sample a training set D and obtain a classifier h. Imagine you train m such classifiers h1, . . . , hm on m data sets D1, . . . , Dm, each drawn i.i.d. from the data distribution P. As you increase m from m = 1 to m >> 0, how does bias of h^ compare to bias of h?

The bias is unaffected, i.e. the bias of hˆ is identical to the bias of h, because the E[hˆ] = E[h]

does random forest decrease variance

YES overall because it decreases correlation between classifiers but it increases the variance of individual trees

if epsilon in boosting is big, what does that mean for alpha (learning rate)

alpha is small means weight this weak learner very little cuz it did terribly

when is it possible to find pure leaves/consistent tree

always possible if for x=x', y=y' (if 2 feature vectors are the same, they have the same label)

what does adaboost return

an ensemble

what is normalizing constant Z in adaboost identical to

Z = SUM wi = SUM e^-yi H(xi) this is the loss function!!! adaboost uses exponential loss!!!!! then it makes sense for wi to be contribution towards overall loss

what training error does adaboost achieve with sufficient time

ZERO PERCENT!!!!!!!!! (but prone to overfitting)

variance reduction of possibly correlated RV's

[x1 x2 x3] ~ N(0, cov matrix) σij in cov matrix = E[xi xj] - E(xi) E(xj) = E(xi xj) if independent, it is zero var(xbar) = σ^2/3 + SUM i not eq j of σij/9 SO WHEN RVs are POSITIVELY CORRELATED, averaging may not reduce variance! but reducing correlation will reduce the variance...

when does adaboost terminate - does it terminate the moment it reaches 0% training error

adaboost is a for loop! so it doesn't have any early stopping...it does NOT terminate at 0% training error! it will keep boosting as long as there is a weak learner with < 0.5 weighted training error (so 0% training error does NOT mean all weak learners are correct) OTHERWISE IF NOT, if e = 1/2, it will exit the for loop AND terminate!

formal convergence of adaboost

after T iterations, for original exp loss, we have 1/n SUM exp(-HT(xi)yi) <= n (1-4γ^2)^T/2 (this also implies zero-one loss = 1/n SUM 1[sign(HT(xi)) not yi] is less than that bound bc it is upper bounded by exp loss) SO WHEN WEAK LEARNERS ARE GOOD (BETTER THAN 50% ACCURACY), TRAINING LOSS DEC EXPONENTIALLY!

adaboost renormalization

after you take a step, need to recompute all weights and renormalize [because weights depended on the current classifier Ht!!!] unnormalized weight is wi_t+1 = wi e^-a h(xi) yi normalizer Z becomes Z * 2sqrt(e(1-e)) wi_t+1 <- wi e^-a h(xi) yi / 2sqrt(e(1-e))

shrinkage

aka step size a

optimal step size of adaboost

alpha = argmin over a of l(H + ah) compute deriv wrt a, set to zero, solve for a a = 0.5 ln ((1-e)/e)

finding best weak learner in terms of derivatives

argmin SUM dl/dH(xi) h(xi)

ensemble of classifiers means

average of multiple classifiers

random forest in a nutshell

bagged decision trees

bagging reduces ... boosting reduces...

bagging reduces variance boosting reduces bias

ball trees advantage over KD trees

ball trees fix curse of dim and axis aligned splits partition with hyperspheres instead of boxes (partition on low-dimensional manifold that doesn't align with axes)

ID3 algorithm pseudocode

base cases: if all labels are the same, return leaf with that label if not, and ALL feature vectors in this set ARE THE SAME, return leaf with most common label or mean label (regression) >> bc we can't split further! **or technically if only one point left in set can't split anymore* OTHERWISE for all features and all positive splits f1 > t, find one that minimizes IMPURITY of tree (ex. |SL|/|S| G(SL) + |SR|/|S| G(SR)) - aka find split making R/L impurities the least aka for each feature, sort on that feature & try splits... so pick some xf pick some t! these minimize! with SL being xf <= t, SR xf > t call ID3 recursively on subtrees

benefit of random forest

because for every tree and every split we randomly select subset of features reduce correlation between h^i and h^j (increases variance of the individual trees) now var(xbar) = sigma^2/n + SUM over i not eq j sigma_ij /n^2

why does ID3 not stop if no split can improve impurity? does it stop if all splits lead to the exact same impurity? how does this relate to myopic?

because next split may help (trees are myopic-greedily choose split that minimizes impurity but may not be globally optimal) ex. XOR first split doesn't improve impurity, but next one might! ONLY STOPS SPLITTING WHEN ALL INPUTS IDENTICAL, all labels identical, or max depth/# nodes reached it doesn't matter if all splits lead to same impurity, choose a split at random

why is KD tree faster usually

because only need to check 1 box for nearest neighbor (unless close to boundary)

bias of a single decision tree vs bagged treees/random forest

bias for trees is generally pretty low because few assumptions are made about the data it is pretty accurate but can overfit for a single decision tree vs. BAGGED trees, the BIAS STAYS THE SAME!!!!!!!!!!! think of it like E(bagged classifier) is basically E(single decision tree) [approximately by weak law of large numbers since RF ensures classifiers are fairly independent]

setup for boosting (adaboost)

binary classification data loss function (ex. exponential = exp(-yh(x))

bagging vs bootstrapping

bootstrapping is random sampling technique bagging aka boostrap aggregation is doing it many times and averaging the resulting classifiers

what size is bootstrap sample

by default it is the same size as the original sample D (the training set)

how can boosted classifiers be stopped prematurely in test time

cascades spend little time on common case but more time on rare case if clear which way prediction will go, stop after a few weak learners!

CART

classification and regression trees

gradient boosted regression tree (GBRT) setup

classification or regression weak learners are regressors, typically fixed-depth regression trees step size a is hyperparameter loss func = any differentiable convex loss func L(H) = SUM l(H(xi))

motivation of random forest

consider h^i h^j in bagging, i is not j not independent under true distribution P because of overlapping samples, so variance may not be reduced introduce random forest to make h^i, h^j as independent as possible AKA DECREASES CORRELATION OF TREES (thus reducing variance) but does not make them entirely independent......also helps with overfitting (if some of the features aren't that generalizable, you may skip it!)

how does partitioning in KD trees speed up testing

consider one neighbor case 1. identify which side the test point lies in 2. find the nearest neighbor on the same side 3. compute the distance between test point and the dividing wall, if this distance > d(xt, xNN on same side), we are done and get 2x speedup aka if the distance to the partition is larger than distance to our closest neighbor, none of the points inside the partition can be closer!

motivation for bagging

consider trained decision tree h^ = ID3(D) h^ is random quantity, high variance high variance = overfitting! so we want to average it out and get hbar = 1/k SUM h^i!!! LOWER VARIANCE

proof that if distance to partition is larger than distance to point on the same side, none of the points in the partition can be closer

d(xt,x) = dist bw test point and candidate x x lies on other side of the wall d(xt,x) = d1 + d2 where d1 = dist on test point side and d2 = dist on candidate side dw = shortest dist from xt to wall d1 > dw d(xt,x) = d1 + d2 >= dw + d2 >= dw so if dw is larger than dist to current best candidate point for nearest neighbor, we can discard it!

entropy overview

define impurity as how close we are to uniform use KL-divergence to compute closeness uses same p1...pk as Gini impurity

how to partition in KD

divide data into two halves, left and right, along 1 feature for each training input, remember the half it lies in

variance reduction of independent RV's

do this by averaging!!! consider iid random xi, xi ~ N(0, σ^2) variance of x bar = SUM xi/n is σ^2/n SO REDUCED!!!! (BECAUSE var(SUM indep) = SUM var(indep) and var(aX) = a^2 Var(X))

decision tree advantage over nearest neighbor

doesn't store training (can instead store # points of each label per leaf - typically pure so just store label of all points) fast during test time (tree traversal time) - prediction is majority label of leaf decision trees don't need metric (split based on feature thresholds not dist)

decision trees overview

don't store training data use it to build tree structure that divides space into regions with similar labels root node represents entire set divide into subtrees by feature val >= or < t ideally leaves are pure (all points have same label), keep dividing until this is the case

why does inner loop in adaboost only update Ht when e < 1/2

e > 0.5 is impossible e = 0.5 is not useful bc then weak learner h is only as good as a coin toss so not beneficial (and step size a would be 0!)...meanwhile e <= 0.5 - gamma implies approximating gradient well!!! means -gradient L dot H > 0 = within 90 degrees

what implies a stronger learner

each weaker learner doing better than random coin toss (0.5 - γ)

what kind of method is bagging

ensemble method

weak learner's loss/weighted classification error upper bound and why does it imply approximating gradient well

epsilon <= 1/2 - γ, γ > 0 holds when H is symmetric (h in H means -h in H) this means inner product bw true neg gradient and [ht+1(x1)...] is >= 2γSUM |wj| > 0 = WITHIN 90 degree = IMPROVES OBJECTIVE!!!!

AdaBoost with decision trees (depth 3) is non-parametric

false, the set of parameters is not a function of the number of training instances, n basically...the # of weak classifiers is not related to # of training points...the number of weights is BUT this is about the FINAL model, and that would be the ENSEMBLE! the weights are not shipped to production!

out-of-bag error

for each training point, find all sets that don't have it find average classifier over these sets h~i(x) out of bag error = average loss these classifiers yield = 1/n SUM over (xi,yi) in D of loss(h~i(xi), yi) so good estimate of test error bc using classifiers that haven't seen certain points and removing classifiers that did see that point doesn't matter much if we have large # of classifiers

boosting algorithm in english

for every iteration, we are using approximate gradient descent we want to estimate the negative gradient of loss by finding a h that's in our hypothesis class we approximate the loss with taylor's and find a ht+1 that minimizes a portion of it instead then we do Ht+1 = Ht + a ht+1

ID3 algorithm overview

generates decision tree from dataset

in boosting, how to find h that minimizes l(Ht + ah)?

gradient descent in function space

show that finding the best weak learner in adaboost is equivalent to minimizing epsilon/weighted classification error

h(xi) = argmin L(H+ah) = argmin <∇ℓ(H),h> = argmin SUM dl/dH(xi) h(xi) = argmin SUM -yi e^-H(xi)yi h(xi) = argmin SUM -yi h(xi) wi = argmin SUM h(xi) not eq yi of wi - SUM equal of wi = argmin SUM wi not eq = argmin epsilon

parametric algorithm

has constant set of parameters independent of # of training samples like amnt of space needed to store TRAINED classifier (aka the space model would take up if you package it and send it to production) ex. perceptron has w, b, where w depends on dim of training data but not HOW MANY training samples used

for hbar = SUM h^i / k, what happens when n->inf

hbar -> E_D~P^ [ID3(D)] as P^ -> P when n-> inf, the expectation then approaches E_D~P [ID3(D)] the expected decision tree! (From perspective of ED(1/m SUM hi)...= 1/m SUM ED(hi) = ED~P(ID3(D)) SO IT IS DETERMINISTIC! zero variance! SO getting infinite data points WOULD give us zero variance (even if bootstrap datasets aren't indep) but this is infeasible in practice so make classifiers as independent as possible instead

KL-divergence

higher KL divergence means more mismatch 0 means perfect match KL(p||q) = SUM pk log(pk/qk) >= 0 where q1...qc be uniform label (qk = 1/c) = SUM pk log(pk) - pk log(qk) = SUM pk log(pk) + pk log(c) (because qk = 1/c) = SUM pk log(pk) + log(c) SUM pk where log(c) is a constant and SUM pk = 1 so KL(p||q) is just SUM pk log(pk) when q is uniform distribution

what does alpha (learning rate) in adaboost depend on

ht+1's performance weak learner's performance

what is true about Di ~ D and Di ~ P in bootstrapping

identically distributed!! Each dataset Di is drawn from P, but not independently (UNLESS CONDITIONED ON D - conditional probability RESTRICTS) Q(X=xi) = P(X=xi) hence drawing Di from D is the same as drawing it from P

are bagged datasets independent

if considering P as underlying distribution, no! because knowing one set tells you likelihood of points in other set but if considering D, then yes because it becomes your whole universe (you aren't operating on a subset of the universe anymore)

what are values of the splits

if splitting between x1 and x2 it is x1 + x2 / 2!

why does checking distance to wall make sense

if the distance to the partition is larger than distance to our closest neighbor, none of the points inside the partition can be closer!

decision trees parametric or not

if trained to full depth, they are non-parametric (depth of decision tree scales as func of training data, in practice O(log2(n))) but if limit tree depth by max value, is parametric (upper bound of model size known prior to observing training data)

if the distance to the ball is greater than dist to current closest neighbor, what do we do

ignore that ball entirely because d(xt, x) = d1 + d2, and if d1 >= db, then overall >= db + d2 >= db

random forest

in ID3, for every split, randomly select k < d features, find split only using these k

where do KD trees store data

in leaf nodes each training point lies in exactly one leaf node

adaboost algorithm (from lecture)

init H1 = h1 For t = 1.... compute wi = -yi exp(-Ht(xi)yi) pi = |wi| / SUM |wj| (SO proportional to exp(-Ht(xi)yi)) find ht+1 = argmin SUM pi 1[h(xi) is not yi] (THIS IS A WEIGHTED BINARY CLASSIFICATION PROBLEM) weak learner's loss e = SUM pi over i where yi is not h_h+1(xi) Ht+1 = Ht + a ht+1

boosting algorithm in terms of pi, etc.

init H1 = h1 For t = 1.... compute y^i = Ht(xi) compute wi = dL(y^i, yi)/dy^i pi = |wi| / SUM over j of |wj| find ht+1 = argmin SUM pi 1[h(xi) is not - sign(wi)] Ht+1 = Ht + a ht+1

boosting algorithm at very high level

initialize H1 = h1 in ℋ For t = 1... find a new classifier ht+1 st Ht+1 (ensemble at t+1) = Ht + a ht+1 has smaller training error aka an ht+1 that minimizes l(Ht + aht) where l(H) = 1/n SUM l(H(xi), yi) WHICH IS EQUIVALENT TO FINDING ht+1 = argmax <[ht+1(x1)..ht+1(xn)], -∇L(ŷ )> MAX INNER PRODUCT = minimize theta = moving in dir as close to true neg gradient as possible = minimizing loss as much as possible

ball tree construction pseudocode

input set S, n = |S|, k func BALLTREE(S,k) if |s| < k stop otherwise, pick random x0 pick x1 = max d(x0,x1) ****those two steps above pick direction with large spread x1-x2!!!!!!!!!!!!!********* pick x2 = max d(x2, x1) project data onto (x1-x2) aka (x1-x2)T xi take median of projections SL = projected < m SR = projected >= m return tree (center c = mean(S), radius r = max d(x,c), children = Balltree(SL,k), Balltree(SR,k))

if datasets Di were truly independent, what happens to 1/m SUM hDi

it approaches hbar as m -> inf due to WLLN hence variance of 0! but this is not actually the case bc we bootstrap samples so the datasets aren't independent (hDis aren't truly indep)

why is KNN slow during testing

it does a lot of unnecessary work

Assume you pre-process all your features in the following way: you sort each feature independently. For each feature, you then assign all those inputs that share the lowest feature value a new feature value of 1, all those with the second lowest value a 2, etc. How does this affect the trees that you construct?

it doesn't the relative ordering is maintained! so we are still processing the same splits across the different features, so the best one is still the same

in adaboost, how to interpret wi

it is the contribution of each training point towards the overall loss

how does increasing number of trees in RF create a better decision boundary

it smooths because it's averaging decisions across trees

adaboost is powerful bc

it turns any weak learner that can classify a weighted version of training set w below 0.5 error into a strong learner whose training error decreases exponentially requires only O(log(n)) steps until consistent

for what algorithms is boosting ineffective

kNN, unlimited depth decision trees, kernel SVMs (all have zero bias essentially - all are highly non-linear!!!) linear classifiers on non-linearly separable data bc not much use to ensemble linear classifier random labeling (has low bias, also not a weak learner cuz it doesn't learn) NOT naive bayes (prolly cuz it's too vague to tell if high bias or not wo knowing how u modelled p([x]a|y) or mode labeling

approximation approach to finding min size tree w pure leaves

keep splitting data to minimize an impurity function, which measures label purity amongst children

regression trees and what does decision boundary look like? what about classification decision boundaries?

labels are continuous uses square loss impurity store like a linear regression line per leaf OR average label of points in the leaf decision boundary is piecewise constant (overfitting to each label)......whereas for classification trees the decision boundary is like a kNN classifier boundary - don't draw it like a KD tree

how does adaboost find the best weak learner (aka find ht+1 = argmin)

let Z = SUM e^-yi H(xi) (normalizing factor) wi = 1/Z e^-yi H(xi) [from lecture notes] SUM wi = 1! ri = dl/dH(xi) = -yi e^-yiH(xi) need to solve h(xi) = argmin SUM ri h(xi) ....argmin SUM over i where h(xi) is not yi of wi = ϵ we want SUM ri h(xi) to be negative too, in order for this to happen just need eps < 0.5

good choice of m and k in RF

let k = sqrt(d) m is as large as possible

GBRT how to plug in to anyboost formulation of ht+1 = argmin SUM ri h(xi)

let ti = -ri = yi - H(xi) [when using squared loss] argmin SUM ri h(xi) simplifies to argmin SUM (h(xi) - ti)^2

CART overall pros/cons

lightweight, fast during testing but not competitive in accuracy (can become strong via bagging(random forests) or boosting (gradient boosted trees))

entropy formula and derivation

max over p of KL(p||q) = max SUM over k of pk log(pk) // WANT TO MAXIMIZE DIFFERENCE FROM UNIFORM = min - SUM pk log(pk) = min over p of H(s) so H(s) = -SUM pk log(pk)

gini impurity graph

maximized when uniformly distributed (each label is equally likely) but goes to zero as mostly 1 label for 2 label problem, it is -x^2 graph, maximized at p = 0.5 and gini impurity 2p(1-p) = 0.5

in adaboost, if SUM ri h(xi) < 0, what is true about epsilon

means epsilon < 0 too (so weak learner better than 50% accuracy)

what does large pi in boosting mean, pi = |wi| / SUM |wj|

means it contributes a lot to loss so points in a previous iteration that were classified wrong, now have larger weight in next iteration, means fixing them is MOST important to loss!

how to negate a tree

negate predictions

is KL-divergence a metric

no because not symmetric KL(p||q) is not the same as KL(q||p)

do random forests need a training/validation split

no because we have out of bag error! so we already have a so-called "validation error" estimate and so we don't need to reduce the training set

adaboost - can epsilon > 0.5?

no! if H is negation closed (for every h in H we have -h in H) it can't happen because if h has error e, -h has error 1-e so just flip h to -h and obtain classifier w smaller error but h minimizes, contradiction!

is it computationally tractable to find a pure and maximally compact tree (the global optimum)

no! NP hard to find min size tree use greedy instead to approximate (so ID3 finds local optimum)

are gradient boosting algorithms part of boosting family

not necessarily bc they don't guarantee that training error decreases exponentially so these algorithms are called stage-wise regression sometimes

does bagging with bootstrapping achieve zero variance

not really because h^D = 1/m SUM hDi no longer approaches hbar because weak law of large numbers only works for iid samples if you got inf training samples it would, but that's not usually the case

what happens in adaboost when two training inputs in a binary classification problem are identical in features but have diff labels

one point is labeled right for a while but the other point's weight keeps increasing until that one needs to be classified correctly to reduce loss now the first point is incorrect, its weights increase...this loops again... SO both points have high weights that dominate the training set weak learner can't do better than 50% accuracy algorithm will stop

ball tree general idea

partition data on underlying manifold instead of entire feature space pick random pt, find pt farthest away then find pt farthest away from that one draw a line connecting, draw bisector (perpendicular) at median value of projection onto the line split points with this bisector

what will a weak learner focus on classifying correctly

points with HIGH weights means getting HIGH weights right will reduce loss more

ensemble method

predictions are made based on the combination of a collection of models ensemble H(x) = SUM at ht(x) combine multiple simpler algs to get a better learning alg

pros and cons of adaboost as a result of using exponential loss

pro: converges fast con: bad for noisy data, lots of overfitting

pros and cons of KD trees

pros: exact, easy to build (find MEDIAN of features) cons: curse of dimensionality makes KD trees ineffective for high dimensions (TOO MUCH splitting, and most points tend to be evenly spaced in high dimensions, so you have to check MORE boxes!) all splits are axis aligned (but data may not be aligned along features, ex. may be diagonal) still need all training data!!!!!! stored! is in the worst case the same as KNN

pros and cons of decision trees

pros: fast at test time cons: prone to overfitting (with outliers, may split them out even though they aren't important) so high variance

random forest ID3 vs regular ID3

regular ID3 looks for split in all d dimensions ID3 in random forest is only looking in k randomly picked dims

boosting as two player zero sum game

row player plays hypothesis col player plays example (x,y) row player gets loss 1[h(x) is not y] col player gets loss -1[h(x) is not y] boosting = running alg to find nash equilibrium of game

random forest algorithm pseudocode? two sources of randomness

sample m datasets D1..Dm w replacement (bootstrapping) (THIS IS RANDOM 1) for each Dj train full decision tree (max depth = inf) but before each split, sample k features and use these for split (RANDOM 2) final classifier is average of these

non-parametric algorithm

scales w # training samples ex. KNN bc during training, store entire training data, so num params/storage required grows linearly w training set size

what question does boosting answer

scenario: large bias classifiers with high training error (CART trees w limited depth) can weak learners be combined together to generate a strong learner w low bias?

assumption of RF/decision trees

similar inputs have similar labels

ball tree performance comparison with KD trees

slower than KD in d <= 3 but faster in high dimensions both affected by curse of dim, but ball trees tend to still work if data lies on low-dim manifold (so usually works better in high dim than KD trees)............notably ball tree computation time for a split is O(n) [you have to project onto the new line!], whereas in KD it is constant time (compute median)

KD trees

space partitioning data structure for organizing points in K-dimensional space can speed up KNN during testing partitions feature space! aka recursively subdivide along features to make KNN fast

KD tree construction

split data recursively in half (as in # of points is equal on both sides) on exactly one feature rotate through features (good heuristic is to pick feature with max variance) - MEANS YOU CAN SPLIT ON THE SAME FEATURE MANY TIMES so root may be f1 (feature 1) > t1 (either yes or no)... children could be f2 > t2 and f2 > t3...so forth

stochastic gradient boosting

subsample training data for each weak learner combines benefits of bagging + boosting

how is adaboost iterative

the next iteration (next weak learner) focuses on data that the pervious weak learner misclassified reweighting = misclassified data gains more weight

training loss vs training error

training loss is often an upper bound on training error

finding [h(x1)...h(xn)]T that is close to −∇L(y ̂ ) in boosting gradient descent in function space as weighted binary classification

training set can be rewritten as: {pi, xi, -sign(wi)} wi = dl(y^i, yi) / dy^i [THIS IS ri in the lecture notes] y^i = Ht(xi) where pi = |wi| / SUM |wi| ht+1 = argmin SUM pi 1[h(xi) is not -sign(wi)] THIS ht+1 LATER GETS MULTIPLED BY ALPHA BEFORE BEING ADDED TO Ht!!!

One advantage of bagging is that all ensemble members (i.e. classifiers) can be trained in parallel

true

one advantage of Random Forests is that you obtain meaningful probability estimates as your output predictions P(y|x)

true probabilities are calculated from the votes of the different trees!!!!

adaboost how are weights changed in the update to reflect classifier accuracy

unnormalized weight is updated to wi * e^-a h(xi) yi if classifier is incorrect, factor = e^a > 1, so inc weight if classifier is right, e^-a < 1 (decreases weight)

what does bagging reduce

variance

effect of increasing ensemble size on bias/variance/noise

variance decreases, it becomes 1/m of variance of single classifier bias and noise are unaffected bc they use expectations, which don't depend on m

how does bagging reduce variance in terms of variance equation (WHEN Di are independently drawn from P)

variance is E((classifier u drew - average classifier)^2) the one resulting from bagging is h^_D = 1/m SUM h_Di by weak law of large numbers, this approaches E_D~P(ID3(D)) which is essentially the average classifier so since h^_D approaches hbar, variance goes to 0 WHEN Di are independently drawn from P

gradient descent in functional space idea

want to minimize l(H + ah) ~ l(H) + a ∇ℓ(H)T h (Taylor approx) so we want to minimize l(H) + SUM dl/dH(xi) h(xi) = min SUM dl/dH(xi) h(xi) <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< !!!!! = min SUM ri h(xi) where dl/dH (xi) = dl / dH(xi) as long as SUM ri h(xi) < 0, we make progress! so just need some function to solve this minimization! NOTE: this minimization can be rewritten as a weighted binary classification problem!

what do we not want in terms of impurity

we don't want uniform distribution! don't want p1 = .... pc = 1/c (then each leaf is equally likely)

approximate gradient descent for minimizing L(y) where yt+1 = yt - ng^t g^t is not gradient of L(yt) under what condition of g^t can we guarantee L(yt+1) < L(yt)? and proof

when <g^t, gradient L(yt)> > 0 (SO UR STILL GOING IN AN OKAY DIRECTION) aka inner product proof (USING FIRST ORDER TAYLOR and g^tT gt > 0): let ∇L(yt ) = gt g^t = estimate g^t = g^t parallel + g^t perpendicular g^t parallel = agt (projection) L(yt+1) ~ L(yt) - n gtT g^t = L(yt) - ngtT (agt + g^t perp) = L(yt) - (na) gtT gt since a>0 this is L(yt) - positive (a is positive because degree between grad est and true grad is less than 90 degrees! imagine drawing it out!)

when is adaboost a bad idea

when the data has label noise! exponential loss ensures the mislabeled data points will also be classified correctly means OVERFITTING!!!!!!

when does adaboost stop

when weak learners no longer achieve accuracy better than 50% (aka error not less than 50%) AT THIS POINT RETURN THE ENSEMBLE!!!!!!!!!!!! exit the for loop!!!!!

using more classifiers averaged together for bagging does what when using excessively high number of classifiers

will slow down classifier reduces variance a lot but won't increase its error (BIAS)

bagging in test time

y^ = [p // 1-p] where p = # of trees predicting -1 / k CLASSIFICATION

are CART trees with limited depth weak learners

yes

is pi in boosting a probability

yes SUM pi = 1 pi >= 0

can decision trees fit non-linear trends

yes could be considered piecewise linear

can weak learners be combined together to generate a strong learner w low bias?

yes create ensemble classifier H = SUM at ht(x) build in iterative fashion (in iteration t add classifier at ht(x)) similar to gradient descent iterations

does adaboost converge very fast

yes! because we know the optimal step size alpha

gradient descent for minimizing L(y), when is L(y+1) < L(yt)

yt+1 = yt - ngt gt = gradient of L(yt) when n is small and gt is not zero, L(yt+1) < L(yt)

what depth is the root

zero

gini impurity G^T(S) of a tree

|SL|/|S| GT(SL) + |SR|/|S| GT(SR) where SL and SR are disjoint, S = SL U SR |SL|/|S| = frac of inputs in left subtree


Ensembles d'études connexes

Chapter 14- Anxiety and Anxiety Disorders (Prep U)

View Set

Chapter 3 - ( MARKET RESEARCH )

View Set

Honan-Chapter 39: Nursing Management: Patients With Rheumatic Disorders

View Set

[Prop&Cas] Ch9- Commercial General Liability Coverage

View Set

chapter 11 chemical reactions test

View Set

Français- Unité 5: L'avenir et les métiers

View Set

EDF2005 Module 4 (chapters 7 and 8)

View Set

PSYCH Chapters 13, 14, 15, 16, 17, 18, and 19

View Set

Abnormal Psychology Chapter 8: Eating Disorders

View Set