KD/ball/decision trees/parametric/bagging/boosting
weak learner
classifier whose accuracy is slightly above 50% one which optimizes its own data {pi, xi, yi} and solves argmin SUM pi * 1[h(xi) is not yi]
what kind of decision boundaries/coloring will you get for RF w/ 2 trees
3 colorings either trees agree positive, agree negative, or disagree
Adaboost setup
classification, yi in 1, -1 weak learners h in H are binary, h(xi) = -1, 1 loss function is exponential L(H) = 1/n SUM e^-yi H(xi)
goal of boosting
learn an ensemble
after n samples, what is probability that z1 is never sampled when bootstrapping?
(1-1/n)^n approaches 1/e as n -> inf
boosting what are the values of yi h(xi)
-1 or 1!!!! so we can consider all i where h(xi) = yi or where it does not
boosting dl(y^i, yi) /dy^i
-yi e^-H(xi)yi
SUM of wi from lecture notes for boosting
1
SUM wi is what in boosting
1 hence SUM wi over h(xi) = yi + SUM wi over h(xi) noteq yi = 1
bagging (bootstrap aggregation) algorithm
1. construct P^(xi,yi) = 1/n 2. treat P^ as ground truth, draw k datasets D1...Dk from P^ (using bootstrapping - so Di is a bootstrapped sample) DRAWING FROM P^ is basically drawing from P...but not independently...so identically distributed but not independently 3. train classifier for each h^k = ID3(Dk) 4. averaging/aggregation: hbar = SUM h^i/k (THIS STEP REDUCES VARIANCE!)
advantages of bagging
1. easy to implement 2. reduces variance 3. prediction is average of many classifiers, so u get a mean score and variance (uncertainty of prediction) 4. provides unbiased estimate of test error by giving us out-of-bag error (can do this without reducing training set!)
goal of decision trees
1. find a compact tree (maximally compact) 2. with pure leaves (single label) [WITH PURE LEAVES, COMMON LABEL BECOMES A GREAT IDEA FOR AN ESTIMATE! bc with pure leaves, you know that test points with features like the ones that made it to the leaf tend to all have similar labels!]
why is random forest popular
1. has only 2 params, m and k (m being # classifiers) insensitive to both of these, BOTH ARE EASY TO SET 2. decision trees don't need much preprocessing or feature scaling (things can be different units, can be categorical or real-valued can scale all to be within 0,1] and still would not change tree)
kd tree algorithm
1. identify which side the test point lies in 2. find the nearest neighbor on the same side 3. compute the distance between test point and the dividing wall, if this distance > d(xt, xNN on same side), we are done and get 2x speedup if not the case, compute distance to all points recursively apply to subdomains
regularizations in decision trees (reduces model complexity for high variance situations...)
1. minimum # of examples per leaf (no split if # examples < threshold) 2. max depth (no split if hits depth limit) because more depth = more splits (bc # leaves exponential wrt depth) so more likely to split on outliers 3. max # nodes (stop tree if it hits max # nodes) 2 and 3 are similar if tree is balanced PRUNING THE TREE
variants of RF
1. split each training set into two disjoint partitions, DA and DB build tree on DA and estimate leaf labels on DB stop splitting if a leaf has only a single point in DB in it this means each tree and RF classifier becomes consistent 2. don't grow each tree to full depth, instead prune based on the leave out samples (can improve bias-variance tradeoff)
total training loss of ensemble Ht
1/n SUM l(Ht(xi), yi)
adaboost on full CART trees without depth limit
CART tree has zero training error, hence infinite step size alpha! and division by zero
binary classification on weighted data
D = {pi, xi, yi} where SUM pi = 1 pi >= 0 pi = |wi| / SUM |wj| BASICALLY pi is NORMALIZED WEIGHTS!!!!!!!!!!!!!!!!
normalizer Z in boosting is? how does this help us interpret wi?
EQUIVALENT TO THE LOSS FUNCTION NO MATTER HOW MANY UPDATES YOU MAKE!!!!!!!!! So at any point in the algorithm, Z is the loss function of the entire ensemble aka SUM e^-yi H(xi) big H is the entire ensemble SUM a hi each weight is normalized and divided by this constant Z makes it so the sum of the weights is always 1 THIS MEANS wi is the contribution to the total loss (kinda like total exponential loss for this point and the prediction by the ensemble)
for hbar = SUM h^i/k (so bagging) what happens when k->inf where h^i results from bootstrapped dataset Di
Each h^i comes from a bootstrapped dataset Di, and when k -> inf, h bar does not approach the true average classifier because the bootstrapped datasets Di are not truly independent. (can't apply WLLN) So variance can't ever be zero, but variance does decrease in practice.
In bagging, each classifier in the ensemble is trained on a data set that is independently and identically distributed
FALSE!!!!!!! The data is NOT independently sampled because independent sampling means having one sample tells you NOTHING about the other sample this is not the case! Seeing D1 tells you a good idea of what is in D2, likely significant overlap
squared loss impurity for regression trees
L(S) = 1/|S| SUM (y - y^)^2 where y^ = 1/|S| SUM y (average label) can think of it like reducing variance similar labels = small variance impurity in this case measures variance
gini impurity for a leaf formula
G(S) = SUM over k of pk (1-pk)
GBRT pseudocode
H = 0 for t... find ti = yi - H(xi) for all i h = argmin (h(xi) - ti)^2 H = H + ah [assumes squared loss]
entropy over a tree
H(S) = |SL|/|S| H(SL) + |SR|/|S| H(SR)
adaboost pseudocode
H0 0 init wi = 1/n for t... ht+1 = argmin e where e = SUM over i:h(xi) is not yi of wi if e < 0.5 then compute a Ht+1 = Ht + aht+1 update wi OTHERWISE: return Ht (because no improvement on loss)
Under what conditions on your training set will a CART tree (with unlimited depth) obtain 0% training error
If there are no two training inputs with identical features but different labels
Generic boosting (aka Anyboost) algorithm
Input l, α, {(xi,yi)} H0 = 0 for t:0...T-1: ri = dl(Ht(x1),y1...Ht(xn),yn)/dH(xi) ht+1 = argmin SUM ri h(xi) if SUM ri ht+1(xi) < 0 then (THIS MEANS L(Ht + at+1 ht+1) IS LESS THAN L(Ht)!!!!): Ht+1 = Ht + at+1 ht+1 else return Ht (NO IMPROVEMENT!)
proof/intuition that bootstrapping (drawing w replacement from set D) is like drawing from P
Let Q be distribution picking sample from D uniformly Q(X=x) = P(X=x) we can imagine that we draw D from P then Q picks sample from D this is the same as Q picking a slot in D then filling that slot with something from P slot that we picked doesn't matter, so Q(X=x) = P(X=x) BUT THE DATA IS NOT INDEPENDENT! THE DATASETS Di ARE NOT INDEPENDENT
What is pk in Gini impurity
Let Sk subset S where Sk = (x,y) in S with label k S = S1 U S2... pk = |Sk| / |S| = frac of inputs in S with label k
in CART does the relative scaling between features matter
NO we're just splitting between features
in boosting to minimize L(y^) can we just do gradient descent on y^ directly? if not, what to do instead? Let y^ be [Ht(x1)...Ht(xn)] Let L(y^) = SUM l(y^i, yi) [total training loss of ensemble Ht]
NO! y^ that minimizes L needs to be from some ensemble H (so y^ is some manifold, but −∇L(y ̂ ) may move us off this manifold) instead, find h st [h(x1)...h(xn)] is close to −∇L(y ̂ )
does decision tree and KNN return same prediction
No! One easy way to see this that the decision boundaries of a decision tree are always axis-aligned, whereas the decision boundaries of 1NN, the Voronoi diagram of the points, are not necessarily.
time-complexity of KNN adding more data point to train on
O(d)
time-complexity of KNN classifying one test point
O(dn)
time-complexity of KNN training
O(dn)
after how many iterations is adboost training error 0%
O(log(n))
cost of finding best split in regression tree
O(n log n)
why can P^(zi) = 1/n be considered a good approximation of P in bootstrapping?
P^ can approx P's mean and variance E_z~P^ [z] = SUM zi/n -> E_z~P[z] E_z~P^ [z^2] = SUM zi^2/n -> E_z~P[z^2] all by weak law of large numbers
Bootstrapping
RANDOM SAMPLING W REPLACEMENT TECHNIQUE D = zi ~P approx P with Phat(zi) = 1/n treat P^ as if it were P sample uniformly randomly from P^ n times w replacement (THIS IS NOT INDEPENDENT)
regression tree algorithm
RT(S): if |S| <= k, set leaf value to average label (or linear regressor) where k is a hyperparameter meaning you won't split nodes with # pts <= k ELSE: for all dims and all values, find split minimizing |SL|/|S| var(SL) + |SR|/|S| var(SR) call RT(SL), RT(SR)
epsilon in adaboost
SUM wi over all i where h_t+1(xi) is not yi (where wi is normalized like in notes) aka SUM over all i where h_t+1(xi) is not yi OF pi (from lecture) aka weak learner's loss (loss of ht+1) total weight of examples where ht+1 made a mistake.......AKA WEIGHTED CLASSIFICATION ERROR
How to make KD tree and traverse one? traversal complexity?
Split data recursively in half on exactly one feature. Rotate through features. When rotating through features, a good heuristic is to pick the feature with maximum variance. DRAW ACTUAL TREE!!! And when you need to classify a point, traverse down to leaf!!! Then compute nearest neighbor in that box (BASICALLY COMPUTING K NEAREST NEIGHBORS!!!!!!!!!!) Compare with previous node to see if it is nearest neighbor Otherwise compute more distances in search of nearest neighbor.......O(logn) traversal time
Given a distribution P you can sample a training set D and obtain a classifier h. Imagine you train m such classifiers h1, . . . , hm on m data sets D1, . . . , Dm, each drawn i.i.d. from the data distribution P. As you increase m from m = 1 to m >> 0, how does bias of h^ compare to bias of h?
The bias is unaffected, i.e. the bias of hˆ is identical to the bias of h, because the E[hˆ] = E[h]
does random forest decrease variance
YES overall because it decreases correlation between classifiers but it increases the variance of individual trees
if epsilon in boosting is big, what does that mean for alpha (learning rate)
alpha is small means weight this weak learner very little cuz it did terribly
when is it possible to find pure leaves/consistent tree
always possible if for x=x', y=y' (if 2 feature vectors are the same, they have the same label)
what does adaboost return
an ensemble
what is normalizing constant Z in adaboost identical to
Z = SUM wi = SUM e^-yi H(xi) this is the loss function!!! adaboost uses exponential loss!!!!! then it makes sense for wi to be contribution towards overall loss
what training error does adaboost achieve with sufficient time
ZERO PERCENT!!!!!!!!! (but prone to overfitting)
variance reduction of possibly correlated RV's
[x1 x2 x3] ~ N(0, cov matrix) σij in cov matrix = E[xi xj] - E(xi) E(xj) = E(xi xj) if independent, it is zero var(xbar) = σ^2/3 + SUM i not eq j of σij/9 SO WHEN RVs are POSITIVELY CORRELATED, averaging may not reduce variance! but reducing correlation will reduce the variance...
when does adaboost terminate - does it terminate the moment it reaches 0% training error
adaboost is a for loop! so it doesn't have any early stopping...it does NOT terminate at 0% training error! it will keep boosting as long as there is a weak learner with < 0.5 weighted training error (so 0% training error does NOT mean all weak learners are correct) OTHERWISE IF NOT, if e = 1/2, it will exit the for loop AND terminate!
formal convergence of adaboost
after T iterations, for original exp loss, we have 1/n SUM exp(-HT(xi)yi) <= n (1-4γ^2)^T/2 (this also implies zero-one loss = 1/n SUM 1[sign(HT(xi)) not yi] is less than that bound bc it is upper bounded by exp loss) SO WHEN WEAK LEARNERS ARE GOOD (BETTER THAN 50% ACCURACY), TRAINING LOSS DEC EXPONENTIALLY!
adaboost renormalization
after you take a step, need to recompute all weights and renormalize [because weights depended on the current classifier Ht!!!] unnormalized weight is wi_t+1 = wi e^-a h(xi) yi normalizer Z becomes Z * 2sqrt(e(1-e)) wi_t+1 <- wi e^-a h(xi) yi / 2sqrt(e(1-e))
shrinkage
aka step size a
optimal step size of adaboost
alpha = argmin over a of l(H + ah) compute deriv wrt a, set to zero, solve for a a = 0.5 ln ((1-e)/e)
finding best weak learner in terms of derivatives
argmin SUM dl/dH(xi) h(xi)
ensemble of classifiers means
average of multiple classifiers
random forest in a nutshell
bagged decision trees
bagging reduces ... boosting reduces...
bagging reduces variance boosting reduces bias
ball trees advantage over KD trees
ball trees fix curse of dim and axis aligned splits partition with hyperspheres instead of boxes (partition on low-dimensional manifold that doesn't align with axes)
ID3 algorithm pseudocode
base cases: if all labels are the same, return leaf with that label if not, and ALL feature vectors in this set ARE THE SAME, return leaf with most common label or mean label (regression) >> bc we can't split further! **or technically if only one point left in set can't split anymore* OTHERWISE for all features and all positive splits f1 > t, find one that minimizes IMPURITY of tree (ex. |SL|/|S| G(SL) + |SR|/|S| G(SR)) - aka find split making R/L impurities the least aka for each feature, sort on that feature & try splits... so pick some xf pick some t! these minimize! with SL being xf <= t, SR xf > t call ID3 recursively on subtrees
benefit of random forest
because for every tree and every split we randomly select subset of features reduce correlation between h^i and h^j (increases variance of the individual trees) now var(xbar) = sigma^2/n + SUM over i not eq j sigma_ij /n^2
why does ID3 not stop if no split can improve impurity? does it stop if all splits lead to the exact same impurity? how does this relate to myopic?
because next split may help (trees are myopic-greedily choose split that minimizes impurity but may not be globally optimal) ex. XOR first split doesn't improve impurity, but next one might! ONLY STOPS SPLITTING WHEN ALL INPUTS IDENTICAL, all labels identical, or max depth/# nodes reached it doesn't matter if all splits lead to same impurity, choose a split at random
why is KD tree faster usually
because only need to check 1 box for nearest neighbor (unless close to boundary)
bias of a single decision tree vs bagged treees/random forest
bias for trees is generally pretty low because few assumptions are made about the data it is pretty accurate but can overfit for a single decision tree vs. BAGGED trees, the BIAS STAYS THE SAME!!!!!!!!!!! think of it like E(bagged classifier) is basically E(single decision tree) [approximately by weak law of large numbers since RF ensures classifiers are fairly independent]
setup for boosting (adaboost)
binary classification data loss function (ex. exponential = exp(-yh(x))
bagging vs bootstrapping
bootstrapping is random sampling technique bagging aka boostrap aggregation is doing it many times and averaging the resulting classifiers
what size is bootstrap sample
by default it is the same size as the original sample D (the training set)
how can boosted classifiers be stopped prematurely in test time
cascades spend little time on common case but more time on rare case if clear which way prediction will go, stop after a few weak learners!
CART
classification and regression trees
gradient boosted regression tree (GBRT) setup
classification or regression weak learners are regressors, typically fixed-depth regression trees step size a is hyperparameter loss func = any differentiable convex loss func L(H) = SUM l(H(xi))
motivation of random forest
consider h^i h^j in bagging, i is not j not independent under true distribution P because of overlapping samples, so variance may not be reduced introduce random forest to make h^i, h^j as independent as possible AKA DECREASES CORRELATION OF TREES (thus reducing variance) but does not make them entirely independent......also helps with overfitting (if some of the features aren't that generalizable, you may skip it!)
how does partitioning in KD trees speed up testing
consider one neighbor case 1. identify which side the test point lies in 2. find the nearest neighbor on the same side 3. compute the distance between test point and the dividing wall, if this distance > d(xt, xNN on same side), we are done and get 2x speedup aka if the distance to the partition is larger than distance to our closest neighbor, none of the points inside the partition can be closer!
motivation for bagging
consider trained decision tree h^ = ID3(D) h^ is random quantity, high variance high variance = overfitting! so we want to average it out and get hbar = 1/k SUM h^i!!! LOWER VARIANCE
proof that if distance to partition is larger than distance to point on the same side, none of the points in the partition can be closer
d(xt,x) = dist bw test point and candidate x x lies on other side of the wall d(xt,x) = d1 + d2 where d1 = dist on test point side and d2 = dist on candidate side dw = shortest dist from xt to wall d1 > dw d(xt,x) = d1 + d2 >= dw + d2 >= dw so if dw is larger than dist to current best candidate point for nearest neighbor, we can discard it!
entropy overview
define impurity as how close we are to uniform use KL-divergence to compute closeness uses same p1...pk as Gini impurity
how to partition in KD
divide data into two halves, left and right, along 1 feature for each training input, remember the half it lies in
variance reduction of independent RV's
do this by averaging!!! consider iid random xi, xi ~ N(0, σ^2) variance of x bar = SUM xi/n is σ^2/n SO REDUCED!!!! (BECAUSE var(SUM indep) = SUM var(indep) and var(aX) = a^2 Var(X))
decision tree advantage over nearest neighbor
doesn't store training (can instead store # points of each label per leaf - typically pure so just store label of all points) fast during test time (tree traversal time) - prediction is majority label of leaf decision trees don't need metric (split based on feature thresholds not dist)
decision trees overview
don't store training data use it to build tree structure that divides space into regions with similar labels root node represents entire set divide into subtrees by feature val >= or < t ideally leaves are pure (all points have same label), keep dividing until this is the case
why does inner loop in adaboost only update Ht when e < 1/2
e > 0.5 is impossible e = 0.5 is not useful bc then weak learner h is only as good as a coin toss so not beneficial (and step size a would be 0!)...meanwhile e <= 0.5 - gamma implies approximating gradient well!!! means -gradient L dot H > 0 = within 90 degrees
what implies a stronger learner
each weaker learner doing better than random coin toss (0.5 - γ)
what kind of method is bagging
ensemble method
weak learner's loss/weighted classification error upper bound and why does it imply approximating gradient well
epsilon <= 1/2 - γ, γ > 0 holds when H is symmetric (h in H means -h in H) this means inner product bw true neg gradient and [ht+1(x1)...] is >= 2γSUM |wj| > 0 = WITHIN 90 degree = IMPROVES OBJECTIVE!!!!
AdaBoost with decision trees (depth 3) is non-parametric
false, the set of parameters is not a function of the number of training instances, n basically...the # of weak classifiers is not related to # of training points...the number of weights is BUT this is about the FINAL model, and that would be the ENSEMBLE! the weights are not shipped to production!
out-of-bag error
for each training point, find all sets that don't have it find average classifier over these sets h~i(x) out of bag error = average loss these classifiers yield = 1/n SUM over (xi,yi) in D of loss(h~i(xi), yi) so good estimate of test error bc using classifiers that haven't seen certain points and removing classifiers that did see that point doesn't matter much if we have large # of classifiers
boosting algorithm in english
for every iteration, we are using approximate gradient descent we want to estimate the negative gradient of loss by finding a h that's in our hypothesis class we approximate the loss with taylor's and find a ht+1 that minimizes a portion of it instead then we do Ht+1 = Ht + a ht+1
ID3 algorithm overview
generates decision tree from dataset
in boosting, how to find h that minimizes l(Ht + ah)?
gradient descent in function space
show that finding the best weak learner in adaboost is equivalent to minimizing epsilon/weighted classification error
h(xi) = argmin L(H+ah) = argmin <∇ℓ(H),h> = argmin SUM dl/dH(xi) h(xi) = argmin SUM -yi e^-H(xi)yi h(xi) = argmin SUM -yi h(xi) wi = argmin SUM h(xi) not eq yi of wi - SUM equal of wi = argmin SUM wi not eq = argmin epsilon
parametric algorithm
has constant set of parameters independent of # of training samples like amnt of space needed to store TRAINED classifier (aka the space model would take up if you package it and send it to production) ex. perceptron has w, b, where w depends on dim of training data but not HOW MANY training samples used
for hbar = SUM h^i / k, what happens when n->inf
hbar -> E_D~P^ [ID3(D)] as P^ -> P when n-> inf, the expectation then approaches E_D~P [ID3(D)] the expected decision tree! (From perspective of ED(1/m SUM hi)...= 1/m SUM ED(hi) = ED~P(ID3(D)) SO IT IS DETERMINISTIC! zero variance! SO getting infinite data points WOULD give us zero variance (even if bootstrap datasets aren't indep) but this is infeasible in practice so make classifiers as independent as possible instead
KL-divergence
higher KL divergence means more mismatch 0 means perfect match KL(p||q) = SUM pk log(pk/qk) >= 0 where q1...qc be uniform label (qk = 1/c) = SUM pk log(pk) - pk log(qk) = SUM pk log(pk) + pk log(c) (because qk = 1/c) = SUM pk log(pk) + log(c) SUM pk where log(c) is a constant and SUM pk = 1 so KL(p||q) is just SUM pk log(pk) when q is uniform distribution
what does alpha (learning rate) in adaboost depend on
ht+1's performance weak learner's performance
what is true about Di ~ D and Di ~ P in bootstrapping
identically distributed!! Each dataset Di is drawn from P, but not independently (UNLESS CONDITIONED ON D - conditional probability RESTRICTS) Q(X=xi) = P(X=xi) hence drawing Di from D is the same as drawing it from P
are bagged datasets independent
if considering P as underlying distribution, no! because knowing one set tells you likelihood of points in other set but if considering D, then yes because it becomes your whole universe (you aren't operating on a subset of the universe anymore)
what are values of the splits
if splitting between x1 and x2 it is x1 + x2 / 2!
why does checking distance to wall make sense
if the distance to the partition is larger than distance to our closest neighbor, none of the points inside the partition can be closer!
decision trees parametric or not
if trained to full depth, they are non-parametric (depth of decision tree scales as func of training data, in practice O(log2(n))) but if limit tree depth by max value, is parametric (upper bound of model size known prior to observing training data)
if the distance to the ball is greater than dist to current closest neighbor, what do we do
ignore that ball entirely because d(xt, x) = d1 + d2, and if d1 >= db, then overall >= db + d2 >= db
random forest
in ID3, for every split, randomly select k < d features, find split only using these k
where do KD trees store data
in leaf nodes each training point lies in exactly one leaf node
adaboost algorithm (from lecture)
init H1 = h1 For t = 1.... compute wi = -yi exp(-Ht(xi)yi) pi = |wi| / SUM |wj| (SO proportional to exp(-Ht(xi)yi)) find ht+1 = argmin SUM pi 1[h(xi) is not yi] (THIS IS A WEIGHTED BINARY CLASSIFICATION PROBLEM) weak learner's loss e = SUM pi over i where yi is not h_h+1(xi) Ht+1 = Ht + a ht+1
boosting algorithm in terms of pi, etc.
init H1 = h1 For t = 1.... compute y^i = Ht(xi) compute wi = dL(y^i, yi)/dy^i pi = |wi| / SUM over j of |wj| find ht+1 = argmin SUM pi 1[h(xi) is not - sign(wi)] Ht+1 = Ht + a ht+1
boosting algorithm at very high level
initialize H1 = h1 in ℋ For t = 1... find a new classifier ht+1 st Ht+1 (ensemble at t+1) = Ht + a ht+1 has smaller training error aka an ht+1 that minimizes l(Ht + aht) where l(H) = 1/n SUM l(H(xi), yi) WHICH IS EQUIVALENT TO FINDING ht+1 = argmax <[ht+1(x1)..ht+1(xn)], -∇L(ŷ )> MAX INNER PRODUCT = minimize theta = moving in dir as close to true neg gradient as possible = minimizing loss as much as possible
ball tree construction pseudocode
input set S, n = |S|, k func BALLTREE(S,k) if |s| < k stop otherwise, pick random x0 pick x1 = max d(x0,x1) ****those two steps above pick direction with large spread x1-x2!!!!!!!!!!!!!********* pick x2 = max d(x2, x1) project data onto (x1-x2) aka (x1-x2)T xi take median of projections SL = projected < m SR = projected >= m return tree (center c = mean(S), radius r = max d(x,c), children = Balltree(SL,k), Balltree(SR,k))
if datasets Di were truly independent, what happens to 1/m SUM hDi
it approaches hbar as m -> inf due to WLLN hence variance of 0! but this is not actually the case bc we bootstrap samples so the datasets aren't independent (hDis aren't truly indep)
why is KNN slow during testing
it does a lot of unnecessary work
Assume you pre-process all your features in the following way: you sort each feature independently. For each feature, you then assign all those inputs that share the lowest feature value a new feature value of 1, all those with the second lowest value a 2, etc. How does this affect the trees that you construct?
it doesn't the relative ordering is maintained! so we are still processing the same splits across the different features, so the best one is still the same
in adaboost, how to interpret wi
it is the contribution of each training point towards the overall loss
how does increasing number of trees in RF create a better decision boundary
it smooths because it's averaging decisions across trees
adaboost is powerful bc
it turns any weak learner that can classify a weighted version of training set w below 0.5 error into a strong learner whose training error decreases exponentially requires only O(log(n)) steps until consistent
for what algorithms is boosting ineffective
kNN, unlimited depth decision trees, kernel SVMs (all have zero bias essentially - all are highly non-linear!!!) linear classifiers on non-linearly separable data bc not much use to ensemble linear classifier random labeling (has low bias, also not a weak learner cuz it doesn't learn) NOT naive bayes (prolly cuz it's too vague to tell if high bias or not wo knowing how u modelled p([x]a|y) or mode labeling
approximation approach to finding min size tree w pure leaves
keep splitting data to minimize an impurity function, which measures label purity amongst children
regression trees and what does decision boundary look like? what about classification decision boundaries?
labels are continuous uses square loss impurity store like a linear regression line per leaf OR average label of points in the leaf decision boundary is piecewise constant (overfitting to each label)......whereas for classification trees the decision boundary is like a kNN classifier boundary - don't draw it like a KD tree
how does adaboost find the best weak learner (aka find ht+1 = argmin)
let Z = SUM e^-yi H(xi) (normalizing factor) wi = 1/Z e^-yi H(xi) [from lecture notes] SUM wi = 1! ri = dl/dH(xi) = -yi e^-yiH(xi) need to solve h(xi) = argmin SUM ri h(xi) ....argmin SUM over i where h(xi) is not yi of wi = ϵ we want SUM ri h(xi) to be negative too, in order for this to happen just need eps < 0.5
good choice of m and k in RF
let k = sqrt(d) m is as large as possible
GBRT how to plug in to anyboost formulation of ht+1 = argmin SUM ri h(xi)
let ti = -ri = yi - H(xi) [when using squared loss] argmin SUM ri h(xi) simplifies to argmin SUM (h(xi) - ti)^2
CART overall pros/cons
lightweight, fast during testing but not competitive in accuracy (can become strong via bagging(random forests) or boosting (gradient boosted trees))
entropy formula and derivation
max over p of KL(p||q) = max SUM over k of pk log(pk) // WANT TO MAXIMIZE DIFFERENCE FROM UNIFORM = min - SUM pk log(pk) = min over p of H(s) so H(s) = -SUM pk log(pk)
gini impurity graph
maximized when uniformly distributed (each label is equally likely) but goes to zero as mostly 1 label for 2 label problem, it is -x^2 graph, maximized at p = 0.5 and gini impurity 2p(1-p) = 0.5
in adaboost, if SUM ri h(xi) < 0, what is true about epsilon
means epsilon < 0 too (so weak learner better than 50% accuracy)
what does large pi in boosting mean, pi = |wi| / SUM |wj|
means it contributes a lot to loss so points in a previous iteration that were classified wrong, now have larger weight in next iteration, means fixing them is MOST important to loss!
how to negate a tree
negate predictions
is KL-divergence a metric
no because not symmetric KL(p||q) is not the same as KL(q||p)
do random forests need a training/validation split
no because we have out of bag error! so we already have a so-called "validation error" estimate and so we don't need to reduce the training set
adaboost - can epsilon > 0.5?
no! if H is negation closed (for every h in H we have -h in H) it can't happen because if h has error e, -h has error 1-e so just flip h to -h and obtain classifier w smaller error but h minimizes, contradiction!
is it computationally tractable to find a pure and maximally compact tree (the global optimum)
no! NP hard to find min size tree use greedy instead to approximate (so ID3 finds local optimum)
are gradient boosting algorithms part of boosting family
not necessarily bc they don't guarantee that training error decreases exponentially so these algorithms are called stage-wise regression sometimes
does bagging with bootstrapping achieve zero variance
not really because h^D = 1/m SUM hDi no longer approaches hbar because weak law of large numbers only works for iid samples if you got inf training samples it would, but that's not usually the case
what happens in adaboost when two training inputs in a binary classification problem are identical in features but have diff labels
one point is labeled right for a while but the other point's weight keeps increasing until that one needs to be classified correctly to reduce loss now the first point is incorrect, its weights increase...this loops again... SO both points have high weights that dominate the training set weak learner can't do better than 50% accuracy algorithm will stop
ball tree general idea
partition data on underlying manifold instead of entire feature space pick random pt, find pt farthest away then find pt farthest away from that one draw a line connecting, draw bisector (perpendicular) at median value of projection onto the line split points with this bisector
what will a weak learner focus on classifying correctly
points with HIGH weights means getting HIGH weights right will reduce loss more
ensemble method
predictions are made based on the combination of a collection of models ensemble H(x) = SUM at ht(x) combine multiple simpler algs to get a better learning alg
pros and cons of adaboost as a result of using exponential loss
pro: converges fast con: bad for noisy data, lots of overfitting
pros and cons of KD trees
pros: exact, easy to build (find MEDIAN of features) cons: curse of dimensionality makes KD trees ineffective for high dimensions (TOO MUCH splitting, and most points tend to be evenly spaced in high dimensions, so you have to check MORE boxes!) all splits are axis aligned (but data may not be aligned along features, ex. may be diagonal) still need all training data!!!!!! stored! is in the worst case the same as KNN
pros and cons of decision trees
pros: fast at test time cons: prone to overfitting (with outliers, may split them out even though they aren't important) so high variance
random forest ID3 vs regular ID3
regular ID3 looks for split in all d dimensions ID3 in random forest is only looking in k randomly picked dims
boosting as two player zero sum game
row player plays hypothesis col player plays example (x,y) row player gets loss 1[h(x) is not y] col player gets loss -1[h(x) is not y] boosting = running alg to find nash equilibrium of game
random forest algorithm pseudocode? two sources of randomness
sample m datasets D1..Dm w replacement (bootstrapping) (THIS IS RANDOM 1) for each Dj train full decision tree (max depth = inf) but before each split, sample k features and use these for split (RANDOM 2) final classifier is average of these
non-parametric algorithm
scales w # training samples ex. KNN bc during training, store entire training data, so num params/storage required grows linearly w training set size
what question does boosting answer
scenario: large bias classifiers with high training error (CART trees w limited depth) can weak learners be combined together to generate a strong learner w low bias?
assumption of RF/decision trees
similar inputs have similar labels
ball tree performance comparison with KD trees
slower than KD in d <= 3 but faster in high dimensions both affected by curse of dim, but ball trees tend to still work if data lies on low-dim manifold (so usually works better in high dim than KD trees)............notably ball tree computation time for a split is O(n) [you have to project onto the new line!], whereas in KD it is constant time (compute median)
KD trees
space partitioning data structure for organizing points in K-dimensional space can speed up KNN during testing partitions feature space! aka recursively subdivide along features to make KNN fast
KD tree construction
split data recursively in half (as in # of points is equal on both sides) on exactly one feature rotate through features (good heuristic is to pick feature with max variance) - MEANS YOU CAN SPLIT ON THE SAME FEATURE MANY TIMES so root may be f1 (feature 1) > t1 (either yes or no)... children could be f2 > t2 and f2 > t3...so forth
stochastic gradient boosting
subsample training data for each weak learner combines benefits of bagging + boosting
how is adaboost iterative
the next iteration (next weak learner) focuses on data that the pervious weak learner misclassified reweighting = misclassified data gains more weight
training loss vs training error
training loss is often an upper bound on training error
finding [h(x1)...h(xn)]T that is close to −∇L(y ̂ ) in boosting gradient descent in function space as weighted binary classification
training set can be rewritten as: {pi, xi, -sign(wi)} wi = dl(y^i, yi) / dy^i [THIS IS ri in the lecture notes] y^i = Ht(xi) where pi = |wi| / SUM |wi| ht+1 = argmin SUM pi 1[h(xi) is not -sign(wi)] THIS ht+1 LATER GETS MULTIPLED BY ALPHA BEFORE BEING ADDED TO Ht!!!
One advantage of bagging is that all ensemble members (i.e. classifiers) can be trained in parallel
true
one advantage of Random Forests is that you obtain meaningful probability estimates as your output predictions P(y|x)
true probabilities are calculated from the votes of the different trees!!!!
adaboost how are weights changed in the update to reflect classifier accuracy
unnormalized weight is updated to wi * e^-a h(xi) yi if classifier is incorrect, factor = e^a > 1, so inc weight if classifier is right, e^-a < 1 (decreases weight)
what does bagging reduce
variance
effect of increasing ensemble size on bias/variance/noise
variance decreases, it becomes 1/m of variance of single classifier bias and noise are unaffected bc they use expectations, which don't depend on m
how does bagging reduce variance in terms of variance equation (WHEN Di are independently drawn from P)
variance is E((classifier u drew - average classifier)^2) the one resulting from bagging is h^_D = 1/m SUM h_Di by weak law of large numbers, this approaches E_D~P(ID3(D)) which is essentially the average classifier so since h^_D approaches hbar, variance goes to 0 WHEN Di are independently drawn from P
gradient descent in functional space idea
want to minimize l(H + ah) ~ l(H) + a ∇ℓ(H)T h (Taylor approx) so we want to minimize l(H) + SUM dl/dH(xi) h(xi) = min SUM dl/dH(xi) h(xi) <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< !!!!! = min SUM ri h(xi) where dl/dH (xi) = dl / dH(xi) as long as SUM ri h(xi) < 0, we make progress! so just need some function to solve this minimization! NOTE: this minimization can be rewritten as a weighted binary classification problem!
what do we not want in terms of impurity
we don't want uniform distribution! don't want p1 = .... pc = 1/c (then each leaf is equally likely)
approximate gradient descent for minimizing L(y) where yt+1 = yt - ng^t g^t is not gradient of L(yt) under what condition of g^t can we guarantee L(yt+1) < L(yt)? and proof
when <g^t, gradient L(yt)> > 0 (SO UR STILL GOING IN AN OKAY DIRECTION) aka inner product proof (USING FIRST ORDER TAYLOR and g^tT gt > 0): let ∇L(yt ) = gt g^t = estimate g^t = g^t parallel + g^t perpendicular g^t parallel = agt (projection) L(yt+1) ~ L(yt) - n gtT g^t = L(yt) - ngtT (agt + g^t perp) = L(yt) - (na) gtT gt since a>0 this is L(yt) - positive (a is positive because degree between grad est and true grad is less than 90 degrees! imagine drawing it out!)
when is adaboost a bad idea
when the data has label noise! exponential loss ensures the mislabeled data points will also be classified correctly means OVERFITTING!!!!!!
when does adaboost stop
when weak learners no longer achieve accuracy better than 50% (aka error not less than 50%) AT THIS POINT RETURN THE ENSEMBLE!!!!!!!!!!!! exit the for loop!!!!!
using more classifiers averaged together for bagging does what when using excessively high number of classifiers
will slow down classifier reduces variance a lot but won't increase its error (BIAS)
bagging in test time
y^ = [p // 1-p] where p = # of trees predicting -1 / k CLASSIFICATION
are CART trees with limited depth weak learners
yes
is pi in boosting a probability
yes SUM pi = 1 pi >= 0
can decision trees fit non-linear trends
yes could be considered piecewise linear
can weak learners be combined together to generate a strong learner w low bias?
yes create ensemble classifier H = SUM at ht(x) build in iterative fashion (in iteration t add classifier at ht(x)) similar to gradient descent iterations
does adaboost converge very fast
yes! because we know the optimal step size alpha
gradient descent for minimizing L(y), when is L(y+1) < L(yt)
yt+1 = yt - ngt gt = gradient of L(yt) when n is small and gt is not zero, L(yt+1) < L(yt)
what depth is the root
zero
gini impurity G^T(S) of a tree
|SL|/|S| GT(SL) + |SR|/|S| GT(SR) where SL and SR are disjoint, S = SL U SR |SL|/|S| = frac of inputs in left subtree