Perceptron + MLE/MAP + Naive Bayes + Logistic/linear Regression
assumptions required for theorem that perceptron algorithm makes at most 1/γ^2 mistakes
all inputs xi live in unit sphere ||xi|| <=1 there exists a separating hyperplane w star with ||w*|| = 1 (OPTIMAL) γ is distance from hyperplane to the closest point
derivation of normal equation for linear regression
argmin SUM (wT xi - yi)^2 = argmin || XT w - Y|| ^2 = argmin (XTw-Y)T (XTw -Y) basically finding Y^ = XT w that is in the span of (XT) and is closest to Y vector which is just orthogonal projection... Gradient is XXTw - XY Set to 0 XXT w = XY
ridge linear regression
argmin over w of lambda ||w||^2 + SUM (wTxi-yi)^2 solution is w^ = (XXT + lambda I)^-1 XY works even if XXT is not full rank because if lambda > 0, XXT + lambda I is always pos def for when errors are assumed to be normal and w is assumed to be normal
in coin flip, how does map and mle estimate relate
as n -> inf param_est_Map approaches param_est_mle since alpha, beta become IRRELEVANT with LARGE n
how does MAP treat parameters
as random variables so they have some probability distribution (ex. Beta)
how many mistakes does perceptron algorithm make (aka RUNTIME)
at most 1/γ^2 when the following is true: all inputs xi live in unit sphere ||xi|| <=1 there exists a separating hyperplane w star with ||w*|| = 1 γ is distance from hyperplane to the closest point..SO AFTER DATA IS NORMALIZED!!!!!!!!!!!
bias term and necessity
b necessary because otherwise the hyperplane defined by w would always have to go through the origin
why is NB faster than LR
because LR uses gradient descent NB is just multiplication
why does naive bayes seek to estimate P(Y|x)
because then you can use bayes optimal classifier h(x_test) = argmax over P(y|x)
does y(xTw*) = |xT w star|
because w star is a perfect classifier, so all training points are correct and therefore y = sign(xTw*)
assumptions of perceptron algorithm
binary classification data is linearly separable
why is normal equation for linear regression often not used
computing inverse is too computationally expensive so often use gradient descent, etc.
is naive bayes assumption conditional independence or true independence!
conditional independence! because the label absolutely affects xa values ex. imagine complete independence, then P(Fruit=orange) is independent of P(Fruit=yellow) regardless of label (banana vs orange)! but that's not true
proof that perceptron algorithm makes at most 1/γ^2 mistakes
consider effect of an update on wTw star: it must be that (w+yx)Twstar >= wTw* + γ so wTwstar grows by at least γ meanwhile update on wTw means it is <= wTw + 1 where 2y(wTx) < 0 bc w was wrong and 0 <= y^2 (xTx) <= 1 because y^2 = 1 and all xTx <= 1 so wTw grows by at most 1 SO since wTw and wTwstar = 0 at first (bc w=0 initially), after M updates: 1. wTwstar <= Mγ 2. wT w <= M SO THEN # UPDATES M <= 1/γ^2
how should you compute maximum
derivative AND THEN SECOND DERIVATIVE TO ENSURE IT TRULY IS MAX!
positive examples from perceptron are such that
dot(w, x) + b > 0
negative examples from perceptron are such that
dot(x,w) + b < 0
choose P(xa|y) and P(x|y) for naive bayes with categorical features
each feature a falls into one of Ka categories (ex. single/married) so we can use MLE or MAP, but basically we get THIS PARAMETER ESTIMATION P(xa = j | y = c) = # samples w label c that have feature a with value j / # samples with label c generative model assumes data generated by first choosing label, then tossing a die to get actual value +l in numerator and +l * # categories for xa (SET l = 0 to get the MLE PARAMETER ESTIMATOR) P(x|y) = PROD theta_jac
generative learning/algorithms
estimate P(x, y) often by estimating P(y) and P(x|y) [PROBABILITIES] - do NOT draw boundaries between classes, models how data is placed in space
what is parameter estimation for n independent coin flips where theta = P(head), MLE vs MAP
estimate is n1/n...FOR MAP IT IS nH + m / (n + 2m) (RECALL SMOOTHING ADDS l to numerator AND l * 2 to denominator, ASSUME PRIOR IS 50%) argmax P(D|theta) = argmax ln likelihood n1 = # heads WHEN LOOKING AT SPECIFIC SEQUENCE OF COIN FLIPS NOT BINOMIAL!
how to use mle in binary classification
estimate p(y|x) assume P(y|x;theta) has some distribution involving theta MLE gives you argmax P(D | theta) = argmax PROD P(xi, yi | theta) = argmax SUM lnP(yi | xi; theta) NOTICE P(xi,yi | theta) = P(yi | xi; theta)- because we do not care about the P(xi) TERM!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
binary classification with MAP
estimate p(y|x) assume P(y|x;theta) has some distribution involving theta argmax P(theta | D) = argmax P(theta) PROD P(xi,yi | theta) = argmax ln(theta) + SUM ln P(yi | xi;theta)
when the Naive bayes assumption doesn't hold, the algorithm cannot be effective
false
naive bayes assumption
feature values are INDEPENDENT GIVEN THE LABEL ex. P(x|y) = PROD P(xa | y) whee xa is value for feature a (FORMAL) implies all words in email are conditionally independent xa is [x]a, value for feature a
goal of linear regression
find h(x) = w^T x that fits data well where w has b = w0 appended and x has 1 appended
MLE principle
find theta that maxes likelihood of data parameter_estimate = argmax P(D|param) = argmax likelihood = argmax log likelihood TAKE DERIVATIVE (or partial derivative) WRT THETA/PARAM!
is naive bayes generative or discriminative
generative because it models P(x|y)
how is MAP a Bayesian perspective
ground truth is a RV search for parameter maximizing posterior
how is MLE a frequentist perspective
ground truth is unknown but fixed search for parameter that makes data as likely as possible
if yi is -1, 1 and features are multinomial then naive bayes returns
h(x) = argmax P(y) PROD P(xa | y) EQUALS sign(wT x + b)
naive bayes baye's classifier definition
h(x) = argmax P(y|x) = argmax P(x|y)P(y) because P(x) doesn't depend on y so can remove from bayes = argmax PROD P(xa |y) P(y) = argmax S UM log(P(xa | y)) + log P(y)
h function for perceptron
h(xi) = sign (wT xi + b) MORE COMMONLY: sign(wTx) because we add 1 to end of xi and b to end of w
naive bayes laplace smoothing
handles problem of zero probability P(x has item a | y) may be 0 if that word does not appear at all in training so add l to numerator (l = 1 = laplace) add l * d to numerator (d being total # of words/letters)
naive bayes as linear classifier derivation case
happens when yi is -1, 1 and features are multinomial (LABELS DO NOT ALWAYS HAVE TO BE -1, 1 for this to be true) h(x) = argmax P(y) PROD P(xa | y) = sign (wTx + b) where [w]a = log(theta a+) - log(theta a -) b = log (P(y=1)) - log (P(y=-1)) recall P(xa | y = 1) is theta_a+ ^ xa
relationship between weight vector w and hyperplane
hyperplane dot(x,w) + b = 0 is perpendicular to w
guaranteed convergence of perceptron
if dataset is linearly separable, perceptron will find separating hyperplane in finite # of updates (otherwise loops forever)
how is MLE consistent
if our model assumption is correct (i.e. that coin flips are actually Bernoulli) then parameter estimation -> optimal as n -> inf (as long as iid samples - unbaised estimator)
geometric intuition of perceptron
if something is incorrectly classified, you take its label and multiply it by x to modify w, the perpendicular vector this moves the corresponding line the right way ex. if label should be -1, want to subtract x from wt
perceptron algorithm
init w = 0 (misclassify all) while TRUE do: m = 0 (# misclassifications) for (xi,yi) in D: if yi(wTxi) <=0 then w = w + yx and m++ endfor if m = 0 then break endwhile
input, hypothesis, hypothesis class and loss func for linear regression
input: dataset xi, yi both in R hypothesis: h(x) = wT x hypothesis class is all linear funcs loss is (wTx-y)^2 squared loss or equivalently 1/n (wTx-y)^2
how to imagine naive bayes as solving many problems
it is solving multiple 1D probability estimation problems
logistic regression MLE estimate for w (b is absorbed into w)
max conditional likelihood [SINCE THIS IS ABOUT discriminative learning, our "data" is y!] max P(y|X, w) = PROD P(yi | xi, w) take log so wMLE = argmin [1/n sometimes] SUM log(1 + e^-yi wT xi) no closed form, so use gradient descent on negative log likelihood = SUM log(1 + e^-yiwTxi)
MAP stands for
maximum a posteriori probability
MLE stands for
maximum likelihood estimation
consider the effect of an update on w, what is true of y(xTw) and y(xTw*)
y(xTw) <=0 bc x is misclassified by w, ow wouldn't update y(xTw*) > 0 because w star classifies all correctly
predicting y^ given xtest using normal equation
y^ = xtest T (XXT)^-1 XY = SUM (xtest T XXT^-1 xi) * yi the thing in parentheses is a scalar, like a similarity score
can linear regression be derived with MAP/MLE, if so how
yes assume some distribution for P(y|x;w) (relevant for likelihood) cuz P(D|w) = PROD P(yi|xi;w) estimate w! use this in linear equation
even with the naive bayes assumption we still have to determine how to model p(x|y)
yes need to choose P(xa | y)
can perceptron alg be applied to infinite dimension space
yes as long as margin exists so no curse of dimensionality
does naive bayes reduce the amount of data
yes because instead of needing a ton of data to get accurate estimates of P(X|Y) (very hard to get instances of all combinations of X) you just need data to get P(Xa|Y) much easier because the probability of specific feature (instead of an entire vector) happening is much larger
can naive bayes be used with MLE and MAP
yes! parameters can be fit in many different ways
for all x, how does y(xTw*) = |xT w star| relate to γ
|xTw*| >= γ because w star is a perfect classifier, so all training points are correct and therefore y = sign(xTw star) also by defn of margin
normal equation linear regression
XXT w = XY
how are x and w modified at initial part of perceptron algorithm
xi becomes [xi 1] w becomes [w b] this means h(xi) = sign(wT x)
how to tell if xi is classified correctly for perceptron
xi is on the correct side of the hyperplane if... requires yi in -1, 1 yi(wTxi) > 0 (signs match)
use logistic regression to make a prediction
y = 1 if P(1|xtest, w) > 0.5 otherwise -1
in the binomial case with a beta distribution prior, MAP essentially hallucinates samples
True hallucinated samples are such that with less data, result is closer to prior but more data = data dominates
how does naive bayes estimate P(y|x)
(P(x|y)P(y))/P(x) bayes rule! estimate P(y) BY THIS!!!!!!!!!!! SUM I(yi=c) / n and then assuming P(x|y) = PROD P(xa | y) where xa is an element of feature
steps to MAP
1. define prior P(Theta) 2. define likelihood P(D|theta) 3. compute posterior proportional to prior x likelihood 4. compute max!
logistic function shape
1/(1+e^-x) x -> -inf, y-> 0 x-> inf, y->1 y = 0.5 at x = 0 1/(1+e^x) x -> inf, y-> 0 x-> -inf, y->1 y = 0.5 at x = 0
ordinary least squares (OLS)
A method for estimating the parameters of a multiple linear regression model. The ordinary least squares estimates are obtained by minimizing the sum of squared residuals. Assumes errors are normally distributed THIS IS WHAT OLS IS>>>>>>>> min 1/n SUM (xiTw -yi)^2 (aka squared loss) either optimized with Gradient descent, etc... or closed form is (XXT)^-1 XyT
Given x, how does naive Bayes know which class to predict
For each class y=c argmax P(y|x) APPLY NAIVE BAYES ON JUST THAT X, DO NOT ADD ADDITIONAL FEATURES!!!!!!!!!!! Even if that X is not part of the full feature space, can still use naive bayes DOES NOT HAVE TO BE FULL FEATURE SPACE
how does MAP defer from MLE in terms of estimation parameter
MAP often has a regularization term that represents the prior but also prevents w from getting too big
what is MLE estimate for P(y|x) and why does this not work (hence a need for naive bayes)?
MLE estimate for P(y|x) = P(y,x)/P(x) P(x) = SUM I[xi=x] P(y,x) = SUM I[xi = x AND yi = y] but this fails in high dimensional spaces or for continuous x because then there are few training vectors with the exact same features as x... so numerator and denominator -> 0 bc for continuous, P(x=x) is 0 and if you see no x, denominator is also 0!
logistic regression with SGD as optimizer
MLE objective function (kind of like loss) initialize w = 0 while not converged: randomly sample one training pair (x,y) compute gradient of MLe objective (loss) SGD update! w^t + a (1 - P(y|x, w^t) yx
ordinary least squares vs ridge regression
OLS has no regularization Ridge has L2 regularization w = (XXT)^-1 X yT for ridge regression there is regularization and closed form is w = (XXT + lambda I)^-1 X yT OLS results from MLE! Ridge results from MAP
choose P(x|y) for naive bayes with continuous features
P(xa|y=c) is a gaussian distribution estimate parameters of said gaussian using MLE or MAP Use values with the correct class to estimate the mean and variance nc = # samples with label c mean_ac = 1/nc SUM xia for all samples with label c sigma^2_ac = SUM (xia-uac)^2 / nc for all xia with label c FULL P(x,y) is also gaussian with mean = uy and Sigma_y = diagonal with sigma_ac ^2 on diagonal
for continuous features, how does (gaussian) naive bayes relate to logistic regression
P(y|x) becomes EXACTLY logistic regression WHEN THE VARIANCES per-feature ARE THE SAME ACROSS CLASSES 1/(1 + e ^ -y(wTxi +b)) for y in +1, -1 so GNB and LR produce the same model IF THE VARIANCE per-feature IS THE SAME ACROSS CLASSES — if the NB assumption holds then any exponential-family NB and LR will asymptomatically produce the same model
logistic regression equation
P(y|xi) = 1/ 1 + e ^ -y(wTxi + b) estimate w and b directly with MLE or MAP to maximize conditional likelihood PROD P(yi |xi;w,b) [SINCE THIS IS ABOUT discriminative learning, our "data" is y!]
logistic regression vs naive bayes
THE TWO MODELS PRODUCE THE SAME RESULT AS n->inf IF NB assumption holds - so logistic regression is the discriminative counterpart (same results, just different modeling) LR does not model P(x|y) just model P(y|x) directly so LR is more flexible bc we don't assume P(x|y)
choose P(x|y) for naive bayes with multinomial features and prediction
means features represent COUNTS! but each word/item is still a TRIAL! ex. text document categorization xa = j means in doc x, ath word appears J TIMES!!!!!!!! do not need to see ath word exactly j times EACH FEATURE IS A COUNT! so use model P(x|m, y=c) where m is length of THE SEQUENCE to get m!/x1!...xd! * PROD Oac ^xa (SO P(x|y) is MULTINOMAIL instead) where [probability of selecting xa] Oac = # times word a appears in all c type docs + l / # words in all c type docs combined + l * d (d = # letters/words) D IS SIZE OF VOCABULARY prediction = PROD theta_ac ^ xa
margin γ of hyperplane
min over all xi,yi in D of |xiT w*| with w star being the w that correctly classifies all basically CLOSEST POINT TO HYPERPLANE (recall ||w star|| rescaled to be 1)
discriminative algorithms/learning
model P(y|x) - basically learns boundaries between classes, cannot be used to generate new data
what does naive bayes do
models P(x|y) and makes assumptions on its distribution parameters of distribution are estimated with MLE or MAP
MAP estimate for coin fip (INDEPENDENT COIN FLIPS not binomial, Beta prior)
n1 + alpha - 1 divided by n + alpha + beta - 2 where alpha, beta = Beta parameters n1 = # heads so these are hallucinated heads! that incorporate prior more for small n! SAME as MLE just with HALLUCINATED stuff!!!!!!!
naive bayes decision boundaries are always linear
no
will cube loss work for linear regression
no will give small loss if wT x is smaller than y sign issue!
how does adding a constant dimension to w allow data to become separable?
no hyperplane through origin initially but by adding a dimension, the data also becomes a higher dimension, allowing hyperplane to exist
is theta in MLE a RV
no just a parameter! a property of dstribution/experiment
does perceptron alg require data to be iid
no, no iid assumption for # updates <= 1/γ^2 just requires data to be linearly separable (data can even be selected by adversary!)
can we use absolute loss for linear regression
not preferred square loss penalizes big mistakes also square loss gives closed form solution, absolute loss doesn't
naive bayes and logistic regression decision boundaries
occurs when P(Y=a|x) = P(Y=b|x) for logistic, specifically when P(y|x) = 0.5!!!!! because 0.5 + 0.5 = 1
naive bayes gaussian P refers to
probability density (PDF)! So P(X=c) is not zero but an actual number because it is not the probability (which would require an integral for continuous gaussian) but is a FUNCTION VALUE! this is because we really care about the likelihood, not the probability, but we write it with P anyways
for perceptron proof, how to scale all xi
scale by largest norm to preserve relative spacing but also makes ||xi|| <= 1 do not need to scale for the perceptron algorithm
w* in perceptron
such that yi(xTw*) > 0 for all (optimal w)
how does MAP work
take prior P(theta) define posterior P(theta | D) use bayes, find that posterior is proportional to prior x likelihood so param_est = argmax P(theta | D) = argmax P(theta) P(D|theta) = argmax ln P(theta) + ln P(D|theta)
what does it mean to estimate P(y|x) is gaussian for linear regression
the label y is drawn from a gaussian distribution also means errors are N(0, sigma^2)
what happens if model assumption is wrong for MLE
then estimated parameter may not be optimal
P(D | theta) for n independent coin tosses where theta = P(head)
theta ^ n1 (1-theta) ^ n- n1 n1 = # of heads
purpose of map/mle in classification
to help you estimate P(y|x) - so you yourself have to model P(y|x) but there's probably some parameter theta for this distribution that it depends on - MLE, MAP helps you estimate this theta!!!!!!!!!!!1
logistic regression MAP estimate for w (b is absorbed into w)
treat w as RV could approximate w ~ N(0, sigma^2 I) want P(w | D) = P(w | X, y) w_map = argmin SUM log(1 + e^-yi wT xi) + lambda wT w lambda = 1/2 sigma^2 no closed form solution, can use gradient descent on this negative log posterior
the NB assumption holds when features are independent given the label, true or false
true
true or false: MLE works great for large n but bad for small n
true if ur data is 2 heads flipped, MLE gives 100% head rate! even tho ur data sucks
smoothing parameters naive bayes
used in P(x |y) to prevent zero probability issue + l in numerator + l * c in denominator l > 0 means MAP l = 0 is MLE l = 1 is laplace
solution to linear regression when XXT is full rank using normal equation
w = (XXT)^-1 XY so Y^ = XT w = XT (XXT)^-1 XY
how to tell what w* should be if given wt, xt, and yt
well if wtxt is classified wrong, then we know angle between xt and w* should be such that the sign matches yt
when does MAP work better than MLE
when n is small and prior is accurate MLE can overfit when n is small! but if prior belief is wrong, then MAP is terrible!
when do NB and LR produce asymptomatically the same model
when naive bayes assumption holds for exponential family because then exponential NB is linear classifier, and so is LR They produce the same model EVEN WHEN NOT ASYMPTOMATIC if NB is gaussian AND WHEN VARIANCE per-feature ACROSS CLASSES IS THE SAME
performance of LR vs naive bayes
with little data, NB outperforms if the modeling assumption is right - LR needs more data to avoid overfitting LR outperforms with more data because NB usually doesn't choose P(x|y) perfectly if NB chooses right P(x|y) then LR and NB converge to same result in limit (but NB is faster)
cos(theta) in perceptron is equal to
wt T w* divided by ||wt||2 where wt is current weight vector