Perceptron + MLE/MAP + Naive Bayes + Logistic/linear Regression

Ace your homework & exams now with Quizwiz!

assumptions required for theorem that perceptron algorithm makes at most 1/γ^2 mistakes

all inputs xi live in unit sphere ||xi|| <=1 there exists a separating hyperplane w star with ||w*|| = 1 (OPTIMAL) γ is distance from hyperplane to the closest point

derivation of normal equation for linear regression

argmin SUM (wT xi - yi)^2 = argmin || XT w - Y|| ^2 = argmin (XTw-Y)T (XTw -Y) basically finding Y^ = XT w that is in the span of (XT) and is closest to Y vector which is just orthogonal projection... Gradient is XXTw - XY Set to 0 XXT w = XY

ridge linear regression

argmin over w of lambda ||w||^2 + SUM (wTxi-yi)^2 solution is w^ = (XXT + lambda I)^-1 XY works even if XXT is not full rank because if lambda > 0, XXT + lambda I is always pos def for when errors are assumed to be normal and w is assumed to be normal

in coin flip, how does map and mle estimate relate

as n -> inf param_est_Map approaches param_est_mle since alpha, beta become IRRELEVANT with LARGE n

how does MAP treat parameters

as random variables so they have some probability distribution (ex. Beta)

how many mistakes does perceptron algorithm make (aka RUNTIME)

at most 1/γ^2 when the following is true: all inputs xi live in unit sphere ||xi|| <=1 there exists a separating hyperplane w star with ||w*|| = 1 γ is distance from hyperplane to the closest point..SO AFTER DATA IS NORMALIZED!!!!!!!!!!!

bias term and necessity

b necessary because otherwise the hyperplane defined by w would always have to go through the origin

why is NB faster than LR

because LR uses gradient descent NB is just multiplication

why does naive bayes seek to estimate P(Y|x)

because then you can use bayes optimal classifier h(x_test) = argmax over P(y|x)

does y(xTw*) = |xT w star|

because w star is a perfect classifier, so all training points are correct and therefore y = sign(xTw*)

assumptions of perceptron algorithm

binary classification data is linearly separable

why is normal equation for linear regression often not used

computing inverse is too computationally expensive so often use gradient descent, etc.

is naive bayes assumption conditional independence or true independence!

conditional independence! because the label absolutely affects xa values ex. imagine complete independence, then P(Fruit=orange) is independent of P(Fruit=yellow) regardless of label (banana vs orange)! but that's not true

proof that perceptron algorithm makes at most 1/γ^2 mistakes

consider effect of an update on wTw star: it must be that (w+yx)Twstar >= wTw* + γ so wTwstar grows by at least γ meanwhile update on wTw means it is <= wTw + 1 where 2y(wTx) < 0 bc w was wrong and 0 <= y^2 (xTx) <= 1 because y^2 = 1 and all xTx <= 1 so wTw grows by at most 1 SO since wTw and wTwstar = 0 at first (bc w=0 initially), after M updates: 1. wTwstar <= Mγ 2. wT w <= M SO THEN # UPDATES M <= 1/γ^2

how should you compute maximum

derivative AND THEN SECOND DERIVATIVE TO ENSURE IT TRULY IS MAX!

positive examples from perceptron are such that

dot(w, x) + b > 0

negative examples from perceptron are such that

dot(x,w) + b < 0

choose P(xa|y) and P(x|y) for naive bayes with categorical features

each feature a falls into one of Ka categories (ex. single/married) so we can use MLE or MAP, but basically we get THIS PARAMETER ESTIMATION P(xa = j | y = c) = # samples w label c that have feature a with value j / # samples with label c generative model assumes data generated by first choosing label, then tossing a die to get actual value +l in numerator and +l * # categories for xa (SET l = 0 to get the MLE PARAMETER ESTIMATOR) P(x|y) = PROD theta_jac

generative learning/algorithms

estimate P(x, y) often by estimating P(y) and P(x|y) [PROBABILITIES] - do NOT draw boundaries between classes, models how data is placed in space

what is parameter estimation for n independent coin flips where theta = P(head), MLE vs MAP

estimate is n1/n...FOR MAP IT IS nH + m / (n + 2m) (RECALL SMOOTHING ADDS l to numerator AND l * 2 to denominator, ASSUME PRIOR IS 50%) argmax P(D|theta) = argmax ln likelihood n1 = # heads WHEN LOOKING AT SPECIFIC SEQUENCE OF COIN FLIPS NOT BINOMIAL!

how to use mle in binary classification

estimate p(y|x) assume P(y|x;theta) has some distribution involving theta MLE gives you argmax P(D | theta) = argmax PROD P(xi, yi | theta) = argmax SUM lnP(yi | xi; theta) NOTICE P(xi,yi | theta) = P(yi | xi; theta)- because we do not care about the P(xi) TERM!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

binary classification with MAP

estimate p(y|x) assume P(y|x;theta) has some distribution involving theta argmax P(theta | D) = argmax P(theta) PROD P(xi,yi | theta) = argmax ln(theta) + SUM ln P(yi | xi;theta)

when the Naive bayes assumption doesn't hold, the algorithm cannot be effective

false

naive bayes assumption

feature values are INDEPENDENT GIVEN THE LABEL ex. P(x|y) = PROD P(xa | y) whee xa is value for feature a (FORMAL) implies all words in email are conditionally independent xa is [x]a, value for feature a

goal of linear regression

find h(x) = w^T x that fits data well where w has b = w0 appended and x has 1 appended

MLE principle

find theta that maxes likelihood of data parameter_estimate = argmax P(D|param) = argmax likelihood = argmax log likelihood TAKE DERIVATIVE (or partial derivative) WRT THETA/PARAM!

is naive bayes generative or discriminative

generative because it models P(x|y)

how is MAP a Bayesian perspective

ground truth is a RV search for parameter maximizing posterior

how is MLE a frequentist perspective

ground truth is unknown but fixed search for parameter that makes data as likely as possible

if yi is -1, 1 and features are multinomial then naive bayes returns

h(x) = argmax P(y) PROD P(xa | y) EQUALS sign(wT x + b)

naive bayes baye's classifier definition

h(x) = argmax P(y|x) = argmax P(x|y)P(y) because P(x) doesn't depend on y so can remove from bayes = argmax PROD P(xa |y) P(y) = argmax S UM log(P(xa | y)) + log P(y)

h function for perceptron

h(xi) = sign (wT xi + b) MORE COMMONLY: sign(wTx) because we add 1 to end of xi and b to end of w

naive bayes laplace smoothing

handles problem of zero probability P(x has item a | y) may be 0 if that word does not appear at all in training so add l to numerator (l = 1 = laplace) add l * d to numerator (d being total # of words/letters)

naive bayes as linear classifier derivation case

happens when yi is -1, 1 and features are multinomial (LABELS DO NOT ALWAYS HAVE TO BE -1, 1 for this to be true) h(x) = argmax P(y) PROD P(xa | y) = sign (wTx + b) where [w]a = log(theta a+) - log(theta a -) b = log (P(y=1)) - log (P(y=-1)) recall P(xa | y = 1) is theta_a+ ^ xa

relationship between weight vector w and hyperplane

hyperplane dot(x,w) + b = 0 is perpendicular to w

guaranteed convergence of perceptron

if dataset is linearly separable, perceptron will find separating hyperplane in finite # of updates (otherwise loops forever)

how is MLE consistent

if our model assumption is correct (i.e. that coin flips are actually Bernoulli) then parameter estimation -> optimal as n -> inf (as long as iid samples - unbaised estimator)

geometric intuition of perceptron

if something is incorrectly classified, you take its label and multiply it by x to modify w, the perpendicular vector this moves the corresponding line the right way ex. if label should be -1, want to subtract x from wt

perceptron algorithm

init w = 0 (misclassify all) while TRUE do: m = 0 (# misclassifications) for (xi,yi) in D: if yi(wTxi) <=0 then w = w + yx and m++ endfor if m = 0 then break endwhile

input, hypothesis, hypothesis class and loss func for linear regression

input: dataset xi, yi both in R hypothesis: h(x) = wT x hypothesis class is all linear funcs loss is (wTx-y)^2 squared loss or equivalently 1/n (wTx-y)^2

how to imagine naive bayes as solving many problems

it is solving multiple 1D probability estimation problems

logistic regression MLE estimate for w (b is absorbed into w)

max conditional likelihood [SINCE THIS IS ABOUT discriminative learning, our "data" is y!] max P(y|X, w) = PROD P(yi | xi, w) take log so wMLE = argmin [1/n sometimes] SUM log(1 + e^-yi wT xi) no closed form, so use gradient descent on negative log likelihood = SUM log(1 + e^-yiwTxi)

MAP stands for

maximum a posteriori probability

MLE stands for

maximum likelihood estimation

consider the effect of an update on w, what is true of y(xTw) and y(xTw*)

y(xTw) <=0 bc x is misclassified by w, ow wouldn't update y(xTw*) > 0 because w star classifies all correctly

predicting y^ given xtest using normal equation

y^ = xtest T (XXT)^-1 XY = SUM (xtest T XXT^-1 xi) * yi the thing in parentheses is a scalar, like a similarity score

can linear regression be derived with MAP/MLE, if so how

yes assume some distribution for P(y|x;w) (relevant for likelihood) cuz P(D|w) = PROD P(yi|xi;w) estimate w! use this in linear equation

even with the naive bayes assumption we still have to determine how to model p(x|y)

yes need to choose P(xa | y)

can perceptron alg be applied to infinite dimension space

yes as long as margin exists so no curse of dimensionality

does naive bayes reduce the amount of data

yes because instead of needing a ton of data to get accurate estimates of P(X|Y) (very hard to get instances of all combinations of X) you just need data to get P(Xa|Y) much easier because the probability of specific feature (instead of an entire vector) happening is much larger

can naive bayes be used with MLE and MAP

yes! parameters can be fit in many different ways

for all x, how does y(xTw*) = |xT w star| relate to γ

|xTw*| >= γ because w star is a perfect classifier, so all training points are correct and therefore y = sign(xTw star) also by defn of margin

normal equation linear regression

XXT w = XY

how are x and w modified at initial part of perceptron algorithm

xi becomes [xi 1] w becomes [w b] this means h(xi) = sign(wT x)

how to tell if xi is classified correctly for perceptron

xi is on the correct side of the hyperplane if... requires yi in -1, 1 yi(wTxi) > 0 (signs match)

use logistic regression to make a prediction

y = 1 if P(1|xtest, w) > 0.5 otherwise -1

in the binomial case with a beta distribution prior, MAP essentially hallucinates samples

True hallucinated samples are such that with less data, result is closer to prior but more data = data dominates

how does naive bayes estimate P(y|x)

(P(x|y)P(y))/P(x) bayes rule! estimate P(y) BY THIS!!!!!!!!!!! SUM I(yi=c) / n and then assuming P(x|y) = PROD P(xa | y) where xa is an element of feature

steps to MAP

1. define prior P(Theta) 2. define likelihood P(D|theta) 3. compute posterior proportional to prior x likelihood 4. compute max!

logistic function shape

1/(1+e^-x) x -> -inf, y-> 0 x-> inf, y->1 y = 0.5 at x = 0 1/(1+e^x) x -> inf, y-> 0 x-> -inf, y->1 y = 0.5 at x = 0

ordinary least squares (OLS)

A method for estimating the parameters of a multiple linear regression model. The ordinary least squares estimates are obtained by minimizing the sum of squared residuals. Assumes errors are normally distributed THIS IS WHAT OLS IS>>>>>>>> min 1/n SUM (xiTw -yi)^2 (aka squared loss) either optimized with Gradient descent, etc... or closed form is (XXT)^-1 XyT

Given x, how does naive Bayes know which class to predict

For each class y=c argmax P(y|x) APPLY NAIVE BAYES ON JUST THAT X, DO NOT ADD ADDITIONAL FEATURES!!!!!!!!!!! Even if that X is not part of the full feature space, can still use naive bayes DOES NOT HAVE TO BE FULL FEATURE SPACE

how does MAP defer from MLE in terms of estimation parameter

MAP often has a regularization term that represents the prior but also prevents w from getting too big

what is MLE estimate for P(y|x) and why does this not work (hence a need for naive bayes)?

MLE estimate for P(y|x) = P(y,x)/P(x) P(x) = SUM I[xi=x] P(y,x) = SUM I[xi = x AND yi = y] but this fails in high dimensional spaces or for continuous x because then there are few training vectors with the exact same features as x... so numerator and denominator -> 0 bc for continuous, P(x=x) is 0 and if you see no x, denominator is also 0!

logistic regression with SGD as optimizer

MLE objective function (kind of like loss) initialize w = 0 while not converged: randomly sample one training pair (x,y) compute gradient of MLe objective (loss) SGD update! w^t + a (1 - P(y|x, w^t) yx

ordinary least squares vs ridge regression

OLS has no regularization Ridge has L2 regularization w = (XXT)^-1 X yT for ridge regression there is regularization and closed form is w = (XXT + lambda I)^-1 X yT OLS results from MLE! Ridge results from MAP

choose P(x|y) for naive bayes with continuous features

P(xa|y=c) is a gaussian distribution estimate parameters of said gaussian using MLE or MAP Use values with the correct class to estimate the mean and variance nc = # samples with label c mean_ac = 1/nc SUM xia for all samples with label c sigma^2_ac = SUM (xia-uac)^2 / nc for all xia with label c FULL P(x,y) is also gaussian with mean = uy and Sigma_y = diagonal with sigma_ac ^2 on diagonal

for continuous features, how does (gaussian) naive bayes relate to logistic regression

P(y|x) becomes EXACTLY logistic regression WHEN THE VARIANCES per-feature ARE THE SAME ACROSS CLASSES 1/(1 + e ^ -y(wTxi +b)) for y in +1, -1 so GNB and LR produce the same model IF THE VARIANCE per-feature IS THE SAME ACROSS CLASSES — if the NB assumption holds then any exponential-family NB and LR will asymptomatically produce the same model

logistic regression equation

P(y|xi) = 1/ 1 + e ^ -y(wTxi + b) estimate w and b directly with MLE or MAP to maximize conditional likelihood PROD P(yi |xi;w,b) [SINCE THIS IS ABOUT discriminative learning, our "data" is y!]

logistic regression vs naive bayes

THE TWO MODELS PRODUCE THE SAME RESULT AS n->inf IF NB assumption holds - so logistic regression is the discriminative counterpart (same results, just different modeling) LR does not model P(x|y) just model P(y|x) directly so LR is more flexible bc we don't assume P(x|y)

choose P(x|y) for naive bayes with multinomial features and prediction

means features represent COUNTS! but each word/item is still a TRIAL! ex. text document categorization xa = j means in doc x, ath word appears J TIMES!!!!!!!! do not need to see ath word exactly j times EACH FEATURE IS A COUNT! so use model P(x|m, y=c) where m is length of THE SEQUENCE to get m!/x1!...xd! * PROD Oac ^xa (SO P(x|y) is MULTINOMAIL instead) where [probability of selecting xa] Oac = # times word a appears in all c type docs + l / # words in all c type docs combined + l * d (d = # letters/words) D IS SIZE OF VOCABULARY prediction = PROD theta_ac ^ xa

margin γ of hyperplane

min over all xi,yi in D of |xiT w*| with w star being the w that correctly classifies all basically CLOSEST POINT TO HYPERPLANE (recall ||w star|| rescaled to be 1)

discriminative algorithms/learning

model P(y|x) - basically learns boundaries between classes, cannot be used to generate new data

what does naive bayes do

models P(x|y) and makes assumptions on its distribution parameters of distribution are estimated with MLE or MAP

MAP estimate for coin fip (INDEPENDENT COIN FLIPS not binomial, Beta prior)

n1 + alpha - 1 divided by n + alpha + beta - 2 where alpha, beta = Beta parameters n1 = # heads so these are hallucinated heads! that incorporate prior more for small n! SAME as MLE just with HALLUCINATED stuff!!!!!!!

naive bayes decision boundaries are always linear

no

will cube loss work for linear regression

no will give small loss if wT x is smaller than y sign issue!

how does adding a constant dimension to w allow data to become separable?

no hyperplane through origin initially but by adding a dimension, the data also becomes a higher dimension, allowing hyperplane to exist

is theta in MLE a RV

no just a parameter! a property of dstribution/experiment

does perceptron alg require data to be iid

no, no iid assumption for # updates <= 1/γ^2 just requires data to be linearly separable (data can even be selected by adversary!)

can we use absolute loss for linear regression

not preferred square loss penalizes big mistakes also square loss gives closed form solution, absolute loss doesn't

naive bayes and logistic regression decision boundaries

occurs when P(Y=a|x) = P(Y=b|x) for logistic, specifically when P(y|x) = 0.5!!!!! because 0.5 + 0.5 = 1

naive bayes gaussian P refers to

probability density (PDF)! So P(X=c) is not zero but an actual number because it is not the probability (which would require an integral for continuous gaussian) but is a FUNCTION VALUE! this is because we really care about the likelihood, not the probability, but we write it with P anyways

for perceptron proof, how to scale all xi

scale by largest norm to preserve relative spacing but also makes ||xi|| <= 1 do not need to scale for the perceptron algorithm

w* in perceptron

such that yi(xTw*) > 0 for all (optimal w)

how does MAP work

take prior P(theta) define posterior P(theta | D) use bayes, find that posterior is proportional to prior x likelihood so param_est = argmax P(theta | D) = argmax P(theta) P(D|theta) = argmax ln P(theta) + ln P(D|theta)

what does it mean to estimate P(y|x) is gaussian for linear regression

the label y is drawn from a gaussian distribution also means errors are N(0, sigma^2)

what happens if model assumption is wrong for MLE

then estimated parameter may not be optimal

P(D | theta) for n independent coin tosses where theta = P(head)

theta ^ n1 (1-theta) ^ n- n1 n1 = # of heads

purpose of map/mle in classification

to help you estimate P(y|x) - so you yourself have to model P(y|x) but there's probably some parameter theta for this distribution that it depends on - MLE, MAP helps you estimate this theta!!!!!!!!!!!1

logistic regression MAP estimate for w (b is absorbed into w)

treat w as RV could approximate w ~ N(0, sigma^2 I) want P(w | D) = P(w | X, y) w_map = argmin SUM log(1 + e^-yi wT xi) + lambda wT w lambda = 1/2 sigma^2 no closed form solution, can use gradient descent on this negative log posterior

the NB assumption holds when features are independent given the label, true or false

true

true or false: MLE works great for large n but bad for small n

true if ur data is 2 heads flipped, MLE gives 100% head rate! even tho ur data sucks

smoothing parameters naive bayes

used in P(x |y) to prevent zero probability issue + l in numerator + l * c in denominator l > 0 means MAP l = 0 is MLE l = 1 is laplace

solution to linear regression when XXT is full rank using normal equation

w = (XXT)^-1 XY so Y^ = XT w = XT (XXT)^-1 XY

how to tell what w* should be if given wt, xt, and yt

well if wtxt is classified wrong, then we know angle between xt and w* should be such that the sign matches yt

when does MAP work better than MLE

when n is small and prior is accurate MLE can overfit when n is small! but if prior belief is wrong, then MAP is terrible!

when do NB and LR produce asymptomatically the same model

when naive bayes assumption holds for exponential family because then exponential NB is linear classifier, and so is LR They produce the same model EVEN WHEN NOT ASYMPTOMATIC if NB is gaussian AND WHEN VARIANCE per-feature ACROSS CLASSES IS THE SAME

performance of LR vs naive bayes

with little data, NB outperforms if the modeling assumption is right - LR needs more data to avoid overfitting LR outperforms with more data because NB usually doesn't choose P(x|y) perfectly if NB chooses right P(x|y) then LR and NB converge to same result in limit (but NB is faster)

cos(theta) in perceptron is equal to

wt T w* divided by ||wt||2 where wt is current weight vector


Related study sets

Biology Chapter 3B - 2 Cellular transport

View Set

Marketing Principles Chapter 1-5 Exam

View Set

2.4 Cash Flow: Cash Flow from Assets

View Set

Chapter 23 - The Respiratory System - Test Packet #2

View Set

Lifepac History & Geography Unit 4

View Set

Topic 13: The Industrial Revolution (1750-1914) Quizzes

View Set

PNF/Synergy Patterns Scorebuilders

View Set

Jeroo Chapters 1C (Loops), 1D (If) & 1E (Conditionals/Boolean)

View Set

Therapeutic Exercise Elbow Chapter 18

View Set