Naive Bayes
MLE for multinomials
-Let X in {1, ..., k} be a discrete random variable with k values, where P(X=j)=thetaj -then P(X) is a multinomial distribution where I(X=j) is an indicator function -the likelihood is -the maximum likelihood estimates for each parameter are
Naive Bayes classifiers
-instead of learning a function f that assigns labels, learn a conditional probability distribution over the output function f -P(f(x) | x) = P(f(x) = y | x1, x2, ..., xp) -Can use probabilities for other tasks, like classification and ranking
Maximum likelihood estimation
-most widely used method of parameter estimation -"learn" the best parameters by finding the values of theta that maximizes likelihood -often easier to work with loglikelihood
Numerical Stability
-multiplying probabilities can get us into problems -we need underflow prevention -better to sum logs of probabilities rather than multiplying probabilities -class with highest final un-normalized log probability score is still the most probable
NBC learning model space
-parametric model with specific form -models vary based on parameter estimates in CPDs
Naive Bayes classifier
-simplifying (naive) assumption: attributes are conditionally independent given the class -Strengths: easy to implement, often performs well even when assumption is violated, can be learned incrementally -Weaknesses: class conditional assumption produces skewed probability estimates, dependencies among variables cannot be modeled
Bayes rule for probabilistic classifier
-the learner considers a set of candidate labels, and attempts to find the most probable one y in Y given the observed data -such maximally probable assignment is called maximum a posteriori assignment (MAP); Bayes theorem is used to compute it
Score Function: Likelihood
Let D = {x(1), ..., x(n) } -assume data D are independently sampled from the same distribution: p(X|theta) -the likelihood function represents the probability of the data as a function of model parameters
Likelihood
Likelihood is not a probability distribution -gives relative probability of data given a parameter -numerical value of L is not relevant, only the ratio of two scores is relevant
NBC learning search algorithm
MLE optimization of parameters (convex optimization results in exact solution)
Laplace correction
Numerator: add 1 Denominator: add k, where k = number of possible values of X
Likelihood function
allows us to determine unknown parameters based on known outcomes
Probability distribution
allows us to predict unknown outcomes based on known parameters
NBC learning scoring function
likelihood of data given NBC model form
Zero counts
problem -if an attribute value does not occur in training example, we assign zero probability to that value -that could make conditional probability equal 0 adjust zero counts by smoothing probability estimates
Bayes rule
see slides for equation
P(y|x)
the posterior probability of v. the probability that v is the target, given that D has been observed
P(y)
the prior probability of a label y, reflects background knowledge; before data is observed. If no information - uniform distribution
P(x|y)
the probability of observing the sample x, given the label y is the target (likelihood)
P(x)
the probability that this sample of the data is observed (no knowledge of the label)