Bayes-Machine Learning: Mid-term
What are some features of bayesian learning methods?
* each observed training example can incrementally decrease or increase the estimated probability that a hypothesis is correct. * Prior knowledge can be combined with observed data to determine the final probability of a hypothesis. * new instances can be classified by combining the predictions of multiple hypotheses, weighted by their probabilities. * bayesian methods can accommodate hypotheses that make probabilistic predictions.
What are two practical difficulties in applying Bayesian methods?
* they typically require initial knowledge of many probabilities. * there is significant computational cost required to determine the Bayes optimal hypothesis.
Why is Naive Bayes cool?
*Inference is cheap. *Few parameters *Estimate parameters with labeled data. *Connects inference and classification. *Empirically successful *Doesn't model interrelationships between attributes *ordering is preserved *One unseen attribute spoils the whole lunch.
What are 2 reasons why bayesian methods is important to the study of machine learning?
1. Bayesian learning algorithms that calculate explicit probabilities for hypotheses, such as the naive bayes classifier, are among the most practical approaches to certain type of learning problems. 2. they provide a useful perspective for understanding many learning algorithms that do not explicitly manipulate probabilities.
What assumptions must be met to perform Brute-Force h(MAP)?
1. the training data is noise free. 2. the target concept c is contained in the hypothesis space H. 3. We have no priori reason to believe that any hypothesis is more probable than any other.
What is naive bayes inductive bias?
At least all things are possible.
How does Bayes theorem work?
Bayes theorem provides a way to calculate the probability of a hypothesis based on its prior probability, the probabilities of observing various data given the hypothesis, and the observed data itself.
Describe bayesian belief networks
Bayesian belief networks represent the joint probability distribution for a set of variables. Comparing and representing probabilistic quantities over complex spaces (networks).
Describe bayesian learning
Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of interest are governed by probability distributions and that optimal decisions can be made by reasoning about these probabilities together with observed data.
Describe Lc(i).
C = the number of bits required to encode the message. i = message being encoded The description length of message i with respect to C, is denoted by Lc(i).
What is the MDL algorithm?
Choose h sub MDL where: h sub MDL = argmin h set of H times Lc_1(h) + Lc_2 (D|h)
What values should we specify for P(h) when performing h(MAP)?
Given no prior knowledge that one hypothesis is more likely than another, it is reasonable to assign the same probability to ever hypothesis h in H. Therefore we can assume that the prior probabilities sum to 1. Thus P(h)= 1 / |H| for all h in H.
Describe maximum likelihood?
In some cases, we will assume that every hypothesis in H is equally probable a priori (P(h sub i)= P(h sub j) for all h sub i and h sub j in H). In this case we need only consider the term P(D|h) to find the probable hypothesis. P(D|h) is often called the likelihood of the data D given h, and any hypothesis that maximizes P(D|h) is called a maximum likelihood (ML) hypothesis.
What is required of joint distributions that have a topological order?
It must be acyclic. Meaning acyclic dependencies.
What is MDL?
Minimum description length recommends the shortest method for re-encoding the training data, where we count both the size of the hypothesis and any additional cost of encoding the data given this hypothesis. It recommends the hypothesis that minimizes the sum of these two description lengths.
How is naive bayes unique among the learning methods in regards to hypothesis space?
Naive bayes does not explicitly search through the hypothesis space. Instead, the hypothesis is formed without searching, simply by counting the frequency of various data combinations within the training examples.
Describe Naive Bayes
Naive bayes is based upon the assumption that the attribute values are conditionally independent given the target value.
What is Bayes theorem?
P(h|D) = P(D|h) P(h) / P(D)
What is the Bayes Optimal Classifier?
The most probable classification of the new instance is obtained by combining the predictions of all hypotheses, weighted by their posterior probabilities. It is obtained by taking a weighted vote among all members of the version space, with each candidate hypothesis weighted by its posterior probability
Describe the significance in finding the least squared error.
Under certain assumptions any learning algorithm that minimizes the squared error between the output hypothesis predictions and the training data will output a maximum likelihood hypothesis. The significance of this result is that it provides a Bayesian justification for many neural network and other curve fitting methods that attempt to minimize the sum of squared errors over the training data.
Describe the maximum a posteriori (MAP) hypothesis.
When the learner considers some set of candidate hypotheses H and is interested in finding the most probable hypothesis h in set of H given the observed data D (or at least one of the maximally probable if there are several.) Any such maximally probable hypothesis is called a maximum a posteriori (MAP) hypothesis.
Describe conditionally independent.
X is conditionally independent of y given Z if probability distribution governing X is independent of the value of y given the value of Z; essentially the probability of X does not need Y, if given the value of Z.
What is a consistent learner?
a learning algorithm is a consistent learner provided it outputs a hypothesis that commits zero errors over the training examples. We can conclude that every consistent learner outputs a MAP hypothesis, if we assume a uniform priori probability distribution over H, and if we assume noise-free training data.
For learning target functions that predict probabilities, which error function is more appropriate: sum of squared errors or cross-entropy?
cross-entropy
What is the MAP hypothesis?
h(MAP)= argmax of h set of H times P(D|h) P(h)
What rules provide bayesian justifications for other learning methods? Map the bayesian rule to what it's justifying.
h(ML) -- justifies minimizing the sum of squared errors. MDL -- justifies Occam's razor (choose the shortest explanation for the observed data)
What is the maximum likelihood hypothesis?
h(ML) = argmax h of set H times P(D|h)
Describe what is Brute-Force MAP learning?
it applies Bayes theorem to each hypothesis in H to calculate P(h|D).
What yields the same result as maximizing the likelihood p(D|h)?
selecting the hypothesis that maximizes the logarithm of the likelihood (ln p(D|h)) in order to determine the most probable hypothesis, yields the same result as ML.
What is P(h|D)?
the posterior probability of h, because it reflects our confidence that h holds after we have seen the training data D. Notice the posterior probability P(h|D) reflects the influence of the training data D, in contrast to the prior probability P(h), which is independent of D.
What is P(h)?
the prior probability of h. It may reflect any background knowledge we have about the chance that h is a correct hypothesis.
What is P(D)?
the prior probability that training data D will be observed (i.e., the probability of D given no knowledge about which hypothesis holds).
What is P(D|h)?
the probability of observing data D given some world in which hypothesis h holds.
What is VS sub H,D?
the subset of hypotheses from H that are consistent with D (i.e., VS sub H,D is the version space of H with respect to D).