CS 580 Final
Probability calculation for KNOWN + ALL
(# examples B = b, A = a) + 0.5 / ((# examples A = a) + 1)
How many independent parameters (probabilities) are needed to define the joint probability distribution for n random Boolean variables?
(2^n) - 1
error
(FP + FN / (TP + TN + FP + FN)
accuracy
(TP + TN / (TP + TN + FP + FN)
Entropy(ABCDEFGH)
-8*1/8*log2 1/8
Entropy(S)
-∑pi log2 pi
Probability of "lacking support"?
0%-50%
What is the question asked when assessing inferential force?
What is the probability of the above hypothesis being true based only on evidence below?
What is the question asked when assessing credibility?
What is the probability that the evidence is true?
3 applications where lots of input output pairs can be obtained
1) Credit risk assessment 2) Disease diagnosis 3) Estimating price of a house in a neighborhood
2 applications which require the software to customize to its operational environment after it is fielded
1) Email filtering 2) Speech recognition system customized to user
3 applications where it is too complex for people to manually design the algorithm, but it is simple to provide input-output pairs
1) Face recognition 2) Automatic steering 3) Hand-written character recognition
linear unit training rule is guaranteed to converge to the hypothesis with minimum squared error given ______, even when ______, and even when ______
1) Given sufficiently small learning rate 2) Even when training data contains noise 3) Even when training data isn't separable by H
Resampling allows for 1)_____ and 2)______
1) More accurate estimates of error rates while training on most cases, 2) Duplication of analysis conditions in future experiments on same data
4 different variations of learning bayesian networks
1) Network structure KNOWN, training data provides values for ALL network variables 2) Network structure KNOWN, training data provides values for SOME network variables 3) Network structure UNKNOWN, training data provides values for ALL network variables 4) Network structure UNKNOWN, training data provides values for SOME network variables
3 examples of preference bias
1) Occam's Razor, simpler explanation is better 2) Most specific hypothesis consistent with the training examples. 3) Most general hypothesis consistent with the training examples.
How to calculate prior probability of specific atomic event using Bayesian network?
1) Start with P(A, B, C) = P(A | B, C) * P(B, C). 2) Expand the P(B, C) into P(B | C) * P(C) . 3) For longer chains, use chain rule
How to calculate conditional probability of something?
1) Start with P_bold(c | s, d, ¬p) 2) Calculate A: P(c, s, d, ¬p) 3) Calculate B: P(¬c, s, d, ¬p) 4) Calculate alpha: 1 / (A + B) 5) Remember query is of form: P(X | e) = alpha * Σ over Y P(X, e, y) 6) Calculate actual probability P(c | s, d, ¬p) using query ^: alpha * Σ P(c, s, d, ¬p)
2 ways to avoid overfitting
1) Stop growing tree earlier 2) Grow full tree, then post-prune (most effective)
perceptron training rule is guaranteed to succeed if 1) ____ and 2) ____
1) Training examples are linearly separable 2) Sufficiently small learning rate
3 things normal distributions can model
1) the height of a man, 2) velocity in any direction of a gas molecule, 3) error made in measuring a physical quantity
What is the systematic approach to answering an intelligence question?
1. Formulate all the hypotheses that will help answer the question. 2. Develop arguments that break hypotheses into simpler hypotheses 3. Marshal evidence to favor or disfavor these hypothesis, using assumptions to fill the gaps 4. Evaluate evidence and assumptions for their credibility and relevance 5. Synthesize resulting assessments to reach conclusion.
Steps for building the argumentation
1. Read the problem 2. Determine the conditions that would make the top hypothesis true 3. Consider conditions that would make sub hypotheses true 4. Continue
Gaussian probability distribution formula
1/(root(2 * pi * SD^2)) e ^ (-0.5 * ((x - E[X]) / SD)^2)
Probability of "certain"?
100%
F measure
2*R*P / (R + P) = 2*TP/(2*TP+FP+FN)
How many functions with n boolean inputs?
2^(2^n)
Probability of "barely likely"?
50%-55%
Probability of "likely"?
55%-70%
Probability of "more than likely"?
70%-80%
Probability of "very likely"?
80%-95%
Probability of "almost certain"?
95%-99%
What is the question asked when assessing relevance?
Assuming that the evidence is true, what is the probability that the hypothesis is true?
Which attribute to split on?
Attribute with highest information gain or lowest entropy
restricted hypothesis space or representation bias
Bias that restricts the existing hypothesis space (choose only conjunctions, etc)
error space E[w]
E[w] = (1/2) ∑ d ∈ D (t_d - o_d)^2
mixed evidence
Evidence that is a mixture of different evidence types
credibility
Extent to which an item of evidence or a source of evidence may be believed
inferential force
Force of an item of evidence or a sub-hypothesis in favoring or disfavoring a hypothesis, the minimum of relevance and credibility
Function approximation
Given a set of training examples of an unknown target and a set of hypotheses functions, determine the hypothesis that best approximates the target function
Inductive learning of functions from examples
Given training examples <xi, f(xi)> of an unknown function f, find a good approximation of f that enables us to determine f(x)
relevance
How strongly an item of evidence or a sub-hypothesis supports a hypothesis
Batch gradient descent
I look at all the examples to compute gradient. Only after that I update the weights
Incremental (stochastic) gradient descent
I update the weights after each example. It is much faster because I do many weight updates.
What is inferential force of balance argument?
If favoring argument inferential force is less than disfavoring argument inferential force, we are lacking support for the balance argument
probability of a disjunction formula
P(A OR B) = P(A) + P(B) - P(A AND B)
Baye's formula
P(A, B) = P(A | B) * P(B) also P(A, B) = P(B | A) * P(A)
chain rule example
P(A, B, C) = P(A|B, C) P(B|C) P(C)
query
P(X | e) = alpha * Σ over Y P(X, e, y), where e is evidence variables and Y is unobserved variables
conditioning rule
P(Y) = Σ P(Y | z)P(z)
Marginalization/summing out
P(Y) = Σ P(Y, z), for any set of variables Y and Z, a distribution over Y can be obtained by summing out all other variables from any joint distribution containing Y
Inference using full joint probability distribution
P(a) = Σ P(ei), where e is all the events where a holds true
The Baye's Theorem
P(h, D) = P(D | h) * P(h) / P(D)
bayesian network node probability
P(node | Parents(node))
binomial probability distribution formula
P(x) = [n!/r!(n-r)!]*pⁿ(1-p)^(n-r)
main hypothesis
Possible answer to the intelligence question
Bayesian Interpretation of Probability
Probabilities quantify our uncertainty about something, being fundamentally related to information rather than repeated trials. We believe the coin is equally likely to land heads or tails on the next toss.
Frequentist Interpretation of Probability
Probabilities represent long frequencies of events.If we flip the coin many times, we expect it to land heads about half of the time.
assumption
Statement taken to be likely true, based on knowledge about similar situations and commonsense reasoning, without having any direct supporting evidence
Why is binomial probability distribution important?
The probability mass function of a binomial (n, p) random variable becomes more and more normal as n gets larger.
Central Limit Theorem
The theory that, as sample size increases, the distribution of sample means of size n, randomly selected, approaches a standard normal distribution
Any function can be approximated to arbitrary accuracy by a network with _______ hidden layers
two
variance formula
variance(X) = E[(X - E[X]^2)], (x - E[X])^2 * P(X = x)
Change in weights in gradient descent
Δ𝑤𝑖 = -η * gradientOf(E[w])
perceptron training rule
𝑤𝑖 ← 𝑤𝑖 + Δ𝑤𝑖 Δ𝑤𝑖 = η(t - o)𝑥𝑖
evidence is frequently ___________
ambiguous
inductive bias
any basis for choosing one hypothesis over another, other than strict consistency with the observed training examples
Every __________________ function can be approximated with arbitrarily small error, by a network with one hidden layer
bounded continuous
machine learning is _______
concerned with building adaptive computer systems that are able to improve their performance at some task through learning from input data, from the user, or from their own problem solving experience
data mining is _____
concerned with discovering useful information in large data repositories
pattern recognition is _______
concerned with the recognition of patterns, primarily from sensory input such as speech and vision
precision
correctly predicted positive / all predicted positives
TP rate
correctly predicted positives / all positives
recall
correctly prediction positives / all positives
evidence has various degrees of ____________
credibility
Gains(S)
current entropy - ∑ over v in Values(A) [probability of tall/short over current group) * entropy of tall/short group]
evidence is commonly _____________
dissonant
overfitting
error_true(h) > error_train(h)
testimonial evidence
evidence provided by human source, credibility based on reliability, access, expertise, veracity, observational accuracy, objectivity (RACVOO)
tangible evidence
evidence that can be directly examined, credibility is based on authenticity, accuracy, and reliability (AAR)
fact
evidence that no one questions, credibility is always certain
linear unit training rule uses ______
gradient descent
As the depth of decision tree increases the hypothesis space _____
grows
evidence is always ___________
incomplete
evidence is usually __________
inconclusive
FP rate
incorrectly predicted positives / all negatives
Do we take maximum or minimum probability of inferential force in alternative argument?
maximum
Use gradient descent to ________ the squared error
minimize
Do we take maximum or minimum probability of inferential force in argument?
minimum
Do we take maximum or minimum probability of inferential force in AND argument?
minimum of probabilities of sub hypotheses and relevance of AND argument
missing evidence
missing evidence for specific hypothesis to be true, credibility may be based on other conclusions drawn from argumentation
binomial probability distribution mean formula
n * p
binomial probabiliity distributioin variance
n * p * (1 - p)
preference bias
places a preference ordering over the hypotheses in the hypothesis space H.
Prior
prior probability of hypothesis h
Normalizer
prior probability of training data D
Likelihood
probability of D given h
Posterior
probability of h given D
binomial probability distribution standard deviation
root(n * p * (1 - p))
standard deviation formula
square root of variance