CS 580 Final

Ace your homework & exams now with Quizwiz!

Probability calculation for KNOWN + ALL

(# examples B = b, A = a) + 0.5 / ((# examples A = a) + 1)

How many independent parameters (probabilities) are needed to define the joint probability distribution for n random Boolean variables?

(2^n) - 1

error

(FP + FN / (TP + TN + FP + FN)

accuracy

(TP + TN / (TP + TN + FP + FN)

Entropy(ABCDEFGH)

-8*1/8*log2 1/8

Entropy(S)

-∑pi log2 pi

Probability of "lacking support"?

0%-50%

What is the question asked when assessing inferential force?

What is the probability of the above hypothesis being true based only on evidence below?

What is the question asked when assessing credibility?

What is the probability that the evidence is true?

3 applications where lots of input output pairs can be obtained

1) Credit risk assessment 2) Disease diagnosis 3) Estimating price of a house in a neighborhood

2 applications which require the software to customize to its operational environment after it is fielded

1) Email filtering 2) Speech recognition system customized to user

3 applications where it is too complex for people to manually design the algorithm, but it is simple to provide input-output pairs

1) Face recognition 2) Automatic steering 3) Hand-written character recognition

linear unit training rule is guaranteed to converge to the hypothesis with minimum squared error given ______, even when ______, and even when ______

1) Given sufficiently small learning rate 2) Even when training data contains noise 3) Even when training data isn't separable by H

Resampling allows for 1)_____ and 2)______

1) More accurate estimates of error rates while training on most cases, 2) Duplication of analysis conditions in future experiments on same data

4 different variations of learning bayesian networks

1) Network structure KNOWN, training data provides values for ALL network variables 2) Network structure KNOWN, training data provides values for SOME network variables 3) Network structure UNKNOWN, training data provides values for ALL network variables 4) Network structure UNKNOWN, training data provides values for SOME network variables

3 examples of preference bias

1) Occam's Razor, simpler explanation is better 2) Most specific hypothesis consistent with the training examples. 3) Most general hypothesis consistent with the training examples.

How to calculate prior probability of specific atomic event using Bayesian network?

1) Start with P(A, B, C) = P(A | B, C) * P(B, C). 2) Expand the P(B, C) into P(B | C) * P(C) . 3) For longer chains, use chain rule

How to calculate conditional probability of something?

1) Start with P_bold(c | s, d, ¬p) 2) Calculate A: P(c, s, d, ¬p) 3) Calculate B: P(¬c, s, d, ¬p) 4) Calculate alpha: 1 / (A + B) 5) Remember query is of form: P(X | e) = alpha * Σ over Y P(X, e, y) 6) Calculate actual probability P(c | s, d, ¬p) using query ^: alpha * Σ P(c, s, d, ¬p)

2 ways to avoid overfitting

1) Stop growing tree earlier 2) Grow full tree, then post-prune (most effective)

perceptron training rule is guaranteed to succeed if 1) ____ and 2) ____

1) Training examples are linearly separable 2) Sufficiently small learning rate

3 things normal distributions can model

1) the height of a man, 2) velocity in any direction of a gas molecule, 3) error made in measuring a physical quantity

What is the systematic approach to answering an intelligence question?

1. Formulate all the hypotheses that will help answer the question. 2. Develop arguments that break hypotheses into simpler hypotheses 3. Marshal evidence to favor or disfavor these hypothesis, using assumptions to fill the gaps 4. Evaluate evidence and assumptions for their credibility and relevance 5. Synthesize resulting assessments to reach conclusion.

Steps for building the argumentation

1. Read the problem 2. Determine the conditions that would make the top hypothesis true 3. Consider conditions that would make sub hypotheses true 4. Continue

Gaussian probability distribution formula

1/(root(2 * pi * SD^2)) e ^ (-0.5 * ((x - E[X]) / SD)^2)

Probability of "certain"?

100%

F measure

2*R*P / (R + P) = 2*TP/(2*TP+FP+FN)

How many functions with n boolean inputs?

2^(2^n)

Probability of "barely likely"?

50%-55%

Probability of "likely"?

55%-70%

Probability of "more than likely"?

70%-80%

Probability of "very likely"?

80%-95%

Probability of "almost certain"?

95%-99%

What is the question asked when assessing relevance?

Assuming that the evidence is true, what is the probability that the hypothesis is true?

Which attribute to split on?

Attribute with highest information gain or lowest entropy

restricted hypothesis space or representation bias

Bias that restricts the existing hypothesis space (choose only conjunctions, etc)

error space E[w]

E[w] = (1/2) ∑ d ∈ D (t_d - o_d)^2

mixed evidence

Evidence that is a mixture of different evidence types

credibility

Extent to which an item of evidence or a source of evidence may be believed

inferential force

Force of an item of evidence or a sub-hypothesis in favoring or disfavoring a hypothesis, the minimum of relevance and credibility

Function approximation

Given a set of training examples of an unknown target and a set of hypotheses functions, determine the hypothesis that best approximates the target function

Inductive learning of functions from examples

Given training examples <xi, f(xi)> of an unknown function f, find a good approximation of f that enables us to determine f(x)

relevance

How strongly an item of evidence or a sub-hypothesis supports a hypothesis

Batch gradient descent

I look at all the examples to compute gradient. Only after that I update the weights

Incremental (stochastic) gradient descent

I update the weights after each example. It is much faster because I do many weight updates.

What is inferential force of balance argument?

If favoring argument inferential force is less than disfavoring argument inferential force, we are lacking support for the balance argument

probability of a disjunction formula

P(A OR B) = P(A) + P(B) - P(A AND B)

Baye's formula

P(A, B) = P(A | B) * P(B) also P(A, B) = P(B | A) * P(A)

chain rule example

P(A, B, C) = P(A|B, C) P(B|C) P(C)

query

P(X | e) = alpha * Σ over Y P(X, e, y), where e is evidence variables and Y is unobserved variables

conditioning rule

P(Y) = Σ P(Y | z)P(z)

Marginalization/summing out

P(Y) = Σ P(Y, z), for any set of variables Y and Z, a distribution over Y can be obtained by summing out all other variables from any joint distribution containing Y

Inference using full joint probability distribution

P(a) = Σ P(ei), where e is all the events where a holds true

The Baye's Theorem

P(h, D) = P(D | h) * P(h) / P(D)

bayesian network node probability

P(node | Parents(node))

binomial probability distribution formula

P(x) = [n!/r!(n-r)!]*pⁿ(1-p)^(n-r)

main hypothesis

Possible answer to the intelligence question

Bayesian Interpretation of Probability

Probabilities quantify our uncertainty about something, being fundamentally related to information rather than repeated trials. We believe the coin is equally likely to land heads or tails on the next toss.

Frequentist Interpretation of Probability

Probabilities represent long frequencies of events.If we flip the coin many times, we expect it to land heads about half of the time.

assumption

Statement taken to be likely true, based on knowledge about similar situations and commonsense reasoning, without having any direct supporting evidence

Why is binomial probability distribution important?

The probability mass function of a binomial (n, p) random variable becomes more and more normal as n gets larger.

Central Limit Theorem

The theory that, as sample size increases, the distribution of sample means of size n, randomly selected, approaches a standard normal distribution

Any function can be approximated to arbitrary accuracy by a network with _______ hidden layers

two

variance formula

variance(X) = E[(X - E[X]^2)], (x - E[X])^2 * P(X = x)

Change in weights in gradient descent

Δ𝑤𝑖 = -η * gradientOf(E[w])

perceptron training rule

𝑤𝑖 ← 𝑤𝑖 + Δ𝑤𝑖 Δ𝑤𝑖 = η(t - o)𝑥𝑖

evidence is frequently ___________

ambiguous

inductive bias

any basis for choosing one hypothesis over another, other than strict consistency with the observed training examples

Every __________________ function can be approximated with arbitrarily small error, by a network with one hidden layer

bounded continuous

machine learning is _______

concerned with building adaptive computer systems that are able to improve their performance at some task through learning from input data, from the user, or from their own problem solving experience

data mining is _____

concerned with discovering useful information in large data repositories

pattern recognition is _______

concerned with the recognition of patterns, primarily from sensory input such as speech and vision

precision

correctly predicted positive / all predicted positives

TP rate

correctly predicted positives / all positives

recall

correctly prediction positives / all positives

evidence has various degrees of ____________

credibility

Gains(S)

current entropy - ∑ over v in Values(A) [probability of tall/short over current group) * entropy of tall/short group]

evidence is commonly _____________

dissonant

overfitting

error_true(h) > error_train(h)

testimonial evidence

evidence provided by human source, credibility based on reliability, access, expertise, veracity, observational accuracy, objectivity (RACVOO)

tangible evidence

evidence that can be directly examined, credibility is based on authenticity, accuracy, and reliability (AAR)

fact

evidence that no one questions, credibility is always certain

linear unit training rule uses ______

gradient descent

As the depth of decision tree increases the hypothesis space _____

grows

evidence is always ___________

incomplete

evidence is usually __________

inconclusive

FP rate

incorrectly predicted positives / all negatives

Do we take maximum or minimum probability of inferential force in alternative argument?

maximum

Use gradient descent to ________ the squared error

minimize

Do we take maximum or minimum probability of inferential force in argument?

minimum

Do we take maximum or minimum probability of inferential force in AND argument?

minimum of probabilities of sub hypotheses and relevance of AND argument

missing evidence

missing evidence for specific hypothesis to be true, credibility may be based on other conclusions drawn from argumentation

binomial probability distribution mean formula

n * p

binomial probabiliity distributioin variance

n * p * (1 - p)

preference bias

places a preference ordering over the hypotheses in the hypothesis space H.

Prior

prior probability of hypothesis h

Normalizer

prior probability of training data D

Likelihood

probability of D given h

Posterior

probability of h given D

binomial probability distribution standard deviation

root(n * p * (1 - p))

standard deviation formula

square root of variance


Related study sets

Business Law - Chapter 18 (Performance and Breach of Sales and Lease Contracts)

View Set

Ch 10/11 Questions for Final Exam

View Set

Lesson 10.3 - McGraw Hill Algebra I

View Set

ONE TO FOUR FAMILY RESIDENTIAL CONTRACT (RESALE)

View Set

Securities and Exchange Commission

View Set