CS 440 Final

Ace your homework & exams now with Quizwiz!

Learning from imitation

instead of an explicit reward function, you have expert demonstrations of the task to learn from

Policy search

instead of getting the Q-values right, you simply need to get their ordering right - Write down the policy as a function of some parameters and adjust the parameters to improve the expected reward

Dimensionality reduction, manifold learning

Discover a lower-dimensional surface on which the data lives

Clustering

Discover groups of "similar" data points

Conditional Distribution

Distribution over the values of 1 variable given fixed values of other variables

Machine Learning

- Getting a computer to do well on a task without explicitly programming it - Improving performance on a task based on experience

Experimentation Cycle

- Learn parameters on the training set - Tune hyperparameters (implementation choices) on the held out validation set - Evaluate performance on the test set - VERY IMPORTANT: do not peek at the test set!

Other supervision scenarios

- unsupervised learning - semi-supervised learning - active learning - lifelong learning

Other prediction scenarios

- regression - structured prediction

Perceptron training algorithm

1) Initialize weights 2) Cycle through training examples in multiple passes (epochs) 3) For each training example: - If classified correctly, do nothing - If classified incorrectly, update weights

Constructing Bayesian Networks

1. Choose an ordering of variables X1, ..., Xn 2. For i = 1 to n - add Xi to network - select parents from X1, ..., Xi-1 such that P(Xi | Parents(Xi)) = P(Xi | X1, ..., Xi-1) Deciding conditional independence is hard in noncausal directions

Conditional Independence

Atomic Event

A complete specification of the state of the world, or a complete assignment of domain values to all random variables mutually exclusive and exhaustive P(X1 = x1, X2 = x2, ... Xn = xn)

Why should a rational agent hold beliefs that are consistent with axioms of probability?

An agent who holds beliefs inconsistent with axioms of probability can be convinced to accept a combination of bets that is guaranteed to lose them money

Hidden Markov Models

At each time slice t, the state of the world is described by an unobservable variable Xt and an observable evidence variable Et Transition Model: distribution over the current state given the whole past history: P(Xt | X0, ..., Xt-1) = P(Xt | X0:t-1) Observation Model: P(Et | X0:t, E1:t-1)

The Forward Algorithm

Decision Tree classifier

Based on series of attributes, forms a decision tree of how those attribute values lead to a certain class

Multi-layer Neural Network

Can learn nonlinear functions f(x) = σ[∑w'jσ(∑wjk xk)] Hidden layer size and network capacity Training: find network weights to minimize the error between true and estimated labels of training examples: E(f) = ∑(yi - f(xi))^2 Minimization can be done by gradient descent provided f is differentiable (back-propagation)

Markov Decision Process

Components: - States s, beginning with initial state s0 - Actions a - each state s has actions A(s) available from it - Transition model P(s' | s, a) - Reward function R(S)

Evaluation

Model Parameters

Conditional probability tables for Bayesian networks

Monty Hall Problem

Contestant on game show - 3 closed doors, behind 1 of them is a prize, choose 1 door, host opens 1 of other doors, and reveals there's no prize behind it. offers to switch. EU(Switch) = 1/3 * 0 + (2/3) * Prize EU(Not Switch) = (1/3) * Prize + (2/3) * 0 TLDR: Switch

Parameter Smoothing

Dealing with words that were never seen or seen too few times

Update rule for differentiable perceptron

Define total classification error or loss on the training set: E(w) = ∑(y_j - fw(xj))^2 fw(xj) = σ(w*xj) Update weights by gradient descent w <- w - αδE/δw δE/δw = ∑[-2(yj - f(xj))σ'(w*xj)δ(w*xj)/δw] = ∑[-2(yj - f(xj))σ(w*xj)(1-σ(w*xj))xj] For a single training point, the update is: w <- w + α(y - f(x))σ(w*x)(1-σ(w*x))x

Bag of Words Representation

Document = sequence of words Order of words = not important Each word is conditionally independent of the others given document class

Expected Utility

EU(a) = ∑P(outcome | a)U(outcome) for all outcomes of a

Maximum a Posterior (MAP) decision

Bellman Equation

Expected utility of taking action a in state s = ∑P(s' | s, a)U(s') Choose the optimal action π*(s) = argmax∑P(s' | s, a)U(s') Bellman Equation U(s) = R(s) + y max ∑P(s' | s,a)U(s')

HMM Inference Tasks

Filtering: What is the distribution over the current state Xt given all the evidence so far, e1:t? - the forward algorithm Smoothing: What is the distribution of some state Xk given the entire observation sequence e1:t? - the forward-backward algorithm Evaluation: compute the probability of a given observation sequence e1:t Decoding: What is the most likely state sequence X0:t given the observation sequence e1:t? - the Viterbi algorithm

Density estimation

Find a function that approximates the probability density of the data (i.e. value of the function is high for typical points and low for atypical points) - used for anomaly detection

Linear classifier

Find a linear function to separate the classes f(x) = sgn(w1x1 + w2x2 + ... + wDxD + b) = sgn(w * x + b) Pros - low-dimensional parametric representation - very fast at test time Cons - works for two classes - how to train the linear function? - what if data is not linearly separable?

Marginalization

For P(X = x), sum the probabilities of all atomic events where X=x P(X=x) = P((X=x ∧ Y=y1) ∨ ... ∨ (X = x ∧ Y=yn)) = P((x, y1) ∨ ... ∨ (x, yn)) = ∑P(x, yi) for all i in 1-n

Conditional Probability

For any 2 events A and B, P(A | B) = Probability A given B = P(A, B) / P(B)

Kolmogorov's axioms of probability

For any propositions (events) A, B - 0 <= P(A) <= 1 - P(True) = 1 and P(False) = 0 - P(A ∨ B) = P(A) + P(B) - P(A ∧ B)

Perceptron update rule

For each training instance x with label y: - Classify with current weights: y' = sgn(w*x) - Update weights: w <- w + alpha * (y-y') x - alpha = learning rate that should decay as a function of epoch t - If y' is correct, do nothing - If y' is not correct - - if y = 1 and y' = -1, wi will be increased if xi is positive or decreased if xi is negative -> w * x will get bigger - - if y = -1 and y' = 1, wi will be decreased if xi is positive or increased if xi is negative -> w * x will get smaller

Marginal Distributions

From joint distribution P(X, Y), can find P(X), P(Y)

Bayesian Inference

General Scenario - Query variables: X - Evidence (observed) variables and their values: E = e - Unobserved variables Y Inference Problem - answer questions about query variables given evidence variables - P(X | E = e) = P(X, e) / P(e) ≅Sum(P(X, e, y)) over all y

Reinforcement Learning

Given a regular MDP but the transition model and reward function initially unknown to find the right policy, so you "learn" by doing In each time step: - Take some action - Observe the outcome of the action: successor state and reward - Update some internal representation of the environment and policy - If you reach a terminal state, just start over (each pass through the environment is called a trial)

Learning (HMM)

Given a training sample of observation sequences Goal: compute model parameters - Transition probabilities, observation probabilities If have complete data, estimate by relative frequencies Otherwise, use EM algorithm

Exploitation

Go with the best strategy found so far Pros - Maximize reward as reflected in the current utility estimates - Avoid bad stuff Cons - Might also prevent you from discovering true optimal strategy

Overfitting

Good performance on the training/validation set, poor performance on test set

Bayesian Networks

Graphical models A way to depict conditional independence relationships between random variables Compact specification of full joint distributions Nodes: random variables Arcs: Interactions - An arrow from 1 variable to another indicates direct influence - directed acyclic graph Key property: Each node is conditionally independent of its non-descendants given its parents Suppose nodes X1, ..., Xn are sorted in topological order To get joint distribution P(X1,...,Xn) = ΠP(Xi | X1, ... Xi-1) = ΠP(Xi | Parents(Xi)) To specify full joint distribution, we need to specify a conditional distribution for each node given its parents P(X | Parents(X)) ΠP(Xi | Parents(Xi))

Incorporating Exploration

Idea: explore more in the beginning, become more and more greedy over time Modified Stragety a = argmax f(∑ P(s' | s, a)U(s'), N(s, a')) f is an exploration function, N(s, a') is # of times we've taken action a' in state s f(u,n) = R+ if n < N_e (optimistic reward estimate) or u otherwise

Unsupervised Learning

Idea: given only unlabeled data as input, learn some sort of structure The objective is often more vague or subjective than in supervised learning This is more of an exploratory/descriptive data analysis

Bayesian Networks Compactness

If Xi has k boolean parents, 2^k rows needed. If each variable has no more than k parents, then O(n * 2^k) #s needed instead of O(2^n)

Learning + Inference Pipeline

Learning - Training Samples -> Features + Training Labels -> Training -> Learned Model Inference - Test Sample -> Features + Learned Model -> Prediction

Parameter Learning

Inference Problem: Given values of evidence variables E=e, answer questions about query variables X using the posterior P(X | E=e) Learning Problem: Estimate the parameters of the probabilistic model P(X | E) given a training sample {(x1, e1), ..., (xn, en)} Suppose we know the network structure (but not the parameters) and have a training set of complete observations P(X | Parents(X)) = observed frequencies of different values of X for each combination of parent values Expectation maximization algorithm for dealing with missing data

Discount Factor

Infinite state sequences Should discount the individual state rewards by a factor of y between 0 and 1: U([s0, s1, s2, ...]) = R(s0) + yR(s1) + y^2R(s2) + ... = ∑y^t R(st) <= Rmax / 1 - y Sooner rewards count more than later rewards - makes sure the total utility stays bounded and helps algorithms converge

Perceptron

Input with features x1, ..., xD with weights w1, ..., wD leads to output through sgn(w * x + b) Can incorporate bias as component of the weight vector by always including a feature with value set to 1

Active learning

Learning algorithm can choose its own training examples, or ask a "teacher" for an answer on selected inputs

Efficient inference

Key idea: compute the results of subexpressions in a bottom-up way and cache them for later use (dynamic programming) Poly time and space for polytrees - networks at most 1 undirected path between any 2 nodes

Model-free learning

Learn how to act without explicitly learning the transition probabilities P(s' | s, a)

Model-based learning

Learn the model of the MDP (transition probabilities and rewards) and try to solve the MDP concurrently Basic idea: try to learn the model of the MDP (transition probabilities and rewards) and learn how to act (solve the MDP) simultaneously Learning the model - Keep track of how many times state s' follows state s when you take action a and update the transition probability P(s' | s, a) according to the relative frequencies - Keep track of the rewards R(s) Learning how to act - Estimate the utilities U(s) using Bellman's equations - Choose the action that maximizes expected future utility: π*(s) = argmax ∑ P(s' | s, a)U(s')

Bayesian Decision Theory

Let x be the value predicted by the agent and x* be the true value of X. The agent has a loss function, which is 0 if x = x* and 1 otherwise. Expected loss for predicting x: Sum(L(x, x*)P(x* | e)) for all x*

Convergence of perceptron update rule

Linearly separable data - converges to perfect solution Non-separable data - converges to a minimum-error solution assuming learning rate decays as O(1/t) and examples are presented in random sequence

Semi-supervised learning

Lots of data is available, but only small portion is labeled

Quantization

Map a continuous input to a discrete (more compact) output

Bayesian Network Inference is NP-Hard

NP-Hard, more precisely #P-hard: equivalent to counting satisfying assignments Can reduce satisfiability to Bayesian network inference ( Truth setting components > clause-satisfication testing components > overall-satisfaction testing component)

Multi-class perceptrons (one-vs-others framework)

Need to keep a weight vector wc for each class c Decision rule: c = argmax wc * x Update rule: suppose example from class c gets misclassified as c' - Update for c: wc <- wc + alpha * x - Update for c': wc' <- wc' - alpha * x

Bayes Rule

P(A | B) = P(B | A)P(A) / P(B) Can update our beliefs about A based on evidence B (P(A) = prior, P(A | B) = posterior) Key tool for probabilistic inference: can get diagnostic probability from causal probability

Mutually Exclusive Events

P(A ∨ B) = P(A) + P(B) Not independent

Product Rule

P(A, B) = P(A | B)P(B) = P(B | A)P(A)

Chain Product Rule

P(A1, ..., An) = P(A1)P(A2 | A1)P(A3 | A1, A2)...P(An | A1,...,An-1) = πP(Ai | A1,...,Ai-1) from i = 1 to n

Law of Total Probability

P(X = x) = Sum(P(X = x, Y = yi) = Sum(P(X = x | Y = yi)P(Y = yi))

Softmax

P(c | x) = exp(wc * x) / Sum(exp(wk * x))

Naive Bayes For Document Classification

P(class | document) = P(class) * ΠP(wi | class)

Maximum Likelihood Estimate for words given class

P(word | class) = # of occurrences this word in docs from this class / total # of words in docs from this class ΠΠP(w(d,i) | class(d,i) where d = index of training document, i = index of word

Laplacian smoothing

Pretend you have seen every vocabulary word one more time than you actually did P(word | class) = (# of occurrences of this word in docs from this class + 1) / (total # of words in docs from this class + V)

Subjectivism

Probabilities are degrees of belief How do we assign belief values? What would constrain agents to hold consistent beliefs?

Frequentism

Probabilities are relative frequencies "Reference class" problem / dealing with events that only happen once

Function approximation

So far, assumed a lookup table representation for utility function U(s) or action-utility function Q(s,a) Approximate the utility function as a weighted linear combination of features for large state space - U(s) = w1f1(s) + w2f2(s) + ... + wnfn(s) - RL algorithms can be modified to estimate these weights - More generally, functions can be nonlinear (e.g. neural networks) Benefits: - Can handle very large state spaces (games), continuous state spaces (robot control) - Can generalize to previously unseen states

Value Iteration

Start out with U(s) = 0 Iterate until convergence - During the ith iteration, update the utility of each state according to U_{i+1}(s) <- R(s) + y max ∑P(s' | s,a)U_i(s') In the limit of infinitely many iterations, guaranteed to find the correct utility values Should see plateauing of policy utility estimation

Policy iteration

State with some initial policy π0 and alternate between the following steps: - Policy evaluation: calculate U(s) for every state s in an iteration for a given policy - Policy improvement: calculate a new policy π_{i+1} based on the updated utilities π_{i+1}(s) = argmax ∑P(s' | s,a)U(s') Update Bellman equation to U(s) = R(s) + y∑P(s' | s, π(s))U(s')

Probabilistic Inference

Suppose the agent has to make a decision about the value of an unobserved query variable X given some observed evidence variable(s) E = e Partially observable, stochastic, episodic environment

Naive Bayes Model

Suppose we have many different types of observations E1, ... En that we want to use to obtain evidence about an underlying hypothesis X MAP Decision = P(X = x | E1 = e1, ... , En = en) ≅P(X = x)P(E1 = e1, ..., En = en | X = x) We can make the simplifying assumption that the different features are conditionally independent given the hypothesis P(E1 = e1, ..., En = en | X = x) = ΠP(Ei = ei | X = x) Posterior: P(X = x | E1 = e1, ..., En = en) Prior = P(x) Likelihood = ΠP(Ei = ei | X = x) MAP Decision = x = argmax x P(x | e) ≅P(x) * ΠP(ei | x)

Exploration

Take a new action with unknown consequences Pros - Get a more accurate model of the environment - Discover higher-reward states than the ones found so far Cons - When you're exploring, you're not maximizing your utility - Something bad might happen

Filtering (HMMs)

Markov Assumption

The current state is conditionally independent of all the other states given the state in the previous time step (first order) Transition Model: P(Xt | X0:t-1) = P(Xt | Xt-1) For observations, the evidence at time t depends only on the state at time t Transition Model: P(Et | X0:t, E1:t-1) = P(Et | Xt)

Solving MDPs for optimal policy

The optimal policy should maximize expected utility over all possible state sequences produced following that policy: ∑P(sqeuence)U(sequence) Define utility of state sequence via sum of rewards of individual states

Markov Assumption for MDPS

The probability of going to s' from s depends only on s and a and not on any past actions or states

Normalization Trick for Conditional Distribution

To get the whole conditional distribution P(X | Y=y) at once, select all entries in joint distribution table matching Y=y and renormalize them to sum to 1

Joint Distribution for HMMs

Transition Model: P(Xt | Xt-1) Observation Model: P(Et | Xt) So P(X0:t, E1:t) = P(X0) ΠP(Xi | Xi-1) P(Ei | Xi)

Independence

Two events A and B are independent if and only if P(A ∧ B) = P(A, B) = P(A) P(B)

Decoding

Viterbi algorithm Task: Given observation sequence e1:t, compute most likely state sequence x0:t x*0:t = argmax P(x0:t | e1:t) The most likely path that ends in a particular state xt consists of the most likely path to some state xt-1 followed by the transition to xt Let mt(xt) denote the probability of the most likely path that ends in xt: mt(xt) = max P(x0:t-1, xt | e1:t) ≅max P(0:t-1, xt, e1:t) = max[mt-1(xt-1)P(xt | xt-1)P(et | xt)]

Generalization

Want the classifier that does well on never before seen data

Temporal Difference (TD) learning

We don't want to learn (and don't know) P(s' | s, a). Temporal different (TD) update: - Pretend that the currently observed transition (s, a, s') is the only possible outcome and adjust the Q values towards the "local" equilibrium Qlocal(s,a) = R(s) + y max Q(s', a') Qnew(s, a) = (1 - alpha)Q(s,a) + alpha*Qlocal(s,a) Qnew(s,a) = Q(s,a) + alpha*(Qlocal(s,a) - Q(s,a)) Qnew(s,a) = Q(s,a) + alpha(R(s) + y max Q(s', a') - Q(s, a)) At each time step t - From current state s, select an action a: a = argmax f(Q(s, a'), N(s, a')) - Get the successor state s' - Perform the TD update Q(s,a) = Q(s,a) + alpha(R(s) + y max Q(s', a') - Q(s, a)) alpha = learning rate (should start a 1 and decay as O(1/t))

Smoothing

What is the distribution of some state Xk given the entire observation sequence e1:t? Forward-backward algorithm Forward message: P(Xk | e1:k) Backward message: P(ek+1:t | Xk)

Regularization

add a penalty on weight magnitudes to the objective function E(f) = ∑(yi - f(xi))^2 + λ/2 * ∑ wj^2 Discourages network weights from growing too large, encourages network to use all of its inputs "a little" rather than a few inputs "a lot"

Joint Probability Distribution

assignment of probabilities to every possible atomic event P(X1, X2, ... Xn)

Random Variables

describe the state of the world using random variables, take on values in a domain whose values must be mutually exclusive and exhaustive

Naive Bayes classifier

f(x) = argmax P(y|x) ≈ argmax P(y) P(x | y) = argmax P(y) ΠP(xd | y)

(k-)Nearest Neighbor classifier

f(x) = label of the training example nearest to x - All we need is a distance function for our inputs - No training required! For a new point, find the k closest points from training data Vote for class label with labels of the k points Pros - simple to implement - decision boundaries not necessarily linear - works for any number of classes - nonparametric method Cons - need good distance function - slow at test time

Q-learning

learn an action-utility function Q(s, a) that tells us the value of doing action a in state s Relationship between Q-values and utilities: U(s) = max Q(s, a) Selecting an action: π*(s) = argmax Q(s,a) We don't need to know the transition model to select the next action Equilibrium constraint on Q values: Q(s, a) = R(s) + y∑P(s' | s, a) max Q(s', a')

Decision Theory

probability theory + utility theory

Events

sets of world states define probabilistic statements, described using propositions about random variables P(A) = probability of set of world states in which proposition A holds

Policy

the action that an agents takes in any give state -> solution to an MDP

Utility Theory

used to represent and infer preferences

Maximum Likelihood Decision

x = argmax P(e | x)

Basic Supervised Learning Framework

y = f(x) - y = output - f = classification function - x = input Learning: given a training set of labeled examples {(x1, y1), ..., (xN, yN)}, estimate the parameters of the prediction function f Inference: apply f to a never before seen test example x and output the predicted value y = f(x) Key challenge = generalization

CS 440 Final

Related study sets

US history chapter 17

Legal and Ethical Aspects of Health Information Management- Ch 13 Exam

HI 342: Quiz 1

Adult Health Exam #3

Theology Units 1 and 2

SB 7 Liabilities

URS 1006

Business Management 3130 - Exam 1

chapter 5 : community development and organizing

Industrial Revolution- AP Euro

NCLEX Coronary Vascular Disorders

Ch 25 workbook activities (fluid therapy and transfusion medicine)

ECO C.6

Social Psych Final

Classification of bones

forencies

SOC 4 Final

Unit 2 World History 10/20

RESEARCH SOCIAL SCIENCE C1

Angular Quiz Review