CS 440 Final
Learning from imitation
instead of an explicit reward function, you have expert demonstrations of the task to learn from
Policy search
instead of getting the Q-values right, you simply need to get their ordering right - Write down the policy as a function of some parameters and adjust the parameters to improve the expected reward
Dimensionality reduction, manifold learning
Discover a lower-dimensional surface on which the data lives
Clustering
Discover groups of "similar" data points
Conditional Distribution
Distribution over the values of 1 variable given fixed values of other variables
Machine Learning
- Getting a computer to do well on a task without explicitly programming it - Improving performance on a task based on experience
Experimentation Cycle
- Learn parameters on the training set - Tune hyperparameters (implementation choices) on the held out validation set - Evaluate performance on the test set - VERY IMPORTANT: do not peek at the test set!
Other supervision scenarios
- unsupervised learning - semi-supervised learning - active learning - lifelong learning
Other prediction scenarios
- regression - structured prediction
Perceptron training algorithm
1) Initialize weights 2) Cycle through training examples in multiple passes (epochs) 3) For each training example: - If classified correctly, do nothing - If classified incorrectly, update weights
Constructing Bayesian Networks
1. Choose an ordering of variables X1, ..., Xn 2. For i = 1 to n - add Xi to network - select parents from X1, ..., Xi-1 such that P(Xi | Parents(Xi)) = P(Xi | X1, ..., Xi-1) Deciding conditional independence is hard in noncausal directions
Conditional Independence
A and B are conditionally independent given C iff P(A ∧ B | C) = P(A | C) P(B | C) OR P(A | B, C) = P(A | C) OR P(B | A, C) = P(B | C)
Atomic Event
A complete specification of the state of the world, or a complete assignment of domain values to all random variables mutually exclusive and exhaustive P(X1 = x1, X2 = x2, ... Xn = xn)
Why should a rational agent hold beliefs that are consistent with axioms of probability?
An agent who holds beliefs inconsistent with axioms of probability can be convinced to accept a combination of bets that is guaranteed to lose them money
Hidden Markov Models
At each time slice t, the state of the world is described by an unobservable variable Xt and an observable evidence variable Et Transition Model: distribution over the current state given the whole past history: P(Xt | X0, ..., Xt-1) = P(Xt | X0:t-1) Observation Model: P(Et | X0:t, E1:t-1)
The Forward Algorithm
Base case: priors P(X0) Suppose we know P(Xt-1 | e1:t-1) Prediction: Propagate belief from Xt-1 to Xt P(Xt | e1:t-1) = ∑P(Xt | xt-1)P(xt-1 | e1:t-1) Correction: weight by evidence et P(Xt | e1:t) = P(Xt | et;e1:t-1) ≅ P(et | Xt)P(Xt | e1:t-1) Renormalize to have all P(Xt = x | e1:t) sum to 1
Decision Tree classifier
Based on series of attributes, forms a decision tree of how those attribute values lead to a certain class
Multi-layer Neural Network
Can learn nonlinear functions f(x) = σ[∑w'jσ(∑wjk xk)] Hidden layer size and network capacity Training: find network weights to minimize the error between true and estimated labels of training examples: E(f) = ∑(yi - f(xi))^2 Minimization can be done by gradient descent provided f is differentiable (back-propagation)
Markov Decision Process
Components: - States s, beginning with initial state s0 - Actions a - each state s has actions A(s) available from it - Transition model P(s' | s, a) - Reward function R(S)
Evaluation
Compute the probability of the current sequence P(e1:t) Recursive formulation: suppose we know P(e1:t-1) P(e1:t) = P(e1:t-1, et) = P(e1:t-1)P(et | e1:t-1) = P(e1:t-1) ∑P(et, xt | e1:t-1) = P(e1:t-1)∑P(et | xt, e1:t-1)P(xt | e1:t-1) = P(e1:t-1)∑P(et | xt)P(xt | e1:t-1) So P(e1:t) = P(e1:t-1)∑P(et | xt)P(xt | e1:t-1) so recursion * filtering (sum is the filtering portion)
Model Parameters
Conditional probability tables for Bayesian networks
Monty Hall Problem
Contestant on game show - 3 closed doors, behind 1 of them is a prize, choose 1 door, host opens 1 of other doors, and reveals there's no prize behind it. offers to switch. EU(Switch) = 1/3 * 0 + (2/3) * Prize EU(Not Switch) = (1/3) * Prize + (2/3) * 0 TLDR: Switch
Parameter Smoothing
Dealing with words that were never seen or seen too few times
Update rule for differentiable perceptron
Define total classification error or loss on the training set: E(w) = ∑(y_j - fw(xj))^2 fw(xj) = σ(w*xj) Update weights by gradient descent w <- w - αδE/δw δE/δw = ∑[-2(yj - f(xj))σ'(w*xj)δ(w*xj)/δw] = ∑[-2(yj - f(xj))σ(w*xj)(1-σ(w*xj))xj] For a single training point, the update is: w <- w + α(y - f(x))σ(w*x)(1-σ(w*x))x
Bag of Words Representation
Document = sequence of words Order of words = not important Each word is conditionally independent of the others given document class
Expected Utility
EU(a) = ∑P(outcome | a)U(outcome) for all outcomes of a
Maximum a Posterior (MAP) decision
Estimate of X that minimizes the expected loss and has the greatest posterior probability P(x | e) Value x of X that has the highest posterior probability x = argmax P(X = x | E = e) = P(E = e | X = x)P(X = x) / P(E = e) ≈ argmax P(E = e | X = x)P(X = x) So P(x | e)≈P(e | x)P(x) P(x | e) = posterior P(e | x) = likelihood P(x) = prior
Bellman Equation
Expected utility of taking action a in state s = ∑P(s' | s, a)U(s') Choose the optimal action π*(s) = argmax∑P(s' | s, a)U(s') Bellman Equation U(s) = R(s) + y max ∑P(s' | s,a)U(s')
HMM Inference Tasks
Filtering: What is the distribution over the current state Xt given all the evidence so far, e1:t? - the forward algorithm Smoothing: What is the distribution of some state Xk given the entire observation sequence e1:t? - the forward-backward algorithm Evaluation: compute the probability of a given observation sequence e1:t Decoding: What is the most likely state sequence X0:t given the observation sequence e1:t? - the Viterbi algorithm
Density estimation
Find a function that approximates the probability density of the data (i.e. value of the function is high for typical points and low for atypical points) - used for anomaly detection
Linear classifier
Find a linear function to separate the classes f(x) = sgn(w1x1 + w2x2 + ... + wDxD + b) = sgn(w * x + b) Pros - low-dimensional parametric representation - very fast at test time Cons - works for two classes - how to train the linear function? - what if data is not linearly separable?
Marginalization
For P(X = x), sum the probabilities of all atomic events where X=x P(X=x) = P((X=x ∧ Y=y1) ∨ ... ∨ (X = x ∧ Y=yn)) = P((x, y1) ∨ ... ∨ (x, yn)) = ∑P(x, yi) for all i in 1-n
Conditional Probability
For any 2 events A and B, P(A | B) = Probability A given B = P(A, B) / P(B)
Kolmogorov's axioms of probability
For any propositions (events) A, B - 0 <= P(A) <= 1 - P(True) = 1 and P(False) = 0 - P(A ∨ B) = P(A) + P(B) - P(A ∧ B)
Perceptron update rule
For each training instance x with label y: - Classify with current weights: y' = sgn(w*x) - Update weights: w <- w + alpha * (y-y') x - alpha = learning rate that should decay as a function of epoch t - If y' is correct, do nothing - If y' is not correct - - if y = 1 and y' = -1, wi will be increased if xi is positive or decreased if xi is negative -> w * x will get bigger - - if y = -1 and y' = 1, wi will be decreased if xi is positive or increased if xi is negative -> w * x will get smaller
Marginal Distributions
From joint distribution P(X, Y), can find P(X), P(Y)
Bayesian Inference
General Scenario - Query variables: X - Evidence (observed) variables and their values: E = e - Unobserved variables Y Inference Problem - answer questions about query variables given evidence variables - P(X | E = e) = P(X, e) / P(e) ≅Sum(P(X, e, y)) over all y
Reinforcement Learning
Given a regular MDP but the transition model and reward function initially unknown to find the right policy, so you "learn" by doing In each time step: - Take some action - Observe the outcome of the action: successor state and reward - Update some internal representation of the environment and policy - If you reach a terminal state, just start over (each pass through the environment is called a trial)
Learning (HMM)
Given a training sample of observation sequences Goal: compute model parameters - Transition probabilities, observation probabilities If have complete data, estimate by relative frequencies Otherwise, use EM algorithm
Exploitation
Go with the best strategy found so far Pros - Maximize reward as reflected in the current utility estimates - Avoid bad stuff Cons - Might also prevent you from discovering true optimal strategy
Overfitting
Good performance on the training/validation set, poor performance on test set
Bayesian Networks
Graphical models A way to depict conditional independence relationships between random variables Compact specification of full joint distributions Nodes: random variables Arcs: Interactions - An arrow from 1 variable to another indicates direct influence - directed acyclic graph Key property: Each node is conditionally independent of its non-descendants given its parents Suppose nodes X1, ..., Xn are sorted in topological order To get joint distribution P(X1,...,Xn) = ΠP(Xi | X1, ... Xi-1) = ΠP(Xi | Parents(Xi)) To specify full joint distribution, we need to specify a conditional distribution for each node given its parents P(X | Parents(X)) ΠP(Xi | Parents(Xi))
Incorporating Exploration
Idea: explore more in the beginning, become more and more greedy over time Modified Stragety a = argmax f(∑ P(s' | s, a)U(s'), N(s, a')) f is an exploration function, N(s, a') is # of times we've taken action a' in state s f(u,n) = R+ if n < N_e (optimistic reward estimate) or u otherwise
Unsupervised Learning
Idea: given only unlabeled data as input, learn some sort of structure The objective is often more vague or subjective than in supervised learning This is more of an exploratory/descriptive data analysis
Bayesian Networks Compactness
If Xi has k boolean parents, 2^k rows needed. If each variable has no more than k parents, then O(n * 2^k) #s needed instead of O(2^n)
Learning + Inference Pipeline
Learning - Training Samples -> Features + Training Labels -> Training -> Learned Model Inference - Test Sample -> Features + Learned Model -> Prediction
Parameter Learning
Inference Problem: Given values of evidence variables E=e, answer questions about query variables X using the posterior P(X | E=e) Learning Problem: Estimate the parameters of the probabilistic model P(X | E) given a training sample {(x1, e1), ..., (xn, en)} Suppose we know the network structure (but not the parameters) and have a training set of complete observations P(X | Parents(X)) = observed frequencies of different values of X for each combination of parent values Expectation maximization algorithm for dealing with missing data
Discount Factor
Infinite state sequences Should discount the individual state rewards by a factor of y between 0 and 1: U([s0, s1, s2, ...]) = R(s0) + yR(s1) + y^2R(s2) + ... = ∑y^t R(st) <= Rmax / 1 - y Sooner rewards count more than later rewards - makes sure the total utility stays bounded and helps algorithms converge
Perceptron
Input with features x1, ..., xD with weights w1, ..., wD leads to output through sgn(w * x + b) Can incorporate bias as component of the weight vector by always including a feature with value set to 1
Active learning
Learning algorithm can choose its own training examples, or ask a "teacher" for an answer on selected inputs
Efficient inference
Key idea: compute the results of subexpressions in a bottom-up way and cache them for later use (dynamic programming) Poly time and space for polytrees - networks at most 1 undirected path between any 2 nodes
Model-free learning
Learn how to act without explicitly learning the transition probabilities P(s' | s, a)
Model-based learning
Learn the model of the MDP (transition probabilities and rewards) and try to solve the MDP concurrently Basic idea: try to learn the model of the MDP (transition probabilities and rewards) and learn how to act (solve the MDP) simultaneously Learning the model - Keep track of how many times state s' follows state s when you take action a and update the transition probability P(s' | s, a) according to the relative frequencies - Keep track of the rewards R(s) Learning how to act - Estimate the utilities U(s) using Bellman's equations - Choose the action that maximizes expected future utility: π*(s) = argmax ∑ P(s' | s, a)U(s')
Bayesian Decision Theory
Let x be the value predicted by the agent and x* be the true value of X. The agent has a loss function, which is 0 if x = x* and 1 otherwise. Expected loss for predicting x: Sum(L(x, x*)P(x* | e)) for all x*
Convergence of perceptron update rule
Linearly separable data - converges to perfect solution Non-separable data - converges to a minimum-error solution assuming learning rate decays as O(1/t) and examples are presented in random sequence
Semi-supervised learning
Lots of data is available, but only small portion is labeled
Quantization
Map a continuous input to a discrete (more compact) output
Bayesian Network Inference is NP-Hard
NP-Hard, more precisely #P-hard: equivalent to counting satisfying assignments Can reduce satisfiability to Bayesian network inference ( Truth setting components > clause-satisfication testing components > overall-satisfaction testing component)
Multi-class perceptrons (one-vs-others framework)
Need to keep a weight vector wc for each class c Decision rule: c = argmax wc * x Update rule: suppose example from class c gets misclassified as c' - Update for c: wc <- wc + alpha * x - Update for c': wc' <- wc' - alpha * x
Bayes Rule
P(A | B) = P(B | A)P(A) / P(B) Can update our beliefs about A based on evidence B (P(A) = prior, P(A | B) = posterior) Key tool for probabilistic inference: can get diagnostic probability from causal probability
Mutually Exclusive Events
P(A ∨ B) = P(A) + P(B) Not independent
Product Rule
P(A, B) = P(A | B)P(B) = P(B | A)P(A)
Chain Product Rule
P(A1, ..., An) = P(A1)P(A2 | A1)P(A3 | A1, A2)...P(An | A1,...,An-1) = πP(Ai | A1,...,Ai-1) from i = 1 to n
Law of Total Probability
P(X = x) = Sum(P(X = x, Y = yi) = Sum(P(X = x | Y = yi)P(Y = yi))
Softmax
P(c | x) = exp(wc * x) / Sum(exp(wk * x))
Naive Bayes For Document Classification
P(class | document) = P(class) * ΠP(wi | class)
Maximum Likelihood Estimate for words given class
P(word | class) = # of occurrences this word in docs from this class / total # of words in docs from this class ΠΠP(w(d,i) | class(d,i) where d = index of training document, i = index of word
Laplacian smoothing
Pretend you have seen every vocabulary word one more time than you actually did P(word | class) = (# of occurrences of this word in docs from this class + 1) / (total # of words in docs from this class + V)
Subjectivism
Probabilities are degrees of belief How do we assign belief values? What would constrain agents to hold consistent beliefs?
Frequentism
Probabilities are relative frequencies "Reference class" problem / dealing with events that only happen once
Function approximation
So far, assumed a lookup table representation for utility function U(s) or action-utility function Q(s,a) Approximate the utility function as a weighted linear combination of features for large state space - U(s) = w1f1(s) + w2f2(s) + ... + wnfn(s) - RL algorithms can be modified to estimate these weights - More generally, functions can be nonlinear (e.g. neural networks) Benefits: - Can handle very large state spaces (games), continuous state spaces (robot control) - Can generalize to previously unseen states
Value Iteration
Start out with U(s) = 0 Iterate until convergence - During the ith iteration, update the utility of each state according to U_{i+1}(s) <- R(s) + y max ∑P(s' | s,a)U_i(s') In the limit of infinitely many iterations, guaranteed to find the correct utility values Should see plateauing of policy utility estimation
Policy iteration
State with some initial policy π0 and alternate between the following steps: - Policy evaluation: calculate U(s) for every state s in an iteration for a given policy - Policy improvement: calculate a new policy π_{i+1} based on the updated utilities π_{i+1}(s) = argmax ∑P(s' | s,a)U(s') Update Bellman equation to U(s) = R(s) + y∑P(s' | s, π(s))U(s')
Probabilistic Inference
Suppose the agent has to make a decision about the value of an unobserved query variable X given some observed evidence variable(s) E = e Partially observable, stochastic, episodic environment
Naive Bayes Model
Suppose we have many different types of observations E1, ... En that we want to use to obtain evidence about an underlying hypothesis X MAP Decision = P(X = x | E1 = e1, ... , En = en) ≅P(X = x)P(E1 = e1, ..., En = en | X = x) We can make the simplifying assumption that the different features are conditionally independent given the hypothesis P(E1 = e1, ..., En = en | X = x) = ΠP(Ei = ei | X = x) Posterior: P(X = x | E1 = e1, ..., En = en) Prior = P(x) Likelihood = ΠP(Ei = ei | X = x) MAP Decision = x = argmax x P(x | e) ≅P(x) * ΠP(ei | x)
Exploration
Take a new action with unknown consequences Pros - Get a more accurate model of the environment - Discover higher-reward states than the ones found so far Cons - When you're exploring, you're not maximizing your utility - Something bad might happen
Filtering (HMMs)
Task: compute the probability distribution over the current state given all the evidence so far: P(Xt | e1:t) Recursive formulation: suppose we know P(Xt-1 | E1:t-1) P(Xt | e1:t-1) = ∑P(Xt, xt-1 | e1:t-1) = ∑P(Xt | xt-1, e1:t-1)P(xt-1 | e1:t-1) = ∑P(Xt | xt-1)P(xt-1 | e1:t-1) So P(Xt | et; e1:t-1) = (P(et | Xt; e1:t-1)P(Xt | e1:t-1))/P(et | e1:t-1) ≅P(et | Xt) P(Xt | e1:t-1)
Markov Assumption
The current state is conditionally independent of all the other states given the state in the previous time step (first order) Transition Model: P(Xt | X0:t-1) = P(Xt | Xt-1) For observations, the evidence at time t depends only on the state at time t Transition Model: P(Et | X0:t, E1:t-1) = P(Et | Xt)
Solving MDPs for optimal policy
The optimal policy should maximize expected utility over all possible state sequences produced following that policy: ∑P(sqeuence)U(sequence) Define utility of state sequence via sum of rewards of individual states
Markov Assumption for MDPS
The probability of going to s' from s depends only on s and a and not on any past actions or states
Normalization Trick for Conditional Distribution
To get the whole conditional distribution P(X | Y=y) at once, select all entries in joint distribution table matching Y=y and renormalize them to sum to 1
Joint Distribution for HMMs
Transition Model: P(Xt | Xt-1) Observation Model: P(Et | Xt) So P(X0:t, E1:t) = P(X0) ΠP(Xi | Xi-1) P(Ei | Xi)
Independence
Two events A and B are independent if and only if P(A ∧ B) = P(A, B) = P(A) P(B)
Decoding
Viterbi algorithm Task: Given observation sequence e1:t, compute most likely state sequence x0:t x*0:t = argmax P(x0:t | e1:t) The most likely path that ends in a particular state xt consists of the most likely path to some state xt-1 followed by the transition to xt Let mt(xt) denote the probability of the most likely path that ends in xt: mt(xt) = max P(x0:t-1, xt | e1:t) ≅max P(0:t-1, xt, e1:t) = max[mt-1(xt-1)P(xt | xt-1)P(et | xt)]
Generalization
Want the classifier that does well on never before seen data
Temporal Difference (TD) learning
We don't want to learn (and don't know) P(s' | s, a). Temporal different (TD) update: - Pretend that the currently observed transition (s, a, s') is the only possible outcome and adjust the Q values towards the "local" equilibrium Qlocal(s,a) = R(s) + y max Q(s', a') Qnew(s, a) = (1 - alpha)Q(s,a) + alpha*Qlocal(s,a) Qnew(s,a) = Q(s,a) + alpha*(Qlocal(s,a) - Q(s,a)) Qnew(s,a) = Q(s,a) + alpha(R(s) + y max Q(s', a') - Q(s, a)) At each time step t - From current state s, select an action a: a = argmax f(Q(s, a'), N(s, a')) - Get the successor state s' - Perform the TD update Q(s,a) = Q(s,a) + alpha(R(s) + y max Q(s', a') - Q(s, a)) alpha = learning rate (should start a 1 and decay as O(1/t))
Smoothing
What is the distribution of some state Xk given the entire observation sequence e1:t? Forward-backward algorithm Forward message: P(Xk | e1:k) Backward message: P(ek+1:t | Xk)
Regularization
add a penalty on weight magnitudes to the objective function E(f) = ∑(yi - f(xi))^2 + λ/2 * ∑ wj^2 Discourages network weights from growing too large, encourages network to use all of its inputs "a little" rather than a few inputs "a lot"
Joint Probability Distribution
assignment of probabilities to every possible atomic event P(X1, X2, ... Xn)
Random Variables
describe the state of the world using random variables, take on values in a domain whose values must be mutually exclusive and exhaustive
Naive Bayes classifier
f(x) = argmax P(y|x) ≈ argmax P(y) P(x | y) = argmax P(y) ΠP(xd | y)
(k-)Nearest Neighbor classifier
f(x) = label of the training example nearest to x - All we need is a distance function for our inputs - No training required! For a new point, find the k closest points from training data Vote for class label with labels of the k points Pros - simple to implement - decision boundaries not necessarily linear - works for any number of classes - nonparametric method Cons - need good distance function - slow at test time
Q-learning
learn an action-utility function Q(s, a) that tells us the value of doing action a in state s Relationship between Q-values and utilities: U(s) = max Q(s, a) Selecting an action: π*(s) = argmax Q(s,a) We don't need to know the transition model to select the next action Equilibrium constraint on Q values: Q(s, a) = R(s) + y∑P(s' | s, a) max Q(s', a')
Decision Theory
probability theory + utility theory
Events
sets of world states define probabilistic statements, described using propositions about random variables P(A) = probability of set of world states in which proposition A holds
Policy
the action that an agents takes in any give state -> solution to an MDP
Utility Theory
used to represent and infer preferences
Maximum Likelihood Decision
x = argmax P(e | x)
Basic Supervised Learning Framework
y = f(x) - y = output - f = classification function - x = input Learning: given a training set of labeled examples {(x1, y1), ..., (xN, yN)}, estimate the parameters of the prediction function f Inference: apply f to a never before seen test example x and output the predicted value y = f(x) Key challenge = generalization
Differentiable Perceptron
σ(t) = 1 / 1 + e^-t Output is σ(w * x + b)