CMPSC442
Task Environment
A problem specification for which the agent is a solution
Two ways to use parent method to define subclass method
1.)Explicitly call the parent class method in the redefinition 2.) use super()
Random Variables Indexed over Time
Assume: fixed, constant, discrete time steps t ● Notation: Xa:b = X a , Xa+1 ,..., Xb-1 , Xb ● Markov assumption: random variable Xt depends on bounded subset of X0:t-1
UCS Evaluation
Complete, like BFS. Optimal for any step cost function. Time complexity contingent on step costs. Space complexity = time complexity.
Database Semantics
DB semantics for a fragment of FOL with two constants (R, J), one binary relation ○ Unique-names assumption: every constant refers to a distinct object (R ≠ J) ○ Closed world assumption: an atomic sentence not known to be true is false ○ Domain closure: bijective relation of domain elements to constant symbols
Existential Instantiation (EI)
For any sentence σ, variable v, and constant symbol k that does not appear elsewhere in the knowledge base:
UCS in words
Store the frontier as priority queue ordered by f. Expand node n from the frontier that has the lowest path cost f(n) Apply the goal test when a node is selected for expansion Save any goal node in reached Select the goal node in reached with the lowest path cost. (If there are multiple paths to the goal, lowest cost path is found first).
Each time step Xt is conditioned on the preceding states Xt-2, Xt-1: Second Order Markov Process
T
Single Feed Forward Neuron Cannot be an XOR Gate
T
● Connected components of a constraint graph constitute independent problems ○ If assignment Si is a solution of CSPi , then ∪Si is a solution of ∪CSPi ○ Reduction of complexity: if each CSPi has c variables from a total of n, then there are n/c subproblems, each with complexity d c , where d is the size of the domain, giving O(d cn/c) which is linear in n, instead of O(d^n ) which is exponential in n ● Connected components of CSPs are rare
T
Two different functions cannot have the same name, even if they have different numbers, order, or name of arguments
True
α: lower bound on MAX's outcome
β: upper bound on MIN's outcome
Majority Class Baseline for POS Tagging
● Assign the most frequent tag
Bayesian Networks
● Concisely represent any full joint probability distribution ○ Graphically represents conditional independence ○ Defined by the topology and the local probability information ● Also known as belief nets
Entropy
● Entropy measures the uncertainty of a random variable; the more uncertainty in the variable, the less information it has ● Recall the use of mutual information for feature selection for NB classifiers ○ MI presented there as a measure of how independent X and Y are ○ It can also be seen as a measure of how much information X and Y convey about each other; it is the normalized negative joint entropy
Resolution-based Theorem Prover
● For any sentences A and B in propositional logic, resolution can decide whether A ⊨ B ● Step 1: Put the statements in conjunctive normal form (CNF) ● Step 2: Proof by contradiction
Discounted Rewards with Infinite Horizon MDP
● Optimal policy π* is independent of the start state ● Utility of a state s is the expected sum of discounted rewards over time (t=0 . . . t=∞) ● Optimal policy is the policy that gives the maximum utility
Game Search, for Two-Player, Zero-Sum Games
● Two players: MAX and MIN ● MAX moves first ● MAX and MIN take turns until the game is over ● Winner gets reward, loser gets penalty
Expectimax for "Environment" Adversary
● Uncertain outcomes of agent's actions
Utilities of Sequences
● What preferences should an agent have over reward sequences? ● More or less? ● Now or later?
Semantics ctd
● Where the network contains n random variables Xi , each entry in the joint probability distribution is P(x1 , . . . , xn ) ● In a Bayesian network, each entry P(x1 , . . . , xn ) in the joint probability distribution is defined in terms of the parameters θ:
Expectimax Search
● Why wouldn't we know what the result of an action will be? ○ Explicit randomness: rolling dice ○ Actions can fail: when moving a robot, wheels might slip ● Values should reflect average-case (expectimax) outcomes, not worst-case (minimax) outcomes ● Expectimax search: use average score under optimal play ○ Max nodes as in minimax search ○ Chance nodes are like min nodes, represent uncertain outcomes ○ Calculate their expected utilities, i.e. take weighted average (expectation) of the children
Convergence Properties
● With fixed α, no guarantee of convergence ● With decreasing α, convergence is guaranteed ● What previous topic is this similar to?
Deriving ξ t (i,j)
1. By laws of probability 2. : the probability of the whole observation sequence 3. Divide ε t (i,j) by to get ξ t (i,j)
Knowledge Engineering using FOL
1. Identify the task 2. Assemble the relevant knowledge 3. Decide on a vocabulary of predicates, functions, and constants 4. Encode general knowledge about the domain 5. Encode a description of the specific problem instance 6. Pose queries to the inference procedure and get answers 7. Debug the knowledge base
Action Selection
1. Instantiate all evidence 2. Set action node(s) each possible way; for each action value: a. Calculate the posterior for the parents of the utility node, given the evidence b. Calculate the utility for each action 3. Choose the action with the highest utility
Generalized Search Algorithm
1.)Enumerate all paths from the initial state through the state space 2.)Find the subset of paths that end in the goal state 3.)Order the solution paths by cost, ascending 4.)Select the lowest cost solution
During graph search, states are in one of three disjoint subsets
1.)States associated with expanded nodes 2.) States associated with nodes in the frontier 3.)States associated with nodes that have not been reached
Six properties of task environments
1.)States can be fully, partially, or not observable 2.)Agency involves a single or multiple agents who might co-operate, compete or confront. 3.)Successor states can be deterministic, non-deterministic or stochastic 4.)Agent decisions can be episodic or sequential 5.)The world can be static or dynamic 6.)Time and space can be discrete or continuous
Initial state S0 :
: How the game is set up at the start
Constructing a Bayesian Network
A Bayesian network is a correct representation of a domain only if each node is conditionally independent of its other predecessors in the node ordering, given its parents 1. Determine the variables to model the domain, and order them such that causes precede effects 2. Choose a minimal set of parents for Xi to satisfy the equation for the local & global semantics, and insert an edge from every Parent(Xi ) to Xi 3. Add the CPT at each node. This procedure guarantees the network is acyclic, and with no redundant probability values: it cannot violate the axioms of probability.
Is-Terminal(s):
A Boolean to indicate when the game is finished
Compactness of a Bayesian Network
A CPT for Boolean Xi with k Boolean parents has 2 ^k rows for the combinations of parent values ● Each row requires one probability value p for Xi = true (the number for Xi = false is 1-p) ● If no Boolean variable has more than k Boolean parents, the complete network requires O(n ⋅ 2^k ) numbers ● Grows linearly with n versus O(2^n ) for the full joint distribution
PL versus FOL Expressivity
A Logical Formalism Can be a Tool to Investigate Information in Language ● Propositional logic (PL) assumes the world contains statements ○ Atomic terms standing for statements ○ Operators to combine with atomic statements ● First-order logic (FOL) assumes the world contains ○ Objects: people, houses, numbers, colors, baseball games, . . . ○ Relations: red, round, prime, brother of, bigger than, part of, . . . ○ Functions: father of, best friend, one more than, plus, . . . ○ Quantification: all wumpuses smell bad, some squares are breezy ○ Statements about objects, relations, functions and their quantification
Markov Assumption for Language Modeling
A bigram language model is a Markov model: ● S, a set of states, one for each word wi ● A, a transition matrix where a(i,j) is the probability of going from state wi to state wj ○ Where the probability a(i,j) can be estimated by ● π, a vector of initial state probabilities, where π(i) is the probability of the first word being w
Subclasses
A class can extend the definition of another class. Allows use of methods and attributes already defined in the previous one. New class: subclass. Original: parent ancestor, or superclass
object-oriented programming
A computer programming model that organizes software design around data, or objects, rather than functions and logic. An object can be defined as a data element with characteristic attributes and behavior defined by the class it represents
Convex Functions
A function is convex when, for any two points (x, f(x)) and (y,f(y)), a line segment connecting the two points lies above the curve f. Use of an arbitrary function g as your objective function means to use g as the criterion for achieving the problem goal. If the objective function g is convex, then the goal can be identified as z such that g(z) =0
A Naïve Bayes Model is a Bayesian Network
A graphical representation has one node for each random variable ● Directed arrows represent the conditioning effect of a parent node on its children
Generative Models Support Parameter Estimation
A joint probability distribution supports queries about any one or more random variables by marginalizing over the other variables ● Therefore the same model can be used in multiple directions: ○ To estimate the posterior probability of labels given the data ○ To estimate the posterior probability of the data given the labels
Likelihood Function
A likelihood function is a probability function (e.g., probability mass function) ● Likelihood L is considered to be a function of the parameters L(θ) for fixed x and y ○ where θ is a vector of parameters ○ x is fixed data (e.g., Naïve Bayes features) ○ y are prior probabilities of the classes ● Bayes rule: Posterior = Likelihood x Priors
List
A mutable ordered sequence of mixed types
Validity and Satisfiability
A sentence is valid if it is true in all models (tautologies; necessarily true) ○ e.g., A ∨¬A, A ⇒ A, (A ∧ (A ⇒ B)) ⇒ B ● Validity is connected to inference via the Deduction Theorem: ○ A ⊨ B if and only if (B ⇒ A) is valid ● A sentence is satisfiable if it is true in some model ○ e.g., A ∨ B ○ Determining satisfiability (the SAT problem) is NP-complete ● A sentence is unsatisfiable if it is true in no models ○ e.g., A∧¬A ● Satisfiability is connected to inference via reductio ad absurdum: A ⊨ B if and only if (A ∧¬ B) is unsatisfiable
Solution
A sequence of actions from the initial state to the goal state
Syntax
A set of nodes, one per random variable (discrete or continuous) ● A directed, acyclic graph (DAG) ○ Directed arcs for parent node as conditioning context point to the child nodes ● A conditional distribution for each node Xi given its parents that quantifies the effect of the parents on the node using a finite number of parameters θ
Tuple
A simple immutable ordered sequence of items (cannot be modified) Items can be of mixed types, including collection types
Default Dict
A subclass of Dict. Does not raise a key error if a key lacks a value; assigns a default value. Optimal default_factory arg: a function to return a default value. Can use built in or define a function.
Defining the state space
Abstract over the real world details. State space abstraction. Action abstraction: pairs of states where 1st can be succeeded by 2nd.
Evaluation Metrics
Accuracy Information Retrieval Metrics ● A confusion matrix is a square matrix of n classes by n hypotheses ● The same confusion matrix is used for accuracy, and IR metrics ● Accuracy is preferred if all cell values can be determined ● Recall and precision are often used when TN is not known
Add-One La Place Smoothing
Add pseudo counts for unobserved class members where the missing word could have occurred (e.g., if only you had more data): ○ Add 1 to the numerator in the conditional probabilities for every word ○ Increment the denominator total count of words in the class by the size of the vocabulary (the class size is expanded by all the pseudo counts)
Two criteria for Good Heuristic functions
Admissibility- A good heuristic never overestimates the distance to the goal. f(n) >= h(n) Consistency- Given a node n, a successor n', and the goal node of G, a heuristic function is consistent if n, n' and G obey the triangle inequality
Search as Optimization
Advantages: Low memory usage; often finds reasonable solutions in large or infinite state spaces. Useful for pure optimization problems: Find or approximate the best state according to some objective function. Optimal if the space to be searched is convex.
General Description of Search
Agent Formulates a goal Agent formulates an abstract representation of the problem Agent simulates sequences of actions until it finds a best path to the goal, or exhausts the search space. If it found the best path, it executes the path and achieves the goal.
How does a rational agent choose outcomes?
Agent cannot perfectly predict all outcomes. Agent relies on expected outcomes.
Rational Behavior
Agent is assessed by its performance meaning the consequences of its actions
Goal Test
Agent possibly achieves the goal if any state s in the belief state satisfies the goal test. Agent necessarily achieves the goal if all states s in the belief state satisfy the goal test
Motivation for Alpha-Beta Pruning
Among the possible actions at a given MIN node, MIN will always choose the one that results in MAX's lowest score
*args
An iterable for a variable number of arguments
Performance Measure
An objective criterion for success of an agent's behavior, given the evidence provided by the percept sequence.
Simulated Annealing
Annealing: the process by which a metal cools slowly and as a result freezes into a minimum energy crystalline structure. Adopt a control parameter T, which by analogy with metallurgy is known as the system temperature. T controls the amount of randomness in picking a successor. T starts out high to promote escape from local maxima early on. T gradually decreases towards 0 over time to avoid getting thrown off track late in the search
Lambda
Anonymous functions. Used to create a function with no name. any number of arguments, a single expression. Useful when a function is required only rarely.
Using Search to Solve Sensorless Problems
Applying search algorithms to sensorless problems ○ So far, we have used search algorithms to search the state space ○ The same algorithms can search the belief state ● Why should this work? ○ Percepts are irrelevant: they are always empty ○ The solution to a sensorless problem is a single sequence of actions ○ The belief state is deterministic: the agent knows its own beliefs by definition
Or-nodes
As in the deterministic search methods, search trees will contain state nodes with one or more possible action arcs
Expectations for the Transition Probabilities
Assume we have an estimate of the transition probability aij at a particular point in time t in the observation sequence ● Then the estimate of the total count for sums over all t ○ Formulate an expression for the expected joint probability ξ t (i,j) of state i at time t and state j at time t+1, based on observation sequences ○ For the numerator of , sum over ξ t (i,j) for all t ○ For the denominator, sum over ξ t (i,k) for all t and for all transitions i,k
Axioms versus Theorems
Axioms: Foundational statements taken as given Theorems: entailed by the axioms
Given a finite branching factor and finite state space
BFS is complete. At most it branches b times.
Four uninformed search algorithms
BFS: FIFO queue; Optimal if step costs are the same. Uniform Cost Search (Best First): Priority queue ordered by a cost function; Optimal for any step cost function. DFS: LIFO queue; Not optimal. Iterative Deepening: Adopts benefits of BFS and DFS without their limitations
First-order Markov Process
Bayesian network over time ○ Random variables . . . Xt-2 , Xt-1 , Xt , Xt+1 , Xt+2 . . . ○ Directed edges for conditional independence ● Each state Xt is conditioned on the preceding state Xt-1
Viterbi Decoding for POS Tagging
Because many of these counts are small or do not occur, smoothing is necessary for best results ● HMM taggers typically achieve about 95-96% accuracy, for the standard 36-42 set of POS tags
Value Iteration
Bellman equations characterize the optimal values: ● Value iteration computes them:
Two Ways to compare algorithm efficiency
Benchmarking: empirical measurement on benchmark tasks. Very Specific Mathematical analysis with big O notation: Asymptomatic analysis; how computation time changes with length of input
A* search
Best known form of heuristic best-first search. Key ideas: Avoid expanding expensive paths, expand most promising first. Evaluation function f(n) = p(n) + h(n) p(n) = the cost to reach the node h(n) = the estimated cost to get from the node to the goal. f(n) = estimated total path through n to the goal. Implementation: Frontier as priority queue by increasing f(n)
Components of O() notation
Branching factor: number of successors of a search node. Depth: number of actions in the optimal solution Maximum Depth: Maximum actions on any path
Dict
Built in container. Python dictionaries of key: value pairs are unordered and work by hashing so keys must be immutable.
Limitations of Propositional Logic
Can only state specific truths. Cannot state generic truths
Action cost
Case one: if all actions have same cost from any state, then the cost is the same as in the state space problem Case two: if the same action can have different costs, depending on the state, then the cost is a function of the belief state
Decision Networks
Chance nodes (ovals, as in Bayesian Networks) ○ Parents can be chance nodes or decision nodes ● Decisions (rectangles; no parents, treated as observed evidence) ● Utility nodes (diamonds, depend on action and chance nodes)
Concept of a Utility Function
Choosing among actions based on the desirability of their outcomes ○ Each action a in state s results in a new state s' with some probability P(result(a)=s') ○ The transition model gives the probabilities of action outcomes ● Given a utility function U(s) that quantifies the desirability of a state s ○ Expected Utility EU(a) of an action a is the sum of utilities over the outcomes, weighted by their probabilities
Evaluation of A*
Complete Time: O(b^d) Space: All nodes are stored; runs out of space before time. A good heuristic can reduce the complexity by orders of magnitude. Optimal
Iterative Deepening Evaluation
Complete. Optimal if step cost = 1. Time complexity= O(b^d) Space Complexity= O(d)
Evaluation of Greedy Best First
Complete: No - can get stuck in loops Time: O(b^m) worst case Space: O(b^m)- priority queue Optimal: No
Inference Task: Prediction
Compute posterior over a future state, based on all the evidence to date
Inference Task: Smoothing
Compute the posterior distribution over a past state ● Smoothing gives a better estimate of Xk , k ≤ t, than was available at time tk ○ More evidence is incorporated for the state Xk - evidence preceding, concurrent with, and following Xk
HMM POS Tagger as a Bayesian Network
Condition the hidden POS tag p at time t on the POS tag at time t-1: P(pi |pi-1 ) ● Condition the word w at time t on the POS tag at time t: P(wi |pi )
Benefits of Crossover
Crossover can be beneficial given an advantageous pattern (schema) on one side of the crossover point.
Simple Reflex Agent
Current percept determines agent's next action
Alpha-Beta Algorithm Description
DFS ● Pass current values of α, β down to children during search ● Update values of α and β during search: ○ Update α at MAX nodes ○ Update β at MIN nodes ● Prune remaining branches at a node whenever α ≥ β
Methods to Handle Uncertainty for Logical Agents
Default or nonmonotonic logic. Fuzzy logic: truth values in [0,1] ○ Can handle different interpretations of the same predicate ● Subjectivist (Bayesian) Probability ○ Model agent's degree of belief ○ Estimate probabilities from experience (e.g., expectation as average) ○ Probabilities have a clear calculus of combination
Scope of Peano Axioms
Define multiplication as repeated addition Define exponentiation as repeated multiplication Define division similarly All of Number Theory is defined from: ● One constant (zero) ● One function (Successor) ● One predicate (+) ● Nine axiom
M-Step for Naive Bayes
Define γ t (qm ) to be the new parameter value at iteration t for the prior probability of class qm ● Define γ j t (v|qm ) to be the new parameters at iteration t for the conditional probabilities of each jth word v given the class qm
E-Step for Naive Bayes
Define δ(qm |di ) to be the conditional probability of the class qm given di and the parameter values θ t-1 from iteration t−1 ● In each E-step, calculate the values δ(qm |di )
The Bellman Equations
Definition of "optimal utility" gives a simple one-step lookahead relationship amongst optimal utility value
Minimax Performance
Depth-first search (DFS) with fixed number of ply m as the limit. ● O(b^m) time complexity ● O(bm) space complexity if algorithm computes all moves at once ● O(m) space complexity if algorithm computes moves one at a time Performance will depend on ● The quality of the static evaluation function (expert knowledge) ● Depth of search (computing power and search algorithm)
Transition model: Two cases
Deterministic actions: Non-deterministic actions
Diagnostic Knowledge versus Causal Knowledge
Diagnostic network is less compact than the causal network: 1 + 2 + 4 + 2 + 4 = 13 CPT entries instead of 10 ● Causal models and conditional independence seem hardwired for humans!
Expectation Maximization and Naive Bayes
EM learns the parameters of a probabilistic model ● Meeting 28 presented EM for learning the parameters of an HMM ● Naive Bayes is one of the simplest probabilistic models ○ If the training data has class labels, the MLE parameter estimates can be computed from the observed distribution in the data ○ If the training data lacks class labels, the Naive Bayes parameters can be learned through EM
Computing Values of Each Trellis Entry
Each cell of the forward algorithm trellis αt ( j ) represents the probability of being in state j after seeing the first t observations, given the HMM λ . The value of each cell is computed by summing over the probabilities of every path that leads to this cell
Markov Assumption
Each state depends on a fixed finite number of prior states ● Future is conditionally independent of the past ● A Markov chain is a Bayesian network that incorporates time (temporal sequences of states)
Universal Instantiation (UI)
Every instantiation of a universally quantified sentence is entailed by it
Structure of CSP Problems
Examination of the CSP graph can be used to solve problems more quickly ○ Independent subproblems ○ Tree-structured CSPs ● Patterns of values can also help solve problems
Utility Based on Rational Preferences
Existence of Utility Function: if an agent's preferences obey the axioms of utility, then ○ There exists a function U such that U(A) > U(B) iff A > B ○ U(A) = U(B) iff A ∼ B ● Expected Utility of a Lottery
Magic Methods
Existing method names with preceding/following underscores. Built-ins that can be modified for user's classes. Best to make them intuitive like duck-typing. Adds magic to user classes that builds on python syntax for built ins.
Constructing a Search Tree
Expand each next node by applying actions
Fixed Policies
Expectimax trees max over all actions to compute the optimal values ● If we fixed a policy π(s), then the tree will be simpler, with one action per state ○ Though the tree's value would depend on which policy we fixed
Binomial Distribution: x Successes in y Trials
Experiment: n repeated trials where each trial has one of two outcomes, success or failure (Bernouilli trials) ● P (probability of success) is the same on every trial ● Trials are independent ● x = number of successful trials out of n ● Binomial Probability Mass Function
Types of Search queues
FIFO: first in first out; used in breadth first tree traversal LIFO: last in first out; used in depth first tree traversal Priority Queue: Orders nodes in queue by an evaluation function; used in best first search
Problems with DFS
Fails in infinite-depth search spaces. Can be very inefficient if m>>d
Inference Task: Most Likely Explanation
Find the most likely sequence of states that could have generated a set of observations
Forward Probabilities
For a given HMM λ, given that the state q is i at time t, what is the probability that the partial observation o1 ... ot has been generated? ● Forward algorithm computes at (i) 1 < i< N, 1 < t < T in time 0(N2T) using the trellis, where T is number of observations and N is the number of hidden states
Rational Agent
For each possible percept sequence P, a rational agent should select an action a, that is expected to maximize its performance measure.
● Monte Carlo Tree Search can be used instead of minimax for games ○ More efficiency for games with a high branching factor ○ No need for heuristics to inform the game state evaluation function
Formalizes the tradeoff between exploration versus exploitation
General Framework Monte Carlo
From a given position s i , simulate m move sequences to game end ● Each simulation is called a rollout or playout ● Value of s i is given by the average utility over the m simulations ● Random playouts work only for simple games, so we need ○ A playout policy (how to bias moves towards good ones) rather than randomly pick moves ○ What start positions to use for playouts, how many to do? ■ Pure Monte Carlo: do N simulations, find the next move with highest win % ■ Selection policy: Balances exploration and exploitation
GMP and FOL
GMP is a lifted Modus Ponens ● Raises Modus Ponens from variable-free propositional logic to FOL ● Inference in FOL - Lifted versions of ○ Forward Chaining ○ Backward Chaining ○ Resolution
How to Play a Game by Searching
General Scheme 1. Consider all legal successors to the current state ('board position') 2. Evaluate each successor board position 3. Pick the move which leads to the best board position 4. After your opponent's best move(s), repeat.
Gradient Descent (Alternative)
Given a locally correct formula for the gradient, perform steepest ascent hill climbing to move in the direction of grad(f) = 0
Greedy Best FIrst
Given f(n) = p(n) +h(n) for evaluation function f(n), path cost so far p(n) and estimate to goal h(n). Greedy Best first ignores p(n) so f(n) = h(n)
Reduction in Complexity
Given n random variables that are all conditionally independent on a single causal variable, probabilistic inference goes from O(2^n ) to O(n) ● Basis for naïve bayes ● Allows probabilistic inference to scale to many variables ● Conditional independence is very common, in contrast to full independence, which is not
Backward Algorithm
Given that we are in state i at time t, the probability β of seeing the observations from time t + 1 to T, given λ, is: ● The Forward and Backward algorithms give same total probabilities over the observations: P(O)=P(o1 ,o2 ,...,oT )
Hill Climbing Details
Given the current state n, an evaluation function f(n) and s equal to n's successor state s that maximizes f(s). If f(s) >= f(n) then move to s. Else halt at n. Terminates when a peak is reached Has no look-ahead at the immediate neighbors of the current state. Chooses a random best successor, if there is more than one. Cannot backtrack, since it doesn't remember where its been
Cycle Cutsets
Given the efficiency of tree-structured CSPs, turn other CSP graphs into trees ● A cycle cutset S of a graph G: any subset of vertices of G that, if removed, leaves G a tree ○ EG: Assign a value v to SA, then remove v from domains of remaining variables; SA is a cycle cutset that can be removed,, leaving a forest of trees
Formulation for Estimating the Transition Probabilities
Given the expression ξ t (i,j), we can now sum over all t ● Numerator sums over all probabilities of states i at t and states j at t+1 ● Denominator sums over all probabilities of states i at t and states k at t+1 for all k in full set of states Q
Non-deterministic problems
Given the states s and actions A Current state + action = belief state consisting of alternative possible successors. Solution to search is a strategy for taking actions, rather than a specific action sequence. The strategy execution is contingent on detecting results of actions.
Deterministic Problems
Given the states s and actions A Current state + action = resulting successor state
Unification: Systematic Substitution
Given two logical statements p, q ● Find a substitution θ (unifier) that make p and q look identical Unify(p, q) = θ where Subst(θ,p) = Subst(θ,q) ● A key component of first order inference algorithms
Utility(s, p):
Gives a numerical value to player p at the terminal state
The Value of Information
How an agent chooses what information to acquire: values of any of the potentially observable chance variables in the model ○ Observation actions affect the agent's belief state ○ Value of any observation derives from the potential effect on the agent's actions
PEAS:
How to specify the task environment Performance Measurement Environment Description Actuators Sensors
Expectimax: Average Case Instead of Worst Case
Idea: Uncertain outcomes controlled by chance, not an agent adversary
Improved Backtracking: Backjumping
Identify a variable's conflict set ○ Variable assignments that are in conflict. Jump back to reassign the most recently assigned variable in the conflict set
Constraint Learning
If a partial assignment is encountered that leads to failure (e.g., {WA = red, NT = green, Q = blue}) during a search, save the information ○ NoGood({WA = red, NT = green, Q = blue}) ● Then it won't be tried again ● Modern CSP solvers gain in efficiency through use of constraint learning
Multiple Resolution Order
If a subclass method name is in multiple parent classes, use the order in the subclass statement
Heuristic function h(n) = estimated cost from node n to goal node
If n is goal then h(n) = 0. Evaluation function g(N) >= h(n). Otherwise h is not a good heuristic
String
Immutable, like a tuple with different syntax. Character encoding versus data storage and transmission.
Goal-based reflex agent
Implicit or explicit notion of planning. Agent's next action depends on transition model + sensor model+ goal
Utility-based agent
Implicit or explicit preference ordering of different plans for same goal
Rudimentary Profiling
Import the time module, retrieve the before and after times, subtract
General idea behind informed Search
Improves upon Best-First Search. (Best-first uses priority queue ordered by an evaluation function f(n). f(n) is path cost function on path from start node to current node.) What matters is path from start to goal. (Replace f(n) with new function g(n) that includes an estimate h(n) of cost of current node to goal. g(n) = f(n) + h(n). Computation of h(n) uses heuristic knowledge to get a good estimate
Pros of Propositional Logic
In contrast to database languages or programming languages, PL is Declarative rather than procedural Supports partial information: disjunction, negation Compositional ○ Meaning of a complex statement is a function of the meanings of the parts (atomic statements, operators) Meaning is context-independent ○ Contrasts with natural languages, where meaning is context dependent
Solving Partially Observable Problems
In place of the successor function from fully observable deterministic search, we now have: ○ A PERCEPT function to produce possible observations in successor belief state ○ A RESULTS function to update the belief state ● Given the above, the AND-OR search algorithm can be applied to the belief state space to find a solution ● The solution is a conditional plan
Developing the Parameters for an HMM
In previous meetings, we assumed that the parameters of the HMM model are known ○ Often these parameters are estimated on annotated training data ○ Annotation is often difficult and/or expensive ○ Training data might be different from a new dataset of interest ● Parameters can be learned from unlabeled data ○ In principle, the goal is to maximize the parameters with respect to the current data
Theory versus Practice
In theory, Baum-Welch can do completely unsupervised learning of the parameters ● In practice, the initialization is very important
Mutual Information
In training set, choose k words which best predict the categories Words X with maximal Mutual Information with the class Y
Markov Blanket
Independent of the global network structure: ● For node A, the only part of the network to consider for predicting the behavior of A and its children ○ Parents of A, children of A, and the other parents of A's children's ● This Markov Blanket property is exploited by inference algorithms that use local and distributed stochastic sampling processes
Estimating the Priors for Each State
Initial state distribution: is the probability that qi is a start state (t = 1)
EM Algorithm: Input and Initialization
Input ● An integer m for the number of classes ● Training examples di for i = 1 . . . D where each di ∈ ● A parameter T specifying the number of iterations Initialization: set γ 0 (qm ) and γ j 0 (v|qm ) to some initial (e.g., random) values satisfying the constraints
Unification Increases Efficiency
Instead of instantiation, find a substitution for variables in the KB ○ Make the conjuncts in an implication match atomic statements in the KB ○ Find substitutions for quantified statements
Beam Search
Keep track of k states Find successors of all k Take top k successors across all k beams (shares information across beams, only the computation of successor states is pursued in parallel.) Stochastic variant: Pick a successor with probability proportional to the successor's value.
Iterative Re-Estimation of Parameters
Key Idea: parameter re-estimation by hill-climbing ● Iteratively re-estimate the HMM parameters until convergence ○ Initialize with some parameter values ○ With these initial parameters, compute the expected counts of states, and expected state transition counts, given the observation sequences ○ Using these expectations, which approximate MLE counts over the data (were the state labels available), we then recompute the transition probabilities and emission probabilities
DFS
LIFO; Usually implemented as a tree-like search. While the goal state has not been found- search finds each next successor proceeding to maximum depth. Frontier follows the next deepest node.
A heuristic is consistent, if for node n and its successor n' cost function c, and action a: h(n) <= c (n,a,n') +h(n')
Lemma: If h is consistent, f(n) is non-decreasing on any path (where f(n) is the estimated total path cost). f(n') =g(n') + h(n') =g(n) +c(n,a,n') +h(n') >= g(n) +h(n) = f(n)
Inference: Queries about Probabilities
Let X be a variable to query, E be the list of evidence variables, e be the list of observed values for the evidence, and Y the remaining unobserved values, then we can formulate a query P(X|e) ● Compute the probability of X conditioned on e by summing over all combinations of values of the unobserved variables ● Theoretically, this general query can be addressed for any conditioning context of any variable using a full joint probability distribution ● In practice, full joint probability distributions are impractical for large sets of variables
Semantics
Local semantics: given its parents, each node is conditionally independent of its other ancestors Local semantics give rise to global semantics
Solving a Univariate Regression
Loss is minimized when the partial derivatives of the loss function with respect to w1 and w0 are both 0
Maximum Expected Utility (MEU) Principle
MEU defines a rational agent as one that chooses its next action to be the one that maximizes the expected utility: ● Implemention requires computational solutions to perception, learning, causal knowledge about outcomes of actions, and inference ● Instead of a retrospective performance measure, a decision theoretic agent incorporates it into the agent's utility function, thus allowing it to anticipate how to achieve the highest performance
Formalizing Naive Bayes
MLE Parameter Estimates for Multinomial Naive Bayes ○ Generalization of MLE estimates for binary Naive Bayes ● EM for Estimating NB parameters π q and ρn for M classes
Marginalization and Conditioning
Marginalizing (summing out) for a variable sums over all the values of another variable in the joint distribution ● Conditioning, derived from applying the product rule to the rule for marginalizing
MAX AND MIN
Max moves first: all play is computed from MAX's vantage point ● When MAX moves, MAX attempts to MAXimize MAX's outcome ● MAX assumes that when MIN moves, MIN attempts to MINimize MAX's outcome
Hill Climbing
Maximize an objective function (global maximum)
Optimization
Maximize or minimize a real function: choose input values from the domain. Compute the value of the objective function.
Cons of PL
Meaning is context-independent ○ Contrasts with natural languages, where many expressions are context dependent ○ PL cannot express much of the meaning natural languages convey Limitations on the expressivity of propositional logic: ○ Cannot state generic truths, only specific ones ■ Specific truth (fact): B1,1 ⇔ (P1,2 ∨ P2,1) ■ Generic truth: Squares adjacent to pits are breezy
Gradient Descent
Minimize a loss function (global minimum)
Classes support
Modularity for easier troubleshooting Reuse of code through inheritance Flexibility through polymorphism
Inference Rules
Modus Ponens: if A is true, and A ⇒ B is true, then B is true
Iterator
Mutable objects with a next() method. Keeps track of how much of the iterator remains. Throws StopIteration when done. Use memory efficiently.. Implemented as classes. Objects for a data stream with a next method
Details on UCS
Nearly the same as Dijkstra's algorithm. which finds lowest cost paths. Similar to BFS but applies a cost function to every step.( Orders frontier by f. First node on f is therefore on t he current lowest cost path) BFS is optimal if all step costs are equal. UCS is optimal for any cost function.
BFS Cost optimality
Necessarily fins the shortest solution first. Optimal if all actions have the same cost.
Probability of a Sentence S: Bigram Version
Next guess: products of bigrams: ● Given a lattice for word hypotheses in a 3-word sequence: ● Bigram probabilities are more predictive, but very low and sparse! Markov Chains 7 form subsidy for
Graph search
No state occurs in more than one path.
Decision Tree Algorithm
Nodes consist of tests on attributes Edges consisting of attribute values Leaves consist of output values
VPI Properties
Non-negative: one can ignore information that is not useful ● Non-additive, because the value depends on the current belief state ● Order-independence of sensing actions as distinct from other actions
DFS evaluation
Not complete; therefore not optimal. Time complexity = O(b^m) Space complexity = O(bm) - Linear space complexity
Efficiency of CSP Local Search
Not counting the initial placement of n-queens, the run-time of min-conflicts is independent of the size of n!
BFS Space and time complexity
O(b^d): optimal solution i s at depth d. Nodes remain in memory so time complexity = space complexity. Memory requirements quickly become a problem. BFS cannot solve problems with large state space.
Belief Updating
Observations cannot decrease uncertainty Sensing can be deterministic or non-deterministic ○ Deterministic sensing leads to disjoint belief states for the possible percepts, thus a partition of the predicted belief state
Uninformed Search
Only the information available in the problem definition is used. Different algorithms use different kinds of tree traversal.
Gaussian Mixture Model
Parameters of a mixture of Gaussians are ○ The weight of each component ○ The mean of each component ○ The covariance of each component
Manually Derived Parameters
Parameters that provide a good fit to the data (with smoothing), where good fit means that these parameters predict the data
Parts of Speech
Parts of speech: traditional grammatical categories like "noun, ""verb," "adjective," "adverb" . . . (and many more) ● Functions: ○ Help distinguish different word meanings: N(oun) vs. V(erb) ■ EG: river bank (N) vs. she banks (V) at a credit union ■ EG: a bear (N) will go after honey vs. bear (V) with me ○ Preprocessing for many other natural language processing tasks
POS Tagging
Pervasive in Natural Language Applications ● Machine Translation ○ Translations of nouns and verbs have different forms (prefixes, suffixes) ● Speech synthesis ○ "Close" as an adjective versus verb ○ see table ● Sense disambiguation of word meanings ○ "Close" (adjective) as in near something ○ "Close" (verb) as in shut something
Design Process for a rational Agent
Precondition: PEAS specification Design: Construct a function f to maximize the value of the performance measures Implementation: Write and test an agent program that implements f on a particular architecture.
Transition Model in Partially Observable Environments
Prediction stage computes the hypothesized belief that results from taking action a in belief state b: Possible Percepts stage computes the possible observations in the predicted state: Update stage computes the belief state resulting from the percepts
Pros and Cons of Generators
Pro: Avoids storing an entire Sequence of memory Con:Bad if you need to inspect the individual values
Marginalize Out the Labels for Posterior of the Data
Probability of each example document di is the sum over all classes of the joint probability of the document di and the class Q ● Probability of each document di is thus given by the product of the prior probability of the class with the products of the conditional probabilities of each attribute in the document
Search in Continuous spaces: Brief introduction
Problem: a continuous action space has an infinite branching factor. (Many local search methods developed for discrete action spaces would not generalize to continuous action spaces. Search methods with random selection of successor states will work. Or in convex spaces follow the gradient of the evaluation function.
Newton-Raphson Method
Produces successively better approximations to the root(zero) of a real valued function x_n+1 =x_n - g(x_n)/g'(x_n). More direct route
Deriving Naïve Bayes
Product rule applied to a joint probability distribution: 2. Conditional Independence
Nodes in search tree represent
Progress in the search, not states in the state space. Different paths to the same state create distinct nodes. Therefore, no backtracking in a search tree, only lookahead
Motivations for Simulated Annealing
Pros and cons of hill climbing. (If landscape is convex, it can be very fast. Real problems are rarely convex. If downhill moves are not allowed, cannot escape local maxima. Stochastic hill climbing allows downhill moves with low probability.) Random restart is complete but with very low probability. Simulated Annealing makes HC both efficient and complete. (Combines completeness of random restart with efficiency of stochastic methods. Basic idea: Diminish the randomness as search progresses.)
HMM Pos Tagger Parameters
Q = qm ∈ {q1 , q2 , . . . , qn } (|Q| = n = 36 for Penn TreeBank) ● A = aij transition probabilities for all 1,296 tag pairs qim qjn s. t. ∑j aij = 1 ● O = o1 o2 . . . oT sequences of T observations from a vocabulary V (words w, arranged in sentences length T): training corpus ● B = bi (ot ) for the 36×|V| observation likelihoods of observations o t generated from states i ● π = π1 , π2 , . . . , πn where πi is the probability that the Markov chain will start with state i s. t. ∑i πi= 1
Continuous Variables, i.e., Infinitely Many Values
Range of a random variable could be all real num ● P(NoonTemp = x) ○ Range is defined in terms of a probability density function (pdf) ○ Parameterized function of x, e.g., Uniform(x; 18C, 26C) ■ 100% probability x falls in the 8C range 18C - 26C ■ 50% probability x falls in a 4C range within [18,26] ○ Intuitively, P(x) is the probability that X falls within a small region beginning at x, divided by the width of the region
Representation of the Data
Real world data is converted to learning examples by defining random variables that occur in the data ○ Values for the random variables are derived from observations ○ The values of random variables for an example represent the example ● Assumptions: ○ The random variables are conditionally independent, given the class ○ The training and test data are drawn from the same distribution (stationarity)
Filtering Exemplified
Recursive estimation: for some function f, where the agent needs to compute the new state Xt+1 based on the new evidence e t+1, recursively add in evidence at each new time step to get the subsequent state
Forward or Backward Chaining
Require Horn Form ○ Conjunction of Horn clauses ○ Horn clauses: literal, or (conjunction of literals) ⇒literal ○ E.g., C ∧ (B ⇒ A) ∧ (C ∧ D ⇒ B) ● Modus Ponens (for Horn Form): complete for Horn KBs ● Forward chaining, linear time ● Backward chaining potentially much less than linear time
Depth Limited search
Set a limit l on search depth. Prevents infinite growth of early paths that do not contain goal node. Succeeds if d<=l, fails otherwise
Iterative Deepening
Set an initial limit l. At each next depth with no goal increase l
Visualizing an Ngram Language Model
Shannon/Miller/Selfridge method: ● To generate a sequence of n words given a bigram language model: ○ Fix an ordering of the vocabulary v1 v2 ...vk and specify a sentence length n ○ Choose a random value r i between 0 and 1 ○ Select the first v j such that P(v j ) = r i ○ For each remaining position in the sentence ■ Choose a random value r i between 0 and 1 ■ Select the first vk such that P(vk |vj ) ≥ ri
Space complexity is analogous to time complexity
Size of memory instead of size of input. Units of space are arbitrary
LaPlace Smoothing
Smoothing is necessary in many applications of NB or other probabilistic models, especially to text classification ○ Avoid zero counts using one of many smoothing methods ○ LaPlace: Add mp to numerator, m to denominator, of all parameter estimates, where p is a prior estimate for an unseen value (e.g., 1/t for t values of Xi ), and m is a weight on the prior
Best First Search
Starts with problem definition and evaluation function. Initialize node to start state; initialize frontier to a queue containing node; initialize reached to a python dict. A while loop while frontier is non-empty. Pop the first node on the frontier and check if the next node contains the goal state. If the most recently popped node does not contain the goal state, a for loop expands that node. (Adds all its children to the frontier for each child state not reached unless there is a new way to reach the state with a lower path cost.) Failure if the while loop reaches an empty frontier.
Infrastructure for Tree Search
State Sn: The current state of the search that a node n represents Parent Pn: the parent node that generated n Action An: the action from Pn to Sn Path cost: g(n), the cost from root to n
Search States, Initial State and Actions
States: Belief state space ○ Every possible subset of physical states P ○ Where |P| = N, equal to 2N (every subset is T or F) Initial state: P in the absence of prior knowledge Actions: Two cases ○ All actions are safe: ○ Some actions cause disaster:
Hill climbing variants
Stochastic- Choose at random from uphill moves. When would this improve over choosing best-valued successor. Random-restart-Trivially complete: If at first you don't succeed, try again. Where each search has a probability of success p, there is a high probability of success after 1/p trials. Works very well with few local maxima and few plateaux.
Depth-first search for CSPs with single-variable assignments is called backtracking search ● In CSP, variable assignments are commutative, meaning it does not matter what order the assignments are made ○ [step 1: WA = red; step 2: NT = blue] = [step 1: NT = blue; step 2: WA = red] ● Given the commutativity of any sequence of assignments, the number of leaves for a CSP with n variables of domain size d = dn ○ In other words, one solution path p of length n is equivalent to all n! permutations of p
T
Do not evaluate a branch ● From a MAX node, given a value v ≥ β ○ MIN will never select that MAX node ● From a MIN node, given a value v ≤ α ○ MAX will never select that MIN node
T
For games whose payoffs are not win/lose ([0,1]), the expectiminimax values of chance nodes must be a positive linear transformation of the expected utilities
T
Given an HMM λ: ● Computing the likelihood of a sequence of observations P(o1 ,o2 ,o3 ) relies on the forward algorithm ○ The trellis entries carry forward all paths to each state that can emit o i ○ The likelihood involves summing over combinations of state sequences that could produce a given observation sequence ● Computing the most likely sequence of states given the observations (decoding) P(Q1 ,Q2 ,Q3 |o1 ,o2 ,o3 ) relies on the Viterbi algorithm ○ The trellis entries carry forward all paths to each state that can emit o i ○ Decoding finds the single maximum probability state sequence
T
Prune below a MAX node when alpha ≥ beta of its (MIN) ancestors ○ MAX nodes update alpha based on children's returned values ○ MIN at MAX's parent node will choose the action leading to beta ● Prune below a MIN node when beta ≤ alpha of its (MAX) ancestors ○ MIN nodes update beta based on children's returned values ○ MAX at MIN's parent node will chose the action leading to alpha
T
Reliance on independence and conditional independence reduces the number of relevant cases to consider, relative to the full joint probability distribution
T
Statistical modeling assumes even though x is observed, the values could have been different
T
The backward probability βt (i) is symmetrical to αt (i) in the Forward Algorithm
T
The belief states resulting from an action a in belief state b and the observations o resulting from the resulting possible percepts
T
The expression ξ t (i,j) can be formulated as the joint probability of the states at i and j conditioned on our observation sequences and the parameters:
T
● NB can be used to "classify" even when there is no causal relationship
T
● Random variables take on values in an experiment, e.g., a set of measurements
T
Sensor Markov Assumption
The agent's observations or evidence Et at time t depend only on the state Xt at time t
Solving Sensorless Problems with Search
The belief state-space can become too large for efficient search. Methods to handle search in belief state spaces ○ Prune the belief space: e.g., if the belief state space at node ni is a superset of the belief state space at node nj , then prune node ni ○ Use a more compact representation of belief ○ Incremental search: ■ A solution to an initial belief state S that contains {s1 , s2 , . . . sn } must work for each state s i ∈ S ■ So, find a solution to s1 ; test the solution for each next state; iterate
Most General Unifier (MGU)
The first unifier is more general than the second (less restrictive) ● There is a single most general unifier (MGU) that is unique up to renaming of variables
Syntax ctd
The nodes and edges represent the topology (or structure) of the network ● In the simplest case, the conditional distribution of each random variable is represented as a conditional probability table (CPT) ● A CPT gives the distribution over Xi for each combination of parent values
To-Move(s):
The player whose turn it is to move in state s
Syntax of Propositional Logic
The proposition symbols P1 , P2 etc are sentences ● If S is a sentence, ¬S is a sentence (negation) ● If S1 and S2 are sentences, S1 ∧ S2 is a sentence (conjunction) ● If S1 and S2 are sentences, S1 ∨ S2 is a sentence (disjunction) ● If S1 and S2 are sentences, S1 ⇒ S2 is a sentence (implication) ● If S1 and S2 are sentences, S1 ⇔ S2 is a sentence (biconditional)
Actions(s):
The set of legal moves in state s
Maximum Likelihood Estimation
The statistical inference problem is to estimate the parameters θ given observations x i for the classes y j ● Maximum likelihood estimation (MLE) chooses the estimate θ* that maximizes the likelihood function, i.e.: ● That is, the MLE parameters are those that maximize the probability of the observed data distribution in the classes
Result(a, s):
The transition model defining the result of taking action a in state s
Latent Variable Models
The validity of the hidden variable (e.g., part-of-speech tag; disease) depends on empirical evidence ○ Explanatory power of the hidden variable for multiple questions ○ Ability of trained judges to agree on its presence or absence ● Given that it can be difficult to get (enough) labeled data, EM can be used to estimate the model parameters
Value of Perfect Information (VPI)
The value of discovering Ej is the average over all possible values e j using the current belief state, less the expected utility
Parameter Re-estimation
Three sets of HMM parameters need to be iteratively estimated ○ Initial state distribution: πi ○ Transition probabilities: a i,j ○ Emission probabilities: bi (ot )
Paths with repeated states are non-optimal
Three solutions: 1.)Update a list of reached/visited states: practical when the sets of all states fits easily in memory (aka graph search) 2.)Ignore: practical when likelihood of revisiting a state is very low. (tree-like search) 3.)Compromise and check for cycles for limited number of steps(parent, grandparent): keeps memory needs to constant time
Evaluating algorithms
Time Complexity, Completeness, Space Complexity, Cost Optimality
Benefits of Memoization
Trades complexity of a function for complexity of a lookup. When a memoized function is evaluated, result is stored in a memoization cache. Calling a recursive function is much faster if memoized. Otherwise, python recursion is very slow
HMMS: Reasoning about Unobserved States
Transition model: How states change over time ● Sensor model: How the state at a given time t affects what evidence the agent perceives at t ● The distribution of the random variable Xt of states, and the random variable Et of evidence form the agent's belief state
Model-based reflex Agent
Transition model: what can happen Sensor model: what the state might be, given a percept Agent's next action depends on transition model + sensor model
Optimality of A*
Tree search version is optimal if h(n) is admissible. Graph search version is optimal if h(n) is consistent. Lemma: A* expands nodes on frontier in order of increasing f. Gradually adds f contours of nodes. A* is a variant of UCS
A PEAS specification of the task environment provides a way for the designer to determine if the pre-conditions for a rational agent are met
True
A function can be returned as the value of another function
True
A node has been expanded if its children have been identified
True
A problem with fewer restrictions on the actions than the original is called a relaxed problem
True
A search node is a data structure used during search
True
A solution is optimal if no solution has a lower path cost.
True
A state represents a physical configuration
True
AI exploits various types of computing methods
True
AI relies on any combination of Heterogeneous Technologies
True
AI solves problems rationally
True
Agent= architecture +program
True
As input approaches infinity O(n) is necessarily better than O(n^2)
True
Assume both players play optimally ● Max prefers next state to have maximum value ● Min prefers next state to have minimum value
True
Belief update is central to agents that operate in partially observable worlds
True
Big O ignores constant multiplicative factors
True
Can Store functions in data structures
True
Can assign functions to variables
True
Default Parameter values are evaluated once when the def statement they belong to is first executed
True
For smaller input sizes, depending on the algorithm O(n^2) could be better
True
Frontier can be represented as a queue
True
Functions can be passed as arguments to other functions
True
Graph search algorithm is the same as tree search with the addition of the explored set.
True
Minimax serves as the basis for the mathematical analysis of games
True
Most real-world problems involve partial knowledge of the state of the world
True
Precednce: args must preced kwargs, a specific argument must precede args
True
Rate of growth of runtime grow relative to input
True
Search in non-deterministic worlds must consider alternative outcomes
True
The cost of an optimal solution to a relaxed problem is an admissible heuristic for the original problem
True
The function uses the same mutable object each call in the recursion.
True
The search() method is a generator that finds all the nodes that can be reached from a given node
True
To define a subclass put the name of the superclass in parentheses after the subclasses name on the first line of the definition
True
To redefine a method inherited from the parent class, add a new definition of the same name to the subclass. The object class determines which definition to use. If object is in parent class then parent method is used. If object is in subclass then subclass method is used
True
Unlike java, a python function is specified by its name alone
True
User can also give a class the ability to use [] notation like an array or () notation like a function call
True
User can specify class-specific behavior for comparison operators
True
● 8-Puzzle cannot be solved if the environment is non-observable/sensorless
True
O() notation summarizes the large scale performance
True; Easier to use than assessing the actual number of operations. Less precise than the alternative
Nodes that have been generated but not expanded a re referred to as the frontier.
True; The frontier is used to guide the direction of search.
Machine Learning in General
Types of ML ○ Replicate a pattern given by a supervision signal ○ Discover new patterns (unsupervised learning) ■ Infer a pattern given an indirect supervision signal (semi-supervised) ○ Learn through trial and error by interacting with environment and receiving a reinforcement signal (reward) ● Supervised machine learning types: ○ Classification (e.g., Naive Bayes) ○ Regression ○ Sequence prediction
Tips about Quantifiers
Typically, the connective in a universally quantified sentence is ⇒ ○ Everyone in CMPSC 442 is smart: ∀x In(x, CMPSC 442) ⇒ Smart(x) ○ In contrast: ∀x In(x, CMPSC 442) ∧ Smart(x) means: Everyone is in CMPSC 442 and everyone is smart ● Typically, the connective in an existentially quantified sentence is ∧ ○ Someone in CMPSC 442 is smart: ∃x In(x, CMPSC 442) ∧ Smart(x) ○ In contrast: ∃x In(x, CMPSC 442) ⇒Smart(x) means: Anyone taking CMPSC 442 is smart (possibly no one)
Learning the HMM Parameters
Unfortunately, there is no known way to analytically find a global maximum, i.e., a model , such that ● It is possible to find a local maximum ● Theorem: Given an initial model , we can always find a model , such that
MAC Algorithm: Maintaining Arc Consistency
Unlike AC-3, forward checking does not recursively propagate constraints when domains of variables are changed ● Solution: combine AC-3 and forward checking ○ After making an assignment to Xi ■ Find subset Y = all arcs (Xi ,Xj ) ● Call AC-3 on Y
Least Constraining Value Heuristic
Used for ordering the domain values (see backtracking pseudo-code) ● Intuition: ensure maximum flexibility for remaining assignments
Empirical gradient methods
Used when the equation grad(f) = 0 has no closed form solution. Search progress depends on comparing the values of the objective function f for the current state x and the successor state x'. Progress is measured by the change in value of f.
Evolutionary Algorithms
Variants of stochastic beam search, natural selection as a metaphor. Many varieties
Back Tracing
Viterbi recursion computes the maximum probability path to state j at time T given the observation o1 , . . . , oT ● Viterbi must also identify the complete single path that gives this maximum probability ○ Keep backpointers ○ Find ○ Trace backpointers from state j at time T to find the state sequence from T back to 1
Viterbi Recursion
Viterbi recursion computes the maximum probability path to state j at time t given the observation o1 , . . . , ot
A solution to a non-deterministic problem assumes that ate execution time, the agents percepts can resolve the outcome of the action
What if the agent's percepts do not provide enough information? ○ The environment is partially observable or not observable Sensorless (or conformant) problems: states are not observable ○ Sensors can be time consuming, unreliable or can suffer damage ○ Example: in manufacturing, placing parts in the correct location could rely on constraints/physics rather than sensing ○ Example: in medicine, a broad-spectrum antibiotic can treat many infections, so no need to wait for test results to identify the pathogen
Belief States
When actions are non-deterministic, the agent must maintain a state representation that includes all the possible action outcomes. A state representation with alternative states that might exist is referred to as a belief state.
And- node
When an action has alternative outcomes the search tree must consider the path from to s_i and s_i+1 and so on
Local search
When path to goal doesn't matter. Iterate (Remember only current state, move to best neighboring state; forget the past). Idea: Incrementally improve an initial guess.
Yield
When we call next(), the method runs until it encounters a yield statement, and then it returns the value that was yielded. On the next call, next() resumes the data stream object where it left off.
When to Prune:
Whenever Alpha ≥ Beta
Inference during Search, using Forward Checking
Whenever a variable Xi is assigned, for each variable Xj connected to Xi by a constraint, delete from the domain of Xj any value that is inconsistent with the value chose for X
Minimax Algorithm Semi-Formally
While game not over: 1. Start with the current position as a MAX node 2. Expand the game tree a fixed number of ply (amount of look-ahead) 3. Apply the evaluation function to the leaf positions at look-ahead depth 4. Back-up the values all the way to the root 5. Pick the move assigned to MAX at the root 6. Wait for MIN to respond
Performance of Naïve Bayes for Text Classification
Words in running text are not independent, so application of NB for text classification is naïve ● Nevertheless, NB works well on many text classification tasks ○ The observed probability may be quite different from the NB estimate ○ The NB classification can be correct even if the probabilities are incorrect, as long as it gets the relative conditional frequency of each class value correct
Frequently used magic methods
__init__ , __len__, __copy__ , etc.
**kwargs
a dict of keyword arguments, unspecified length
Profile
a set of statistics that describes how often and for how long various parts of the program executed
Monte Carlo
are based on computing a expectation from repeated random simulations, aka chance
A rational agent is designed to achieve the_______ outcome, where the best is relative to the explicit performance criteria
best
Iterable
can be iterated over. All sequences are iterable. Has __iter__ method. returns an iterator object
Sequences
can be mutable or immutable; have similar syntax; are iterable
Percept Sequences and actions
can be organized in a table can be restricted to a finite table: restrict the length of a percept sequence
Acceptance probability (ap)
close to 1 if solution is better decreases as new solution is increasingly worse decreases as T decreases if cost(new) > cost(old). ap = e^(cost(old)-cost(new)/T) Can accept mildly bad but not terrible next moves. Accepts any bad jumps earlier rather than later
Probability Density Function (PDF)
for a continuous random variable, the probability for the value to be within some range
Cumulative Distribution Function (CDF)
for a continuous random variable, the probability of having a value ≤ n
● Probability Mass Function (PMF)
for a discrete random variable, the probability of each value
Generators
functions that evaluate to next item in an iterator object, with a yield keyword
An agent
is a function that perceives and acts
range()
is its own class of immutable, iterable objects. attributes(start, stop, step) methods- count, index A range object can be turned into an iterator
Operator overloading
is possible using special methods on various classes
Operations on frontier
isEmpty(): tests if frontier is empty Top(): returns first node on frontier Pop(): removes first node on frontier and returns it Add: Inserts a node into the frontier
reduce
iteratively applies func to next two members of mutable sequence to return a single value
Keyword Arguments
means you do not have to remember the linear order of args to a functions, but you do need to know the keyword names.
Cromwell's Rule
n Bayesian approaches ● Use of prior probabilities of 0 or 1 should be avoided ● Because: if the prior probability is 0 or 1, then so is the posterior, and no evidence can be considered
Simple Search Problems:
single agent, episodic, fully observable, deterministic, static, discrete, known
Optimal Solution
smallest number of actions in solution
Problem Formulation determines:
the combinatorics of search space the efficiency/ complexity of search algorithm
Search is a way to solve problems when
to achieve its goal, the agent needs to execute a sequence of actions, and must look ahead to choose among multiple possible actions at the next step. The state of the world is represented atomically. (discrete, no internal structure)
Python uses a ___________system 0 to len(sequence)-1
zero-based indexing
Recap: Optimal Quantities
▪ The utility of a state s: U * (s) = expected utility starting in s and acting optimally ▪ The value (utility) of a q-state (s,a): Q * (s,a) = expected utility starting out having taken action a from state s and (thereafter) acting optimally ▪ The optimal policy: π * (s) = optimal action from state s
● For chance nodes, e.g., C:
○ Can be pruned if we can find the upper bound of its value ○ Consider the bounds on the utility function, e.g., [-1, 2] ○ Then there are upper and lower bounds on the expectation at C
8-Puzzle can be solved if just one cell c i can be observed:
○ If cell c i is empty, move an adjacent tile into c i and observe its value v i ○ Else record the value v i of the tile in c i ○ For every successor belief state sk ■ Keep track of the new location of v i in sk ■ Every time a new tile is moved into c i , observe its value v j
Degrees of belief
○ P(A) = 1: Agent completely believes A is true. ○ P(A) = 0: Agent completely believes A is false. ○ 0.5 < P(A) < 1: Agent believes A is more likely to be true than false.
Probability of a Sentence S: Unigram Version
● A crucial step in speech recognition, language modeling, etc. ● First guess: products of unigrams: ● Given a lattice for word hypotheses in a 3-word sequence, the above formulation is not quite right
Models
● A logical model m is a formally structured world with respect to which truth can be evaluated ○ m is a model of a sentence α if α is true in m ○ M(α) is the set of all models of α ● Entailment: ○ KB ⊨ α iff M(KB) ⊆ M(α)
A Perceptron is a Classifier
● A neural net: neurons linked by directed arcs ● A sigmoid perceptron = logistic regression classifier
Decision Trees Learn Decision Rules
● A path from root to leaf is a decision rule ○ EG: If Patrons == none → No ● This tree has 13 leaves (paths, or decision rules) ○ The root attribute (Patrons) has 3 values ■ None (n=2) ■ Some (n=4) ■ Full (n=6) ● A tree with fewer paths would be more compact (simpler), thus preferred
Values of Random Variables
● A random variable V can take on one of a set of different values ○ Each value has an associated probability ○ The value of V at a particular time is subject to random variation ○ Discrete random variables have a discrete (often finite) range of values ○ Domain values must be exhaustive and mutually exclusive ● For us, random variables will have a discrete, countable (usually finite) domain of arbitrary values ○ Here we will use categorical or Boolean variables
Markov Decision Processes: Decisions over Time
● A set of states s ∈ S ● A set of actions a ∈ A ● A transition function T(s, a, s') ○ Probability that a from s leads to s', i.e., P(s'| s, a) ○ Also called the model or the dynamics ● A reward function R(s, a, s') 6 Intro to MDPs ○ The reward function (per time step) ○ Figure into the agents's utility function (over time) ● A start state s0 ● Possibly a terminal state
Tree-structured CSPs
● A tree-structured CSP can be solved in time linear in the # of variables ● A constraint graph is a tree when any pair of variables is connected by only one path ● Do a topological sort of the CSP graph to create a tree ○ A linear ordering of the nodes where for every directed edge Xi , Xj , Xi precedes Xj
Bias Variance Tradeoff
● AIMA defines bias in terms of the selected hypothesis space (e.g., linear functions versus sinusoidal) ● AIMA defines variance as arising from the choice of training data (variance across possible training sets)
Conceptual Basis for Decision Theoretic Agent
● Ability to reason about an uncertain world ○ Probabilistic models of agent's beliefs ○ Factored state representations ● Ability to reason about conflicting goals ○ Axioms of utility: constraints on a rational agent's preferences ○ Decision networks: nodes for belief states, actions, utilities ○ Value of information in different settings
Policy Iteration
● Alternative approach for optimal values: ○ Step 1: Policy evaluation: calculate utilities for some fixed policy (not optimal utilities!) until convergence ○ Step 2: Policy improvement: update policy using one-step look-ahead with resulting converged (but not optimal!) utilities as future values ○ Repeat steps until policy converges ● This is policy iteration ○ It's still optimal! ○ Can converge (much) faster under some conditions
Markov Chain versus HMM
● An HMM is a non-deterministic Markov Chain: cannot uniquely identify a state sequence ● States are partially observed (sensor model)
Baum-Welch
● An example of Expectation Maximization ● E-Step: Compute expected values of the states j at times t using γ t (j) and of the transitions i,j from t to t+1 using ξ t (i,j) ● M-Step: From these expected values, compute new parameter estimates and ● Iterate until convergence
Utilities for a Fixed Policy
● Another basic operation: compute the utility of a state s under a fixed (generally non-optimal) policy ● Define the utility of a state s, under a fixed policy π: U π (s) = expected total discounted rewards starting in s and following π ● Recursive relation (one-step look-ahead / Bellman equation) using a fixed policy:
Issues with Chess Lead to Hybrid Approaches
● Apply Eval() only to quiescent positions, where there is no pending move that shifts the game wildly (e.g., capturing the queen) ● ProbCut, a probabilistic cut algorithm (Buro, 1995) ○ Uses forward pruning with Alpha-Beta ○ Estimates the probability that a node can be safely pruned based on statistical knowledge of game states ● Table lookup for openings and endgames, which have fewer variations
Maximum a Posteriori Decision Rule (MAP)
● Approximately Bayesian ● Foundation for Naïve Bayes classifiers ● Find the most probable hypothesis hi , given the data d
How do We Apply Cromwell's Rule
● Assume we know how many types never occur in the data ● Steal probability mass from types that occur at least once ● Distribute this probability mass over the types that never occur
Efficient Model Checking Algorithms for PL
● Backward chaining (Horn Clauses) ● Forward chaining (Horn Clauses) ● DPLL Algorithm (Davis, Putnam, Logemann, Loveland) ○ Efficient and complete backtracking ○ Can efficiently handle tens of millions of variables ○ Applications include hardware verifiation ● WalkSAT ○ Local search, thus very efficient ○ Incomplete
Evaluation functions for board position: f(n)
● Based on static features of that board alone ● Zero-sum assumption lets us use one function to describe goodness for both players ○ f(n) > 0 if MAX is winning in position n ○ f(n) = 0 if position n is tied ○ f(n) < 0 if MIN is winning in position n ● Define using expert knowledge
Batch Gradient Descent for Univariate Case
● Batch: sum over all data points (one epoch) ● Translates into one update rule for each weight ○ α is the learning rate, with the 2 folded in to α ○ Guaranteed to converge if α is small ○ Increasingly slow as N increases
Baum-Welch
● Baum-Welch algorithm uses Expectation Maximization to iteratively re-estimate the parameters yielding at each iteration a new ○ Initializes to a random set of values, then for each iteration: ○ Calculates , from left to right (forward) and from right to left (backward) to get the probability of the states i and j at times t and t+1 ■ Every state transition from t to t+1 occurs as part of a sequence ■ For all transitions, we compute the forward probability from the sequence start to t, and the backward probability from the sequence end back to t+1 ○ Re-estimates ● Requires an algorithm for backward computation through the trellis
Reasoning about Cause and Effect
● Bayes' Rule provides a way to reason from causes to effects ● Note that normalization of probabilities to sum to one means only two kinds of knowledge are needed ○ Prior probability of the cause P(c) ○ Likelihood of the effect given the cause P(e|c)
Handling Uncertainty over Time
● Builds on search in partially observable worlds ○ Belief states + transition model define how agent predicts how the world might be at each next time step ○ Sensor model defines how to update the belief state ● Probability is used to quantify degrees of belief in elements of the belief state ● Time is handled by considering a set of random variables at each next point in time
Motivation for Dynamic Programming
● Calculation of ○ Sum the probabilities of all possible state sequences in the HMM ○ The probability of each state sequence is the product of the state transitions and emission probabilities ● Naïve computation is very expensive. Given T observations and N states, there are NT possible state sequences. ○ For T=10 and N=10, 10 billion different paths! ● Solution: linear time dynamic programming ○ DP: uses a table (trellis) to store intermediate computations
Non-descendants Property
● Capturing conditional independence relations where the conditional probabilities at a node (random variable) capture the dependence on the parent nodes, and independence from all other nodes ● A random variable is conditionally independent of its non-descendants, given its parents
Chain Rule to Construct Bayesian Networks
● Chain rule: a joint probability can be expressed as a product of conditional probabilities in the illustrated way ● Given the global semantics from the preceding slide, this gives a general assertion about variables in the network, and a construction procedure ● Generalization of Naive Bayes
Finding a Hypothesis Function for a Dataset
● Choose a model (meaning type of model) ○ In this context, choosing a model means choosing a hypothesis space, e.g., linear function, polynomial function ○ In other contexts, model can mean model + hyperparameters (degree-2 polynomial), or a specific model (e.g., y=5x2 +3x +2) ● Optimize (or train the model) ○ Find the best hypothesis (instantiated model) ○ Training relies on a training set and a smaller validation (or dev) set for developing the model
Cycle Cutset Algorithm
● Choose some cutset S ● For each possible assignment to the variables in S that satisfies all constraints on S ○ Remove any values for the domains of the remaining variables that are not consistent with S ○ If the remaining CSP has a solution, then you have are done ● For graph size n, domain size d ● Time complexity for cycle cutset of size c: ○ O(dc * d2 (n-c)) = O(dc+2 (n-c))
Generalizing Bayes' Rule
● Conditionalize Bayes' rule on background evidence e ● The evidence e can stand for the set of other variables in a joint probability distribution besides X and Y
FOL Vocabulary
● Constants: Richard, John, 2, . . . ● Connectives: ¬ ∧∨⇒⇔ ● Variables x, y, a, b,... ● Predicates: True/1, False/1, Person/1, >/2, give/3, sell/4. . . ○ Person(John) ○ KingOf(John, a) ● Equality (a special predicate) ● Functions: Sqrt, LeftLegOf, . . . ● Quantifiers: ∀, ∃
Decision Theory
● Decision Theory develops methods to make optimal decisions in the presence of uncertainty ● Decision Theory = utility theory + probability theory ● Utility theory is used to represent and infer preferences ○ Every state has a degree of usefulness ○ An agent is rational if and only if it chooses an action A that yields the maximum expected utility (expected usefulness)
Backward Chaining
● Depth-first recursive proof search: space is linear in size of proof ● Incomplete due to infinite loops ○ Fix: checking current goal against every goal on stack ● Inefficient due to repeated subgoals (both success and failure) ○ Fix: use caching of previous results (extra space) ● Widely used for logic programming
Extension to Multivariate Case
● Each example x j is an n-dimensional vector ● The linear equation sums over all x j,i and adds a bias weight ● The weights are therefore an n+1-dimensional vector, so we define a dummy input attribute x j,0 = 1
Loss Functions: An Objective to Minimize
● Error rate can be due to different error types (e.g., one class) ● A loss function can be used as a training objective to minimize error for all classes ● Most generally, loss should take x into account ● Usually, x is ignored:
Learning the Multivariate Regression
● Essentially the same update rule ● Need to regularize in the multivariate case
Evaluation Functions for H-Minimax
● Estimation of the expected utility of state s to player p ○ If Is-terminal(s), Eval(s, p) = Utility(s, p) ○ Else Utility(loss, p) ≤ Eval(s, p) ≤ Utility(win, p) ● Evaluation functions should be ○ Fast (Heuristic Alpha-Beta intended to improve Alpha-Beta performance) ○ Informed by expert knowledge ○ Often based on features that form equivalence classes of game positions ■ For a given class, experience may indicate the proportion of times games end in win (utility=1.0), lose (utility=0) or draw (utility=0.5) ■ Then use expected value: e.g., if feature A leads to win 82% of the time, loss 2%, and draw 16%, expected value = 82% x 1 + 16% x 0.5 = 0.90
Policy Iteration ctd
● Evaluation: For fixed current policy π, find values with policy evaluation: ○ Iterate until values converge: ● Improvement: For fixed values, get a better policy using policy extraction ○ One-step look-ahead:
Reduction to Propositional Inference
● Every FOL KB can be propositionalized so as to preserve entailment ○ A ground sentence α is entailed by new KB' iff entailed by original KB ● Idea: propositionalize KB and query, apply resolution, return result ● Problem: with function symbols, an infinite number of ground terms can be generated
Probabilities of Elementary Events
● Every ωi ∈ Ω is assigned a probability (elementary event in the sample space) P(ωi ) ○ 0 ≤ P(ωi ) ≤ 1 ● Assuming Ω is finite (w1 ,..., wn ) we require ○ P(Ω) = ∑ω_i P(ωi ) = 1
Stochastic Games
● Examples of stochastic games ○ Backgammon: includes rolls of the dice ● To extend minimax to handle chance: ○ The search tree must include a new ply for chance nodes (green circles) after every MAX or MIN node ○ Minimax(n) has to include an expectation of the value of a position, taking into account the probabilities of the chance events from green nodes
Another View of Bias Variance Tradeoff
● Expected prediction error (EPE) for a new observation with value x is given by: ● is the irreducible error (noise) apart from bias and estimation variance ● Bias is the result of misspecifying the statistical model f ● Estimation variance is the result of using a sample to estimate f ● Modeling goals: ○ Explanatory modeling attempts to minimize bias meaning to find the same theoretical explanation for some phenomenon, e.g., across categories of datasets ○ Predictive modeling aims for empirical precision (minimize bias and estimation variance)
Backtracking Search Heuristics
● Exploits domain-independent heuristics (in contrast to the domain-dependent heuristics of informed search algorithms) ● Demonstrates the advantages of a factored state representation ● Four kinds of heuristics ○ Which variable to assign next: Select-Unassigned-Variable() ○ What inferences to perform at each step: Inference() ○ How far to backtrack: Backtrack() ○ When to save and re-use partial results
Correspondence of FOL and Natural Language
● FOL expressivity is closer than PL to natural language ○ Objects denote real-world entities, which can be referred to with noun phrases ○ Logical relations correspond to real-world relations, which can be expressed as adjectives and verbs ● FOL statements are context independent and unambiguous, while natural language phrases are context-dependent and ambiguous ○ Two FOL statements can have different forms and the identical semantic interpretation ○ Natural language statements and meanings are many-to-many ○ Natural language meaning is broader than a way to encapsulate "knowledge" (opinions/attitudes/social conventions/bias . . .)
Assertions and Queries
● FOL statements (assertions) can be added to a KB ○ Same as in PL ○ TELL(KB, Brother(Richard,John)) ● Two types of queries can be made ○ ASK(KB, saturated statement) returns true or false, depending on truth evaluation of statement (must not have unbound variables) ○ ASKVARS(KB, unsaturated statement) returns bindings for the unbound variables ∀, ∃ ■ AskVars(KB, ∀x evil(x)) ■ AskVars(KB, ∃x evil(x))
Types of Neural Networks
● Feedforward network: a directed acyclic graph ○ Information propagates in one direction ○ Output is a function of the input ● Recurrent network: has cycles ○ Outputs can recur as inputs ○ Output is a function of the initial state, dependent on previous inputs ○ Dependence on earlier inputs amounts to short-term memory ● Single layer versus multi-layer
What is Markov about MDPs?
● For Markov decision processes, action outcomes depend only on the current state: ● This is like search: ○ Search: successor function uses the current state to return successors ○ MDP: search process includes successor state based on current state, and the transition model, and the reward
Forward Checking versus Backjumping
● Forward checking can build the conflict set ○ When forward checking from an assignment X = v deletes v from the domain of Y, add X=v to the conflict set for Y ○ If the last value is deleted the domain of Y then the assignments in the conflict set for Y are added to the conflict set of X (since the assignment X=v leads to a contradiction in Y try a new assignment for X ● Notice that backjumping finds the same conflicts that forward checking does
Unsupervised Clustering of Continuous Data
● Gaussian distribution ○ Many things in nature, e.g., attributes of Iris species: sepal length, sepal width, petal length, petal width ○ Iris-Setosa dataset ● Mixture of Gaussians ○ Three species of iris 24 ○ Gaussian Mixture Model can identify three distinct clusters in an unsupervised manner using EM
Resolution
● Generalization of unit resolution ● Two clauses can be combined to produce a new clause as follows ○ If the first clause contains a literal a, and the second clause contains its negation ¬a ○ Then inference step is to produce a new single clause that includes all the literals from both clauses except a and ¬a
Supervised Statistical Machine Learning
● Given a set of independent and identically distributed (i.i.d.) training examples (x(1) {1:m} , y(1) )...(x(n) {n:m} , y(n) ) ○ Assume each pair was generated by an unknown function y=f(x) ○ Discover a function y=h(x) where h approximates f ○ The labeled data (x(1) {1:m} , y(1) )...(x(n) {1:m} , y(n) ) represents the ground truth ● Test the generalization ability of h on labeled examples that are not in the training set
Seismic Data Can be Classified
● Given the weight vector for the seismic data, and the 2-D vectors of examples, a classification decision boundary can be operationalized as follows
Performance of Alpha-Beta Pruning
● Guaranteed to compute same root value as Minimax ○ Recall: the root value tells MAX which action to take ● Worst case complexity: no pruning, same as Minimax O(b^d ) ● Best case complexity: when each player's best move is the first option examined, examines only O(b^d/2) nodes, allowing to search twice as deep!
Heuristic Alpha-Beta
● H-Minimax function will treat non-terminal nodes as if they were terminal ○ Replace the terminal test with a cutoff test (e.g., use iterative deepening) ○ Replace the utility function with a heuristic evaluation function
Linear Classification with Logistic Regression
● Hard threshold decision rule is non-differentiable, cannot use SGD ● Replacing hard threshold with sigmoid function gives a differentiable decision function ● Because the values of the sigmoid function are in [0,1] they are interpreted as the probability of the class for each example
Two Layer XOR Gate
● Hidden unit in the middle with a threshold of 1.5 goes on only if both inputs are 1 ● The three weights on the inputs to the final output ensure: ○ If input is 1, 1 then sum of weights is 0 ○ If input is 1, 0 then sum of weights is 1 ○ If input is 0, 1 then sum of weights is 1 ○ If input is 0, 0 then sum of weights is 0
Extensions to Bayesian Networks
● Hidden variables ● Decision (action) variables ● Random variables with continuous distributions ● "Plate" models ○ Latent Dirichlet Allocation (LDA), a generative statistical model that allows sets of observations to be explained by unobserved groups ● Dynamical Belief Nets (DBNs): Change over time ○ Hidden Markov Models (HMMs): a special case of DBNs in which the entire state of the world is represented by a single hidden state variable
Policy Evaluation
● How do we calculate the U's for a fixed policy π? ● Idea 1: Turn recursive Bellman equations into updates (like value iteration) ● Efficiency: O(s^2 ) per iteration ● Idea 2: Without the maxes, the Bellman equations are just a linear system ○ Solve with your favorite linear system solver
Convergence*
● How do we know the Vk vectors are going to converge? ● Case 1: If the tree has maximum depth M, then VM holds the actual untruncated values ● Case 2: If the discount is less than 1 ○ For any state Vk and Vk+1 can be viewed as depth k+1 results in nearly identical search trees ○ The difference is that on the bottom layer, Vk+1 has actual rewards while Vk has zeros ○ The last layer is at best all RMAX ○ It is at worst RMIN ○ But everything is discounted by γ k that far out ○ Vk and Vk+1 are at most γ k RMAX different ○ So as k increases, the values converge
Discounting ctd
● How to discount? ○ Each time we descend a level, we multiply in the discount once ● Why discount? ○ Sooner rewards probably do have higher utility than later rewards ○ Also helps our algorithms converge ● Example: discount of 0.5 ○ U([1,2,3]) = 1*1 + 0.5*2 + 0.25*3 ○ U([1,2,3]) < U([3,2,1])
Unit Resolution: A First Step towards Completeness
● If (a ∨ ¬b) ∧ ¬a, then ¬b ● A disjunction of literals conjoined with the negation of one of the disjuncts proves the other disjunct ● Called unit resolution because a literal (e.g., c) combined with a clause (e.g., (¬ c ∨ d) is equivalent to a disjunction of the literal (∨ c), or a "unit clause" combined with another clause
A CSP can easily be expressed as a search problem ○ Initial State: the empty assignment {} ○ Successor function: Assign a value to any unassigned variable provided that there is not a constraint conflict ○ Goal test: the current assignment is complete ○ Path cost: a constant cost for every step
● If a solution exists, it is necessarily at depth n, given n variables ○ Depth First Search can be used
Summary of pruning
● If opponent's actions from node m or m' are better for Player than those from node n, the Player will never allow the game to proceed to n
Expectiminimax
● If the next node s is a chance node, then: ○ Sum over all the observed chance outcomes r at s, weighted by the probability P(r) of each chance action (e.g. dice roll)
Generalization Loss versus Empirical Loss
● If the set ε of all possible pairs (x,y) is known, then generalization loss can be used (this could be used for simulated data) ● Otherwise, empirical loss for a dataset E be used where N = |E| (this is far more typical)
Deterministic Environments
● In a deterministic environment (as in game playing, e.g., minimax), a preference ranking on states is sufficient, exact quantities for preferences are not needed ● Such preference rankings are called value functions
Search Strategies versus Decision Policies
● In deterministic single-agent search problems, find an optimal sequence of actions, from start to a goal ● For MDPs, find an optimal policy π*: S → A ○ A policy π gives an action for each state ○ An optimal policy is one that maximizes expected utility over time
Role of Conditional Independence
● In most cases, the use of conditional independence reduces the size of the representation of the joint distribution from exponential in n to linear in n. ● Conditional independence is much more common in the real world than complete independence. ● Conditional independence is our most basic and robust form of knowledge about uncertain environments.
Definition of Random Variable
● In the probability space (Ω, Ρ): ○ Ω is a sample space (set of all outcomes) ○ Ρ is the probability measure ● Random variables are functions ● Given some probability space (Ω, Ρ), a random variable X: Ω →R is a function defined from the probability space to the real line ● In other words, Ρ attaches probabilities to events, which are subsets of Ω
Efficiency of Forward Chaining
● Incremental forward chaining: no need to match a rule on iteration k if a premise wasn't added on iteration k-1 ○ Match each rule whose premise contains a newly added positive literal ○ Problem: matching can be expensive: polynomial ○ Solution: database indexing allows linear time retrieval of known facts ● Forward chaining is widely used in deductive databases
Uninformed Search to Derive Proofs
● Initial State: The sentences in initial knowledge base ● Actions: Apply inference rules where a KB sentence matches the l.h.s. ● Result: Add r.h.s. of matched rules to KB ● Goal: A state containing the sentence to prove
EM for Naive Bayes
● Initialization: Assign initial values to the parameters ● Expectation step: Calculate the posterior probability of the class given the observed attribute values and the current parameters θ from initialization or from the M-step ● Maximization step: Calculate the new parameter values based on the class assignments from the E-step ● Continue for T iterations
Structure of the Forward Algorithm
● Initialization: all the ways to start ● Induction: all the ways to go from any given state at time t to any subsequent state at time t+1 ○ Given the probability for state qi at time t, induction carries forward the probability to each next qj at time t+1 ● Termination: all the ways to end
Biological Neuron
● Input ○ A neuron's dendritic tree is connected to a thousand neighboring neurons. When one fires, a positive or negative charge is received 28 ○ The strengths of all the received charges are added together ● Output ○ If the aggregate input is greater than the axon hillock's threshold value, then the neuron fires ○ The physical and neurochemical characteristics of each synapse determines the strength and polarity of the new signal
Discounting
● It's reasonable to maximize the sum of rewards ● It's also reasonable to prefer rewards now to rewards later ● One solution: values of rewards decay exponentially
Using Resolution in FOL
● KB must be in CNF ● Yields a complete inference procedure ● Efficient inference strategies exist ● Similar to conversion of PL to CNF; differences due to quantifiers ○ Implication elimination ○ Move ¬ inwards ○ Standardize variables ○ Skolemize: remove existential quantification ○ Drop universal quantifiers ○ Distribute ∨ over ∧
Agents with Explicit Knowledge Representation
● Knowledge base = set of sentences (declarations)in a formal language ● Adding to the KB ○ Agent Tells the KB what it perceives: Si ○ Inference: derive new statements from KB + Si ● Using the KB ○ Agent Asks the KB what action to take ○ Declaration of the action to take leads to action
Minimax Setup
● Label the root MAX ● Alternate MAX/MIN at each next level of the tree (ply) ○ Minimax(node) is the utility of any node (for MAX) ○ Even levels represent turns for MAX ○ Odd levels represent turns for MIN
Perceptron Learning Rule
● Learning the weights for the classifier with a hard threshold ○ Cannot use gradient descent because the gradients of the values of the threshold function are either zero or undefined ○ Can use the following update rule (for a single example) that converges to a solution, provided the data are linearly separable ● This is called the perceptron learning rule ● The update rule for linear regression
Computing Actions from Q-Value
● Let's imagine we have the optimal Q-values: ● How should we act? ○ Completely trivial to decide! ● Important lesson: actions are easier to select from Q-values than values!
Computing Actions from Values (aka Utilities)
● Let's imagine we have the optimal values V*(s) ● How should we act? It's not obvious! ● We need to do a mini-expectimax (one step) ● This is called policy extraction: it gets the policy implied by the values
One Neuron (Unit)
● Link from neuron input ai to output aj propagates through the network ● Each input link has a weight wi,j for the strength and sign of activation ● A neuron's input is a weighted sum over the input links (including dummy input a0=1 for bias weight w0j ) ● A neuron's output activation aj is a function g over the input inj ● A neural network: neurons linked by directed arcs
Local Search for CSPs
● Local search algorithms such as hill-climbing and simulated annealing can apply to CSPs using complete state formulation ○ Each state assigns a value to every variable ○ The search changes one variable at a time ● Variable selection: randomly select any conflicted variable ● Value selection by min-conflicts heuristic: ○ Choose a value that violates the fewest constraints ○ Apply hill-climb with h(n) = total number of violated constraints
General Properties of a Logical Formalism
● Logic: a way to formulate statements, interpret them, and derive conclusions ○ Syntax: vocabulary, rules of combination to make statements ○ Semantics: truth of statements in possible worlds, or models ○ Entailment: a way to evaluate consistency of models with one another
Logical Agents and Uncertainty
● Logical agents have belief states ● Probability theory can be incorporated into logical agents ○ To change epistemological commitments from truth values to degrees of belief in truth ○ Ontological commitments (what is believed to be in the world) remain the same
Perplexity as a Language Model Metric
● Lower perplexity on test set means higher likelihood of observed sentences (less perplexed) ● Nth root normalizes the inverse probability by the number of words to get a per word perplexity ● Equivalent to the weighted average branching factor
Connection to Reinforcement Learning
● MCTS uses a fixed metric as a "selection policy" to choose the next move ● Reinforcement learning iterates through courses of action to train a flexible decision policy that maximizes a long term reward ● Similarities ○ Future moves are simulated ○ Exploration involves acquiring knowledge about unknowns, sometimes through failure ○ Exploitation involves re-using what has been learned through trial-and-error
Why a Naive Bayes Model is Generative
● ML models that rely on joint probability distributions are generative ○ Given a full joint probability distribution, P(X1 , . . ., Xn, ,Y), the hypothesis P(Y|X) depends on known priors and conditional probabilities: it's all probabilities ○ Conceptually, new data can be generated from the model ■ P(X|Y) = P(X,Y) / P(Y) ● ML models can be discriminative rather than generative ○ Discriminative models do not use full joint probability distributions ○ P(Y|X) depends on features X that discriminate among the outcomes Y ○ Logistic regression is a discriminative modeling method
Smoothing
● Many words are relatively rare ○ A relatively infrequent word might not occur in one of the text categories ○ A word in the test data might not have been seen in the training data ● If any word is assigned a zero conditional probability for any category, then the product of conditional probabilities is zero ● Ensure that a conditional probability is assigned for every word in every class ○ Smooth all conditional probabilities: add a small number to all counts ○ Use a special UNKNOWN token during training and testing; all words not in the training data can be treated as instances of UNKNOWN
Recap: MDPs
● Markov decision processes: ○ States S ○ Actions A ○ Transitions P(s'|s,a) (or T(s,a,s')) ○ Rewards R(s,a,s') (and discount γ) ○ Start state s0 ● Quantities: ○ Policy = map of states to actions ○ Utility= sum of discounted rewards ○ Q-Value = expected future utility from a q-state
Recall Minimax
● Max is the agent ● Min is the opponent ● Look ahead to get future utilities ● Back them up to the root decision node ○ Assume worst case for Max (Min plays optimally)
Learning the Naïve Bayes Parameters
● Maximum Likelihood estimates ○ Use counts from a training set of data ○P stands for the probability estimate
Information Gain
● Measures the relative reduction in entropy for all attributes ● Start with a measure of the entropy of the dataset with respect to the desired classes C = {c1 , c2 , . . . cn } ○ One class: H(C) = 0 ○ Two equal classes: H(C) = 1 ○ Three classes with p(c1 ) = ⅙, p(c2 ) = ⅓, p(c3 )=½: H(C) = 1.46 ○ Three equal classes: H(C) = 1.58 ○ Ten equal classes: H(C) = 3.32 ● Information gain of an attribute:
Inference Task: Learning
● Model parameters are unknown ○ Transition model ○ Sensor model ● Model parameters can be learned from the data ○ Maximum likelihood parameter estimation ○ Expectation maximization (EM)
Size of the Trellis
● N nodes per column, where N is the number of states ● S columns, where S is the length of the sequence ● E edges, one for each transition ● Total trellis size is approximately S(N+E) ○ For N=10, S=10: ■ E = (N x S) {edges from S n to Sn+1} = 102 ■ S(N+E) = 10(10+100) = 1,100 << 1010
Proof Methods: Roughly Two Kinds
● Natural Deduction: Application of inference rules ○ Legitimate (sound) generation of new sentences from old ○ Proof = a sequence of inference rule applications ■ Use inference rules as operators in a standard search algorithm ○ Typically requires transformation of sentences into a normal form ● Model checking ○ Truth table enumeration: for n propositional symbols, 0(2n ) ○ Backtracking search, e.g., Davis-Putnam-Logemann-Loveland (DPLL) ○ Heuristic search in model space (sound but incomplete) ■ E.g., min-conflicts-like hill-climbing algorithms
Games as Search
● New material specific to games ○ There is an opponent to keep track of ○ The search tree includes the adversary's possible moves with alternating plies (levels) for the player and opponent ○ A utility function (or payoff function) of the payoff in points to each player
Ockham's Razor, and Underfitting versus Overfitting
● Ockham's razor: choose the simplest model that "works" ● Underfitting: the model fails to find a pattern that exists in the data ○ Not common; training usually continues until a good fit is achieved ○ Solution: choose a more complex model, and find more data to enable learning the more complex model ● Overfitting: the model finds a pattern that exists in the sample, but the pattern does not generalize across samples ○ Fairly common ○ Solution: simplify the model, and sample the data more rigorously
What Makes Naïve Bayes Naïve?
● Often used when the effects are not conditionally independent ● Can often work well in cases that violate the conditional independence assumption ○ Text classification ○ Why use a more complicated model that seems more correct if a simpler (more naïve) model perform as well or better
Limitations of Game Search
● Optimal search for complex 2-person zero-sum games is intractable due to large branching factor b, and average depth of game d ● Tradeoff between different kinds of algorithms ○ MCTS is best if b is high and/or evaluation function is hard to construct ○ Alpha Beta can be more precise, but heuristic alpha beta is very sensitive to the evaluation function, whose average error could cause bad choices ● Biggest limitation is the focus on individual moves in a game; people reason about games at a higher level of abstraction that breaks the overall of winning down into component sub-goals, such as trapping the opponent's queen in chess
Assumptions for Multinomial Naive Bayes
● PQ is the set of all distributions over Q ● For each such distribution, π is a vector with components π q for each q ∈ Q corresponding to the probability of q occurring
Naive Bayes as a Machine Learning Algorithm
● Performs classification: given an example, produces a hypothesis as to which class the example belongs in ○ Relies on Maximum a Posteriori decision rule (Mtg 21, slide 16) ● NB is a form of supervised statistical machine learning: a training dataset that has already been classified provides the supervision ○ The parameters of the model are the prior probabilities of each class value, and the conditional probabilities of each conditionally independent variable ○ The parameter values are determined empirically, using maximum likelihood estimation ○ A model with learned parameters is tested on previously unseen data
Prior (Unconditional) versus Conditional Probabilities
● Prior probability: probability of an event, apart from conditioning evidence Conditional (or posterior) probablity of an event conditioned on the occurrence of an earlier event
Pros and Cons of La Place Smoothing
● Pro ○ Very simple technique ○ Addresses the key idea that smoothing compensates for not having enough data: Cromwell's rule ● Cons: ○ Probability of frequent words is underestimated ○ Probability of rare (or unseen) words is overestimated ○ Therefore, too much probability mass is shifted towards unseen words ○ All unseen words are smoothed in the same way ● Many more sophisticated methods for smoothing exist; for this class use La Place
Propositions and Random Variables
● Probabilistic propositions are factored representations consisting of variables and values (combines elements of PL and CSP) Variables in probability theory are called random variables ○ Uppercase names for the variables, e.g., P(A=true) ○ Lowercase names for the values, e.g., P(a) is an abbreviation for A=true ● A random variable is a function from a domain of possible worlds Ω to a range of values
Infinite Utilities?!
● Problem: What if the game lasts forever? Do we get infinite rewards? ● Solutions: ○ Finite horizon: (similar to depth-limited search) ■ Terminate episodes after a fixed T steps (e.g. life) ■ Gives nonstationary policies (π depends on time left) ○ Discounting: use 0 < γ < 1 ■ Smaller γ means smaller "horizon" = shorter term focus ○ Absorbing state: guarantees that for every policy, a terminal state will eventually be reached (like "overheated" for racing)
Convert FOL to PL Then do Inference
● Propositionalize the FOL ○ Eliminate quantifiers ○ Skolemize ● Semidecidability ○ Theorem: any sentence entailed by FOL KB is entailed by a finite subset of the propositionalized KB ○ Problem: a sentence not entailed by FOL cannot be recognized as unprovable in the propositionalized KB
Stochastic Gradient Descent (SGD): Univariate Case
● Randomly select m data points at a time, m << N ● Comparison to batch: assume N = 10,000 and m = 100 ○ Each SGD step is 100 times faster than batch gradient descent ○ Increase in standard error is proportional to the square root of the number of examples, or a factor of 10
Language Modeling: Current Methods
● Recurrent neural networks (RNNs) ○ Avoids exponential increase in computation time with statistical LMs ○ Weight parameters are shared across the network ○ Therefore, there is a linear increase in computation time ● Still requires smoothing
Description of Minimax Algorithm
● Recurse down the game tree ○ Search proceeds to some depth d (number of moves to look ahead) ○ Expand to leaf nodes at depth d ● Pass the minimax values back up through the tree ○ Compute the minimax() utility function at the depth-d leaves ○ Pass value back up tree to the parent nodes ● Backed-up values ○ At a MAX node: the maximum of MAX's descendants (the best for MAX) ○ At a MIN node: the minimum of MIN's descendants (the best for MIN)
Regularization
● Regularization aims to minimize the complexity of the model ● Choice of regularization function depends on the hypothesis space ○ For example, a regularization with a linear regression (weights on each attribute) often uses one of the two following regularizers ■ Shrink all small weights to zero (eliminates attributes; Lasso or L1) ■ Prevent very large weights (smooths the weights, uses all features; Ridge or L2)
Universal and Existential Instantiation
● Replace quantified sentences with instantiated sentences ○ Variables are replaced with constants ○ Not with the "objects" of a model ● Inferentially equivalent to the original quantified sentences ● New KB' is satisfiable whenever KB is satisfiable ● Can use propositional inference on KB'
Evaluation of Game States
● Represent the game problem space by a tree: ○ Nodes represent board positions (states) ○ Edges represent legal moves (actions) ○ Root node is the first position in which a decision must be made ● Evaluation function f assigns real-number scores to board positions without reference to the search path ● A terminal node represents a possible game end, labeled with its utility (e.g. win/lose/draw, etc.)
Design issues
● Representing the 'board' and its successor boards ● Evaluating positions ● Looking ahead (search)
Proof by Resolution
● Resolution: an inference rule that if coupled with a complete search algorithm yields a complete inference algorithm ○ Inference rules covered above are all sound ○ Adding resolution yields completeness, if using a complete search algorithm
Differences from Previous Search Methods
● Search goal is to make one move; playing the game has many moves ● No cost on arcs - costs derive from backed-up static evaluation ● MAX can't be sure how MIN will respond to his moves
Regression versus Classification
● Seismic data (1982-1990, Asia & Middle East): body wave magnitude (x1 ) and surface wave magnitude (x2 ) for earthquakes (orange circles) and nuclear explosions (green circles) ● Decision boundary for a linearly separable subset of the data (left) ● All of the data: not linearly separable
Minimum Remaining Values (MRV) Heuristic
● Select a variable to assign with the fewest legal values, meaning the most constrained variable ● Identifies a potential conflict early, if no value can be assigned
Degree Heuristic
● Select the variable to assign that participates in the largest number of contraints with other variables (highest degree in the constraint graph) ○ Reduces branching factor on remaining variables ○ Very effective for picking which state to color first: SA
Monte Carlo Tree Search Balances Explo(r it)ation
● Selection: ○ Apply a metric (selection policy) to rank next moves ○ Take each next move in a (known) playout (path in the MC tree) to a leaf ● Expansion: Add one or more new children below the leaf ● Simulation: ○ Perform a playout simulation from the new node(s) to a game end ○ Note: The simulation is not part of the tree ● Back-Propagation: ○ Incorporate the game result of the simulation into the tree by updating all the nodes back to the root
Truth in FOL
● Sentences are true with respect to a model and an interpretation (grounding) ● A model contains objects (domain entities) and relations among them ● Interpretation specifies referents for ○ constant symbols → objects ○ predicate symbols → relations ○ function symbols → functional relations ● An atomic sentence (predicate(term1 ,...,termn )) is true iff the objects referred to by term1 ,...,termn are in the relation referred to by predicate()
Propositional Logic
● Simple sentences symbols are atomic, non-decomposable: A, B ● Logical operators combine simple sentences into complex sentences ○ ¬A (not A) ○ A ∧ B (A and B) ○ A ∨ B (A or B) ○ A ⇒ B (if A then B) ○ A ⇔ B (A entails B) ● Sentences are true or false
Smoothing Can Cause Underflow
● Smoothing, especially with low α, leads to values close to 0 ● Multiplying lots of probabilities, which are between 0 and 1 by definition, can result in floating-point underflow ● Mitigation: calculate in log space ○ Given that log(xy) = log(x) + log(y): perform all computations by summing logs of probabilities rather than multiplying probabilities
Value Iteration ctd
● Solving the Bellman equations: ○ For n states, there are n Bellman equations with unknown utilities ○ Systems of linear equations can be solved easily using linear algebra, but the max operator is not a linear operator ● Value iteration is one solution ○ Initialize with arbitrary values for the utilities of every state ○ Calculate the right hand side of the equation, then use the new value as an update for the left hand side, applied simultaneously to all states
Forward Chaining
● Sound and complete for Datalog ● Datalog = first-order definite clauses + no functions ● Forward chaining terminates for Datalog in finite number of iterations ● May not terminate in general if α is not entailed ○ Recall: Entailment with definite clauses is semidecidable
Inference Task: Computing the Belief State (Filtering)
● State estimation: the posterior distribution over the most recent state, given all the evidence to date ● For example, what is the probability it will rain today, given all the evidence to date ● Referred to as filtering from early work on signal processing to filter out noise by estimating the underlying signal ● When extended to the posterior over a sequence of states, it is referred to as computing likelihood
Deriving the Equation for the Global Semantics
● Step 1: by definition ● Step 2: where y is all the other variables in the full joint probability ● Step 3: proof that the network's local parameters θ are the conditional probabilities of the full joint probability distribution
Decision Trees
● Supervised learning: ○ Usually of a classifier ○ Can learn a regression ○ CART (classification and regression trees) can learn either one ● Basic idea, using a binary decision goal (e.g., ± C) and n binary attributes A: ○ From your training data, find the one attribute (e.g., A1 ) that best splits the data into two largely equal disjoint sets corresponding to + C and −C ○ If each set is "pure" (has only + C or only −C), the algorithm terminates ○ Else, for each of the 2 sets, find the attribute from A\{A1 } that best divides each set into two maximally pure classes ○ Iterate
FOL Expressions
● Terms ○ Constant ○ Variable ○ Function(Term, . . .) ● Atomic sentences ○ Predicate ○ Predicate(Term, . . .) ○ Term = Term ● Complex sentences ○ Atomic sentences combined with logical connectives ○ Sentences with quantification
Cost Combines Model Selection and Optimization
● The cost of a hypothesis can be considered as the sum of the loss and the regularization: ○ Lower loss is equivalent to reducing all the error types ○ Lower regularization term is equivalent to a simpler model ● Cost formalizes Ockham's razor, in an empirical way ○ Ockham: "A plurality of entities should not be posited without necessity" ○ If necessity is interpreted as empirically predictive, then cost directly formalizes Ockham's razor ○ If necessity is interpreted as providing a more useful explanation with respect to a theory of the world, then cost is only part of the story
Training a Logistic Regression
● The derivative g' of a function g satisfies: ● Thus the update rule has a somewhat different form than for multivariate regression with a hard threshold
Semantics ctdd
● The full joint distribution is defined as the product of the local conditional distributions
Resolution Requires Conjunctive Normal Form (CNF)
● The input to resolution consists of two clauses (implicit conjunction) that are each disjunctions of literals ● To apply resolution, convert all propositions to conjunctive normal form (CNF) ○ Apply bi-conditional elimination if applicable ○ Apply implication elimination if applicable ○ Move all ¬ inward to literals (double-negation elimination; de Morgan) ○ Apply distributivity of ∨ over ∧
Log Likelihood of the Parameters θ
● The log-likelihood function L(θ) is the sum over all documents of the logs of their probabilities (given their attribute values) ● The parameters are those that maximize the log-likelihood function, subject to the constraints that the probabilities of the each class sum to one, and the conditional probability of each word in each class sums to one
Choosing the Optimal Action in State s
● The optimal policy should choose the action leading to a successor state with the maximum utility ● The utility of a given state U(s) can then be defined in terms of the expected utilities of all state sequences from s: Bellman equation
Sample Spaces
● The set of all possible worlds (e.g., for a given logical agent) is called the sample space (specifiable) ○ The sample space consists of a exhaustive set of mutually exclusive possibilities ○ Each member ωi of a sample space Ω is called an elementary event
Analytic Solution: Univariate Case
● The univariate linear model can be easily solved using the above equations, based on finding the values of the weights where the partial derivatives equal zero ● An alternative method can be applied that relies on hill-climbing
Semi-decidability of Propositionalized KB
● Theorem (Herbrand, 1930): If a sentence α is entailed by an FOL KB, it is entailed by a finite subset of the propositionalized KB ● For n = 0 to ∞ do ○ create a propositional KB by instantiating with depth-n terms ○ see if α is entailed by this KB ● Problem: works if α is entailed, loops if α is not entailed at a finite n ● Theorem (Turing, 1936; Church, 1936) Entailment for FOL is semi-decidable ○ Algorithms exist that prove every entailed sentence ○ No algorithm exists that also disproves every non-entailed sentence
Stationary Preferences
● Theorem: if we assume stationary preferences: ● Then: there are only two ways to define utilities ○ Additive utility: ○ Discounted utility
Convergence Properties of EM
● Theorem: the log-likelihood function of the parameters is non-decreasing ● In the limit as T goes to ∞, EM converges to a local optimum
Probability Summarizes the Unknown
● Theoretical ignorance: ○ Often, we have no complete theory of the domain, e.g. medicine ● Poor cost benefit tradeoff if we have fairly complete theories: ○ Difficult to formulate knowledge and inference rules about a domain that handles all (important) cases ● Unavoidable uncertainty (partial observability): ○ When we know all the implication relations (rules), we might be uncertain about the premises
Likelihood of the Observations
● To compute the likelihood for o1 ,o2 ,...,oT as P(0|λ) we would want all the paths through the trellis and the associated probabilities ● Given that we do not have λ, we use an approximation λ' to compute an expectation for each sequence of observations ● Given the Markov assumptions, we need only consider the probability of every state transition i, j for every observation at every t ● Given all our sequences o1 ,o2 ,...,oT the probability of every state transition with an emission at t+1 is derivable from the forward probability from 1 to t, and the backward probability from T to t+1
Execution of the Conditional Plan
● To execute an if-then-else expression in the conditional plan ○ Agent receives percept, then executes the appropriate branch of the condition ○ Agent updates its beliefs after each action ● Similar to, but simpler than, the search ○ Percepts are actual observations in the environment, rather than possible observations maintained for all ways the belief state space could evolve
Logic versus Natural Language
● To express information is one of the functions of natural language ○ A logical formalism is a language to express truth-conditional meaning ■ More rigorous (rule-governed, consistent) than natural language ■ Most words have many meanings (semantic ambiguity) ○ Most sentences have multiple syntactic analyses (syntactic ambiguity) ○ Can characterize reasoning (inference) of various forms ● Language has many other functions besides to convey information (which can be true of false) ○ Word choices can be a reflection of what group one identifies with ○ Saying one has an emotion is information, but the emotion itself is not information (cannot evaluate the truth conditions of "sadness")
Optimization: Choosing Model Hyperparameters
● Training set error tends to decrease as complexity of model increases ● Because in many cases, error can be reduced nearly to zero, the validation set serves as a check on overfitting
Hard Threshold Function for Classification
● Turns a linear regression into a classifier that uses a hard threshold function ● At zero, the decisions switch to the other class ○ Values of the function above 0 are in one class ○ Values of the function below 0 are in the other class
Two Layer Feed Forward Network
● Two inputs ● One hidden layer with two neurons ● Two output neurons ○ Output is a 2D vector: a5 ,a6 ● Fully connected feed forward network ● Output of a network with m output nodes is a length m vector ● Given a loss function that is additive ○ Learning decomposes into m learning problems ● Loss must be back-propagated ○ Each node j in the current hidden layer contributes to Err m ○ Node error depends on the weights ● Activation function must be differentiable
Machine Learning in General
● Types of ML ○ Replicate a pattern given by a supervision signal ○ Discover new patterns (unsupervised learning) ○ Learn through trial and error by interacting with environment and receiving a reinforcement signal (reward; learn a Markov Decision Process model) ● Supervised machine learning types: ○ Classification, e.g., Naive Bayes ○ Sequence prediction, e.g., HMM ○ Regression
Upper Confidence Bounds for Trees Selection Policy
● U(n) is the total utility of all playouts through node n ● N(n) is the number of playouts through node n ● N(PARENT(n)) is the number of rollouts at the parent node of n ● U(n)/N(n) is the average utility of n, i.e., the exploitation term ● The square root term is the exploration term; because N(n) is in the denominator, and the log of PARENT(n) is in the numerator, the exploration term starts high and goes to zero as the counts increase ● C is a constant to balance exploitation and exploration, values around sqr(2)
Utility of a State Sequence
● U(s) The expected cumulative rewards over time from state s ○ Additive rewards for utility function on history: ○ Discounted rewards for utility function on history, for discount γ ∈ [0,1]: ● If γ = 1, discounted sum of rewards is the same as the additive sum of rewards ● The best policy is the policy that produces the state sequence with the highest rewards over time
Utilities of States Over Time
● U(s) The expected cumulative rewards over time from state s ● Finite versus infinite horizons ○ Finite: agent is restricted to a limited number of actions ○ Infinite: agent can take any number of actions until it reaches a goal ● Additive versus discounted rewards ○ Additive: sum the rewards over time ○ Discounted: apply a discount to prefer immediate rewards over future rewards
Operator Precedence
● Unary operators (¬) precede binary operators (∧∨⇒⇔) ● Conjunction and disjunction (∧∨) precede conditionals (⇒⇔) ● Conjunction (∧) precedes disjunction (∨) ● Implication (⇒) precedes biconditional (⇔)
Using Alpha-Beta Pruning
● Use Iterative Deepening search, sort by value last iteration ● Expand captures first, then threats, then forward moves
Feature Selection
● Use terms above some frequency threshold ○ No particular foundation ○ Can work well in practice ● Feature selection using Mutual Information (MI) ○ Clear information-theoretic interpretation ○ The amount of information (bits) obtained about one random variable based on another random variable (See next slide) ○ May select rare uninformative terms ● Other methods: Bayes factor good at selecting informative rare terms
Comparison: Two Dynamic Programming Approaches
● Value iteration and policy iteration compute the same thing (all optimal values) ● In value iteration: ○ Every iteration updates both the values and (implicitly) the policy ○ We don't track the policy, but taking the max over actions implicitly recomputes it ● In policy iteration: ○ Several passes that update utilities with a fixed policy (each pass is fast because one action is considered) ○ After the policy is evaluated, a new policy is chosen (slow like a value iteration pass) ○ The new policy will be better (or we're done)
Problems with Value Iteration
● Value iteration repeats the Bellman updates: ● Problem 1: It's slow - O(S2A) per iteration ● Problem 2: The "max" at each state rarely changes ● Problem 3: The policy often converges long before the values
Search Tree
● We're doing way too much work! ● Problem: States are repeated ○ Idea: Only compute needed quantities once ● Problem: Tree goes on forever ○ Idea: Do a depth-limited computation, but with increasing depths until change is small ○ Note: deep parts of the tree eventually don't matter if γ < 1
Gradient Descent Hill Climbing
● Where w is a vector for the weights (including bias), and α is the step size or learning rate, apply the following update rule ● For univariate regression, the loss is quadratic so the partial derivative will be linear
General Form of EM
● Where x is all the observed values in all the examples, Z is all the hidden variables for all the examples, and θ is all the parameters ○ E-step is the summation over P(Z=z | x, θ (k)), which is the posterior of the hidden variables given the data ○ M-step is the maximization of the expected log likelihood L(x,Z = z|θ ) ● For mixtures of Gaussians, the hidden variables are the Zijs, where Zij=1 if example j was generated by component i ● For Bayes nets, Zij is the value of unobserved variable Xi in example j ● For HMMs, Zjt is the state of the sequence in example j at time t ● Many improvements exist, and many other applications of EM are possible
POS Tags: Hidden States, Inherently Sequential
● Words are observed; part-of-speech tags are hidden ○ Cf. observed umbrellas vs. hidden weather states in the umbrella world ○ Cf. observed ice creams vs. hidden weather states in the ice cream world ● Likely POS sequences ○ JJ NN (delicious food, large pot) ○ NNS VBD (people voted, planes landed) ● Unlikely POS sequences ○ NN JJ (food delicious, pot large) ○ NNS VBZ (people votes, planes lands)
Formalizing the KB
● Wumpus world illustrates a logical agent: the agent takes an action, which results in a new percept, leading to new facts to add to the KB ○ Facts can follow directly from percepts ○ Facts can follow from other facts ● How could such an agent be implemented? ○ A logic language provides a way to represent and reason about facts ○ Predicate logic is introduced next as a first step
Generalization of Conditional Independence
● X and Y are conditionally independent given Z ● Any number of random variables can be conditionally independent given a single random variable distinct from the rest: the Cause