CMPSC442

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Task Environment

A problem specification for which the agent is a solution

Two ways to use parent method to define subclass method

1.)Explicitly call the parent class method in the redefinition 2.) use super()

Random Variables Indexed over Time

Assume: fixed, constant, discrete time steps t ● Notation: Xa:b = X a , Xa+1 ,..., Xb-1 , Xb ● Markov assumption: random variable Xt depends on bounded subset of X0:t-1

UCS Evaluation

Complete, like BFS. Optimal for any step cost function. Time complexity contingent on step costs. Space complexity = time complexity.

Database Semantics

DB semantics for a fragment of FOL with two constants (R, J), one binary relation ○ Unique-names assumption: every constant refers to a distinct object (R ≠ J) ○ Closed world assumption: an atomic sentence not known to be true is false ○ Domain closure: bijective relation of domain elements to constant symbols

Existential Instantiation (EI)

For any sentence σ, variable v, and constant symbol k that does not appear elsewhere in the knowledge base:

UCS in words

Store the frontier as priority queue ordered by f. Expand node n from the frontier that has the lowest path cost f(n) Apply the goal test when a node is selected for expansion Save any goal node in reached Select the goal node in reached with the lowest path cost. (If there are multiple paths to the goal, lowest cost path is found first).

Each time step Xt is conditioned on the preceding states Xt-2, Xt-1: Second Order Markov Process

T

Single Feed Forward Neuron Cannot be an XOR Gate

T

● Connected components of a constraint graph constitute independent problems ○ If assignment Si is a solution of CSPi , then ∪Si is a solution of ∪CSPi ○ Reduction of complexity: if each CSPi has c variables from a total of n, then there are n/c subproblems, each with complexity d c , where d is the size of the domain, giving O(d cn/c) which is linear in n, instead of O(d^n ) which is exponential in n ● Connected components of CSPs are rare

T

Two different functions cannot have the same name, even if they have different numbers, order, or name of arguments

True

α: lower bound on MAX's outcome

β: upper bound on MIN's outcome

Majority Class Baseline for POS Tagging

● Assign the most frequent tag

Bayesian Networks

● Concisely represent any full joint probability distribution ○ Graphically represents conditional independence ○ Defined by the topology and the local probability information ● Also known as belief nets

Entropy

● Entropy measures the uncertainty of a random variable; the more uncertainty in the variable, the less information it has ● Recall the use of mutual information for feature selection for NB classifiers ○ MI presented there as a measure of how independent X and Y are ○ It can also be seen as a measure of how much information X and Y convey about each other; it is the normalized negative joint entropy

Resolution-based Theorem Prover

● For any sentences A and B in propositional logic, resolution can decide whether A ⊨ B ● Step 1: Put the statements in conjunctive normal form (CNF) ● Step 2: Proof by contradiction

Discounted Rewards with Infinite Horizon MDP

● Optimal policy π* is independent of the start state ● Utility of a state s is the expected sum of discounted rewards over time (t=0 . . . t=∞) ● Optimal policy is the policy that gives the maximum utility

Game Search, for Two-Player, Zero-Sum Games

● Two players: MAX and MIN ● MAX moves first ● MAX and MIN take turns until the game is over ● Winner gets reward, loser gets penalty

Expectimax for "Environment" Adversary

● Uncertain outcomes of agent's actions

Utilities of Sequences

● What preferences should an agent have over reward sequences? ● More or less? ● Now or later?

Semantics ctd

● Where the network contains n random variables Xi , each entry in the joint probability distribution is P(x1 , . . . , xn ) ● In a Bayesian network, each entry P(x1 , . . . , xn ) in the joint probability distribution is defined in terms of the parameters θ:

Expectimax Search

● Why wouldn't we know what the result of an action will be? ○ Explicit randomness: rolling dice ○ Actions can fail: when moving a robot, wheels might slip ● Values should reflect average-case (expectimax) outcomes, not worst-case (minimax) outcomes ● Expectimax search: use average score under optimal play ○ Max nodes as in minimax search ○ Chance nodes are like min nodes, represent uncertain outcomes ○ Calculate their expected utilities, i.e. take weighted average (expectation) of the children

Convergence Properties

● With fixed α, no guarantee of convergence ● With decreasing α, convergence is guaranteed ● What previous topic is this similar to?

Deriving ξ t (i,j)

1. By laws of probability 2. : the probability of the whole observation sequence 3. Divide ε t (i,j) by to get ξ t (i,j)

Knowledge Engineering using FOL

1. Identify the task 2. Assemble the relevant knowledge 3. Decide on a vocabulary of predicates, functions, and constants 4. Encode general knowledge about the domain 5. Encode a description of the specific problem instance 6. Pose queries to the inference procedure and get answers 7. Debug the knowledge base

Action Selection

1. Instantiate all evidence 2. Set action node(s) each possible way; for each action value: a. Calculate the posterior for the parents of the utility node, given the evidence b. Calculate the utility for each action 3. Choose the action with the highest utility

Generalized Search Algorithm

1.)Enumerate all paths from the initial state through the state space 2.)Find the subset of paths that end in the goal state 3.)Order the solution paths by cost, ascending 4.)Select the lowest cost solution

During graph search, states are in one of three disjoint subsets

1.)States associated with expanded nodes 2.) States associated with nodes in the frontier 3.)States associated with nodes that have not been reached

Six properties of task environments

1.)States can be fully, partially, or not observable 2.)Agency involves a single or multiple agents who might co-operate, compete or confront. 3.)Successor states can be deterministic, non-deterministic or stochastic 4.)Agent decisions can be episodic or sequential 5.)The world can be static or dynamic 6.)Time and space can be discrete or continuous

Initial state S0 :

: How the game is set up at the start

Constructing a Bayesian Network

A Bayesian network is a correct representation of a domain only if each node is conditionally independent of its other predecessors in the node ordering, given its parents 1. Determine the variables to model the domain, and order them such that causes precede effects 2. Choose a minimal set of parents for Xi to satisfy the equation for the local & global semantics, and insert an edge from every Parent(Xi ) to Xi 3. Add the CPT at each node. This procedure guarantees the network is acyclic, and with no redundant probability values: it cannot violate the axioms of probability.

Is-Terminal(s):

A Boolean to indicate when the game is finished

Compactness of a Bayesian Network

A CPT for Boolean Xi with k Boolean parents has 2 ^k rows for the combinations of parent values ● Each row requires one probability value p for Xi = true (the number for Xi = false is 1-p) ● If no Boolean variable has more than k Boolean parents, the complete network requires O(n ⋅ 2^k ) numbers ● Grows linearly with n versus O(2^n ) for the full joint distribution

PL versus FOL Expressivity

A Logical Formalism Can be a Tool to Investigate Information in Language ● Propositional logic (PL) assumes the world contains statements ○ Atomic terms standing for statements ○ Operators to combine with atomic statements ● First-order logic (FOL) assumes the world contains ○ Objects: people, houses, numbers, colors, baseball games, . . . ○ Relations: red, round, prime, brother of, bigger than, part of, . . . ○ Functions: father of, best friend, one more than, plus, . . . ○ Quantification: all wumpuses smell bad, some squares are breezy ○ Statements about objects, relations, functions and their quantification

Markov Assumption for Language Modeling

A bigram language model is a Markov model: ● S, a set of states, one for each word wi ● A, a transition matrix where a(i,j) is the probability of going from state wi to state wj ○ Where the probability a(i,j) can be estimated by ● π, a vector of initial state probabilities, where π(i) is the probability of the first word being w

Subclasses

A class can extend the definition of another class. Allows use of methods and attributes already defined in the previous one. New class: subclass. Original: parent ancestor, or superclass

object-oriented programming

A computer programming model that organizes software design around data, or objects, rather than functions and logic. An object can be defined as a data element with characteristic attributes and behavior defined by the class it represents

Convex Functions

A function is convex when, for any two points (x, f(x)) and (y,f(y)), a line segment connecting the two points lies above the curve f. Use of an arbitrary function g as your objective function means to use g as the criterion for achieving the problem goal. If the objective function g is convex, then the goal can be identified as z such that g(z) =0

A Naïve Bayes Model is a Bayesian Network

A graphical representation has one node for each random variable ● Directed arrows represent the conditioning effect of a parent node on its children

Generative Models Support Parameter Estimation

A joint probability distribution supports queries about any one or more random variables by marginalizing over the other variables ● Therefore the same model can be used in multiple directions: ○ To estimate the posterior probability of labels given the data ○ To estimate the posterior probability of the data given the labels

Likelihood Function

A likelihood function is a probability function (e.g., probability mass function) ● Likelihood L is considered to be a function of the parameters L(θ) for fixed x and y ○ where θ is a vector of parameters ○ x is fixed data (e.g., Naïve Bayes features) ○ y are prior probabilities of the classes ● Bayes rule: Posterior = Likelihood x Priors

List

A mutable ordered sequence of mixed types

Validity and Satisfiability

A sentence is valid if it is true in all models (tautologies; necessarily true) ○ e.g., A ∨¬A, A ⇒ A, (A ∧ (A ⇒ B)) ⇒ B ● Validity is connected to inference via the Deduction Theorem: ○ A ⊨ B if and only if (B ⇒ A) is valid ● A sentence is satisfiable if it is true in some model ○ e.g., A ∨ B ○ Determining satisfiability (the SAT problem) is NP-complete ● A sentence is unsatisfiable if it is true in no models ○ e.g., A∧¬A ● Satisfiability is connected to inference via reductio ad absurdum: A ⊨ B if and only if (A ∧¬ B) is unsatisfiable

Solution

A sequence of actions from the initial state to the goal state

Syntax

A set of nodes, one per random variable (discrete or continuous) ● A directed, acyclic graph (DAG) ○ Directed arcs for parent node as conditioning context point to the child nodes ● A conditional distribution for each node Xi given its parents that quantifies the effect of the parents on the node using a finite number of parameters θ

Tuple

A simple immutable ordered sequence of items (cannot be modified) Items can be of mixed types, including collection types

Default Dict

A subclass of Dict. Does not raise a key error if a key lacks a value; assigns a default value. Optimal default_factory arg: a function to return a default value. Can use built in or define a function.

Defining the state space

Abstract over the real world details. State space abstraction. Action abstraction: pairs of states where 1st can be succeeded by 2nd.

Evaluation Metrics

Accuracy Information Retrieval Metrics ● A confusion matrix is a square matrix of n classes by n hypotheses ● The same confusion matrix is used for accuracy, and IR metrics ● Accuracy is preferred if all cell values can be determined ● Recall and precision are often used when TN is not known

Add-One La Place Smoothing

Add pseudo counts for unobserved class members where the missing word could have occurred (e.g., if only you had more data): ○ Add 1 to the numerator in the conditional probabilities for every word ○ Increment the denominator total count of words in the class by the size of the vocabulary (the class size is expanded by all the pseudo counts)

Two criteria for Good Heuristic functions

Admissibility- A good heuristic never overestimates the distance to the goal. f(n) >= h(n) Consistency- Given a node n, a successor n', and the goal node of G, a heuristic function is consistent if n, n' and G obey the triangle inequality

Search as Optimization

Advantages: Low memory usage; often finds reasonable solutions in large or infinite state spaces. Useful for pure optimization problems: Find or approximate the best state according to some objective function. Optimal if the space to be searched is convex.

General Description of Search

Agent Formulates a goal Agent formulates an abstract representation of the problem Agent simulates sequences of actions until it finds a best path to the goal, or exhausts the search space. If it found the best path, it executes the path and achieves the goal.

How does a rational agent choose outcomes?

Agent cannot perfectly predict all outcomes. Agent relies on expected outcomes.

Rational Behavior

Agent is assessed by its performance meaning the consequences of its actions

Goal Test

Agent possibly achieves the goal if any state s in the belief state satisfies the goal test. Agent necessarily achieves the goal if all states s in the belief state satisfy the goal test

Motivation for Alpha-Beta Pruning

Among the possible actions at a given MIN node, MIN will always choose the one that results in MAX's lowest score

*args

An iterable for a variable number of arguments

Performance Measure

An objective criterion for success of an agent's behavior, given the evidence provided by the percept sequence.

Simulated Annealing

Annealing: the process by which a metal cools slowly and as a result freezes into a minimum energy crystalline structure. Adopt a control parameter T, which by analogy with metallurgy is known as the system temperature. T controls the amount of randomness in picking a successor. T starts out high to promote escape from local maxima early on. T gradually decreases towards 0 over time to avoid getting thrown off track late in the search

Lambda

Anonymous functions. Used to create a function with no name. any number of arguments, a single expression. Useful when a function is required only rarely.

Using Search to Solve Sensorless Problems

Applying search algorithms to sensorless problems ○ So far, we have used search algorithms to search the state space ○ The same algorithms can search the belief state ● Why should this work? ○ Percepts are irrelevant: they are always empty ○ The solution to a sensorless problem is a single sequence of actions ○ The belief state is deterministic: the agent knows its own beliefs by definition

Or-nodes

As in the deterministic search methods, search trees will contain state nodes with one or more possible action arcs

Expectations for the Transition Probabilities

Assume we have an estimate of the transition probability aij at a particular point in time t in the observation sequence ● Then the estimate of the total count for sums over all t ○ Formulate an expression for the expected joint probability ξ t (i,j) of state i at time t and state j at time t+1, based on observation sequences ○ For the numerator of , sum over ξ t (i,j) for all t ○ For the denominator, sum over ξ t (i,k) for all t and for all transitions i,k

Axioms versus Theorems

Axioms: Foundational statements taken as given Theorems: entailed by the axioms

Given a finite branching factor and finite state space

BFS is complete. At most it branches b times.

Four uninformed search algorithms

BFS: FIFO queue; Optimal if step costs are the same. Uniform Cost Search (Best First): Priority queue ordered by a cost function; Optimal for any step cost function. DFS: LIFO queue; Not optimal. Iterative Deepening: Adopts benefits of BFS and DFS without their limitations

First-order Markov Process

Bayesian network over time ○ Random variables . . . Xt-2 , Xt-1 , Xt , Xt+1 , Xt+2 . . . ○ Directed edges for conditional independence ● Each state Xt is conditioned on the preceding state Xt-1

Viterbi Decoding for POS Tagging

Because many of these counts are small or do not occur, smoothing is necessary for best results ● HMM taggers typically achieve about 95-96% accuracy, for the standard 36-42 set of POS tags

Value Iteration

Bellman equations characterize the optimal values: ● Value iteration computes them:

Two Ways to compare algorithm efficiency

Benchmarking: empirical measurement on benchmark tasks. Very Specific Mathematical analysis with big O notation: Asymptomatic analysis; how computation time changes with length of input

A* search

Best known form of heuristic best-first search. Key ideas: Avoid expanding expensive paths, expand most promising first. Evaluation function f(n) = p(n) + h(n) p(n) = the cost to reach the node h(n) = the estimated cost to get from the node to the goal. f(n) = estimated total path through n to the goal. Implementation: Frontier as priority queue by increasing f(n)

Components of O() notation

Branching factor: number of successors of a search node. Depth: number of actions in the optimal solution Maximum Depth: Maximum actions on any path

Dict

Built in container. Python dictionaries of key: value pairs are unordered and work by hashing so keys must be immutable.

Limitations of Propositional Logic

Can only state specific truths. Cannot state generic truths

Action cost

Case one: if all actions have same cost from any state, then the cost is the same as in the state space problem Case two: if the same action can have different costs, depending on the state, then the cost is a function of the belief state

Decision Networks

Chance nodes (ovals, as in Bayesian Networks) ○ Parents can be chance nodes or decision nodes ● Decisions (rectangles; no parents, treated as observed evidence) ● Utility nodes (diamonds, depend on action and chance nodes)

Concept of a Utility Function

Choosing among actions based on the desirability of their outcomes ○ Each action a in state s results in a new state s' with some probability P(result(a)=s') ○ The transition model gives the probabilities of action outcomes ● Given a utility function U(s) that quantifies the desirability of a state s ○ Expected Utility EU(a) of an action a is the sum of utilities over the outcomes, weighted by their probabilities

Evaluation of A*

Complete Time: O(b^d) Space: All nodes are stored; runs out of space before time. A good heuristic can reduce the complexity by orders of magnitude. Optimal

Iterative Deepening Evaluation

Complete. Optimal if step cost = 1. Time complexity= O(b^d) Space Complexity= O(d)

Evaluation of Greedy Best First

Complete: No - can get stuck in loops Time: O(b^m) worst case Space: O(b^m)- priority queue Optimal: No

Inference Task: Prediction

Compute posterior over a future state, based on all the evidence to date

Inference Task: Smoothing

Compute the posterior distribution over a past state ● Smoothing gives a better estimate of Xk , k ≤ t, than was available at time tk ○ More evidence is incorporated for the state Xk - evidence preceding, concurrent with, and following Xk

HMM POS Tagger as a Bayesian Network

Condition the hidden POS tag p at time t on the POS tag at time t-1: P(pi |pi-1 ) ● Condition the word w at time t on the POS tag at time t: P(wi |pi )

Benefits of Crossover

Crossover can be beneficial given an advantageous pattern (schema) on one side of the crossover point.

Simple Reflex Agent

Current percept determines agent's next action

Alpha-Beta Algorithm Description

DFS ● Pass current values of α, β down to children during search ● Update values of α and β during search: ○ Update α at MAX nodes ○ Update β at MIN nodes ● Prune remaining branches at a node whenever α ≥ β

Methods to Handle Uncertainty for Logical Agents

Default or nonmonotonic logic. Fuzzy logic: truth values in [0,1] ○ Can handle different interpretations of the same predicate ● Subjectivist (Bayesian) Probability ○ Model agent's degree of belief ○ Estimate probabilities from experience (e.g., expectation as average) ○ Probabilities have a clear calculus of combination

Scope of Peano Axioms

Define multiplication as repeated addition Define exponentiation as repeated multiplication Define division similarly All of Number Theory is defined from: ● One constant (zero) ● One function (Successor) ● One predicate (+) ● Nine axiom

M-Step for Naive Bayes

Define γ t (qm ) to be the new parameter value at iteration t for the prior probability of class qm ● Define γ j t (v|qm ) to be the new parameters at iteration t for the conditional probabilities of each jth word v given the class qm

E-Step for Naive Bayes

Define δ(qm |di ) to be the conditional probability of the class qm given di and the parameter values θ t-1 from iteration t−1 ● In each E-step, calculate the values δ(qm |di )

The Bellman Equations

Definition of "optimal utility" gives a simple one-step lookahead relationship amongst optimal utility value

Minimax Performance

Depth-first search (DFS) with fixed number of ply m as the limit. ● O(b^m) time complexity ● O(bm) space complexity if algorithm computes all moves at once ● O(m) space complexity if algorithm computes moves one at a time Performance will depend on ● The quality of the static evaluation function (expert knowledge) ● Depth of search (computing power and search algorithm)

Transition model: Two cases

Deterministic actions: Non-deterministic actions

Diagnostic Knowledge versus Causal Knowledge

Diagnostic network is less compact than the causal network: 1 + 2 + 4 + 2 + 4 = 13 CPT entries instead of 10 ● Causal models and conditional independence seem hardwired for humans!

Expectation Maximization and Naive Bayes

EM learns the parameters of a probabilistic model ● Meeting 28 presented EM for learning the parameters of an HMM ● Naive Bayes is one of the simplest probabilistic models ○ If the training data has class labels, the MLE parameter estimates can be computed from the observed distribution in the data ○ If the training data lacks class labels, the Naive Bayes parameters can be learned through EM

Computing Values of Each Trellis Entry

Each cell of the forward algorithm trellis αt ( j ) represents the probability of being in state j after seeing the first t observations, given the HMM λ . The value of each cell is computed by summing over the probabilities of every path that leads to this cell

Markov Assumption

Each state depends on a fixed finite number of prior states ● Future is conditionally independent of the past ● A Markov chain is a Bayesian network that incorporates time (temporal sequences of states)

Universal Instantiation (UI)

Every instantiation of a universally quantified sentence is entailed by it

Structure of CSP Problems

Examination of the CSP graph can be used to solve problems more quickly ○ Independent subproblems ○ Tree-structured CSPs ● Patterns of values can also help solve problems

Utility Based on Rational Preferences

Existence of Utility Function: if an agent's preferences obey the axioms of utility, then ○ There exists a function U such that U(A) > U(B) iff A > B ○ U(A) = U(B) iff A ∼ B ● Expected Utility of a Lottery

Magic Methods

Existing method names with preceding/following underscores. Built-ins that can be modified for user's classes. Best to make them intuitive like duck-typing. Adds magic to user classes that builds on python syntax for built ins.

Constructing a Search Tree

Expand each next node by applying actions

Fixed Policies

Expectimax trees max over all actions to compute the optimal values ● If we fixed a policy π(s), then the tree will be simpler, with one action per state ○ Though the tree's value would depend on which policy we fixed

Binomial Distribution: x Successes in y Trials

Experiment: n repeated trials where each trial has one of two outcomes, success or failure (Bernouilli trials) ● P (probability of success) is the same on every trial ● Trials are independent ● x = number of successful trials out of n ● Binomial Probability Mass Function

Types of Search queues

FIFO: first in first out; used in breadth first tree traversal LIFO: last in first out; used in depth first tree traversal Priority Queue: Orders nodes in queue by an evaluation function; used in best first search

Problems with DFS

Fails in infinite-depth search spaces. Can be very inefficient if m>>d

Inference Task: Most Likely Explanation

Find the most likely sequence of states that could have generated a set of observations

Forward Probabilities

For a given HMM λ, given that the state q is i at time t, what is the probability that the partial observation o1 ... ot has been generated? ● Forward algorithm computes at (i) 1 < i< N, 1 < t < T in time 0(N2T) using the trellis, where T is number of observations and N is the number of hidden states

Rational Agent

For each possible percept sequence P, a rational agent should select an action a, that is expected to maximize its performance measure.

● Monte Carlo Tree Search can be used instead of minimax for games ○ More efficiency for games with a high branching factor ○ No need for heuristics to inform the game state evaluation function

Formalizes the tradeoff between exploration versus exploitation

General Framework Monte Carlo

From a given position s i , simulate m move sequences to game end ● Each simulation is called a rollout or playout ● Value of s i is given by the average utility over the m simulations ● Random playouts work only for simple games, so we need ○ A playout policy (how to bias moves towards good ones) rather than randomly pick moves ○ What start positions to use for playouts, how many to do? ■ Pure Monte Carlo: do N simulations, find the next move with highest win % ■ Selection policy: Balances exploration and exploitation

GMP and FOL

GMP is a lifted Modus Ponens ● Raises Modus Ponens from variable-free propositional logic to FOL ● Inference in FOL - Lifted versions of ○ Forward Chaining ○ Backward Chaining ○ Resolution

How to Play a Game by Searching

General Scheme 1. Consider all legal successors to the current state ('board position') 2. Evaluate each successor board position 3. Pick the move which leads to the best board position 4. After your opponent's best move(s), repeat.

Gradient Descent (Alternative)

Given a locally correct formula for the gradient, perform steepest ascent hill climbing to move in the direction of grad(f) = 0

Greedy Best FIrst

Given f(n) = p(n) +h(n) for evaluation function f(n), path cost so far p(n) and estimate to goal h(n). Greedy Best first ignores p(n) so f(n) = h(n)

Reduction in Complexity

Given n random variables that are all conditionally independent on a single causal variable, probabilistic inference goes from O(2^n ) to O(n) ● Basis for naïve bayes ● Allows probabilistic inference to scale to many variables ● Conditional independence is very common, in contrast to full independence, which is not

Backward Algorithm

Given that we are in state i at time t, the probability β of seeing the observations from time t + 1 to T, given λ, is: ● The Forward and Backward algorithms give same total probabilities over the observations: P(O)=P(o1 ,o2 ,...,oT )

Hill Climbing Details

Given the current state n, an evaluation function f(n) and s equal to n's successor state s that maximizes f(s). If f(s) >= f(n) then move to s. Else halt at n. Terminates when a peak is reached Has no look-ahead at the immediate neighbors of the current state. Chooses a random best successor, if there is more than one. Cannot backtrack, since it doesn't remember where its been

Cycle Cutsets

Given the efficiency of tree-structured CSPs, turn other CSP graphs into trees ● A cycle cutset S of a graph G: any subset of vertices of G that, if removed, leaves G a tree ○ EG: Assign a value v to SA, then remove v from domains of remaining variables; SA is a cycle cutset that can be removed,, leaving a forest of trees

Formulation for Estimating the Transition Probabilities

Given the expression ξ t (i,j), we can now sum over all t ● Numerator sums over all probabilities of states i at t and states j at t+1 ● Denominator sums over all probabilities of states i at t and states k at t+1 for all k in full set of states Q

Non-deterministic problems

Given the states s and actions A Current state + action = belief state consisting of alternative possible successors. Solution to search is a strategy for taking actions, rather than a specific action sequence. The strategy execution is contingent on detecting results of actions.

Deterministic Problems

Given the states s and actions A Current state + action = resulting successor state

Unification: Systematic Substitution

Given two logical statements p, q ● Find a substitution θ (unifier) that make p and q look identical Unify(p, q) = θ where Subst(θ,p) = Subst(θ,q) ● A key component of first order inference algorithms

Utility(s, p):

Gives a numerical value to player p at the terminal state

The Value of Information

How an agent chooses what information to acquire: values of any of the potentially observable chance variables in the model ○ Observation actions affect the agent's belief state ○ Value of any observation derives from the potential effect on the agent's actions

PEAS:

How to specify the task environment Performance Measurement Environment Description Actuators Sensors

Expectimax: Average Case Instead of Worst Case

Idea: Uncertain outcomes controlled by chance, not an agent adversary

Improved Backtracking: Backjumping

Identify a variable's conflict set ○ Variable assignments that are in conflict. Jump back to reassign the most recently assigned variable in the conflict set

Constraint Learning

If a partial assignment is encountered that leads to failure (e.g., {WA = red, NT = green, Q = blue}) during a search, save the information ○ NoGood({WA = red, NT = green, Q = blue}) ● Then it won't be tried again ● Modern CSP solvers gain in efficiency through use of constraint learning

Multiple Resolution Order

If a subclass method name is in multiple parent classes, use the order in the subclass statement

Heuristic function h(n) = estimated cost from node n to goal node

If n is goal then h(n) = 0. Evaluation function g(N) >= h(n). Otherwise h is not a good heuristic

String

Immutable, like a tuple with different syntax. Character encoding versus data storage and transmission.

Goal-based reflex agent

Implicit or explicit notion of planning. Agent's next action depends on transition model + sensor model+ goal

Utility-based agent

Implicit or explicit preference ordering of different plans for same goal

Rudimentary Profiling

Import the time module, retrieve the before and after times, subtract

General idea behind informed Search

Improves upon Best-First Search. (Best-first uses priority queue ordered by an evaluation function f(n). f(n) is path cost function on path from start node to current node.) What matters is path from start to goal. (Replace f(n) with new function g(n) that includes an estimate h(n) of cost of current node to goal. g(n) = f(n) + h(n). Computation of h(n) uses heuristic knowledge to get a good estimate

Pros of Propositional Logic

In contrast to database languages or programming languages, PL is Declarative rather than procedural Supports partial information: disjunction, negation Compositional ○ Meaning of a complex statement is a function of the meanings of the parts (atomic statements, operators) Meaning is context-independent ○ Contrasts with natural languages, where meaning is context dependent

Solving Partially Observable Problems

In place of the successor function from fully observable deterministic search, we now have: ○ A PERCEPT function to produce possible observations in successor belief state ○ A RESULTS function to update the belief state ● Given the above, the AND-OR search algorithm can be applied to the belief state space to find a solution ● The solution is a conditional plan

Developing the Parameters for an HMM

In previous meetings, we assumed that the parameters of the HMM model are known ○ Often these parameters are estimated on annotated training data ○ Annotation is often difficult and/or expensive ○ Training data might be different from a new dataset of interest ● Parameters can be learned from unlabeled data ○ In principle, the goal is to maximize the parameters with respect to the current data

Theory versus Practice

In theory, Baum-Welch can do completely unsupervised learning of the parameters ● In practice, the initialization is very important

Mutual Information

In training set, choose k words which best predict the categories Words X with maximal Mutual Information with the class Y

Markov Blanket

Independent of the global network structure: ● For node A, the only part of the network to consider for predicting the behavior of A and its children ○ Parents of A, children of A, and the other parents of A's children's ● This Markov Blanket property is exploited by inference algorithms that use local and distributed stochastic sampling processes

Estimating the Priors for Each State

Initial state distribution: is the probability that qi is a start state (t = 1)

EM Algorithm: Input and Initialization

Input ● An integer m for the number of classes ● Training examples di for i = 1 . . . D where each di ∈ ● A parameter T specifying the number of iterations Initialization: set γ 0 (qm ) and γ j 0 (v|qm ) to some initial (e.g., random) values satisfying the constraints

Unification Increases Efficiency

Instead of instantiation, find a substitution for variables in the KB ○ Make the conjuncts in an implication match atomic statements in the KB ○ Find substitutions for quantified statements

Beam Search

Keep track of k states Find successors of all k Take top k successors across all k beams (shares information across beams, only the computation of successor states is pursued in parallel.) Stochastic variant: Pick a successor with probability proportional to the successor's value.

Iterative Re-Estimation of Parameters

Key Idea: parameter re-estimation by hill-climbing ● Iteratively re-estimate the HMM parameters until convergence ○ Initialize with some parameter values ○ With these initial parameters, compute the expected counts of states, and expected state transition counts, given the observation sequences ○ Using these expectations, which approximate MLE counts over the data (were the state labels available), we then recompute the transition probabilities and emission probabilities

DFS

LIFO; Usually implemented as a tree-like search. While the goal state has not been found- search finds each next successor proceeding to maximum depth. Frontier follows the next deepest node.

A heuristic is consistent, if for node n and its successor n' cost function c, and action a: h(n) <= c (n,a,n') +h(n')

Lemma: If h is consistent, f(n) is non-decreasing on any path (where f(n) is the estimated total path cost). f(n') =g(n') + h(n') =g(n) +c(n,a,n') +h(n') >= g(n) +h(n) = f(n)

Inference: Queries about Probabilities

Let X be a variable to query, E be the list of evidence variables, e be the list of observed values for the evidence, and Y the remaining unobserved values, then we can formulate a query P(X|e) ● Compute the probability of X conditioned on e by summing over all combinations of values of the unobserved variables ● Theoretically, this general query can be addressed for any conditioning context of any variable using a full joint probability distribution ● In practice, full joint probability distributions are impractical for large sets of variables

Semantics

Local semantics: given its parents, each node is conditionally independent of its other ancestors Local semantics give rise to global semantics

Solving a Univariate Regression

Loss is minimized when the partial derivatives of the loss function with respect to w1 and w0 are both 0

Maximum Expected Utility (MEU) Principle

MEU defines a rational agent as one that chooses its next action to be the one that maximizes the expected utility: ● Implemention requires computational solutions to perception, learning, causal knowledge about outcomes of actions, and inference ● Instead of a retrospective performance measure, a decision theoretic agent incorporates it into the agent's utility function, thus allowing it to anticipate how to achieve the highest performance

Formalizing Naive Bayes

MLE Parameter Estimates for Multinomial Naive Bayes ○ Generalization of MLE estimates for binary Naive Bayes ● EM for Estimating NB parameters π q and ρn for M classes

Marginalization and Conditioning

Marginalizing (summing out) for a variable sums over all the values of another variable in the joint distribution ● Conditioning, derived from applying the product rule to the rule for marginalizing

MAX AND MIN

Max moves first: all play is computed from MAX's vantage point ● When MAX moves, MAX attempts to MAXimize MAX's outcome ● MAX assumes that when MIN moves, MIN attempts to MINimize MAX's outcome

Hill Climbing

Maximize an objective function (global maximum)

Optimization

Maximize or minimize a real function: choose input values from the domain. Compute the value of the objective function.

Cons of PL

Meaning is context-independent ○ Contrasts with natural languages, where many expressions are context dependent ○ PL cannot express much of the meaning natural languages convey Limitations on the expressivity of propositional logic: ○ Cannot state generic truths, only specific ones ■ Specific truth (fact): B1,1 ⇔ (P1,2 ∨ P2,1) ■ Generic truth: Squares adjacent to pits are breezy

Gradient Descent

Minimize a loss function (global minimum)

Classes support

Modularity for easier troubleshooting Reuse of code through inheritance Flexibility through polymorphism

Inference Rules

Modus Ponens: if A is true, and A ⇒ B is true, then B is true

Iterator

Mutable objects with a next() method. Keeps track of how much of the iterator remains. Throws StopIteration when done. Use memory efficiently.. Implemented as classes. Objects for a data stream with a next method

Details on UCS

Nearly the same as Dijkstra's algorithm. which finds lowest cost paths. Similar to BFS but applies a cost function to every step.( Orders frontier by f. First node on f is therefore on t he current lowest cost path) BFS is optimal if all step costs are equal. UCS is optimal for any cost function.

BFS Cost optimality

Necessarily fins the shortest solution first. Optimal if all actions have the same cost.

Probability of a Sentence S: Bigram Version

Next guess: products of bigrams: ● Given a lattice for word hypotheses in a 3-word sequence: ● Bigram probabilities are more predictive, but very low and sparse! Markov Chains 7 form subsidy for

Graph search

No state occurs in more than one path.

Decision Tree Algorithm

Nodes consist of tests on attributes Edges consisting of attribute values Leaves consist of output values

VPI Properties

Non-negative: one can ignore information that is not useful ● Non-additive, because the value depends on the current belief state ● Order-independence of sensing actions as distinct from other actions

DFS evaluation

Not complete; therefore not optimal. Time complexity = O(b^m) Space complexity = O(bm) - Linear space complexity

Efficiency of CSP Local Search

Not counting the initial placement of n-queens, the run-time of min-conflicts is independent of the size of n!

BFS Space and time complexity

O(b^d): optimal solution i s at depth d. Nodes remain in memory so time complexity = space complexity. Memory requirements quickly become a problem. BFS cannot solve problems with large state space.

Belief Updating

Observations cannot decrease uncertainty Sensing can be deterministic or non-deterministic ○ Deterministic sensing leads to disjoint belief states for the possible percepts, thus a partition of the predicted belief state

Uninformed Search

Only the information available in the problem definition is used. Different algorithms use different kinds of tree traversal.

Gaussian Mixture Model

Parameters of a mixture of Gaussians are ○ The weight of each component ○ The mean of each component ○ The covariance of each component

Manually Derived Parameters

Parameters that provide a good fit to the data (with smoothing), where good fit means that these parameters predict the data

Parts of Speech

Parts of speech: traditional grammatical categories like "noun, ""verb," "adjective," "adverb" . . . (and many more) ● Functions: ○ Help distinguish different word meanings: N(oun) vs. V(erb) ■ EG: river bank (N) vs. she banks (V) at a credit union ■ EG: a bear (N) will go after honey vs. bear (V) with me ○ Preprocessing for many other natural language processing tasks

POS Tagging

Pervasive in Natural Language Applications ● Machine Translation ○ Translations of nouns and verbs have different forms (prefixes, suffixes) ● Speech synthesis ○ "Close" as an adjective versus verb ○ see table ● Sense disambiguation of word meanings ○ "Close" (adjective) as in near something ○ "Close" (verb) as in shut something

Design Process for a rational Agent

Precondition: PEAS specification Design: Construct a function f to maximize the value of the performance measures Implementation: Write and test an agent program that implements f on a particular architecture.

Transition Model in Partially Observable Environments

Prediction stage computes the hypothesized belief that results from taking action a in belief state b: Possible Percepts stage computes the possible observations in the predicted state: Update stage computes the belief state resulting from the percepts

Pros and Cons of Generators

Pro: Avoids storing an entire Sequence of memory Con:Bad if you need to inspect the individual values

Marginalize Out the Labels for Posterior of the Data

Probability of each example document di is the sum over all classes of the joint probability of the document di and the class Q ● Probability of each document di is thus given by the product of the prior probability of the class with the products of the conditional probabilities of each attribute in the document

Search in Continuous spaces: Brief introduction

Problem: a continuous action space has an infinite branching factor. (Many local search methods developed for discrete action spaces would not generalize to continuous action spaces. Search methods with random selection of successor states will work. Or in convex spaces follow the gradient of the evaluation function.

Newton-Raphson Method

Produces successively better approximations to the root(zero) of a real valued function x_n+1 =x_n - g(x_n)/g'(x_n). More direct route

Deriving Naïve Bayes

Product rule applied to a joint probability distribution: 2. Conditional Independence

Nodes in search tree represent

Progress in the search, not states in the state space. Different paths to the same state create distinct nodes. Therefore, no backtracking in a search tree, only lookahead

Motivations for Simulated Annealing

Pros and cons of hill climbing. (If landscape is convex, it can be very fast. Real problems are rarely convex. If downhill moves are not allowed, cannot escape local maxima. Stochastic hill climbing allows downhill moves with low probability.) Random restart is complete but with very low probability. Simulated Annealing makes HC both efficient and complete. (Combines completeness of random restart with efficiency of stochastic methods. Basic idea: Diminish the randomness as search progresses.)

HMM Pos Tagger Parameters

Q = qm ∈ {q1 , q2 , . . . , qn } (|Q| = n = 36 for Penn TreeBank) ● A = aij transition probabilities for all 1,296 tag pairs qim qjn s. t. ∑j aij = 1 ● O = o1 o2 . . . oT sequences of T observations from a vocabulary V (words w, arranged in sentences length T): training corpus ● B = bi (ot ) for the 36×|V| observation likelihoods of observations o t generated from states i ● π = π1 , π2 , . . . , πn where πi is the probability that the Markov chain will start with state i s. t. ∑i πi= 1

Continuous Variables, i.e., Infinitely Many Values

Range of a random variable could be all real num ● P(NoonTemp = x) ○ Range is defined in terms of a probability density function (pdf) ○ Parameterized function of x, e.g., Uniform(x; 18C, 26C) ■ 100% probability x falls in the 8C range 18C - 26C ■ 50% probability x falls in a 4C range within [18,26] ○ Intuitively, P(x) is the probability that X falls within a small region beginning at x, divided by the width of the region

Representation of the Data

Real world data is converted to learning examples by defining random variables that occur in the data ○ Values for the random variables are derived from observations ○ The values of random variables for an example represent the example ● Assumptions: ○ The random variables are conditionally independent, given the class ○ The training and test data are drawn from the same distribution (stationarity)

Filtering Exemplified

Recursive estimation: for some function f, where the agent needs to compute the new state Xt+1 based on the new evidence e t+1, recursively add in evidence at each new time step to get the subsequent state

Forward or Backward Chaining

Require Horn Form ○ Conjunction of Horn clauses ○ Horn clauses: literal, or (conjunction of literals) ⇒literal ○ E.g., C ∧ (B ⇒ A) ∧ (C ∧ D ⇒ B) ● Modus Ponens (for Horn Form): complete for Horn KBs ● Forward chaining, linear time ● Backward chaining potentially much less than linear time

Depth Limited search

Set a limit l on search depth. Prevents infinite growth of early paths that do not contain goal node. Succeeds if d<=l, fails otherwise

Iterative Deepening

Set an initial limit l. At each next depth with no goal increase l

Visualizing an Ngram Language Model

Shannon/Miller/Selfridge method: ● To generate a sequence of n words given a bigram language model: ○ Fix an ordering of the vocabulary v1 v2 ...vk and specify a sentence length n ○ Choose a random value r i between 0 and 1 ○ Select the first v j such that P(v j ) = r i ○ For each remaining position in the sentence ■ Choose a random value r i between 0 and 1 ■ Select the first vk such that P(vk |vj ) ≥ ri

Space complexity is analogous to time complexity

Size of memory instead of size of input. Units of space are arbitrary

LaPlace Smoothing

Smoothing is necessary in many applications of NB or other probabilistic models, especially to text classification ○ Avoid zero counts using one of many smoothing methods ○ LaPlace: Add mp to numerator, m to denominator, of all parameter estimates, where p is a prior estimate for an unseen value (e.g., 1/t for t values of Xi ), and m is a weight on the prior

Best First Search

Starts with problem definition and evaluation function. Initialize node to start state; initialize frontier to a queue containing node; initialize reached to a python dict. A while loop while frontier is non-empty. Pop the first node on the frontier and check if the next node contains the goal state. If the most recently popped node does not contain the goal state, a for loop expands that node. (Adds all its children to the frontier for each child state not reached unless there is a new way to reach the state with a lower path cost.) Failure if the while loop reaches an empty frontier.

Infrastructure for Tree Search

State Sn: The current state of the search that a node n represents Parent Pn: the parent node that generated n Action An: the action from Pn to Sn Path cost: g(n), the cost from root to n

Search States, Initial State and Actions

States: Belief state space ○ Every possible subset of physical states P ○ Where |P| = N, equal to 2N (every subset is T or F) Initial state: P in the absence of prior knowledge Actions: Two cases ○ All actions are safe: ○ Some actions cause disaster:

Hill climbing variants

Stochastic- Choose at random from uphill moves. When would this improve over choosing best-valued successor. Random-restart-Trivially complete: If at first you don't succeed, try again. Where each search has a probability of success p, there is a high probability of success after 1/p trials. Works very well with few local maxima and few plateaux.

Depth-first search for CSPs with single-variable assignments is called backtracking search ● In CSP, variable assignments are commutative, meaning it does not matter what order the assignments are made ○ [step 1: WA = red; step 2: NT = blue] = [step 1: NT = blue; step 2: WA = red] ● Given the commutativity of any sequence of assignments, the number of leaves for a CSP with n variables of domain size d = dn ○ In other words, one solution path p of length n is equivalent to all n! permutations of p

T

Do not evaluate a branch ● From a MAX node, given a value v ≥ β ○ MIN will never select that MAX node ● From a MIN node, given a value v ≤ α ○ MAX will never select that MIN node

T

For games whose payoffs are not win/lose ([0,1]), the expectiminimax values of chance nodes must be a positive linear transformation of the expected utilities

T

Given an HMM λ: ● Computing the likelihood of a sequence of observations P(o1 ,o2 ,o3 ) relies on the forward algorithm ○ The trellis entries carry forward all paths to each state that can emit o i ○ The likelihood involves summing over combinations of state sequences that could produce a given observation sequence ● Computing the most likely sequence of states given the observations (decoding) P(Q1 ,Q2 ,Q3 |o1 ,o2 ,o3 ) relies on the Viterbi algorithm ○ The trellis entries carry forward all paths to each state that can emit o i ○ Decoding finds the single maximum probability state sequence

T

Prune below a MAX node when alpha ≥ beta of its (MIN) ancestors ○ MAX nodes update alpha based on children's returned values ○ MIN at MAX's parent node will choose the action leading to beta ● Prune below a MIN node when beta ≤ alpha of its (MAX) ancestors ○ MIN nodes update beta based on children's returned values ○ MAX at MIN's parent node will chose the action leading to alpha

T

Reliance on independence and conditional independence reduces the number of relevant cases to consider, relative to the full joint probability distribution

T

Statistical modeling assumes even though x is observed, the values could have been different

T

The backward probability βt (i) is symmetrical to αt (i) in the Forward Algorithm

T

The belief states resulting from an action a in belief state b and the observations o resulting from the resulting possible percepts

T

The expression ξ t (i,j) can be formulated as the joint probability of the states at i and j conditioned on our observation sequences and the parameters:

T

● NB can be used to "classify" even when there is no causal relationship

T

● Random variables take on values in an experiment, e.g., a set of measurements

T

Sensor Markov Assumption

The agent's observations or evidence Et at time t depend only on the state Xt at time t

Solving Sensorless Problems with Search

The belief state-space can become too large for efficient search. Methods to handle search in belief state spaces ○ Prune the belief space: e.g., if the belief state space at node ni is a superset of the belief state space at node nj , then prune node ni ○ Use a more compact representation of belief ○ Incremental search: ■ A solution to an initial belief state S that contains {s1 , s2 , . . . sn } must work for each state s i ∈ S ■ So, find a solution to s1 ; test the solution for each next state; iterate

Most General Unifier (MGU)

The first unifier is more general than the second (less restrictive) ● There is a single most general unifier (MGU) that is unique up to renaming of variables

Syntax ctd

The nodes and edges represent the topology (or structure) of the network ● In the simplest case, the conditional distribution of each random variable is represented as a conditional probability table (CPT) ● A CPT gives the distribution over Xi for each combination of parent values

To-Move(s):

The player whose turn it is to move in state s

Syntax of Propositional Logic

The proposition symbols P1 , P2 etc are sentences ● If S is a sentence, ¬S is a sentence (negation) ● If S1 and S2 are sentences, S1 ∧ S2 is a sentence (conjunction) ● If S1 and S2 are sentences, S1 ∨ S2 is a sentence (disjunction) ● If S1 and S2 are sentences, S1 ⇒ S2 is a sentence (implication) ● If S1 and S2 are sentences, S1 ⇔ S2 is a sentence (biconditional)

Actions(s):

The set of legal moves in state s

Maximum Likelihood Estimation

The statistical inference problem is to estimate the parameters θ given observations x i for the classes y j ● Maximum likelihood estimation (MLE) chooses the estimate θ* that maximizes the likelihood function, i.e.: ● That is, the MLE parameters are those that maximize the probability of the observed data distribution in the classes

Result(a, s):

The transition model defining the result of taking action a in state s

Latent Variable Models

The validity of the hidden variable (e.g., part-of-speech tag; disease) depends on empirical evidence ○ Explanatory power of the hidden variable for multiple questions ○ Ability of trained judges to agree on its presence or absence ● Given that it can be difficult to get (enough) labeled data, EM can be used to estimate the model parameters

Value of Perfect Information (VPI)

The value of discovering Ej is the average over all possible values e j using the current belief state, less the expected utility

Parameter Re-estimation

Three sets of HMM parameters need to be iteratively estimated ○ Initial state distribution: πi ○ Transition probabilities: a i,j ○ Emission probabilities: bi (ot )

Paths with repeated states are non-optimal

Three solutions: 1.)Update a list of reached/visited states: practical when the sets of all states fits easily in memory (aka graph search) 2.)Ignore: practical when likelihood of revisiting a state is very low. (tree-like search) 3.)Compromise and check for cycles for limited number of steps(parent, grandparent): keeps memory needs to constant time

Evaluating algorithms

Time Complexity, Completeness, Space Complexity, Cost Optimality

Benefits of Memoization

Trades complexity of a function for complexity of a lookup. When a memoized function is evaluated, result is stored in a memoization cache. Calling a recursive function is much faster if memoized. Otherwise, python recursion is very slow

HMMS: Reasoning about Unobserved States

Transition model: How states change over time ● Sensor model: How the state at a given time t affects what evidence the agent perceives at t ● The distribution of the random variable Xt of states, and the random variable Et of evidence form the agent's belief state

Model-based reflex Agent

Transition model: what can happen Sensor model: what the state might be, given a percept Agent's next action depends on transition model + sensor model

Optimality of A*

Tree search version is optimal if h(n) is admissible. Graph search version is optimal if h(n) is consistent. Lemma: A* expands nodes on frontier in order of increasing f. Gradually adds f contours of nodes. A* is a variant of UCS

A PEAS specification of the task environment provides a way for the designer to determine if the pre-conditions for a rational agent are met

True

A function can be returned as the value of another function

True

A node has been expanded if its children have been identified

True

A problem with fewer restrictions on the actions than the original is called a relaxed problem

True

A search node is a data structure used during search

True

A solution is optimal if no solution has a lower path cost.

True

A state represents a physical configuration

True

AI exploits various types of computing methods

True

AI relies on any combination of Heterogeneous Technologies

True

AI solves problems rationally

True

Agent= architecture +program

True

As input approaches infinity O(n) is necessarily better than O(n^2)

True

Assume both players play optimally ● Max prefers next state to have maximum value ● Min prefers next state to have minimum value

True

Belief update is central to agents that operate in partially observable worlds

True

Big O ignores constant multiplicative factors

True

Can Store functions in data structures

True

Can assign functions to variables

True

Default Parameter values are evaluated once when the def statement they belong to is first executed

True

For smaller input sizes, depending on the algorithm O(n^2) could be better

True

Frontier can be represented as a queue

True

Functions can be passed as arguments to other functions

True

Graph search algorithm is the same as tree search with the addition of the explored set.

True

Minimax serves as the basis for the mathematical analysis of games

True

Most real-world problems involve partial knowledge of the state of the world

True

Precednce: args must preced kwargs, a specific argument must precede args

True

Rate of growth of runtime grow relative to input

True

Search in non-deterministic worlds must consider alternative outcomes

True

The cost of an optimal solution to a relaxed problem is an admissible heuristic for the original problem

True

The function uses the same mutable object each call in the recursion.

True

The search() method is a generator that finds all the nodes that can be reached from a given node

True

To define a subclass put the name of the superclass in parentheses after the subclasses name on the first line of the definition

True

To redefine a method inherited from the parent class, add a new definition of the same name to the subclass. The object class determines which definition to use. If object is in parent class then parent method is used. If object is in subclass then subclass method is used

True

Unlike java, a python function is specified by its name alone

True

User can also give a class the ability to use [] notation like an array or () notation like a function call

True

User can specify class-specific behavior for comparison operators

True

● 8-Puzzle cannot be solved if the environment is non-observable/sensorless

True

O() notation summarizes the large scale performance

True; Easier to use than assessing the actual number of operations. Less precise than the alternative

Nodes that have been generated but not expanded a re referred to as the frontier.

True; The frontier is used to guide the direction of search.

Machine Learning in General

Types of ML ○ Replicate a pattern given by a supervision signal ○ Discover new patterns (unsupervised learning) ■ Infer a pattern given an indirect supervision signal (semi-supervised) ○ Learn through trial and error by interacting with environment and receiving a reinforcement signal (reward) ● Supervised machine learning types: ○ Classification (e.g., Naive Bayes) ○ Regression ○ Sequence prediction

Tips about Quantifiers

Typically, the connective in a universally quantified sentence is ⇒ ○ Everyone in CMPSC 442 is smart: ∀x In(x, CMPSC 442) ⇒ Smart(x) ○ In contrast: ∀x In(x, CMPSC 442) ∧ Smart(x) means: Everyone is in CMPSC 442 and everyone is smart ● Typically, the connective in an existentially quantified sentence is ∧ ○ Someone in CMPSC 442 is smart: ∃x In(x, CMPSC 442) ∧ Smart(x) ○ In contrast: ∃x In(x, CMPSC 442) ⇒Smart(x) means: Anyone taking CMPSC 442 is smart (possibly no one)

Learning the HMM Parameters

Unfortunately, there is no known way to analytically find a global maximum, i.e., a model , such that ● It is possible to find a local maximum ● Theorem: Given an initial model , we can always find a model , such that

MAC Algorithm: Maintaining Arc Consistency

Unlike AC-3, forward checking does not recursively propagate constraints when domains of variables are changed ● Solution: combine AC-3 and forward checking ○ After making an assignment to Xi ■ Find subset Y = all arcs (Xi ,Xj ) ● Call AC-3 on Y

Least Constraining Value Heuristic

Used for ordering the domain values (see backtracking pseudo-code) ● Intuition: ensure maximum flexibility for remaining assignments

Empirical gradient methods

Used when the equation grad(f) = 0 has no closed form solution. Search progress depends on comparing the values of the objective function f for the current state x and the successor state x'. Progress is measured by the change in value of f.

Evolutionary Algorithms

Variants of stochastic beam search, natural selection as a metaphor. Many varieties

Back Tracing

Viterbi recursion computes the maximum probability path to state j at time T given the observation o1 , . . . , oT ● Viterbi must also identify the complete single path that gives this maximum probability ○ Keep backpointers ○ Find ○ Trace backpointers from state j at time T to find the state sequence from T back to 1

Viterbi Recursion

Viterbi recursion computes the maximum probability path to state j at time t given the observation o1 , . . . , ot

A solution to a non-deterministic problem assumes that ate execution time, the agents percepts can resolve the outcome of the action

What if the agent's percepts do not provide enough information? ○ The environment is partially observable or not observable Sensorless (or conformant) problems: states are not observable ○ Sensors can be time consuming, unreliable or can suffer damage ○ Example: in manufacturing, placing parts in the correct location could rely on constraints/physics rather than sensing ○ Example: in medicine, a broad-spectrum antibiotic can treat many infections, so no need to wait for test results to identify the pathogen

Belief States

When actions are non-deterministic, the agent must maintain a state representation that includes all the possible action outcomes. A state representation with alternative states that might exist is referred to as a belief state.

And- node

When an action has alternative outcomes the search tree must consider the path from to s_i and s_i+1 and so on

Local search

When path to goal doesn't matter. Iterate (Remember only current state, move to best neighboring state; forget the past). Idea: Incrementally improve an initial guess.

Yield

When we call next(), the method runs until it encounters a yield statement, and then it returns the value that was yielded. On the next call, next() resumes the data stream object where it left off.

When to Prune:

Whenever Alpha ≥ Beta

Inference during Search, using Forward Checking

Whenever a variable Xi is assigned, for each variable Xj connected to Xi by a constraint, delete from the domain of Xj any value that is inconsistent with the value chose for X

Minimax Algorithm Semi-Formally

While game not over: 1. Start with the current position as a MAX node 2. Expand the game tree a fixed number of ply (amount of look-ahead) 3. Apply the evaluation function to the leaf positions at look-ahead depth 4. Back-up the values all the way to the root 5. Pick the move assigned to MAX at the root 6. Wait for MIN to respond

Performance of Naïve Bayes for Text Classification

Words in running text are not independent, so application of NB for text classification is naïve ● Nevertheless, NB works well on many text classification tasks ○ The observed probability may be quite different from the NB estimate ○ The NB classification can be correct even if the probabilities are incorrect, as long as it gets the relative conditional frequency of each class value correct

Frequently used magic methods

__init__ , __len__, __copy__ , etc.

**kwargs

a dict of keyword arguments, unspecified length

Profile

a set of statistics that describes how often and for how long various parts of the program executed

Monte Carlo

are based on computing a expectation from repeated random simulations, aka chance

A rational agent is designed to achieve the_______ outcome, where the best is relative to the explicit performance criteria

best

Iterable

can be iterated over. All sequences are iterable. Has __iter__ method. returns an iterator object

Sequences

can be mutable or immutable; have similar syntax; are iterable

Percept Sequences and actions

can be organized in a table can be restricted to a finite table: restrict the length of a percept sequence

Acceptance probability (ap)

close to 1 if solution is better decreases as new solution is increasingly worse decreases as T decreases if cost(new) > cost(old). ap = e^(cost(old)-cost(new)/T) Can accept mildly bad but not terrible next moves. Accepts any bad jumps earlier rather than later

Probability Density Function (PDF)

for a continuous random variable, the probability for the value to be within some range

Cumulative Distribution Function (CDF)

for a continuous random variable, the probability of having a value ≤ n

● Probability Mass Function (PMF)

for a discrete random variable, the probability of each value

Generators

functions that evaluate to next item in an iterator object, with a yield keyword

An agent

is a function that perceives and acts

range()

is its own class of immutable, iterable objects. attributes(start, stop, step) methods- count, index A range object can be turned into an iterator

Operator overloading

is possible using special methods on various classes

Operations on frontier

isEmpty(): tests if frontier is empty Top(): returns first node on frontier Pop(): removes first node on frontier and returns it Add: Inserts a node into the frontier

reduce

iteratively applies func to next two members of mutable sequence to return a single value

Keyword Arguments

means you do not have to remember the linear order of args to a functions, but you do need to know the keyword names.

Cromwell's Rule

n Bayesian approaches ● Use of prior probabilities of 0 or 1 should be avoided ● Because: if the prior probability is 0 or 1, then so is the posterior, and no evidence can be considered

Simple Search Problems:

single agent, episodic, fully observable, deterministic, static, discrete, known

Optimal Solution

smallest number of actions in solution

Problem Formulation determines:

the combinatorics of search space the efficiency/ complexity of search algorithm

Search is a way to solve problems when

to achieve its goal, the agent needs to execute a sequence of actions, and must look ahead to choose among multiple possible actions at the next step. The state of the world is represented atomically. (discrete, no internal structure)

Python uses a ___________system 0 to len(sequence)-1

zero-based indexing

Recap: Optimal Quantities

▪ The utility of a state s: U * (s) = expected utility starting in s and acting optimally ▪ The value (utility) of a q-state (s,a): Q * (s,a) = expected utility starting out having taken action a from state s and (thereafter) acting optimally ▪ The optimal policy: π * (s) = optimal action from state s

● For chance nodes, e.g., C:

○ Can be pruned if we can find the upper bound of its value ○ Consider the bounds on the utility function, e.g., [-1, 2] ○ Then there are upper and lower bounds on the expectation at C

8-Puzzle can be solved if just one cell c i can be observed:

○ If cell c i is empty, move an adjacent tile into c i and observe its value v i ○ Else record the value v i of the tile in c i ○ For every successor belief state sk ■ Keep track of the new location of v i in sk ■ Every time a new tile is moved into c i , observe its value v j

Degrees of belief

○ P(A) = 1: Agent completely believes A is true. ○ P(A) = 0: Agent completely believes A is false. ○ 0.5 < P(A) < 1: Agent believes A is more likely to be true than false.

Probability of a Sentence S: Unigram Version

● A crucial step in speech recognition, language modeling, etc. ● First guess: products of unigrams: ● Given a lattice for word hypotheses in a 3-word sequence, the above formulation is not quite right

Models

● A logical model m is a formally structured world with respect to which truth can be evaluated ○ m is a model of a sentence α if α is true in m ○ M(α) is the set of all models of α ● Entailment: ○ KB ⊨ α iff M(KB) ⊆ M(α)

A Perceptron is a Classifier

● A neural net: neurons linked by directed arcs ● A sigmoid perceptron = logistic regression classifier

Decision Trees Learn Decision Rules

● A path from root to leaf is a decision rule ○ EG: If Patrons == none → No ● This tree has 13 leaves (paths, or decision rules) ○ The root attribute (Patrons) has 3 values ■ None (n=2) ■ Some (n=4) ■ Full (n=6) ● A tree with fewer paths would be more compact (simpler), thus preferred

Values of Random Variables

● A random variable V can take on one of a set of different values ○ Each value has an associated probability ○ The value of V at a particular time is subject to random variation ○ Discrete random variables have a discrete (often finite) range of values ○ Domain values must be exhaustive and mutually exclusive ● For us, random variables will have a discrete, countable (usually finite) domain of arbitrary values ○ Here we will use categorical or Boolean variables

Markov Decision Processes: Decisions over Time

● A set of states s ∈ S ● A set of actions a ∈ A ● A transition function T(s, a, s') ○ Probability that a from s leads to s', i.e., P(s'| s, a) ○ Also called the model or the dynamics ● A reward function R(s, a, s') 6 Intro to MDPs ○ The reward function (per time step) ○ Figure into the agents's utility function (over time) ● A start state s0 ● Possibly a terminal state

Tree-structured CSPs

● A tree-structured CSP can be solved in time linear in the # of variables ● A constraint graph is a tree when any pair of variables is connected by only one path ● Do a topological sort of the CSP graph to create a tree ○ A linear ordering of the nodes where for every directed edge Xi , Xj , Xi precedes Xj

Bias Variance Tradeoff

● AIMA defines bias in terms of the selected hypothesis space (e.g., linear functions versus sinusoidal) ● AIMA defines variance as arising from the choice of training data (variance across possible training sets)

Conceptual Basis for Decision Theoretic Agent

● Ability to reason about an uncertain world ○ Probabilistic models of agent's beliefs ○ Factored state representations ● Ability to reason about conflicting goals ○ Axioms of utility: constraints on a rational agent's preferences ○ Decision networks: nodes for belief states, actions, utilities ○ Value of information in different settings

Policy Iteration

● Alternative approach for optimal values: ○ Step 1: Policy evaluation: calculate utilities for some fixed policy (not optimal utilities!) until convergence ○ Step 2: Policy improvement: update policy using one-step look-ahead with resulting converged (but not optimal!) utilities as future values ○ Repeat steps until policy converges ● This is policy iteration ○ It's still optimal! ○ Can converge (much) faster under some conditions

Markov Chain versus HMM

● An HMM is a non-deterministic Markov Chain: cannot uniquely identify a state sequence ● States are partially observed (sensor model)

Baum-Welch

● An example of Expectation Maximization ● E-Step: Compute expected values of the states j at times t using γ t (j) and of the transitions i,j from t to t+1 using ξ t (i,j) ● M-Step: From these expected values, compute new parameter estimates and ● Iterate until convergence

Utilities for a Fixed Policy

● Another basic operation: compute the utility of a state s under a fixed (generally non-optimal) policy ● Define the utility of a state s, under a fixed policy π: U π (s) = expected total discounted rewards starting in s and following π ● Recursive relation (one-step look-ahead / Bellman equation) using a fixed policy:

Issues with Chess Lead to Hybrid Approaches

● Apply Eval() only to quiescent positions, where there is no pending move that shifts the game wildly (e.g., capturing the queen) ● ProbCut, a probabilistic cut algorithm (Buro, 1995) ○ Uses forward pruning with Alpha-Beta ○ Estimates the probability that a node can be safely pruned based on statistical knowledge of game states ● Table lookup for openings and endgames, which have fewer variations

Maximum a Posteriori Decision Rule (MAP)

● Approximately Bayesian ● Foundation for Naïve Bayes classifiers ● Find the most probable hypothesis hi , given the data d

How do We Apply Cromwell's Rule

● Assume we know how many types never occur in the data ● Steal probability mass from types that occur at least once ● Distribute this probability mass over the types that never occur

Efficient Model Checking Algorithms for PL

● Backward chaining (Horn Clauses) ● Forward chaining (Horn Clauses) ● DPLL Algorithm (Davis, Putnam, Logemann, Loveland) ○ Efficient and complete backtracking ○ Can efficiently handle tens of millions of variables ○ Applications include hardware verifiation ● WalkSAT ○ Local search, thus very efficient ○ Incomplete

Evaluation functions for board position: f(n)

● Based on static features of that board alone ● Zero-sum assumption lets us use one function to describe goodness for both players ○ f(n) > 0 if MAX is winning in position n ○ f(n) = 0 if position n is tied ○ f(n) < 0 if MIN is winning in position n ● Define using expert knowledge

Batch Gradient Descent for Univariate Case

● Batch: sum over all data points (one epoch) ● Translates into one update rule for each weight ○ α is the learning rate, with the 2 folded in to α ○ Guaranteed to converge if α is small ○ Increasingly slow as N increases

Baum-Welch

● Baum-Welch algorithm uses Expectation Maximization to iteratively re-estimate the parameters yielding at each iteration a new ○ Initializes to a random set of values, then for each iteration: ○ Calculates , from left to right (forward) and from right to left (backward) to get the probability of the states i and j at times t and t+1 ■ Every state transition from t to t+1 occurs as part of a sequence ■ For all transitions, we compute the forward probability from the sequence start to t, and the backward probability from the sequence end back to t+1 ○ Re-estimates ● Requires an algorithm for backward computation through the trellis

Reasoning about Cause and Effect

● Bayes' Rule provides a way to reason from causes to effects ● Note that normalization of probabilities to sum to one means only two kinds of knowledge are needed ○ Prior probability of the cause P(c) ○ Likelihood of the effect given the cause P(e|c)

Handling Uncertainty over Time

● Builds on search in partially observable worlds ○ Belief states + transition model define how agent predicts how the world might be at each next time step ○ Sensor model defines how to update the belief state ● Probability is used to quantify degrees of belief in elements of the belief state ● Time is handled by considering a set of random variables at each next point in time

Motivation for Dynamic Programming

● Calculation of ○ Sum the probabilities of all possible state sequences in the HMM ○ The probability of each state sequence is the product of the state transitions and emission probabilities ● Naïve computation is very expensive. Given T observations and N states, there are NT possible state sequences. ○ For T=10 and N=10, 10 billion different paths! ● Solution: linear time dynamic programming ○ DP: uses a table (trellis) to store intermediate computations

Non-descendants Property

● Capturing conditional independence relations where the conditional probabilities at a node (random variable) capture the dependence on the parent nodes, and independence from all other nodes ● A random variable is conditionally independent of its non-descendants, given its parents

Chain Rule to Construct Bayesian Networks

● Chain rule: a joint probability can be expressed as a product of conditional probabilities in the illustrated way ● Given the global semantics from the preceding slide, this gives a general assertion about variables in the network, and a construction procedure ● Generalization of Naive Bayes

Finding a Hypothesis Function for a Dataset

● Choose a model (meaning type of model) ○ In this context, choosing a model means choosing a hypothesis space, e.g., linear function, polynomial function ○ In other contexts, model can mean model + hyperparameters (degree-2 polynomial), or a specific model (e.g., y=5x2 +3x +2) ● Optimize (or train the model) ○ Find the best hypothesis (instantiated model) ○ Training relies on a training set and a smaller validation (or dev) set for developing the model

Cycle Cutset Algorithm

● Choose some cutset S ● For each possible assignment to the variables in S that satisfies all constraints on S ○ Remove any values for the domains of the remaining variables that are not consistent with S ○ If the remaining CSP has a solution, then you have are done ● For graph size n, domain size d ● Time complexity for cycle cutset of size c: ○ O(dc * d2 (n-c)) = O(dc+2 (n-c))

Generalizing Bayes' Rule

● Conditionalize Bayes' rule on background evidence e ● The evidence e can stand for the set of other variables in a joint probability distribution besides X and Y

FOL Vocabulary

● Constants: Richard, John, 2, . . . ● Connectives: ¬ ∧∨⇒⇔ ● Variables x, y, a, b,... ● Predicates: True/1, False/1, Person/1, >/2, give/3, sell/4. . . ○ Person(John) ○ KingOf(John, a) ● Equality (a special predicate) ● Functions: Sqrt, LeftLegOf, . . . ● Quantifiers: ∀, ∃

Decision Theory

● Decision Theory develops methods to make optimal decisions in the presence of uncertainty ● Decision Theory = utility theory + probability theory ● Utility theory is used to represent and infer preferences ○ Every state has a degree of usefulness ○ An agent is rational if and only if it chooses an action A that yields the maximum expected utility (expected usefulness)

Backward Chaining

● Depth-first recursive proof search: space is linear in size of proof ● Incomplete due to infinite loops ○ Fix: checking current goal against every goal on stack ● Inefficient due to repeated subgoals (both success and failure) ○ Fix: use caching of previous results (extra space) ● Widely used for logic programming

Extension to Multivariate Case

● Each example x j is an n-dimensional vector ● The linear equation sums over all x j,i and adds a bias weight ● The weights are therefore an n+1-dimensional vector, so we define a dummy input attribute x j,0 = 1

Loss Functions: An Objective to Minimize

● Error rate can be due to different error types (e.g., one class) ● A loss function can be used as a training objective to minimize error for all classes ● Most generally, loss should take x into account ● Usually, x is ignored:

Learning the Multivariate Regression

● Essentially the same update rule ● Need to regularize in the multivariate case

Evaluation Functions for H-Minimax

● Estimation of the expected utility of state s to player p ○ If Is-terminal(s), Eval(s, p) = Utility(s, p) ○ Else Utility(loss, p) ≤ Eval(s, p) ≤ Utility(win, p) ● Evaluation functions should be ○ Fast (Heuristic Alpha-Beta intended to improve Alpha-Beta performance) ○ Informed by expert knowledge ○ Often based on features that form equivalence classes of game positions ■ For a given class, experience may indicate the proportion of times games end in win (utility=1.0), lose (utility=0) or draw (utility=0.5) ■ Then use expected value: e.g., if feature A leads to win 82% of the time, loss 2%, and draw 16%, expected value = 82% x 1 + 16% x 0.5 = 0.90

Policy Iteration ctd

● Evaluation: For fixed current policy π, find values with policy evaluation: ○ Iterate until values converge: ● Improvement: For fixed values, get a better policy using policy extraction ○ One-step look-ahead:

Reduction to Propositional Inference

● Every FOL KB can be propositionalized so as to preserve entailment ○ A ground sentence α is entailed by new KB' iff entailed by original KB ● Idea: propositionalize KB and query, apply resolution, return result ● Problem: with function symbols, an infinite number of ground terms can be generated

Probabilities of Elementary Events

● Every ωi ∈ Ω is assigned a probability (elementary event in the sample space) P(ωi ) ○ 0 ≤ P(ωi ) ≤ 1 ● Assuming Ω is finite (w1 ,..., wn ) we require ○ P(Ω) = ∑ω_i P(ωi ) = 1

Stochastic Games

● Examples of stochastic games ○ Backgammon: includes rolls of the dice ● To extend minimax to handle chance: ○ The search tree must include a new ply for chance nodes (green circles) after every MAX or MIN node ○ Minimax(n) has to include an expectation of the value of a position, taking into account the probabilities of the chance events from green nodes

Another View of Bias Variance Tradeoff

● Expected prediction error (EPE) for a new observation with value x is given by: ● is the irreducible error (noise) apart from bias and estimation variance ● Bias is the result of misspecifying the statistical model f ● Estimation variance is the result of using a sample to estimate f ● Modeling goals: ○ Explanatory modeling attempts to minimize bias meaning to find the same theoretical explanation for some phenomenon, e.g., across categories of datasets ○ Predictive modeling aims for empirical precision (minimize bias and estimation variance)

Backtracking Search Heuristics

● Exploits domain-independent heuristics (in contrast to the domain-dependent heuristics of informed search algorithms) ● Demonstrates the advantages of a factored state representation ● Four kinds of heuristics ○ Which variable to assign next: Select-Unassigned-Variable() ○ What inferences to perform at each step: Inference() ○ How far to backtrack: Backtrack() ○ When to save and re-use partial results

Correspondence of FOL and Natural Language

● FOL expressivity is closer than PL to natural language ○ Objects denote real-world entities, which can be referred to with noun phrases ○ Logical relations correspond to real-world relations, which can be expressed as adjectives and verbs ● FOL statements are context independent and unambiguous, while natural language phrases are context-dependent and ambiguous ○ Two FOL statements can have different forms and the identical semantic interpretation ○ Natural language statements and meanings are many-to-many ○ Natural language meaning is broader than a way to encapsulate "knowledge" (opinions/attitudes/social conventions/bias . . .)

Assertions and Queries

● FOL statements (assertions) can be added to a KB ○ Same as in PL ○ TELL(KB, Brother(Richard,John)) ● Two types of queries can be made ○ ASK(KB, saturated statement) returns true or false, depending on truth evaluation of statement (must not have unbound variables) ○ ASKVARS(KB, unsaturated statement) returns bindings for the unbound variables ∀, ∃ ■ AskVars(KB, ∀x evil(x)) ■ AskVars(KB, ∃x evil(x))

Types of Neural Networks

● Feedforward network: a directed acyclic graph ○ Information propagates in one direction ○ Output is a function of the input ● Recurrent network: has cycles ○ Outputs can recur as inputs ○ Output is a function of the initial state, dependent on previous inputs ○ Dependence on earlier inputs amounts to short-term memory ● Single layer versus multi-layer

What is Markov about MDPs?

● For Markov decision processes, action outcomes depend only on the current state: ● This is like search: ○ Search: successor function uses the current state to return successors ○ MDP: search process includes successor state based on current state, and the transition model, and the reward

Forward Checking versus Backjumping

● Forward checking can build the conflict set ○ When forward checking from an assignment X = v deletes v from the domain of Y, add X=v to the conflict set for Y ○ If the last value is deleted the domain of Y then the assignments in the conflict set for Y are added to the conflict set of X (since the assignment X=v leads to a contradiction in Y try a new assignment for X ● Notice that backjumping finds the same conflicts that forward checking does

Unsupervised Clustering of Continuous Data

● Gaussian distribution ○ Many things in nature, e.g., attributes of Iris species: sepal length, sepal width, petal length, petal width ○ Iris-Setosa dataset ● Mixture of Gaussians ○ Three species of iris 24 ○ Gaussian Mixture Model can identify three distinct clusters in an unsupervised manner using EM

Resolution

● Generalization of unit resolution ● Two clauses can be combined to produce a new clause as follows ○ If the first clause contains a literal a, and the second clause contains its negation ¬a ○ Then inference step is to produce a new single clause that includes all the literals from both clauses except a and ¬a

Supervised Statistical Machine Learning

● Given a set of independent and identically distributed (i.i.d.) training examples (x(1) {1:m} , y(1) )...(x(n) {n:m} , y(n) ) ○ Assume each pair was generated by an unknown function y=f(x) ○ Discover a function y=h(x) where h approximates f ○ The labeled data (x(1) {1:m} , y(1) )...(x(n) {1:m} , y(n) ) represents the ground truth ● Test the generalization ability of h on labeled examples that are not in the training set

Seismic Data Can be Classified

● Given the weight vector for the seismic data, and the 2-D vectors of examples, a classification decision boundary can be operationalized as follows

Performance of Alpha-Beta Pruning

● Guaranteed to compute same root value as Minimax ○ Recall: the root value tells MAX which action to take ● Worst case complexity: no pruning, same as Minimax O(b^d ) ● Best case complexity: when each player's best move is the first option examined, examines only O(b^d/2) nodes, allowing to search twice as deep!

Heuristic Alpha-Beta

● H-Minimax function will treat non-terminal nodes as if they were terminal ○ Replace the terminal test with a cutoff test (e.g., use iterative deepening) ○ Replace the utility function with a heuristic evaluation function

Linear Classification with Logistic Regression

● Hard threshold decision rule is non-differentiable, cannot use SGD ● Replacing hard threshold with sigmoid function gives a differentiable decision function ● Because the values of the sigmoid function are in [0,1] they are interpreted as the probability of the class for each example

Two Layer XOR Gate

● Hidden unit in the middle with a threshold of 1.5 goes on only if both inputs are 1 ● The three weights on the inputs to the final output ensure: ○ If input is 1, 1 then sum of weights is 0 ○ If input is 1, 0 then sum of weights is 1 ○ If input is 0, 1 then sum of weights is 1 ○ If input is 0, 0 then sum of weights is 0

Extensions to Bayesian Networks

● Hidden variables ● Decision (action) variables ● Random variables with continuous distributions ● "Plate" models ○ Latent Dirichlet Allocation (LDA), a generative statistical model that allows sets of observations to be explained by unobserved groups ● Dynamical Belief Nets (DBNs): Change over time ○ Hidden Markov Models (HMMs): a special case of DBNs in which the entire state of the world is represented by a single hidden state variable

Policy Evaluation

● How do we calculate the U's for a fixed policy π? ● Idea 1: Turn recursive Bellman equations into updates (like value iteration) ● Efficiency: O(s^2 ) per iteration ● Idea 2: Without the maxes, the Bellman equations are just a linear system ○ Solve with your favorite linear system solver

Convergence*

● How do we know the Vk vectors are going to converge? ● Case 1: If the tree has maximum depth M, then VM holds the actual untruncated values ● Case 2: If the discount is less than 1 ○ For any state Vk and Vk+1 can be viewed as depth k+1 results in nearly identical search trees ○ The difference is that on the bottom layer, Vk+1 has actual rewards while Vk has zeros ○ The last layer is at best all RMAX ○ It is at worst RMIN ○ But everything is discounted by γ k that far out ○ Vk and Vk+1 are at most γ k RMAX different ○ So as k increases, the values converge

Discounting ctd

● How to discount? ○ Each time we descend a level, we multiply in the discount once ● Why discount? ○ Sooner rewards probably do have higher utility than later rewards ○ Also helps our algorithms converge ● Example: discount of 0.5 ○ U([1,2,3]) = 1*1 + 0.5*2 + 0.25*3 ○ U([1,2,3]) < U([3,2,1])

Unit Resolution: A First Step towards Completeness

● If (a ∨ ¬b) ∧ ¬a, then ¬b ● A disjunction of literals conjoined with the negation of one of the disjuncts proves the other disjunct ● Called unit resolution because a literal (e.g., c) combined with a clause (e.g., (¬ c ∨ d) is equivalent to a disjunction of the literal (∨ c), or a "unit clause" combined with another clause

A CSP can easily be expressed as a search problem ○ Initial State: the empty assignment {} ○ Successor function: Assign a value to any unassigned variable provided that there is not a constraint conflict ○ Goal test: the current assignment is complete ○ Path cost: a constant cost for every step

● If a solution exists, it is necessarily at depth n, given n variables ○ Depth First Search can be used

Summary of pruning

● If opponent's actions from node m or m' are better for Player than those from node n, the Player will never allow the game to proceed to n

Expectiminimax

● If the next node s is a chance node, then: ○ Sum over all the observed chance outcomes r at s, weighted by the probability P(r) of each chance action (e.g. dice roll)

Generalization Loss versus Empirical Loss

● If the set ε of all possible pairs (x,y) is known, then generalization loss can be used (this could be used for simulated data) ● Otherwise, empirical loss for a dataset E be used where N = |E| (this is far more typical)

Deterministic Environments

● In a deterministic environment (as in game playing, e.g., minimax), a preference ranking on states is sufficient, exact quantities for preferences are not needed ● Such preference rankings are called value functions

Search Strategies versus Decision Policies

● In deterministic single-agent search problems, find an optimal sequence of actions, from start to a goal ● For MDPs, find an optimal policy π*: S → A ○ A policy π gives an action for each state ○ An optimal policy is one that maximizes expected utility over time

Role of Conditional Independence

● In most cases, the use of conditional independence reduces the size of the representation of the joint distribution from exponential in n to linear in n. ● Conditional independence is much more common in the real world than complete independence. ● Conditional independence is our most basic and robust form of knowledge about uncertain environments.

Definition of Random Variable

● In the probability space (Ω, Ρ): ○ Ω is a sample space (set of all outcomes) ○ Ρ is the probability measure ● Random variables are functions ● Given some probability space (Ω, Ρ), a random variable X: Ω →R is a function defined from the probability space to the real line ● In other words, Ρ attaches probabilities to events, which are subsets of Ω

Efficiency of Forward Chaining

● Incremental forward chaining: no need to match a rule on iteration k if a premise wasn't added on iteration k-1 ○ Match each rule whose premise contains a newly added positive literal ○ Problem: matching can be expensive: polynomial ○ Solution: database indexing allows linear time retrieval of known facts ● Forward chaining is widely used in deductive databases

Uninformed Search to Derive Proofs

● Initial State: The sentences in initial knowledge base ● Actions: Apply inference rules where a KB sentence matches the l.h.s. ● Result: Add r.h.s. of matched rules to KB ● Goal: A state containing the sentence to prove

EM for Naive Bayes

● Initialization: Assign initial values to the parameters ● Expectation step: Calculate the posterior probability of the class given the observed attribute values and the current parameters θ from initialization or from the M-step ● Maximization step: Calculate the new parameter values based on the class assignments from the E-step ● Continue for T iterations

Structure of the Forward Algorithm

● Initialization: all the ways to start ● Induction: all the ways to go from any given state at time t to any subsequent state at time t+1 ○ Given the probability for state qi at time t, induction carries forward the probability to each next qj at time t+1 ● Termination: all the ways to end

Biological Neuron

● Input ○ A neuron's dendritic tree is connected to a thousand neighboring neurons. When one fires, a positive or negative charge is received 28 ○ The strengths of all the received charges are added together ● Output ○ If the aggregate input is greater than the axon hillock's threshold value, then the neuron fires ○ The physical and neurochemical characteristics of each synapse determines the strength and polarity of the new signal

Discounting

● It's reasonable to maximize the sum of rewards ● It's also reasonable to prefer rewards now to rewards later ● One solution: values of rewards decay exponentially

Using Resolution in FOL

● KB must be in CNF ● Yields a complete inference procedure ● Efficient inference strategies exist ● Similar to conversion of PL to CNF; differences due to quantifiers ○ Implication elimination ○ Move ¬ inwards ○ Standardize variables ○ Skolemize: remove existential quantification ○ Drop universal quantifiers ○ Distribute ∨ over ∧

Agents with Explicit Knowledge Representation

● Knowledge base = set of sentences (declarations)in a formal language ● Adding to the KB ○ Agent Tells the KB what it perceives: Si ○ Inference: derive new statements from KB + Si ● Using the KB ○ Agent Asks the KB what action to take ○ Declaration of the action to take leads to action

Minimax Setup

● Label the root MAX ● Alternate MAX/MIN at each next level of the tree (ply) ○ Minimax(node) is the utility of any node (for MAX) ○ Even levels represent turns for MAX ○ Odd levels represent turns for MIN

Perceptron Learning Rule

● Learning the weights for the classifier with a hard threshold ○ Cannot use gradient descent because the gradients of the values of the threshold function are either zero or undefined ○ Can use the following update rule (for a single example) that converges to a solution, provided the data are linearly separable ● This is called the perceptron learning rule ● The update rule for linear regression

Computing Actions from Q-Value

● Let's imagine we have the optimal Q-values: ● How should we act? ○ Completely trivial to decide! ● Important lesson: actions are easier to select from Q-values than values!

Computing Actions from Values (aka Utilities)

● Let's imagine we have the optimal values V*(s) ● How should we act? It's not obvious! ● We need to do a mini-expectimax (one step) ● This is called policy extraction: it gets the policy implied by the values

One Neuron (Unit)

● Link from neuron input ai to output aj propagates through the network ● Each input link has a weight wi,j for the strength and sign of activation ● A neuron's input is a weighted sum over the input links (including dummy input a0=1 for bias weight w0j ) ● A neuron's output activation aj is a function g over the input inj ● A neural network: neurons linked by directed arcs

Local Search for CSPs

● Local search algorithms such as hill-climbing and simulated annealing can apply to CSPs using complete state formulation ○ Each state assigns a value to every variable ○ The search changes one variable at a time ● Variable selection: randomly select any conflicted variable ● Value selection by min-conflicts heuristic: ○ Choose a value that violates the fewest constraints ○ Apply hill-climb with h(n) = total number of violated constraints

General Properties of a Logical Formalism

● Logic: a way to formulate statements, interpret them, and derive conclusions ○ Syntax: vocabulary, rules of combination to make statements ○ Semantics: truth of statements in possible worlds, or models ○ Entailment: a way to evaluate consistency of models with one another

Logical Agents and Uncertainty

● Logical agents have belief states ● Probability theory can be incorporated into logical agents ○ To change epistemological commitments from truth values to degrees of belief in truth ○ Ontological commitments (what is believed to be in the world) remain the same

Perplexity as a Language Model Metric

● Lower perplexity on test set means higher likelihood of observed sentences (less perplexed) ● Nth root normalizes the inverse probability by the number of words to get a per word perplexity ● Equivalent to the weighted average branching factor

Connection to Reinforcement Learning

● MCTS uses a fixed metric as a "selection policy" to choose the next move ● Reinforcement learning iterates through courses of action to train a flexible decision policy that maximizes a long term reward ● Similarities ○ Future moves are simulated ○ Exploration involves acquiring knowledge about unknowns, sometimes through failure ○ Exploitation involves re-using what has been learned through trial-and-error

Why a Naive Bayes Model is Generative

● ML models that rely on joint probability distributions are generative ○ Given a full joint probability distribution, P(X1 , . . ., Xn, ,Y), the hypothesis P(Y|X) depends on known priors and conditional probabilities: it's all probabilities ○ Conceptually, new data can be generated from the model ■ P(X|Y) = P(X,Y) / P(Y) ● ML models can be discriminative rather than generative ○ Discriminative models do not use full joint probability distributions ○ P(Y|X) depends on features X that discriminate among the outcomes Y ○ Logistic regression is a discriminative modeling method

Smoothing

● Many words are relatively rare ○ A relatively infrequent word might not occur in one of the text categories ○ A word in the test data might not have been seen in the training data ● If any word is assigned a zero conditional probability for any category, then the product of conditional probabilities is zero ● Ensure that a conditional probability is assigned for every word in every class ○ Smooth all conditional probabilities: add a small number to all counts ○ Use a special UNKNOWN token during training and testing; all words not in the training data can be treated as instances of UNKNOWN

Recap: MDPs

● Markov decision processes: ○ States S ○ Actions A ○ Transitions P(s'|s,a) (or T(s,a,s')) ○ Rewards R(s,a,s') (and discount γ) ○ Start state s0 ● Quantities: ○ Policy = map of states to actions ○ Utility= sum of discounted rewards ○ Q-Value = expected future utility from a q-state

Recall Minimax

● Max is the agent ● Min is the opponent ● Look ahead to get future utilities ● Back them up to the root decision node ○ Assume worst case for Max (Min plays optimally)

Learning the Naïve Bayes Parameters

● Maximum Likelihood estimates ○ Use counts from a training set of data ○P stands for the probability estimate

Information Gain

● Measures the relative reduction in entropy for all attributes ● Start with a measure of the entropy of the dataset with respect to the desired classes C = {c1 , c2 , . . . cn } ○ One class: H(C) = 0 ○ Two equal classes: H(C) = 1 ○ Three classes with p(c1 ) = ⅙, p(c2 ) = ⅓, p(c3 )=½: H(C) = 1.46 ○ Three equal classes: H(C) = 1.58 ○ Ten equal classes: H(C) = 3.32 ● Information gain of an attribute:

Inference Task: Learning

● Model parameters are unknown ○ Transition model ○ Sensor model ● Model parameters can be learned from the data ○ Maximum likelihood parameter estimation ○ Expectation maximization (EM)

Size of the Trellis

● N nodes per column, where N is the number of states ● S columns, where S is the length of the sequence ● E edges, one for each transition ● Total trellis size is approximately S(N+E) ○ For N=10, S=10: ■ E = (N x S) {edges from S n to Sn+1} = 102 ■ S(N+E) = 10(10+100) = 1,100 << 1010

Proof Methods: Roughly Two Kinds

● Natural Deduction: Application of inference rules ○ Legitimate (sound) generation of new sentences from old ○ Proof = a sequence of inference rule applications ■ Use inference rules as operators in a standard search algorithm ○ Typically requires transformation of sentences into a normal form ● Model checking ○ Truth table enumeration: for n propositional symbols, 0(2n ) ○ Backtracking search, e.g., Davis-Putnam-Logemann-Loveland (DPLL) ○ Heuristic search in model space (sound but incomplete) ■ E.g., min-conflicts-like hill-climbing algorithms

Games as Search

● New material specific to games ○ There is an opponent to keep track of ○ The search tree includes the adversary's possible moves with alternating plies (levels) for the player and opponent ○ A utility function (or payoff function) of the payoff in points to each player

Ockham's Razor, and Underfitting versus Overfitting

● Ockham's razor: choose the simplest model that "works" ● Underfitting: the model fails to find a pattern that exists in the data ○ Not common; training usually continues until a good fit is achieved ○ Solution: choose a more complex model, and find more data to enable learning the more complex model ● Overfitting: the model finds a pattern that exists in the sample, but the pattern does not generalize across samples ○ Fairly common ○ Solution: simplify the model, and sample the data more rigorously

What Makes Naïve Bayes Naïve?

● Often used when the effects are not conditionally independent ● Can often work well in cases that violate the conditional independence assumption ○ Text classification ○ Why use a more complicated model that seems more correct if a simpler (more naïve) model perform as well or better

Limitations of Game Search

● Optimal search for complex 2-person zero-sum games is intractable due to large branching factor b, and average depth of game d ● Tradeoff between different kinds of algorithms ○ MCTS is best if b is high and/or evaluation function is hard to construct ○ Alpha Beta can be more precise, but heuristic alpha beta is very sensitive to the evaluation function, whose average error could cause bad choices ● Biggest limitation is the focus on individual moves in a game; people reason about games at a higher level of abstraction that breaks the overall of winning down into component sub-goals, such as trapping the opponent's queen in chess

Assumptions for Multinomial Naive Bayes

● PQ is the set of all distributions over Q ● For each such distribution, π is a vector with components π q for each q ∈ Q corresponding to the probability of q occurring

Naive Bayes as a Machine Learning Algorithm

● Performs classification: given an example, produces a hypothesis as to which class the example belongs in ○ Relies on Maximum a Posteriori decision rule (Mtg 21, slide 16) ● NB is a form of supervised statistical machine learning: a training dataset that has already been classified provides the supervision ○ The parameters of the model are the prior probabilities of each class value, and the conditional probabilities of each conditionally independent variable ○ The parameter values are determined empirically, using maximum likelihood estimation ○ A model with learned parameters is tested on previously unseen data

Prior (Unconditional) versus Conditional Probabilities

● Prior probability: probability of an event, apart from conditioning evidence Conditional (or posterior) probablity of an event conditioned on the occurrence of an earlier event

Pros and Cons of La Place Smoothing

● Pro ○ Very simple technique ○ Addresses the key idea that smoothing compensates for not having enough data: Cromwell's rule ● Cons: ○ Probability of frequent words is underestimated ○ Probability of rare (or unseen) words is overestimated ○ Therefore, too much probability mass is shifted towards unseen words ○ All unseen words are smoothed in the same way ● Many more sophisticated methods for smoothing exist; for this class use La Place

Propositions and Random Variables

● Probabilistic propositions are factored representations consisting of variables and values (combines elements of PL and CSP) Variables in probability theory are called random variables ○ Uppercase names for the variables, e.g., P(A=true) ○ Lowercase names for the values, e.g., P(a) is an abbreviation for A=true ● A random variable is a function from a domain of possible worlds Ω to a range of values

Infinite Utilities?!

● Problem: What if the game lasts forever? Do we get infinite rewards? ● Solutions: ○ Finite horizon: (similar to depth-limited search) ■ Terminate episodes after a fixed T steps (e.g. life) ■ Gives nonstationary policies (π depends on time left) ○ Discounting: use 0 < γ < 1 ■ Smaller γ means smaller "horizon" = shorter term focus ○ Absorbing state: guarantees that for every policy, a terminal state will eventually be reached (like "overheated" for racing)

Convert FOL to PL Then do Inference

● Propositionalize the FOL ○ Eliminate quantifiers ○ Skolemize ● Semidecidability ○ Theorem: any sentence entailed by FOL KB is entailed by a finite subset of the propositionalized KB ○ Problem: a sentence not entailed by FOL cannot be recognized as unprovable in the propositionalized KB

Stochastic Gradient Descent (SGD): Univariate Case

● Randomly select m data points at a time, m << N ● Comparison to batch: assume N = 10,000 and m = 100 ○ Each SGD step is 100 times faster than batch gradient descent ○ Increase in standard error is proportional to the square root of the number of examples, or a factor of 10

Language Modeling: Current Methods

● Recurrent neural networks (RNNs) ○ Avoids exponential increase in computation time with statistical LMs ○ Weight parameters are shared across the network ○ Therefore, there is a linear increase in computation time ● Still requires smoothing

Description of Minimax Algorithm

● Recurse down the game tree ○ Search proceeds to some depth d (number of moves to look ahead) ○ Expand to leaf nodes at depth d ● Pass the minimax values back up through the tree ○ Compute the minimax() utility function at the depth-d leaves ○ Pass value back up tree to the parent nodes ● Backed-up values ○ At a MAX node: the maximum of MAX's descendants (the best for MAX) ○ At a MIN node: the minimum of MIN's descendants (the best for MIN)

Regularization

● Regularization aims to minimize the complexity of the model ● Choice of regularization function depends on the hypothesis space ○ For example, a regularization with a linear regression (weights on each attribute) often uses one of the two following regularizers ■ Shrink all small weights to zero (eliminates attributes; Lasso or L1) ■ Prevent very large weights (smooths the weights, uses all features; Ridge or L2)

Universal and Existential Instantiation

● Replace quantified sentences with instantiated sentences ○ Variables are replaced with constants ○ Not with the "objects" of a model ● Inferentially equivalent to the original quantified sentences ● New KB' is satisfiable whenever KB is satisfiable ● Can use propositional inference on KB'

Evaluation of Game States

● Represent the game problem space by a tree: ○ Nodes represent board positions (states) ○ Edges represent legal moves (actions) ○ Root node is the first position in which a decision must be made ● Evaluation function f assigns real-number scores to board positions without reference to the search path ● A terminal node represents a possible game end, labeled with its utility (e.g. win/lose/draw, etc.)

Design issues

● Representing the 'board' and its successor boards ● Evaluating positions ● Looking ahead (search)

Proof by Resolution

● Resolution: an inference rule that if coupled with a complete search algorithm yields a complete inference algorithm ○ Inference rules covered above are all sound ○ Adding resolution yields completeness, if using a complete search algorithm

Differences from Previous Search Methods

● Search goal is to make one move; playing the game has many moves ● No cost on arcs - costs derive from backed-up static evaluation ● MAX can't be sure how MIN will respond to his moves

Regression versus Classification

● Seismic data (1982-1990, Asia & Middle East): body wave magnitude (x1 ) and surface wave magnitude (x2 ) for earthquakes (orange circles) and nuclear explosions (green circles) ● Decision boundary for a linearly separable subset of the data (left) ● All of the data: not linearly separable

Minimum Remaining Values (MRV) Heuristic

● Select a variable to assign with the fewest legal values, meaning the most constrained variable ● Identifies a potential conflict early, if no value can be assigned

Degree Heuristic

● Select the variable to assign that participates in the largest number of contraints with other variables (highest degree in the constraint graph) ○ Reduces branching factor on remaining variables ○ Very effective for picking which state to color first: SA

Monte Carlo Tree Search Balances Explo(r it)ation

● Selection: ○ Apply a metric (selection policy) to rank next moves ○ Take each next move in a (known) playout (path in the MC tree) to a leaf ● Expansion: Add one or more new children below the leaf ● Simulation: ○ Perform a playout simulation from the new node(s) to a game end ○ Note: The simulation is not part of the tree ● Back-Propagation: ○ Incorporate the game result of the simulation into the tree by updating all the nodes back to the root

Truth in FOL

● Sentences are true with respect to a model and an interpretation (grounding) ● A model contains objects (domain entities) and relations among them ● Interpretation specifies referents for ○ constant symbols → objects ○ predicate symbols → relations ○ function symbols → functional relations ● An atomic sentence (predicate(term1 ,...,termn )) is true iff the objects referred to by term1 ,...,termn are in the relation referred to by predicate()

Propositional Logic

● Simple sentences symbols are atomic, non-decomposable: A, B ● Logical operators combine simple sentences into complex sentences ○ ¬A (not A) ○ A ∧ B (A and B) ○ A ∨ B (A or B) ○ A ⇒ B (if A then B) ○ A ⇔ B (A entails B) ● Sentences are true or false

Smoothing Can Cause Underflow

● Smoothing, especially with low α, leads to values close to 0 ● Multiplying lots of probabilities, which are between 0 and 1 by definition, can result in floating-point underflow ● Mitigation: calculate in log space ○ Given that log(xy) = log(x) + log(y): perform all computations by summing logs of probabilities rather than multiplying probabilities

Value Iteration ctd

● Solving the Bellman equations: ○ For n states, there are n Bellman equations with unknown utilities ○ Systems of linear equations can be solved easily using linear algebra, but the max operator is not a linear operator ● Value iteration is one solution ○ Initialize with arbitrary values for the utilities of every state ○ Calculate the right hand side of the equation, then use the new value as an update for the left hand side, applied simultaneously to all states

Forward Chaining

● Sound and complete for Datalog ● Datalog = first-order definite clauses + no functions ● Forward chaining terminates for Datalog in finite number of iterations ● May not terminate in general if α is not entailed ○ Recall: Entailment with definite clauses is semidecidable

Inference Task: Computing the Belief State (Filtering)

● State estimation: the posterior distribution over the most recent state, given all the evidence to date ● For example, what is the probability it will rain today, given all the evidence to date ● Referred to as filtering from early work on signal processing to filter out noise by estimating the underlying signal ● When extended to the posterior over a sequence of states, it is referred to as computing likelihood

Deriving the Equation for the Global Semantics

● Step 1: by definition ● Step 2: where y is all the other variables in the full joint probability ● Step 3: proof that the network's local parameters θ are the conditional probabilities of the full joint probability distribution

Decision Trees

● Supervised learning: ○ Usually of a classifier ○ Can learn a regression ○ CART (classification and regression trees) can learn either one ● Basic idea, using a binary decision goal (e.g., ± C) and n binary attributes A: ○ From your training data, find the one attribute (e.g., A1 ) that best splits the data into two largely equal disjoint sets corresponding to + C and −C ○ If each set is "pure" (has only + C or only −C), the algorithm terminates ○ Else, for each of the 2 sets, find the attribute from A\{A1 } that best divides each set into two maximally pure classes ○ Iterate

FOL Expressions

● Terms ○ Constant ○ Variable ○ Function(Term, . . .) ● Atomic sentences ○ Predicate ○ Predicate(Term, . . .) ○ Term = Term ● Complex sentences ○ Atomic sentences combined with logical connectives ○ Sentences with quantification

Cost Combines Model Selection and Optimization

● The cost of a hypothesis can be considered as the sum of the loss and the regularization: ○ Lower loss is equivalent to reducing all the error types ○ Lower regularization term is equivalent to a simpler model ● Cost formalizes Ockham's razor, in an empirical way ○ Ockham: "A plurality of entities should not be posited without necessity" ○ If necessity is interpreted as empirically predictive, then cost directly formalizes Ockham's razor ○ If necessity is interpreted as providing a more useful explanation with respect to a theory of the world, then cost is only part of the story

Training a Logistic Regression

● The derivative g' of a function g satisfies: ● Thus the update rule has a somewhat different form than for multivariate regression with a hard threshold

Semantics ctdd

● The full joint distribution is defined as the product of the local conditional distributions

Resolution Requires Conjunctive Normal Form (CNF)

● The input to resolution consists of two clauses (implicit conjunction) that are each disjunctions of literals ● To apply resolution, convert all propositions to conjunctive normal form (CNF) ○ Apply bi-conditional elimination if applicable ○ Apply implication elimination if applicable ○ Move all ¬ inward to literals (double-negation elimination; de Morgan) ○ Apply distributivity of ∨ over ∧

Log Likelihood of the Parameters θ

● The log-likelihood function L(θ) is the sum over all documents of the logs of their probabilities (given their attribute values) ● The parameters are those that maximize the log-likelihood function, subject to the constraints that the probabilities of the each class sum to one, and the conditional probability of each word in each class sums to one

Choosing the Optimal Action in State s

● The optimal policy should choose the action leading to a successor state with the maximum utility ● The utility of a given state U(s) can then be defined in terms of the expected utilities of all state sequences from s: Bellman equation

Sample Spaces

● The set of all possible worlds (e.g., for a given logical agent) is called the sample space (specifiable) ○ The sample space consists of a exhaustive set of mutually exclusive possibilities ○ Each member ωi of a sample space Ω is called an elementary event

Analytic Solution: Univariate Case

● The univariate linear model can be easily solved using the above equations, based on finding the values of the weights where the partial derivatives equal zero ● An alternative method can be applied that relies on hill-climbing

Semi-decidability of Propositionalized KB

● Theorem (Herbrand, 1930): If a sentence α is entailed by an FOL KB, it is entailed by a finite subset of the propositionalized KB ● For n = 0 to ∞ do ○ create a propositional KB by instantiating with depth-n terms ○ see if α is entailed by this KB ● Problem: works if α is entailed, loops if α is not entailed at a finite n ● Theorem (Turing, 1936; Church, 1936) Entailment for FOL is semi-decidable ○ Algorithms exist that prove every entailed sentence ○ No algorithm exists that also disproves every non-entailed sentence

Stationary Preferences

● Theorem: if we assume stationary preferences: ● Then: there are only two ways to define utilities ○ Additive utility: ○ Discounted utility

Convergence Properties of EM

● Theorem: the log-likelihood function of the parameters is non-decreasing ● In the limit as T goes to ∞, EM converges to a local optimum

Probability Summarizes the Unknown

● Theoretical ignorance: ○ Often, we have no complete theory of the domain, e.g. medicine ● Poor cost benefit tradeoff if we have fairly complete theories: ○ Difficult to formulate knowledge and inference rules about a domain that handles all (important) cases ● Unavoidable uncertainty (partial observability): ○ When we know all the implication relations (rules), we might be uncertain about the premises

Likelihood of the Observations

● To compute the likelihood for o1 ,o2 ,...,oT as P(0|λ) we would want all the paths through the trellis and the associated probabilities ● Given that we do not have λ, we use an approximation λ' to compute an expectation for each sequence of observations ● Given the Markov assumptions, we need only consider the probability of every state transition i, j for every observation at every t ● Given all our sequences o1 ,o2 ,...,oT the probability of every state transition with an emission at t+1 is derivable from the forward probability from 1 to t, and the backward probability from T to t+1

Execution of the Conditional Plan

● To execute an if-then-else expression in the conditional plan ○ Agent receives percept, then executes the appropriate branch of the condition ○ Agent updates its beliefs after each action ● Similar to, but simpler than, the search ○ Percepts are actual observations in the environment, rather than possible observations maintained for all ways the belief state space could evolve

Logic versus Natural Language

● To express information is one of the functions of natural language ○ A logical formalism is a language to express truth-conditional meaning ■ More rigorous (rule-governed, consistent) than natural language ■ Most words have many meanings (semantic ambiguity) ○ Most sentences have multiple syntactic analyses (syntactic ambiguity) ○ Can characterize reasoning (inference) of various forms ● Language has many other functions besides to convey information (which can be true of false) ○ Word choices can be a reflection of what group one identifies with ○ Saying one has an emotion is information, but the emotion itself is not information (cannot evaluate the truth conditions of "sadness")

Optimization: Choosing Model Hyperparameters

● Training set error tends to decrease as complexity of model increases ● Because in many cases, error can be reduced nearly to zero, the validation set serves as a check on overfitting

Hard Threshold Function for Classification

● Turns a linear regression into a classifier that uses a hard threshold function ● At zero, the decisions switch to the other class ○ Values of the function above 0 are in one class ○ Values of the function below 0 are in the other class

Two Layer Feed Forward Network

● Two inputs ● One hidden layer with two neurons ● Two output neurons ○ Output is a 2D vector: a5 ,a6 ● Fully connected feed forward network ● Output of a network with m output nodes is a length m vector ● Given a loss function that is additive ○ Learning decomposes into m learning problems ● Loss must be back-propagated ○ Each node j in the current hidden layer contributes to Err m ○ Node error depends on the weights ● Activation function must be differentiable

Machine Learning in General

● Types of ML ○ Replicate a pattern given by a supervision signal ○ Discover new patterns (unsupervised learning) ○ Learn through trial and error by interacting with environment and receiving a reinforcement signal (reward; learn a Markov Decision Process model) ● Supervised machine learning types: ○ Classification, e.g., Naive Bayes ○ Sequence prediction, e.g., HMM ○ Regression

Upper Confidence Bounds for Trees Selection Policy

● U(n) is the total utility of all playouts through node n ● N(n) is the number of playouts through node n ● N(PARENT(n)) is the number of rollouts at the parent node of n ● U(n)/N(n) is the average utility of n, i.e., the exploitation term ● The square root term is the exploration term; because N(n) is in the denominator, and the log of PARENT(n) is in the numerator, the exploration term starts high and goes to zero as the counts increase ● C is a constant to balance exploitation and exploration, values around sqr(2)

Utility of a State Sequence

● U(s) The expected cumulative rewards over time from state s ○ Additive rewards for utility function on history: ○ Discounted rewards for utility function on history, for discount γ ∈ [0,1]: ● If γ = 1, discounted sum of rewards is the same as the additive sum of rewards ● The best policy is the policy that produces the state sequence with the highest rewards over time

Utilities of States Over Time

● U(s) The expected cumulative rewards over time from state s ● Finite versus infinite horizons ○ Finite: agent is restricted to a limited number of actions ○ Infinite: agent can take any number of actions until it reaches a goal ● Additive versus discounted rewards ○ Additive: sum the rewards over time ○ Discounted: apply a discount to prefer immediate rewards over future rewards

Operator Precedence

● Unary operators (¬) precede binary operators (∧∨⇒⇔) ● Conjunction and disjunction (∧∨) precede conditionals (⇒⇔) ● Conjunction (∧) precedes disjunction (∨) ● Implication (⇒) precedes biconditional (⇔)

Using Alpha-Beta Pruning

● Use Iterative Deepening search, sort by value last iteration ● Expand captures first, then threats, then forward moves

Feature Selection

● Use terms above some frequency threshold ○ No particular foundation ○ Can work well in practice ● Feature selection using Mutual Information (MI) ○ Clear information-theoretic interpretation ○ The amount of information (bits) obtained about one random variable based on another random variable (See next slide) ○ May select rare uninformative terms ● Other methods: Bayes factor good at selecting informative rare terms

Comparison: Two Dynamic Programming Approaches

● Value iteration and policy iteration compute the same thing (all optimal values) ● In value iteration: ○ Every iteration updates both the values and (implicitly) the policy ○ We don't track the policy, but taking the max over actions implicitly recomputes it ● In policy iteration: ○ Several passes that update utilities with a fixed policy (each pass is fast because one action is considered) ○ After the policy is evaluated, a new policy is chosen (slow like a value iteration pass) ○ The new policy will be better (or we're done)

Problems with Value Iteration

● Value iteration repeats the Bellman updates: ● Problem 1: It's slow - O(S2A) per iteration ● Problem 2: The "max" at each state rarely changes ● Problem 3: The policy often converges long before the values

Search Tree

● We're doing way too much work! ● Problem: States are repeated ○ Idea: Only compute needed quantities once ● Problem: Tree goes on forever ○ Idea: Do a depth-limited computation, but with increasing depths until change is small ○ Note: deep parts of the tree eventually don't matter if γ < 1

Gradient Descent Hill Climbing

● Where w is a vector for the weights (including bias), and α is the step size or learning rate, apply the following update rule ● For univariate regression, the loss is quadratic so the partial derivative will be linear

General Form of EM

● Where x is all the observed values in all the examples, Z is all the hidden variables for all the examples, and θ is all the parameters ○ E-step is the summation over P(Z=z | x, θ (k)), which is the posterior of the hidden variables given the data ○ M-step is the maximization of the expected log likelihood L(x,Z = z|θ ) ● For mixtures of Gaussians, the hidden variables are the Zijs, where Zij=1 if example j was generated by component i ● For Bayes nets, Zij is the value of unobserved variable Xi in example j ● For HMMs, Zjt is the state of the sequence in example j at time t ● Many improvements exist, and many other applications of EM are possible

POS Tags: Hidden States, Inherently Sequential

● Words are observed; part-of-speech tags are hidden ○ Cf. observed umbrellas vs. hidden weather states in the umbrella world ○ Cf. observed ice creams vs. hidden weather states in the ice cream world ● Likely POS sequences ○ JJ NN (delicious food, large pot) ○ NNS VBD (people voted, planes landed) ● Unlikely POS sequences ○ NN JJ (food delicious, pot large) ○ NNS VBZ (people votes, planes lands)

Formalizing the KB

● Wumpus world illustrates a logical agent: the agent takes an action, which results in a new percept, leading to new facts to add to the KB ○ Facts can follow directly from percepts ○ Facts can follow from other facts ● How could such an agent be implemented? ○ A logic language provides a way to represent and reason about facts ○ Predicate logic is introduced next as a first step

Generalization of Conditional Independence

● X and Y are conditionally independent given Z ● Any number of random variables can be conditionally independent given a single random variable distinct from the rest: the Cause


Ensembles d'études connexes

Week 3: Check Your Understanding

View Set

Supp. Reading Ch. 5 Test Questions

View Set

ECON Final: Monopolistic Competition

View Set

Chapter 20: Developing an Evidence-Based Practice

View Set

Chapter 5: Written and Oral Communication/Understand the Processes, Conventions, and Modes of Written and Oral Communication

View Set

CyberOps Associate – FINAL Exam 0-67

View Set

APUSH Ch. 12 Check Your Understanding

View Set

Chapter 6: Entrepreneurial Opportunities

View Set

Groundwater and Karst Topography

View Set