Cogsci 200 Final
the expected utility maximization framework
"rational choice theory" -at heart of economics, important in psych, soc, polisci, phil, comp sci, and cog sci
implications of computational theory of mind
-Philosophy: implication that we have solved mind-body problem -Biology and neuroscience: implication that the fundamental function of neurons is computation
historic achievements in Al using "search"
-1956- theorem proving -1997- chess: deep blue -2015- texas hold'em poker -2016-Go: AlphaGo
the inverse optics problem
-2d pattern on the retina --> 3d representation of scene - P(scene|retinal patterns) =[ P(retinal patters|scene) P(scene)]/ P(retinal patterns)
the first serious design for a computing machine
-Charles Babbage: "Difference Engine" -designed in 1820s, constructed after his death
approx # of atoms in observable universe
10^80
Turings result #1: formalizing the minimal essence of computation
-a function is "computable" if and only if there is a corresponding machine in this class that computes the function -also called Church-Turing thesis
digital circuit design theory
-a key link from mathematical theory to physical machines -developed by Claude Shannon in 1937
multiple realizability claim
-a single algorithm can be realized (aka implemented) in multiple distinct physical substrates -if all mentation is computation --> mental states and processes are also multiply realizable --> a single mental state or process can be implemented in multiple different physical substrates
why do we believe Turings result #1?
-bc every formalism proposed as the basis of computation since 1936 has been shown to be mathematically equivalent to Turing machines in computational power -bc large classes of mathematical/logical functions have been proven computable by Turing machines -bc incr complex computational functions cont to be implemented on physical computing machines that are essentially mathematiclly equiv to Turing machine
why believe Turings claim #2?
-bc he formally constructed one in 1936-->existence of such machine is a mathematical truth
why does result #1 matter for cog sci?
-bc we have no alternative to computation as the explanation for how neurons (or any machine) can implement the complex functions of vision, perception, cognition, decision making, learning, motor control, etc
what the computational TOM claim actually is
-brains and (some) computers embody intelligence for some of the same reasons -they both embody abstract, general principles of computation -they both compute -ALL MENTATION IS COMPUTATION
Alan Turing
-british mathematician -deepest theoretical results about computation -1930s -key part in secret effort that lead to the breaking of germanys "enigma machine" coding scheme in WWII
how computational theory of mind claim is misunderstood
-claim is NOT that modern computer is metaphor for mind and brain -not the same as the "despised" computer metaphor
all reward is internally generated --Singh, Lewis, &Barto
-classic reinforcement frame work (environment, critic, agent) vs revised framework (external enviro, internal enviro, critic, RL agent) that is better for understanding biological organisms and can learn the control of internal processes as well as external actions
Gary Kasparov vs Deep Blue
-ex of the effectiveness of search-based algorithms -deeper search = higher chess rating
ENIAC
-first general purpose electronic digital computer (1948) -Arthur Burks one of og designers of it
two aspects of reinforcement learning
-functional: what is the problem being solved by reinforcement (reward-based) learning? -algorithmic: what algorithms solve this problem?
humor is rewarding
-humor activates subcortical reward system -modulates the mesolimbic reward centers
when we try to increase the power of the turing machine..
-if we allow it to jump to any cell on the tape instead of moving one cell at a time--> NO -if we incr the set of symbols, like instead of 0 and 1s how about 100000+-->NO -if we give it additional read/write heads, so that it can operate in parallel-->NO -if we allow it to operate nondeterministically or probabilistically aka next state it enters is not deterministically specified --> NO
limitations of fixed stimulus-response mappings
-internal search is only useful if you have already learned something about the world and how it works -wont help rat in a maze
the palindrome algorithm
-it is realized (aka implemented) in the physical states of the Mac, PC, and Golden Mac, but the algorithm is not identical to any of its realizers
properties of a computational procedure
-it maps one set of symbols into another set of symbols--aka it calculates a functions -it is finitely specifiable -its execution doesnt itself require "intelligence"
what does it mean to learn how to behave?
-it means to learn state-->action mappings AKA a function -such mappings/functions are called policies -learning how to behave can be thought of as search in a space of possible policies -must mix external and internal search (exploration)
function
-maps each member of one set of symbols to a (single) member of another set of symbols -to each argument of the function, unique value assigned -addition function -palindrome function
take 1: utility maximization
-maximize objective value -problem: our rule cant handle probabilities
neurons that anticipate reward before actions
-neurons in striatum (part of basal-ganglia) firing before movements- but only rewarded movements -participate in representing value function that maps from (state, action) pairs to future reward
involvement w/ subcortical structures
-not cortex -action selection, cognitive control, and reward based learning depends on circuits involving subcoritcal structures and their connections to frontal lobes
neurons that respond to reward
-positive hedonic "liking" -negative aversive "disliking" -opioid hedonic hotspots -cerebral cortex, nucleus accumbens, amygdala, hippocampus
Babbage machine
-precursor to Turings machine in 1840s -2 types: "the difference engine" and the "analytical engine"
relationship between prediction error and change in firing rate -Bayer and Glimcher
-quantitative relationship
temporal credit assignment problem
-refers to the fact that rewards, especially in fine grained state-action spaces, can occur terribly temporally delayed -such reward signals will only very weakly affect all temporally distant states that have preceded it -almost as if the influence of a reward gets more and more diluted over time and this can lead to bad convergence properties of the RL mechanism -Many steps performed by any iterative reinforcement-learning algorithm to propagate the influence of delayed reinforcement to all states and actions that have an effect on that reinforcement
why 3) add knowledge?
-requires a way to program or learn knowledge about the domain -so sys can evaluate a node (ex: game board) w/o searching to the end (addresses depth prob) -so sys can be more selective in actions it explores in search (addresses breadth prob)
what is the difference btwn reward and value?
-reward: quantity associated with states that defines how "intrinsically desireable" a state is. rewards define the goals of a reinforcement learning agent -values: expected (discounted) sums of future rewards, and are also quantities associate with states, or state and actions (state, action values are called Q values) -rewards are somtimes called primary signals -values called secondary signals
The "Turing Machine"
-self-described as a model of a "computer" -minimalist formation of the intuitive notion of computation -many mathematically equivalent notations for specifying Turing Machines
reward and reinforcement learning: the big idea//implicitly defining an optimization problem
-separate the goodness of states of the organism from behaviors required to attain those states -accomplish that by providing organism w reward sys that maps organism states to some quantitative signal -and a learning sys that uses signal along w experience to adjust behaviors so as to attain more of those good states
neurons that encode reward prediction error
-some midbrain dopamine neurons seem to encode this reward prediction error-->may participate in ERROR-DRIVEN LEARNING -first comes habitual (familiar) reward recognition -then a delayed (surprising) reward: suppressed firing followed by increased firing after reward -also an early (surprising) reward: increased firing after reward
algorithmic level
-specifies procedures and mechanisms that enable the problem to be solved
biological/physical level
-specifies the neural/chemical substrates in which the algorithm/procedures are implemented
functional level
-specifies what problem the capacity is supposed to solve
what is a good choice of action?
-the agent is at time step "t" and needs to pick up next best action a(lower t). what do we do? -best action=action that maximizes expected cumulative reward
4) evolutionary basis
-the beginnings of a plausible evolutionary theory can be provided for the origins of specific reward/motivational functions through the formulation of an optimal reward prob that asks: what is the BEST REWARD FUNCTION TO PROVIDE THIS LIMITED AGENT IN ORDER TO MAXIMIZE ITS FITNESS in some distribution of environments?
Pinker's claim: how the mind works
-the mind is what the brain does -brain processes computation -thinking is type of computation
Turings result #2: the existence of universal computing machines
-there are single Turing Machines that can compute EVERY computable function, by taking as input a description (program) of the function to compute, and its input
Ada Lovelace
-translated italian memoir on Babbage Analytical engine -created method for calculating sequence of Bernoulli #s w engine that wouldve run correctly if it had been built
thinking as computation via internal search
-we can create algorithms tht process symbolic patterns representing possible states of the game, aka possible states of any aspect of the world -algorithms encode the 'rules of the game' (aka how the world works) and can explore(search) the consequences of taking diff actions in game (world) by 'looking ahead' (predicting what might happen)
"bucket brigade" algorithm- John Holland
-wrote book "adaptation in natural and artificial systems"
Alan Turing's 2 breakthroughs
1) a minimal formalization of computation 2) universal compuation
two ways to represent value function
1) a table 2) a neural net that outputs a value given a feature vector representing the state
Parts of a Turing Machine
1) an infinite tape divided into cells which are either blank or have single symbols on them 2) finite alphabet of tape symbols 3) a read/write head that is positioned at a single cell on tape and can read the symbol at that cell & erase or write a symbol 4)a state memory that stores the single current thats of the Turing machine, one of a finite set of states 5) a finite transition table of instructions that determines the control of machine. each entry in table tells machine what to do based on its current state and symbol currently abajo del read/write cell. the actions indicate the new symbol to write, the direction to move the head, and the next state to enter.
what is computation?
1) execution of algorithms that implement functions 2) physical processes transforming physical symbols 3) what Alan Turing said
summary: reinforcement learning
1) functional problem 2) algorithmic solution 3) neural implementation 4)evolutionary basis
solution to delayed reward (Sutton and Barto, Holland)
1) keep track of value functions Q(s,a) 2) treat current value functions as a prediction, and gradually adjust it when there is an error in the prediction (in the direction that will reduce the error
how to find neural correlates of reward-based learning?
1) neurons that detect the presence of reward (REWARD FUNCTION) 2) neurons that anticipate reward (the VALUES OF STATES and the VALUES OF (STATE,ACTION) PAIRS) 3) neurons that encode a reward prediction error (the ERROR SIGNAL- "DELTA")
secrets of Al Turing
1) search 2) reinforcement learning--what is the problem being solved by reward-based learning?
three ways to make search perform better
1) search more deeply 2)search more broadly *both 1 and 2 require additional computational resources (more, faster processors) 3)add knowledge
finding the actual framework of rational choice theory
1) start w simple choice rule 2) notice probs 3) propose better rule 4) notice probs 5) etc -when done--convinced that rational chooser maximizes expected utility
intelligent thought
= knowledge + search -one of deepest principles abt thought to emerge from artifical intelligence
the first "computer programmer"
Ada Lovelace
value functions
mappings from states or (state, action) pairs= expected future cumulative discounted reward
3) neural implementation
a frontal/midbrain/striatal circuit involving the DOPAMINERGIC SYSTEM underlies implementation of this algorithm -evidence comes from neural recording that reveal specialized representations of the diff quantities implicated in the algorithm, including prediction error
Tesauro- "TD-Gamma"
an RL system that learned how to play Backgammon -reward function was: +100 is win -100 if lose 0 for all other states -trained by playing 1.5 million games against itself -became as good as best human players
3 lvls of explanation in cog sci: David Marr
any cognitive capacity can be described at three levels: 1) functional 2) algorithmic 3) biological/physical -all 3 lvls are indispensable -no competition between these levels of explanation
why does claim #2 matter to psychology and neuroscience?
bc we have no alternative to universal computation as the explanation for how a single computing machine (mind/brain) can implement the APPARENTLY UNBOUNDED VARIETY OF COMPLEX FUNCTIONS that are within human capacity
what is the best action to take in a given state?
best action is the one that maximizes EXPECTED CUMULATIVE FUTURE DISCOUNT REWARD
what makes thought (& intelligence) possible?
computation
"the analytical engine"
designed to take paper cards that specified diff functions for it to compute -inspired by the paper cards of Jacquard loom weaving machines
discount factor in reinforcement learning
determines how much the organism cares about future reward relative to immediate reward
what shows us connections among brain regions
diffusion tensor imaging of white-matter tracts
2) algorithmic solution
effective computational algorithm for solving this learning problem is based on: -separate reward functions (presumably innate part of organism) from value functions that estimate expected future rewards for taking particular actions in particular states, and -update these estimates based on reward prediction errors
1) functional problem
even if agents are endowed w reward functions telling them what things (States) are good, need mechanism for leaning behaviors that are effective in obtaining those good states-->learning mechanisms must solve the temporal credit assignment problem
extrinsic rewards
external force of motivation
Ada Lovelace's notes
foreshadowed the fundamental theory of computation developed 100 yrs later by Turing
DeepMind
had artificial intelligence breakthrough and google bought it for $617 million
intrinsic motivation and cognitive rewards
higher mammals (humans) are motivated to play and explore -internal reward functions may reward learning and exploration itself-->rewarding to experience "error signals" as we make predictions in the world--not just about reward but about how world works in general
we can use optimal Q value to select best action
if there are n actions available in current state s, and we had this value for each action, we could just pick the action with the highest Q value
Q-learning
kind of error-driven learning 1)current estimate 2) actual reward "recieved" at time t 3) what estimate should be according to this sample of experience and our estimate of future reward from st+1 4) subtracting out current estimate 5)learning rate (from 0 to 1)-says how fast to learn
way of behaving
mapping of states to actions
is every conceivable function computable?
no
value functions vs reward functions
not the same, although both map states to quantities
how should we (as engineers) or evolution design an organism that acts effectively in the world?
one possibility: "wire-in" innate stimulus-action mappings -when appendage pain sensors activate, withdraw appendage immediately -swim in direction of most shrimp -when shadow overhead, run like crazy into a hole
quantity value
particular way of formalizing subjective value or utility in the context of sequential decision making--making multiple choices over time -we want algorithms that learn what the good actions are
algorithm
procedure that generates the specified mapping relation -can be physically realized in multiple ways -palindrome algorithm
computation
refers to the execution of a computational procedure, called an "algorithm" -physical processes transforming physical symbols (patterns coding information) -doesnt depend on special properties specific to neurons and electronic digit circuits
what is reward?
reward function is a mapping from states to quantities
reinforcement learning (RL) is concerned with ?
sequential decision making--making good decisions over time--in uncertain (probablistic) environments -emphasized sequential part, "made up" reward functions
what quantity is the Q-learning algorithm (and all reinforcement learners) trying to maximize?
sum of (discounted) rewards
problem that search algorithms face
the combinatoric, exponential explosion of possible futures -search spaces can grow exponentially -ex: the tree of lunch
what is the prediction error?
the difference between the current value estimate and a value estimate that takes into account the reward actually recieved
objective value
things that have publicly share standards for how valuble they are -ex: money
Computational theory of mind
thinking/all mentation is type of computation
reward-based learning system in the brain
tracing out the neural circuitry that might underlie these computations
computations for classical conditioning
updating a value function using reward prediction error-taking actions out
How did TD-Gammon represent the value function?
used a neural net with single hidden layer
intrinsic rewards
we are infovores--the rewards of learning
what does it mean to learn how to behave? what is a "way of behaving?"
we formalize ways of behaving as MAPPINGS FROM STATES TO ACTION-->CALLED POLICIES
optimal Q-value function
we have solved problem of learning how to behave -we replace problem of learning the optimal policy w prob of learning the optimal Q-value function -the expected future discounted reward of taking different actions in different states, then behaving optimally (following the optimal policy) thereafter
what is the VALUE of a state, or state and action?
we want algorithms to lean what the good actions are
computations for operant conditioning
with actions included-updating a value function using reward prediction error