16. Dynamic Programming
Approximate Dynamic Programming
- Approximate the value function - using a function approximator v(s,w) - apply dynamic programming v(.,w) - fitted value iteration repeats at each iteration k, -- sample states -- for each state s ϵ S estimate target value using bellman optimality equation vk(s) = max (R + Y ∑Pss' v(s',wk) ) -- train next value function v(. , wk+1) using targets {<s, vk(s)>}
Planning by Dynamic Programming
- DP assumes full knowledge of MDP - used for planning in an MDP - for prediction: *input*: MDP <S,A,R,P,Y> and policy π *Or*: MRP <S,P^π,R^π,Y> *Output*: Value function vπ - for control: *input*: MDP <S,A,R,P,Y> *output*: Optimal value function v* *and*: optimal policy π*
Asynchronous Dynamic Programming
- DP methods described so far used synchronous backups i.e. all states are back up in parallel - asynchronous DP backs up states individually in any order - for each selected state, apply the appropriate backup - can significantly reduce computation - guaranteed to converge if all states if all states continue to be seleceted Three simple ideas for asynchronous dynamic programming: - in place dynamic programming - prioritized sweeping - real-time dynamic programming
Full width backups
- DP uses full width backups - for each backup (sync or async) -- every successor state and action is considered -- using knowledge of the MDP transitions and reward function - DP is effective for medium-sized problems (millions of states) - for large problems DP suffers bellman's curse of dimensionality -- number of states n= | S | grows exponentially with number of state variables - even one backup can be too expensive
What is Dynamic Programming?
- Dynamic = sequential or temporal component to the problem - programming= optimising a "program" i.e. a policy c.f. linear programming - a method for solving complex problems - by breaking them down into subproblems (solve the subproblems, combine the solutions to the subproblems)
Iterative Policy Evaluation
- Problem: evaluate a given policy π - solution: iterative application of bellman expectation backup - v1 --> v2 --> ...--> vπ -using synchronous backups: -- at each iteration k+1 -- for all states s ϵ S -- Update vk+1(s) from vk(s') -- where s' is a successor state of s vk+1(s) <-- s a r vk(s') <-- s' vk+1(s) = ∑π(a | s) [R+Y ∑Pss' vk(s')] vk+1= R^π + Y P^π v^k
Policy improvement
- consider a deterministic policy a = π(s) - We can improve the policy by acting greedily π'(s) = argmax qπ(s,a) - This improves the value from any state s over one step qπ(s, π'(s)) = max qπ(s,a)>= qπ(s,π(s)) = Vπ(s) - it therefore improves the value function Vπ'(s) >= Vπ(s) - if improvement stop, qπ(s, π'(s)) = max qπ(s, a) = qπ(s, π(s)) = vπ(s) - then the bellman optimality equation has been satisfied vπ(s) = max qπ(s,a) - therefore vπ(s) = v*(s) for all s ϵ S , so π is an optimal policy
Value Function Space
- consider the vector space V over value functions - there are | S | dimensions - each point in this space fully specifies a value function v(s) - what does Bellman backup do to points in this space? -- will show it brings value functions closer --therefore the backups must converge on a unique solution
Bellman expectation backup is a contraction
- define the bellman expectation backup operator T^π T^π(v) = R^π + Y P^π v - this operator is Y-contraction i.e. it makes value functions closer by at least Y || T^π(u) - T^π(v) ||infinity = || (R^π + Y P^π u) - (R^π + Y P^π v) ||infinity = || Y P^π(u-v) ||infinity <= || Y P^π ||u-v||infinity ||infinity <= Y||u-v||infinity
Bellman Optimality Backup is a contraction
- define the bellman optimality back operator T* T*(v) = max R + Y P v - this operator is Y-contraction i.e. it makes value functions closer by at least Y (similar to previous proof) || T*(u) - T*(v) ||infinity <= Y ||u-v||infinity
Deterministic Value Iteration
- if we know the solution to subproblems v*(s') - solution v*(s) can be found by one-step lookahead - The idea of value iteration is to apply these updates iteratively - intuition: start with final rewards and work backwards - still works with loopy stochastic MDPs
Sample Backups
- in subsequent lectures we will consider sample backups -using sample rewards and sample transitions <S,A,R,S'> - instead of reward function R and transition dynamics P - Advantages: -- model free: no advance knowledge of MDP required -- breaks the curse of dimensionality through sampling -- cost of backup is constant, independent of n = | S |
Value iteration
- problem: find optimal policy π - solution: iterative application of bellman optimal backup - v1--> v2--> ... --> v* -using synchronous backups -- at each iteration k+! -- for all states s ϵ S -- update vk+1(s) from vk(s') - unlike policy iteration there is no explicit policy -intermediate value functions may not correspond to any policy vk+1(s) <--s a r vk(s') <-- s' vk+1(s) = max [R + Y ∑ Pss' vk(s')] vk+1 = max R+Y P vk
Other Applications of Dynamic Programming
- scheduling algorithms - string algorithms - graph algorithms - graphical models - Bioinformatics
generalize policy iteration
- starting vπ --> V*π* (V=V^π) and π = greedy (V) - *policy evaluation* Estimate vπ any policy evaluation algorithm - *policy improvement* generate π' >= π any policy improvement algorithm π-- (evaluation: V-->V^π) --> V V-- (improvement: π--> greedy(V))-->π
In-Place Dynamic Programming
- synchronous value iteration stores 2 copies of value function for all s in S vnew(s) <-- max (R + Y ∑ Pss' vold(s') vold <-- vnew - in place value iteration only stores one copy of value function for all s in S v(s) <-- max (R + Y ∑ Pss' v(s'))
Convergence of Iter. policy evaluation and policy iteration
- the bellman expectation operator T^π has a unique fixed point - vπ is a fixed point of T^π (by Bellman expectation equation) - by contradiction mapping theorem - iterative policy evaluation converges on vπ - policy iteration converges on v*
Evaluating a random Policy in a small gridworld
- undiscounted episodic MDP (Y=1) - Nonterminal states 1,...,14 - one terminal state (shown twice as shaded squares) - actions leading out of the grid leave state unchanged - reward is -1 until the terminal is reached - agent follows uniform random policy π(n | .) = π (e | .) = π(s | .) = π(w | .) = 0.25
Prioritized sweeping
- use magnitude of bellman error to guide state selection | max (R + Y ∑ Pss' v(s')) - v(s)| - backup the state with the largest remaining bellman error - update bellman error of affected states after each backup - requires knowledge of reverse dynamics (predecessor states) - can be implemented efficiently by maintaining a priority queue
Real-time Dynamic programming
-Idea: only states that are relevant to agent - use agents experience to guide the selection of states - after each time-step St, At, Rt+1 - backup the state St
Policy iteration
-Vπ starting converges to V* π* (V= vπ and π=greedy (V)) -Policy evaluation Estimate vπ - iterative policy evaluation -policy improvement - Generate π'>= π - Greedy policy improvement CYCLE: π--(evaluation: V-->V^π)--> V V--(improvement: π--> greedy(V))-->π
convergence of value iteration
-the bellman optimality operator T* has a unique fixed point - v* is a fixed point of T* (by bellman optimality equation) - by contraction mapping theorem - value iteration converges on v*
Value function infinity-norm
-will measure distance between state-value functions u and v by the infinity-norm - i.e. the largest difference between state values || u - v || infinity = max | u(s) - v(s)|
Requirements for Dynamic Programming
Dynamic Programming is a very general solution method for problems which have two properties - Optimal substructure (principle of optimality applies, optimal solution can be decomposed into subproblems) - overlapping subproblems (subproblems recur many times, Solutions can be cached and reused) - Markov decision processes satisfy both properties (Bellman equation gives recursive decomposition, value function stores and reuses solutions)
Synchronous DP algorithms
Problem: (1) Prediction (2) Control (3) Control Bellman equation (1) Bellman expectation equation (2) Bellman Expectation Equation + Greedy Policy Improvement (3) Bellman Optimality Equation Algorithm: (1) Iterative Policy Evaluation (2) Policy Iteration (3) Value Iteration -Algorithms are based on state-value function vπ(s) or v*(s) - complexity O(mn^2) per iteration, for m actions and n states - could also apply to action-value function qπ(s,a) or q*(s,a) - complexity O(m^2n^2) per iteration
Contraction Mapping Theorem
Theorem: - for any metric space V that is complete (i.e. closed) under an operator T(v); where T is a Y-contraction - T converges to a unique fixed point - at a linear convergence rate of Y
Principle of optimality
any optimal policy can be subdivided into 2 components - an optimal first action A* - optimal policy from successor state S' A policy π(a | s) achieves the optimal value from state s, vπ(s)=v*(s) if and only if - for any state s' reachable from s - π achieves the optimal value from state s', vπ(s') = v*(s')
How to improve a policy
given a policy π - *evaluate* the policy π vπ(s)=E[Rt+1 + Y Rt+2 + ... |St=s] - *Improve* the policy by acting greedily with respect to vπ π' = greedy (vπ) - In a small gridworld improved policy was optimal π'=π* - In general, need more iterations of improvement/ evaluation - but this process of *policy iteration* always converges to π*