Reinforcement Learning
What does the optimal policy try to maximize?
- tries to maximize long term expected reward
There is at least one optimal policy for an MDP.
True
Q-Learning
build a lookup table (Q-Table) where we calculate the maximum expected future rewards for actions at each state 1-Start with all 0s in Q-table 2- choose an action 3- masure reward 4- update Q-table -as we explore env, update Q-function gives us better estimates by continiosly updating the Q-values in the table
How can you defined an MDP
defined by - States - Actions - Rewards - Transition Model
Any MDP converges after the 1st value iteration for a discount factor of ?
gamma = 0. sincle all converged values will be just the immediate rewards
If we knew optimal q(s,a) table we have the optimal policy
true
DP assumes full knowledge of MDP
we know the rules (dynamics) of the game - used for planning an MDP
Exploration and Exploitation Dilemma
we need a way to balance between exploring and exploiting the environment. - Exploration : choose to give up some reward you know about in order to know more about the environment (need for learning) - Exploitation : exploit the information you have already found to maximize reward (stop learning and maximize reward)
Optimal Policy tells you
which action to take at any state that maximizes the expected utility function -optimal way to behave
Does policy iteration always converge
yes, because acting greedy with respect to value function. Its is at least as much as previous policy , never worse. If improve stops then we are reach optimal policy
Rewards in MDP
scaler value you get for being in state R(s), R(s,a) R(s,a,s')
What is Reinforcement Learning(RL) ?>
- ML method where agent learns to behave by acting on the environment and observing the rewards. No supervisor, the agent collects training examples through trial and error as it attempts its task with the goal of maximizing long term reward.
Does minor changes to reward matteR?
- Yes small negative changes everywhere encourage agent to end game faster - reward is like a teaching signal, part of domain knowledge (tells you how important it is to get to the end) -Large negative rewards will encourage the agent to take the shortest path no matter what result is , so that it ends the negative outcome. Determining the rewards is some sort of domain knowledge
Methods to choose actions
- always choose single specific action : violate rules of exploring every (s,a) pairs - Choose randomly : we don't learn optimal policy - use estimates Q to choose actions : poor initialization can cause specific action to be chose repeatedly - random restarts: will take a lot of time to converge - simulated annealing : can faciliate a raondom but faster approach, while exploring the whoel space
What is the Existential Dilemma of Immortality?
- if you live forever why should you care about anything (sum of all reward is infinite) deal with this by using discount factor gamma
Epsilon Greedy Strategy
- in the begiinging epsilon rates will be higher. Allow agent to explore the env, and randomly choose actions. Allows for more exploration As the robot explores the env, the epsilon rate decreases and the robot starts exploiting the env. During the exploration process, agent becomes more and more confident in Q-values
How do you solve the bellmans equations?
- iterative algorithms: Value Iteration, Policy Iteration, Q-Learning, Sarsa
RL properties
- mapping states to actions -> maximize long term rewards - sequential process so time really matters - agent's action affect the future decisions - agent learns from expirience
Why do we discount by factor gamma?
- mathematically convenient because converges to finite solution (geometric series) - avoids infinite returns allows us to go infinite distance in finite time because of the geometric series Rmax/ 1 - gamma
What is temporal credict assignment problem ?
- problem that arises when receiving delayed rewards. Because you get a reward after many actions and state you don't know how to assign blame/credit to any action ex: chess - don't know made a mistake until the end
What is Transition Model in MDP?
- rules (dynamics) of the world T(s, a, s') --transition probability of being in state s, taking action a, and ending up in state s'. This function produces probability of a new state given your current state and action you take
Policy Iteration
- start with some arbitrary policy - evaluate its policy = figure out value function (utility) - improve policy by acting greedily respect to value function (utility) - iteratively evaluate and improve it until converges
How can we use value function to find the optimal policy ?
- take the maximum value function over all the policies, this results in optimal value function. Then, from the optimal value function we can find the optimal policy which solves the MDP
Which converges faster: value iteration or policy iteration?
- value iteration tends to converge faster even though it takes more iterations.
Is value iteration guaranteed to converge?
- yes because at each step you are adding R(s) which is a true value, so even if we start with very wrong estimates of utility, we'll keep adding the true value and eventually dominate the wrong estimate
Why do we use DP for solving MDPs?
Because MDP satisfy both - optimal substructure - and overlapping problems Bellman equation gives the recursive decomposition and Value function stores and reuses solution (cache optimal way to behave)
In a infinite horizon MDP with S states and A actions. How many deterministic policires are there? How many stochastic policies are there?
Deterministic A^s Stochastic uncountable
An MDP with N states and no stochastic actions converges with N value iterations for any 0 <= gamma <= 1
False, consider situation where there are no absorving goal states
For an MDP if we only change the reward function R, the optimal policy is guaranteed to remain the same
False, small changes in rewards matter. help find optimal solution or drive the agent to quickly find terminal state even if not optimal
Policy found by value iterations are superior to policies found by policy iteration
False, there is no superiority. optimal policy for value iteration is same as policy iteration
If the only difference between 2 MDP is value of discount factor then they must have the same optimal policy?
False. Counterexample will be two have 2 terminal states. Terminal A gives reward 1 and is 1 step away from start state and B gives reward 10 and is 2 steps away from start state. All other transition rewards = 0. Assume actions always succeed. Discount factor gamma < 0.1 the optimal policy takes agent into A. If discount factor > 0.1 takes agent into B.
If an MDP has a transition model T that assign non-zero probability for all triples T then Q learning will fail
False. Q-learning does not need model T. It can learn optimal policy by iterating with the environment and gather experience tuples
Sequences of Reward in Infinite vs Finite Horizons
For a finite horizon - we try to maximize the sum of rewards of the next N steps In infinite horizon - maximize the sum of all rewards over all the future - use gamma discounted factor for reward - since the return my by infinite, we use discounting with value < 1 so that the expected value converges to a finite solution
Compare Model based vs Model Free algorithms?
Model Based: - have model If we know T and R then we know the rules of the game and we can build a model T by looking statistically at the transitions. - use policy and value function to look ahead in the model and see what optimal way to behave - we can solve model based problems by using DP algorithms : policy iteration and value iterations which find the optimal policy Model Free - no model algorithms don't know the transition or reward function and therefore need to interact with the enviroment and collect expirience tuples to learn an optimal policy. - You can solve this problems with RL algorithms like Q-Learning
Planning vs RL
Planning - given rules of game (dynamics) - agent internally computes model (without external interaction) to improves its policy RL - environement is initially unknown (we don't know rules of game (don't have T and R)) - agent interacts with the environment and collects expirience to improve its policy
Why is MDP proceess, decision, and stochastic?
Process : because is evolves over time (sequential) Decision : because the agent selection action to take Stochastic: because it random
For a finite horizon MDP , we require at most H+1 iterations of value iteration to compute the optimal policy
True
In a deterministic MDP, even with learning rate = 1. Q-Learning converges. This is optimal learning rate.
True
In a deterministic undiscounted gamma = 1 MDP the optimal value is the maximum return from the state
True
In stocastic MDP, learning rate = 1, the Q(s,a) are always equal to most recent samples for the state actions. The Q(s,a) will cycle among the possible samples and never converge
True
Q-Learning converges to optimal Q-Value function if the states are fully exploreed and the convergence rate is set correctly
True
Q-learning will converge if visited all the (s,a) pairs infinitely
True
Stationary optimal policy is guaranteed if state state and action space are finite
True
T/F You can make anything a markovian process by making the current state remember everything from the past
True
Vlaue iteration is guaranteed to converge if the discount factor is 0 < gamma < 1
True
With discount factor = 1, any future step is just as valuable as any other step.
True
Without Infinite horizon you lose the notion of stationarity?
True
For an infinite horizon MDP with a finite number of states and with a discount factor 0 < gamma < 1, value iteration is guaranteed to converge.
True, because gamma discounting allows us to move infinite distance in finite time because of geometric series Rmax/ 1 - gamma
T/F Any acyclix MDP with N states converges after N value iterations for any 0 <= gamma <= 1
True, since no cycles and thereafter each iteration at least one state whose values are not optimal is guaranteed to have its value set to optimal value
Q-learning can learn the optimal function Q* without ever executing the optimal policy
True.
T/F In a discounted infinite horizon MDP V*(s) >= V pi (s) for all states and policies pi.
True.
What the difference between utility and reward
Utility: expected reward for that state and all the reward from there after(long term) Reward: immediate reward (short term)
What do we need to know in order to solve an MDP?
Value function and policy function Policy function tells you which action to take while the Value function tells you how good it is to be in particular state or taking an action in particular state
when is 1 policy is better than another?
When its value function is better than value function of the other policy for all states
Can the policy change in a finite horizon even if you're in the same state?
Yes
Does value iteration update the policy at every iteration?
Yes
Does changing the discount factor affect optimal policy?
Yes, for example setting the discount factor to 0, will set agent to choose action that give the highest immediate reward
What is the solution to an MDP ?
a policy, which tells us what action to take from any particular state
Value Iteration
idea is to recursively compute the value function by combing one update of policy evaluation and policy improvement until reach the optimal value function. Then, from the optimal value function, we can derived the optimal policy Start with final rewards and work backward through the MDP 1.start with arbitrary utilities 2. update utilities based on neightbors 3. repeat until converges We are updating the estimate of utility by calculating immediate reward plus the discounted reward .
Markov Decision Process (MDP)
is a framework for model sequential decision making is an environment that satisfies the Markov property
What does larger gamma indicate? low value gamma?
larger gamma indicates (close to 1) more we value rewards in future (magnifies reward) - far cited into future low value gamma (close 0) indicates less we value rewards in the future (lowers rewards) -short cited
Bellmans Equation
recursive equation that defines utility of being particular state. The immediate value plus the value function at next state discounted. For every state the bellman equation utility function returns the action that maximizes the expected utilty
What are actions in MDP?
set of all possible actions agent can do in a particular state A(s) ex: up, down
What is states in MDP?
set of all possible states that the world can be in (ex: grid position)
What is Markov property?
states that 1) Future is independent of the past given the present (only the present matters) current states contains all relevant information from previous states (history - sequences of states, actions, rewards) 2) rules don't change (world is stationary - time independent(only thing matters is state you're in not the time step)
What is Utility of sequences
stationary preferences - if i prefer one sequence of events today, then I prefer the same sequence tomorrow
Control in an MDP
tells us the best policy, outputs the optimal value function and optimal policy .
Prediction in an MDP
tells us the transition and reward function - outputs a value function