Reinforcement Learning

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

What does the optimal policy try to maximize?

- tries to maximize long term expected reward

There is at least one optimal policy for an MDP.

True

Q-Learning

build a lookup table (Q-Table) where we calculate the maximum expected future rewards for actions at each state 1-Start with all 0s in Q-table 2- choose an action 3- masure reward 4- update Q-table -as we explore env, update Q-function gives us better estimates by continiosly updating the Q-values in the table

How can you defined an MDP

defined by - States - Actions - Rewards - Transition Model

Any MDP converges after the 1st value iteration for a discount factor of ?

gamma = 0. sincle all converged values will be just the immediate rewards

If we knew optimal q(s,a) table we have the optimal policy

true

DP assumes full knowledge of MDP

we know the rules (dynamics) of the game - used for planning an MDP

Exploration and Exploitation Dilemma

we need a way to balance between exploring and exploiting the environment. - Exploration : choose to give up some reward you know about in order to know more about the environment (need for learning) - Exploitation : exploit the information you have already found to maximize reward (stop learning and maximize reward)

Optimal Policy tells you

which action to take at any state that maximizes the expected utility function -optimal way to behave

Does policy iteration always converge

yes, because acting greedy with respect to value function. Its is at least as much as previous policy , never worse. If improve stops then we are reach optimal policy

Rewards in MDP

scaler value you get for being in state R(s), R(s,a) R(s,a,s')

What is Reinforcement Learning(RL) ?>

- ML method where agent learns to behave by acting on the environment and observing the rewards. No supervisor, the agent collects training examples through trial and error as it attempts its task with the goal of maximizing long term reward.

Does minor changes to reward matteR?

- Yes small negative changes everywhere encourage agent to end game faster - reward is like a teaching signal, part of domain knowledge (tells you how important it is to get to the end) -Large negative rewards will encourage the agent to take the shortest path no matter what result is , so that it ends the negative outcome. Determining the rewards is some sort of domain knowledge

Methods to choose actions

- always choose single specific action : violate rules of exploring every (s,a) pairs - Choose randomly : we don't learn optimal policy - use estimates Q to choose actions : poor initialization can cause specific action to be chose repeatedly - random restarts: will take a lot of time to converge - simulated annealing : can faciliate a raondom but faster approach, while exploring the whoel space

What is the Existential Dilemma of Immortality?

- if you live forever why should you care about anything (sum of all reward is infinite) deal with this by using discount factor gamma

Epsilon Greedy Strategy

- in the begiinging epsilon rates will be higher. Allow agent to explore the env, and randomly choose actions. Allows for more exploration As the robot explores the env, the epsilon rate decreases and the robot starts exploiting the env. During the exploration process, agent becomes more and more confident in Q-values

How do you solve the bellmans equations?

- iterative algorithms: Value Iteration, Policy Iteration, Q-Learning, Sarsa

RL properties

- mapping states to actions -> maximize long term rewards - sequential process so time really matters - agent's action affect the future decisions - agent learns from expirience

Why do we discount by factor gamma?

- mathematically convenient because converges to finite solution (geometric series) - avoids infinite returns allows us to go infinite distance in finite time because of the geometric series Rmax/ 1 - gamma

What is temporal credict assignment problem ?

- problem that arises when receiving delayed rewards. Because you get a reward after many actions and state you don't know how to assign blame/credit to any action ex: chess - don't know made a mistake until the end

What is Transition Model in MDP?

- rules (dynamics) of the world T(s, a, s') --transition probability of being in state s, taking action a, and ending up in state s'. This function produces probability of a new state given your current state and action you take

Policy Iteration

- start with some arbitrary policy - evaluate its policy = figure out value function (utility) - improve policy by acting greedily respect to value function (utility) - iteratively evaluate and improve it until converges

How can we use value function to find the optimal policy ?

- take the maximum value function over all the policies, this results in optimal value function. Then, from the optimal value function we can find the optimal policy which solves the MDP

Which converges faster: value iteration or policy iteration?

- value iteration tends to converge faster even though it takes more iterations.

Is value iteration guaranteed to converge?

- yes because at each step you are adding R(s) which is a true value, so even if we start with very wrong estimates of utility, we'll keep adding the true value and eventually dominate the wrong estimate

Why do we use DP for solving MDPs?

Because MDP satisfy both - optimal substructure - and overlapping problems Bellman equation gives the recursive decomposition and Value function stores and reuses solution (cache optimal way to behave)

In a infinite horizon MDP with S states and A actions. How many deterministic policires are there? How many stochastic policies are there?

Deterministic A^s Stochastic uncountable

An MDP with N states and no stochastic actions converges with N value iterations for any 0 <= gamma <= 1

False, consider situation where there are no absorving goal states

For an MDP if we only change the reward function R, the optimal policy is guaranteed to remain the same

False, small changes in rewards matter. help find optimal solution or drive the agent to quickly find terminal state even if not optimal

Policy found by value iterations are superior to policies found by policy iteration

False, there is no superiority. optimal policy for value iteration is same as policy iteration

If the only difference between 2 MDP is value of discount factor then they must have the same optimal policy?

False. Counterexample will be two have 2 terminal states. Terminal A gives reward 1 and is 1 step away from start state and B gives reward 10 and is 2 steps away from start state. All other transition rewards = 0. Assume actions always succeed. Discount factor gamma < 0.1 the optimal policy takes agent into A. If discount factor > 0.1 takes agent into B.

If an MDP has a transition model T that assign non-zero probability for all triples T then Q learning will fail

False. Q-learning does not need model T. It can learn optimal policy by iterating with the environment and gather experience tuples

Sequences of Reward in Infinite vs Finite Horizons

For a finite horizon - we try to maximize the sum of rewards of the next N steps In infinite horizon - maximize the sum of all rewards over all the future - use gamma discounted factor for reward - since the return my by infinite, we use discounting with value < 1 so that the expected value converges to a finite solution

Compare Model based vs Model Free algorithms?

Model Based: - have model If we know T and R then we know the rules of the game and we can build a model T by looking statistically at the transitions. - use policy and value function to look ahead in the model and see what optimal way to behave - we can solve model based problems by using DP algorithms : policy iteration and value iterations which find the optimal policy Model Free - no model algorithms don't know the transition or reward function and therefore need to interact with the enviroment and collect expirience tuples to learn an optimal policy. - You can solve this problems with RL algorithms like Q-Learning

Planning vs RL

Planning - given rules of game (dynamics) - agent internally computes model (without external interaction) to improves its policy RL - environement is initially unknown (we don't know rules of game (don't have T and R)) - agent interacts with the environment and collects expirience to improve its policy

Why is MDP proceess, decision, and stochastic?

Process : because is evolves over time (sequential) Decision : because the agent selection action to take Stochastic: because it random

For a finite horizon MDP , we require at most H+1 iterations of value iteration to compute the optimal policy

True

In a deterministic MDP, even with learning rate = 1. Q-Learning converges. This is optimal learning rate.

True

In a deterministic undiscounted gamma = 1 MDP the optimal value is the maximum return from the state

True

In stocastic MDP, learning rate = 1, the Q(s,a) are always equal to most recent samples for the state actions. The Q(s,a) will cycle among the possible samples and never converge

True

Q-Learning converges to optimal Q-Value function if the states are fully exploreed and the convergence rate is set correctly

True

Q-learning will converge if visited all the (s,a) pairs infinitely

True

Stationary optimal policy is guaranteed if state state and action space are finite

True

T/F You can make anything a markovian process by making the current state remember everything from the past

True

Vlaue iteration is guaranteed to converge if the discount factor is 0 < gamma < 1

True

With discount factor = 1, any future step is just as valuable as any other step.

True

Without Infinite horizon you lose the notion of stationarity?

True

For an infinite horizon MDP with a finite number of states and with a discount factor 0 < gamma < 1, value iteration is guaranteed to converge.

True, because gamma discounting allows us to move infinite distance in finite time because of geometric series Rmax/ 1 - gamma

T/F Any acyclix MDP with N states converges after N value iterations for any 0 <= gamma <= 1

True, since no cycles and thereafter each iteration at least one state whose values are not optimal is guaranteed to have its value set to optimal value

Q-learning can learn the optimal function Q* without ever executing the optimal policy

True.

T/F In a discounted infinite horizon MDP V*(s) >= V pi (s) for all states and policies pi.

True.

What the difference between utility and reward

Utility: expected reward for that state and all the reward from there after(long term) Reward: immediate reward (short term)

What do we need to know in order to solve an MDP?

Value function and policy function Policy function tells you which action to take while the Value function tells you how good it is to be in particular state or taking an action in particular state

when is 1 policy is better than another?

When its value function is better than value function of the other policy for all states

Can the policy change in a finite horizon even if you're in the same state?

Yes

Does value iteration update the policy at every iteration?

Yes

Does changing the discount factor affect optimal policy?

Yes, for example setting the discount factor to 0, will set agent to choose action that give the highest immediate reward

What is the solution to an MDP ?

a policy, which tells us what action to take from any particular state

Value Iteration

idea is to recursively compute the value function by combing one update of policy evaluation and policy improvement until reach the optimal value function. Then, from the optimal value function, we can derived the optimal policy Start with final rewards and work backward through the MDP 1.start with arbitrary utilities 2. update utilities based on neightbors 3. repeat until converges We are updating the estimate of utility by calculating immediate reward plus the discounted reward .

Markov Decision Process (MDP)

is a framework for model sequential decision making is an environment that satisfies the Markov property

What does larger gamma indicate? low value gamma?

larger gamma indicates (close to 1) more we value rewards in future (magnifies reward) - far cited into future low value gamma (close 0) indicates less we value rewards in the future (lowers rewards) -short cited

Bellmans Equation

recursive equation that defines utility of being particular state. The immediate value plus the value function at next state discounted. For every state the bellman equation utility function returns the action that maximizes the expected utilty

What are actions in MDP?

set of all possible actions agent can do in a particular state A(s) ex: up, down

What is states in MDP?

set of all possible states that the world can be in (ex: grid position)

What is Markov property?

states that 1) Future is independent of the past given the present (only the present matters) current states contains all relevant information from previous states (history - sequences of states, actions, rewards) 2) rules don't change (world is stationary - time independent(only thing matters is state you're in not the time step)

What is Utility of sequences

stationary preferences - if i prefer one sequence of events today, then I prefer the same sequence tomorrow

Control in an MDP

tells us the best policy, outputs the optimal value function and optimal policy .

Prediction in an MDP

tells us the transition and reward function - outputs a value function


Kaugnay na mga set ng pag-aaral

فيزياء الفصل الثانى

View Set

Chapter 4 - The Varieties of Attention

View Set