CS 7642 - Reinforcement Learning - Study Set

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

What does the folk theorem/ 2 player plot tell us? What are all the Nash equilibria in this new version of rock paper scissors?

-

Grim Trigger Strategy

A punishment strategy in which a player who has defected or otherwise punished a second player is itself punished by that player throughout the remainder of the game no matter what type of behavior the first player displays thereafter.

Eligibility Traces

A temporary record of the occurrence of an event (think the exponentially decaying graph showing (1-λ)^n * λ)

In Sutton 88, experiment 2, what kind of figure 4 would be expected if there were infinite sequences instead of 10? Why?

Best alpha and best error values for each λ both decrease (except for alpha=0 which always stays the same). Small alpha is needed to stabilize the training when learning from an increased amount of data. Best λ should be the same (not entirely sure, but best lambda should depend on size of MDP)

For the below general sum game, what's the value of Coco value? [ [ 1,2] [2,0] [3,2] [4,4] ]

COCO(U, U_bar) = maxmax((U + U_bar)/2) + minmax((U - U_bar)/2) = [[1.5,1],[2.5,4]] + [[-.5,1],[.5,0] = 4 + 0 = 4

Value Iteration

Checking all the rewards resulting from the available actions and choosing the highest.

Markov Decision Process

Consists of a set of states, set of actions, a probability function for the next state given the current state and action, and an immediate reward function given the current state, current action, and next state.

In HW3, we saw SARSA is a 'control' algorithm. What is a 'control' task? Are any TD algorithms a 'control' algorithm?

Control task means the action is included. IT estimates the action-value function Q(s,a). TD algorithms can be used for either prediction or control. Not all TD methods have a control algorithm.

COCO Values

Cooperative/Competitive Values COCO(U,Ubar) = maxmax((U+Ubar)/2) + minmax((U-Ubar)/2) COCO = Cooperative + competitive Efficiently computable Utility Maximizing Decompose games into sums of two games Unique Can be extended to stochastic games (COCO-Q non-expansion) Not necessarily equilibrium, but best binding response.

Dec-POMDP

Decentralized-POMDP. This is a model for coordination and decision-making among multiple agents in MARL. The difference between this and a POMDP is that the observations and action functions include joint actions and observations from all cooperating agents.

T/F: Since SARSA is an online algorithm, it will never learn the optimal policy for an environment.

False Online does not matter, SARSA converges with infinite exploration at the limit and a greedy policy.

T/F: LP is the only way to efficiently solve an MDP for its optimal policy. Why?

False Many different methods are guaranteed to converge on an optimal policy. See SARSA and Q learning which use DP

T/F: Policy shaping requires a completely correct oracle to give the RL agent advice. Why?

False (tentatively) Policy shaping requires a certain amount of confidence, but that does not mean it needs a completely correct oracle.

T/F: Contraction mappings and non-expansions are concepts used to prove the convergence of RL algorithms, but are otherwise unrelated concepts. Why?

False. A non-expansion is a type of contraction mapping.

T/F: An optimal pure strategy does not necessarily exist for a two-player, zero-sum finite deterministic game with perfect information. Why?

False. A two player, finite, deterministic game of perfect information always has a pure optimal strategy. Also, maxmin = minmax in this game type.

T/F: There are no non-expansions that converge.

False. All non-expansions converge. And while some expansions do not converge, that does not mean that ALL expansions do not converge.

T/F: KWIK algorithms are an improvement of MB (mistake bound) algorithms and as such will have same or better time complexity for all classes of problems. Why?

False. As evidenced in the examples produced. The larger the problem, the more complex. It can be turned into a MB algorithm through guesses, but not necessarily an improvement in complexity.

T/F: If following the repeated game strategy "Pavlov", we will cooperate until our opponent defects. Once an opponent defects we defect forever. Why?

False. It changed defect once the action is change again. In a Pavlov strategy, once the opponent defects we will defect until the opponent defects again. Then we will cooperate.

T/F: Potential-based shaping will always find an optimal policy faster than an unshaped MDP. Why?

False. It depends on the potential-based shaping. It will find an optimal policy, but may be slowed down if the selected potential is bad.

T/F: The objective of the dual LP presented in lecture is minimization of "policy flow" (the minimization is because we are aiming to find an upper bound on "policy flow"). Why?

False. It is intended to maximize the policy flow.

T/F: With a classic update using Linear Function approximation, we will always converge to some values, but they may not be optimal. Why?

False. It may not even converge, consider the Baird's star counterexample.

T/F: Offline algorithms are generally superior to online algorithms. Why?

False. Online algorithms update values as soon as new information is presented, therefore making the most efficient use of experiences.

T/F: The only algorithms that work in POMDPs are planning algorithms. Why?

False. Planning algorithms on POMDPs are actually undecidable. Non-planning (RL) algorithms are hard but do work is some cases.

T/F: Rmax will always find the optimal policy for a properly tuned learning function. Why?

False. Rmax is not guaranteed to find the optimal policy, but it can help obtain near-optimal results.

T/F: Nash equilibria can only be pure, not mixed.

False. Some require probabilities, others are always pure. All finite games have a mixed Nash equilibrium (where a pure strategy is a mixed strategy with 100% for the selected action), but do not necessarily have a pure strategy Nash equilibrium.

T/F: Any optimal policy found with reward shaping is the optimal policy for the original MDP. Why?

False. Some reward shaping functions could result in a sub-optimal policy with a positive loop that distracts the learner from finding the optimal policy. Only potential-based reward shaping functions are guaranteed to preserve the consistency with the optimal policy for the original MDP.

T/F: "Sub-game perfect" means that every stage of a multistage game has a Nash equilibrium. Why?

False. Subgame perfect means that every state of a nash game is a nash equilibrium. A subperfect game means that the payouts cannot be improved by changing the history of the gameplay.

T/F: In RL, recent moves influence outcomes more than moves further in the past. Why?

False. TCA problem exists because some move in the very distanced past can have dominant influence on the outcome.

T/F: TD(1) slowly propagates information, so it does better in the repeated presentations regime rather than with single presentations. Why?

False. TD(1) == best offline, with training sets. This propagates information all the way in each presentation TS(0) == Best online. TD(0) == Monte Carlo. This propagates slowly.

T/F: The "folk theorem" states that the notion of threats can stabilize payoff profiles in one-shot games. Why?

False. The folk theorem states that the threat of retaliation opens the door to cooperation, implying the games are not one-shot. The folk theorem uses the notion of threats to stabilize payoff profiles in repeated games. If a threat is of no consequence, there is no chance to execute it.

T/F: In TD learning, the sum of the learning rates used must converge for the value function to converge. Why?

False. The sum of the learning rates must diverge for the value function to converge. sum(LR) goes to Infinity for convergence. The sum of the learning rates squared must converge for the value function to converge.

T/F: RL with linear function approximation will only work on toy problems.

False. There are other problems it can be used on. Add non-linear features and you can cover a lot of problems.

T/F: Problems that can be represented as POMDPs cannot be represented as MDPs. Why?

False. You can go from a POMDP over some state-space to a larger MDP over a belief-space.

T/F: The trade-off between exploration and exploitation is not applicable to finite bandit domains since we are able to sample all options. Why?

False. You still need to explore up to a confidence threshold. That is a trade-off. We are able to sample all options, but we also need some exploration on them, and exploit what we have learned so far to get a maximum reward possible and finally converge having computed the confidence of the bandits as per the amount of sampling we have done.

T/F: Given a model (T, R) we can also sample in, we should first try TD learning. Why?

False. Convergence properties of value or policy iterations are better.

T/F: A policy that is greedy, with respect to the optimal value function, is not necessarily an optimal policy. Why?

False. Optimal action on an optimal value function is optimal by definition. Greedy algorithms can get stuck in local maxima, however the value of a state in the optimal value function already accounts for the future.

T/F: If we know the optimal Q values, we can get the optimal V values only if we know the environments transition function/matrix. Why?

False. Q Values do not need to be multiplied with the corresponding transition probabilities and summed over the action space to derive optimal Q values. V(s) = Max_a(Q(s,a))

T/F: The Value of the returned policy is the only way to evaluate a learner. Why?

False. There are also computational, memory, sample efficiency, and other metrics by which to evaluate a learner.

T/F: It is not always possible to convert a finite horizon MDP to an infinite horizon MDP. Why?

False. We can always convert a finite horizon MDP to an infinite horizon MDP by

T/F: Monte Carlo is an unbiased estimator of the value function compared to TD methods. Therefore, it is the preferred algorithm when doing RL with episodic tasks. Why?

False. Even though MC is unbiased, it has a much higher variance. Therefore, TD methods may outperform MC in terms of learning performance.

What is the formula to view TD(λ) as a function of n step estimators?

Gtλ = (1-λ)sum from n=1 to infinity of (λ^(n-1)Gt^n)

We learned about reward shaping in class. Could it be useful for solving Lunar Lander? If so, why and how?

I believe there are already enough rewards in the environment for the successful implementation of the agent. However, reward shaping could be used to modulate the speed at certain positions and, if done well, that could potentially lead to a shorter learning curve for the agent.

T/F: An update rule which is not a non-expansion will not converge without exception. Why?

In general, this is True. However specifically this is False since there are some non non-expansions that do converge. Non-expansion is a condition that guarantees convergence but it is not an "if and only if".

Why does the "folk theorem" help us to solve the prisoner's dilemma.

In iterative prisoner dilemma, the TfT strategy is a Nash equilibrium and it leads to cooperation.

What is the KWIK framework? What are the strengths of this framework and its algorithms?

It is a learning framework for knowing when to explore and when to stop. It is bounded.

Using the coco value, we calculate the side payments as: P = coco(u, u_bar). But why does u getting P amount of side payment make sense?

It makes sense in the case that by cooperating neither party is loosing more than by not cooperating. By taking the coco equation the maximum of both cooperating plus the maxmin (maximum negative reward inflicted on other agent) needs to be >0 for coco side payments to work. Each Coco agent will have an inverse side payment (if one agent gives one (-1) the other agent gets one (+1)).

T/F: RL with linear function approximation will not work on environments having a continuous state space. Why?

Linear function approximations computation and memory restrictions would not work well in a continuous state space because as states increase the computational power needed increases exponentially.

Compare and contrast MC and TD estimators.

MC: High variance, 0 bias. Good Convergence. Not sensitive to initial value. Simple to understand and use. Effective in non-Markov. TD: low variance, medium/high bias. More efficient than MC. TD(0) converges with LP to optimal V(s). IT is not guaranteed to converge with function approximation. More sensitive to initial value. Exploits Markov Property.

In Sutton 88, is the TD algorithm used in the paper the same as the TD we saw in class?

No. Th Sutton 88 paper implemented TD update rules in forward view and the class saw backward view.

What are the benefits of an off-policy algorithm?

Off policy allows the agent to evaluate and improve a policy that is different from the Policy that is used for action selection. Target Policy != Behavior Policy. This allows for continuous exploration, learning from demonstration, and parallel learning.

POMDP

Partially Observable Markov Decision Process. Uses Observations (Observation Functions) to observe what can be known about the MDP. The Agent is only observing part of the environment. Uses a set of conditional transition probabilities between states.

Markov Chain

Process where a system changes its state in a way that depends only on its current state. (No actions)

In Sutton 88, what is Pt​? What is ∇w​Pt?

Pt are the predictions, the latter is the gradient of the predictions.

Is it possible that Q-learning doesn't converge? Is it possible that the point it converges to is not optimal?

Q learning comes with the guarantee that the estimated Q values will converge to the true Q values given that all state-action pairs are sampled infinitely often and that the learning rate is decayed appropriately. If those conditions are not met, it is possible that Q-learning does not converge.

One perspective of SARSA is that it looks a bit like policy iteration. Can you tell us which part of SARSA is policy evaluation and which part is policy improvement?

SARSA is an on-policy TD control. Q(S,A) = Q(S,A) + alpha[Rt+1 + gamma(Q(S',A') - Q(S,A)] chooses next A' from S' using epsilon-greedy

Why is SARSA on policy? How does this compare to an off-policy method like Q-learning by following a random policy to generate data?

SARSA is on policy because it directly takes the action A' going to S' using an e-greedy policy. Q-learning takes action 'a' and observes the R, S' for all A and chooses the maximum (greedy) value. Q(S,A) = Q(S,A) + alpha[Rt+1 + gamma(max_a Q(S',a) - Q(S,A)]

Explain the two types of tasks modeled by MDPs; unify the notation used to describe these two tasks and describe the type of state that allows you to unify them.

Since we are talking about episodic and continuous tasks, the returns from these tasks could be unified using a self-loop with zero rewards to the terminal state for an episodic task. (S&B 3.4)

Markov Reward Process

Stochastic process which extends either a Markov Chain or continuous-time Markov chain by adding a reward to each state.

What are some advantages of TD methods compared to MC or DP methods?

TD learning can work in-between the MC or DP methods to better optimize for both computational capacity and error.

In Sutton 88, why doesn't TD(0) do well in Fig 5?

TD(0) is MC and requires multiple iterations to fully train itself. in Fig 5, only 1 iteration was presented which did not allow TD(0) to converge on optimal value function.

Markov Property

The conditional probability distribution of future states of the process depends only upon the present state, not on the sequence of events that preceded it

What is the meaning of the bellman equations? What is it saying? What is your way of visualizing this equation in your head?

The long-term return for an action is the sum of the current reward and (all possible future rewards multiplied by a discount factor ^n (where n is the number of states past the next state and the discount factor is in between 0 and 1)).

In Sutton 88, if there is way more data than 100 sequences for each training set in experiment 1, what kind of graph of Fig3 would be expected? Why?

The shape of the curves should be similar, but the error values should all decrease because of the impact of random noise going down as the # of sequences increase.

What is the value of the 5-step estimator of the terminal state? What about other n-step estimators of the terminal state?

The state value of the terminal state in an episodic problem should always be 0 (since the agent terminates)

In Sutton 88, experiment 1, what is the convergence criterion? What effect does it have on the TD algorithm?

The weight vector was not updated after each sequence and only used to update the weight after the complete presentation of the training set. This likely was done to account for the stochastic nature of the problem and allow for convergence in minimal sequences and training sets.

In Sutton 88, why are all the curves in Fig 4 convex?

This indicates that the λ value between 0 and 1 is optimal for reducing error.

Why can you find an optimal policy despite finding sub-optimal Q-values (via Q-learning)?

This is an off-policy algorithm and as such it allows for the target and behavior policy to differ somewhat. (unknown if true)

T/F: An MDP given a fixed policy is a Markov Chain with rewards. Why?

True. A fixed policy means the same action is selected for a given state.

T/F: Applying generalization with an "averager" on an MDP results in another MDP. Why?

True. Any generalization of an MDP would result in another MDP.

T/F: Backward and Forward TD(λ) can be applied to the same problems.

True. Both backward and forward TD(λ) will converge on the same problem. Backward TD(λ) typically is easier to compute.

T/F: MDPs are a type of Markov game. Why?

True. MDP can be considered as a single-agent Markov game. Or it could be a multi-agent Markov game when there is only one agent contributing to the reward and transition function.

T/F: The Markov property means RL agents are amnesiacs and forget everything up until the current state. Why?

True. Markov property suggests that the "future is independent of the past given the present". However, the Markov property also asserts that all that is needed to know about past states is contained within the present state. Therefore, nothing is forgotten (since it is contained in the current state), just inefficient data is eliminated. However, the question specifically targets the agent. The agent does not remember anything but the value/policy function at the current state.

T/F: In TD(λ), we should see the same general curve for the best learning rate (lowest error) regardless of λ value. Why?

True. See Figure 4 in Sutton's TD(λ) paper. Generally, TD(1) will have the highest error rate with TD(0) having a lower rate. The in-between values will most likely have an optimal λ, problem dependent.

T/F: In the gridworld MDP in "Smoov and Curly's Bogus Journey", if we add 10 to each state's reward (terminal and non-terminal) the optimal policy will not change. Why?

True. The underlying differential rewards across all states does not change by adding 10 to all rewards.

T/F: The optimal policy for any MDP can be found in polynomial time. Why?

True. For any finite MDP Linear Programming can be used to solve it in polynomial time (through interior point methods, as an example.) For large state spaces, LP may not be practical. In these cases, we take the greedy policy.

Bellman Equations

Vπ(s) = Eπ[R(t+1) + Gamma(Vπ)* S(t+1) | St = s] Qπ(s,a) = Eπ[R(t+1) + Gamma(Qπ) * (S(t+1), A(t+1)) | St = s, At = a]

Let's say you want to use a function approximator like we learned in class. What function(s) are you approximating? What's the input of that function and what's the output of that function?

We are approximating the value function v = v(s,w). y = f(x) where x, features, are the input and y is the target. The weights are updated -->approximated--->value is updated and then y is updated.

Does Policy iteration or value iteration converge faster? Why?

While both algorithms are guaranteed to converge to an optimal policy in the end, the policy iteration algorithm is reported to converge with fewer iterations. As a result, the policy iteration is reported to conclude faster than the value iteration algorithm.

Is it possible to have multiple terminal states in an MDP?

Yes. There is no restriction on the amount of terminal states that an MDP has.

KWIK is a learning framework like PAC (probably approximately correct) learning, why do we even need a learning framework like PAC or KWIK?

You can use them to know when you know! In the context of RL, if you already have good estimates of Q values, then you don't need to further explore that state. However, if you don't know, then you may want to experience that state to gather a better estimate of it's Q value.

In Sutton 88, what is the value of ∇w​Pt in the paper if xt​ = [0,0,1,0,0] under the context of a random walk example? Why?

[1/6, 1/3, 1/2, 2/3, 5/6] Due to the nature of the random walk and terminal states being 0.

What is Monte-Carlo return? What is the n-step estimator? What is the prediction problem? What are model-free methods?

think, mark , think!


Kaugnay na mga set ng pag-aaral

social - Key question, how can knowledge of social psychology be used to reduce prejudice in situations such as crowd behaviour and rioting?

View Set

Complex Numbers in Polar Form - Products, Quotients, and Converting

View Set

ASU Mendes Psych 101 Exam 2: Study Set

View Set

Nutrition 250 Chapter 7 practice questions

View Set

Introduction to Piping Components

View Set

Chapter 48: Respiratory Medications

View Set

Public Opinion Final Exam Readings Study Guide

View Set

Adobe Illustrator & InDesign Basics

View Set