CS 7642 - Reinforcement Learning Final Exam

¡Supera tus tareas y exámenes ahora con Quizwiz!

What are policies?

A policy is a mapping from states to probabilities of selecting each possible action.

What are value functions?

A value function of (states or state-action pairs) estimates how good it is for the agent to be in a given state.

What is TD(0)?

Also known as one-step TD and does not use an eligibility trace, unlike TD(1) and TD(λ). TD(0) only updates the state it left.

Policy Iteration drawback?

Each iteration of policy iteration involves policy evaluations, which may itself be a protracted iterative computation requiring multiple sweeps through the state set.

What is a function approximation?

Function approximation is a method which maps the inputs to outputs. This could be mapping continuous states to discrete states, or continuous actions to discrete actions.

What is Folk Theorem?

In repeated games, the possibility of retaliation opens the door for cooperation.

What is eligibility trace?

Is a memory vector Z, that parallels the long-term weight vector W. When a component of W participates in producing an estimated value, then the corresponding component of Z is bumped up and begins to fade away.

What is TD(λ)?

Is an algorithm which updates based on differences between temporally successive predictions. It combines both TD(0) and TD(1) by using λ to shift the algorithms behavior to be between TD(0) and TD(1).

What is reward shaping?

Is modifying the rewards such that the agent behaves in a consistent manner, with respect to the rewards. Example being modifying the rewards for a dolphin to eventually drive the dolphin to jump through a flaming hoop.

What is Value Iteration?

It combines in each sweep, one sweep of policy iteration and one sweep of policy improvement.

Does Policy Iteration or Value Iteration converge faster? If so, why?

It is unknown whether policy iteration or value iteration converges faster as both algorithms can perform very differently based on the problem they are solving.

What is an MDP?

Markov Decision Process, is a classical formalization of sequential decision making, where actions influence not just immediate rewards but also subsequent situations, or states, and through those future rewards.

What is a n-step estimator?

N-step estimator is an estimator which calculates expected rewards using previous N steps.

What are Nash Equilibrium?

Nash equilibrium is the point at which all player strategies have progressed to a point in which no player has an incentive to change their strategy.

What is Policy Iteration?

Policy iteration works by iteratively improving the current policy. Each policy is guaranteed to be a strict improvement over the previous one (unless already optimal).

What is PAC learning?

Probably Approximately Correct

Is it possible for Q learning to converge to a non optimal policy?

Q Learning always converges to Q* which is the optimal policy.

Is it possible that Q Learning does not converge?

Q Learning will converge to the optimal policy

What is the difference between the reward of state S, R(S) and the utility of state S, U(S)?

R(S) is the immediate reward and the utility (U(S)) is the long term reward.

What is SARSA?

SARSA is an algorithm which is represented by a sequence of State, Action, Reward, NextState, NextAction.

What are some advantages of TD Prediction Methods?

TD methods have an advantage over DP methods in that they do not require a model of the environment, of its reward and next-state probability distributions. TD methods have an advantage of Monte Carlo methods in that they are naturally implemented online in a fully incremental fashion. With Monte Carlo methods you must wait to the end of an episode because that is when the return is known, while TD need to only wait one time step.

What is TD(1)?

TD(1) updates all states at once.

What is TD?

Temporal Difference is a combination of Monte Carlo ideas and Dynamic Programming. Like Monte Carlo methods, TD can learn directly from raw experience without a model of the environments dynamics. Like Dynamic Programming, TD methods update estimates based in part on other learned estimates, without waiting for a final outcome (they bootstrap).

How does Folk Theorem help solve the prisoner's dilemma?

The Folk Theorem helps solve the prisoner's dilemma because it assumes there will be repeated games and each prisoner is going to optimize their reward w.r.t. each of the other prisoners. This occurs because the prisoners have an incentive to provide the best outcome for the other prisoners due to retaliation.

What is KWIK framework?

The KWIK framework stands for Knows What It Knows and ...

What are the inputs and outputs to function approximation?

The input and outputs are problem specific. An example of inputs and outputs in some arbitrary environment where you are trying to transition states from continuous to discrete. The input could be the state, which consists of all information about the environment. The output can be a discrete value which represents the state. This situation works when the state is continuous and the algorithm being used in Q Learning where you must have discrete states.

What are the inputs and outputs of TD?

The input to TD is a policy and the output is the estimated value of that policy.

What are the inputs and outputs of Value Iteration?

The input to Value Iteration is an MDP and the output is a deterministic policy.

What is the Markov Property?

The state must include information about all aspects of the past agent-environment interaction that make a difference for the future.

What is the meaning of the Bellman equation?

The value of a state is the expected reward plus the discounted future rewards.

What are the components of an MDP?

There are Five basic parts to an MDP - State, typically denoted by S - Model: T(S, A, S') -> Pr(S'|S, A) - Action: A(S), A - Reward: R(S), R(S, A), R(S, A, S') - Policy: π(S) -> A

What is the Bellman Equation?

V(s)=maxAction(R(s,a)+γ∗∑(T(s,a,s′)V(s′)))

How does Value Iteration converge?

the greedy policy converges in a finite number of steps. Which means the greedy policy stops changing, meaning it has converged.

What are some good methods to handle environments with continuous states? What are their Pros and Cons?

using function approximation such as a neural network. Pros being you can essentially convert your continuous states into discrete values. Cons could be the added computation cost or the loss of information when approximating the function.


Conjuntos de estudio relacionados

Project Management Vocabulary 1 & 2

View Set

Bio 412 exam 1--> Quiz questions

View Set

Methods of data collection and analysis

View Set

Spanish 111- irregular preterites

View Set

PrepU Chapter 5: Cultural Diversity

View Set