Q learning-on policy and off policy

Download Report

Transcript Q learning-on policy and off policy

Q Learning and SARSA: Off-policy
and On-policy RL
-Tanya Mishra
CS 5368 Intelligent Systems
FALL 2010
Brief Recap:
• What is a Markov Decision Process?
MDPs provide a mathematical framework for modeling
decision-making in situations where outcomes are
partly random and partly under the control of a
decision maker.
For instance, At each time step, the process is in some
state s, and the decision maker may choose any action
a that is available in state s. The process then randomly
moves into a new state s', and giving the decision
maker a corresponding reward Ra(s,s').
The probability that the process chooses s' as its new
state is influenced by the chosen action. Specifically, it
is given by the state transition function Ta(s,s'). Thus,
the next state s' depends on the current state s and the
decision maker's action a.
• What is Reinforcement Learning?
Reinforcement learning is an area concerned
with how an agent ought to take actions in an
environment so as to maximize some notion of
reward.
In machine learning, the environment is
typically formulated as a MDP.
An optimal Policy is a policy that maximizes
the expected reward/reinforcement/feedback
of a state.
Thus, the task of RL is to use observed rewards
to find an optimal policy for the environment.
• The Three Agents:
1. Utility based Agent: A utility based agent learns a
utility function on states and selects actions that
maximizes the expected outcome utility. (Model
Based)
2. The Q-learning Agent: It learns a action-utility or
Q-function, giving the expected utility of taking a
given action in a given state. (Model Free)
3. Reflex Agent: The agent function is based on the
condition-action rule: if condition then action.
Model-based agents can handle partially
observable environments. Its current state is
stored inside the agent maintaining some kind of
structure which describes the part of the world
which cannot be seen.
• Passive Learning:
Agents policy is fixed and task is to learn how good the
policy is i.e. to learn the utility function U.
• Active Learning:
Here the agent must also learn what actions to take.
The principal issue is “exploration”: an agent must
experience as mush as possible about the environment
in order to learn how to behave in it. By “exploitation”:
an agent maximizes its reward- as reflected in its
current utility estimates.
• Passive RL-Temporal Difference Learning:
TD resembles a Monte Carlo method because it learns
by sampling the environment according to some policy.
TD is related to dynamic programming techniques
because it approximates its current estimate based on
previously learned estimates.
Active RL: Learning and Q-Function
• Q(s,a)- Value of doing action a on state s.
Q-values are related to utility values as follows:
U(s)=maxaQ(s,a)
• A TD agent that learns a Q-function does not
need a model of the form P(s’|s,a), either for
learning or for action selection. For this reason,
Q-Learning is Model-Free.
• The update equation for TD Q-learning is as
follows:
• Q(s,a) Q(s,a)+a(R(s)+gmaxa’Q(s’,a’)-Q(s,a)),
which is calculated whenever action a is executed
in state s leading to state s’.
Algorithm:
• The problem model: agent, states S and a
number of actions per state A.
• By performing an action a, where a is in A ,
the agent can move from state to state.
• Each state provides the agent a reward (a
positive reward) or punishment (a negative
reward).
• The goal of the agent is to maximize its total
reward.
• It does this by learning which action is
optimal for each state.
• The algorithm therefore has a function which
calculates the Quality of a state-action
combination:
Q: S x A  R
• Before learning has started, Q returns a fixed
value, chosen by the designer.
• Then, each time the agent is given a reward (the
state has changed) new values are calculated for
each combination of a state s from S, and action a
from A.
• The core of the algorithm is a simple “value
iteration update”. It assumes the old value and
makes a correction based on the new
information.
• Q(s,a) Q(s,a)+a(R(s)+gmaxa’Q(s’,a’)-Q(s,a)),
where,
Q(s,a) on RHS- Old value
a- Learning Rate
R(s)- Reward
g- Discount Factor
maxa’Q(s’,a’)- maximum future value
R(s)+gmaxa’Q(s’,a’)- expected discount reward
• Learning rate
The learning rate determines to what extent the
newly acquired information will override the old
information. A factor of 0 will make the agent not
learn anything, while a factor of 1 would make the
agent consider only the most recent information.
Discount factor
• The discount factor determines the importance of
future rewards. A factor of 0 will make the agent
"opportunistic" by only considering current
rewards, while a factor approaching 1 will make it
strive for a long-term high reward.
SARSA
• SARSA(state-action-reward-state-action) equation
is very similar to the equation we just learned.
Q(s,a) Q(s,a)+a(R(s)+gQ(s’,a’)-Q(s,a)),
where a’ is the action actually taken in state s’.
• The rule is applied at the end of each s,a,r,s’,a’.
• Difference with Q learning:
Q-learning backs up the best Q-value from the
state reached while SARSA waits until an action is
taken and then backs up the Q-value from that
action.
On Policy VS Off Policy
• Q-Learning does not pay attention to what
policy is being followed. Instead, it just uses
the best Q-Value. Thus, it is an off-Policy
learning algorithm.
• It is called an off-policy because the policy
being learned can be different than the policy
being executed.
• SARSA is an on-policy learning Algorithm. It
updates value functions strictly on the basis of
the experience gained from executing some
(possibly non-stationary) policy.
Can Q-Learning be done with OnPolicy?
• No. But, we can use similar techniques. I can
provide two ways to thinking to support my
hypotheses:
1. SARSA is identical to Q-learning, except the
fact that it waits for an action to happen, and
then backs up the Q-value for that action. For
a greedy agent that always takes the action
with the best Q-Values, both are identical.
2. We can use QV Learning to implement QLearning for on-policy. QV-learning is a
natural extension of Q-learning and SARSA.
• It is an on-policy learning algorithm and it
learns Q Values. Instead of finding the
maximum future value, it calculates the state
value function V for a state at time t+1.
References
• [1] Artificial Intelligence: A modern Approach
(Third Edition) by Stuart Russell and Peter
Norvig.
• [2] QV()-learning: A New On-policy
Reinforcement Learning Algorithm by Marco
A. Wiering
• [3] www.wikipedia.com
Thank You!