Reinforcement learning
Download
Report
Transcript Reinforcement learning
MDPs and the RL Problem
CMSC 471 – Spring 2014
Class #25 – Thursday, May 1
Russell & Norvig Chapter 21.1-21.3
Thanks to Rich Sutton and Andy Barto for the use of their slides
(modified with additional slides and in-class exercise)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
1
Learning Without a Model
Last time, we saw how to learn a value function and/or a
policy from a transition model
What if we don’t have a transition model??
Idea #1:
Explore the environment for a long time
Record all transitions
Learn the transition model
Apply value iteration/policy iteration
Slow and requires a lot of exploration! No intermediate
learning!
Idea #2: Learn a value function (or policy) directly from
interactions with the environment, while exploring
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
2
Simple Monte Carlo
V(st ) V(st ) Rt V (st )
where Rt is the actual return following state
st .
st
T
T
T
TT
T
TT
T
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
TT
T
T
TT
3
TD Prediction
Policy Evaluation (the prediction problem):
p
for a given policy p, compute the state-value function V
Recall:
Simple every - visit Monte Carlo method :
V(st ) V(st ) Rt V (st )
target: the actual return after time t
The simplest TD method,
TD(0) :
V(st ) V(st ) rt 1 V (st1 ) V(st )
target: an estimate of the return
γ: a discount factor in [0,1] (relative value of future rewards)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
4
Simplest TD Method
V(st ) V(st ) rt 1 V (st1 ) V(st )
st
rt 1
st 1
TT
T
T
T
TT
T
T
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
TT
T
T
TT
5
Temporal Difference Learning
TD-learning:
π
π
π
π
U (s) = U (s) + α (R(s) + γ U (s’) – U (s))
or equivalently:
Uπ (s) = α [ R(s) + γ Uπ (s’) ] + (1-α) [ Uπ (s) ]
Learning rate
Discount rate
Observed reward
Previous utility estimate
Previous utility estimate
for successor state
General idea: Iteratively update utility values, assuming
that current utility values for other (local) states are correct
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
6
Exploration vs. Exploitation
Problem with naive reinforcement learning:
What action to take?
Best apparent action, based on learning to date
– Greedy strategy
– Often prematurely converges to a suboptimal policy!
Random action
– Will cover entire state space
– Very expensive and slow to learn!
– When to stop being random?
Balance exploration (try random actions) with
exploitation (use best action so far)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
7
Q-Learning
Q-value: Value of taking action A in state S
(as opposed to V = value of state S)
Estimate Qp for the current behavior policy
After every transition from a nonterminal state
p.
st , do this :
Qst ,at Qst ,at Rt 1 Qst 1,at 1 Qst ,at
If st 1 is terminal, then Q(st 1,at 1 ) 0.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
8
Q-Learning Exercise
A
B
G
Q-learning reminder:
Qst ,at Rt 1 Qst 1,at 1 (1 ) Qst ,at
Starting state: A
Reward function: in A yields -1 (at time t+1!); in B yields +1;
all other actions yield -.1; G is a terminal state
Action sequence:
All Q-values are initialized to zero (including Q(G, *))
Fill in the following table for the six Q-learning updates:
t
at
St
0
A
1
2
3
4
5
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
Rt+1
St+1
Q’(st,at)
9
Q-Learning Exercise
A
B
G
Q-learning reminder:
Qst ,at Rt 1 Qst 1,at 1 (1 ) Qst ,at
Starting state: A
Reward function: in A yields -1 (at time t+1!); in B yields +1;
all other actions yield -.1; G is a terminal state
Action sequence:
All Q-values are initialized to zero (including Q(G, *)); α and γ are 0.9
Fill in the following table for the six Q-learning updates:
t
at
St
Rt+1
St+1
Q’(st,at)
0
A
-1
A
-0.9
1
A
-1
A
-0.99
2
A
-.1
B
-0.09
3
B
-.1
A
-0.162
4
A
-.1
B
-0.099
5
B
1
G
0.9
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
10