Reinforcement learning

Download Report

Transcript Reinforcement learning

MDPs and the RL Problem
CMSC 471 – Spring 2014
Class #25 – Thursday, May 1
Russell & Norvig Chapter 21.1-21.3
Thanks to Rich Sutton and Andy Barto for the use of their slides
(modified with additional slides and in-class exercise)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
1
Learning Without a Model
 Last time, we saw how to learn a value function and/or a
policy from a transition model
 What if we don’t have a transition model??
 Idea #1:
 Explore the environment for a long time
 Record all transitions
 Learn the transition model
 Apply value iteration/policy iteration
 Slow and requires a lot of exploration! No intermediate
learning!
 Idea #2: Learn a value function (or policy) directly from
interactions with the environment, while exploring
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
2
Simple Monte Carlo
V(st )  V(st )   Rt  V (st )
where Rt is the actual return following state
st .
st
T
T
T
TT
T
TT
T
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
TT
T
T
TT
3
TD Prediction
Policy Evaluation (the prediction problem):
p
for a given policy p, compute the state-value function V
Recall:
Simple every - visit Monte Carlo method :
V(st )  V(st )   Rt  V (st )
target: the actual return after time t
The simplest TD method,
TD(0) :
V(st )  V(st )   rt 1   V (st1 )  V(st )
target: an estimate of the return
γ: a discount factor in [0,1] (relative value of future rewards)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
4
Simplest TD Method
V(st )  V(st )   rt 1   V (st1 )  V(st )
st
rt 1
st 1
TT
T
T
T
TT
T
T
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
TT
T
T
TT
5
Temporal Difference Learning
 TD-learning:
π
π
π
π
 U (s) = U (s) + α (R(s) + γ U (s’) – U (s))
or equivalently:
Uπ (s) = α [ R(s) + γ Uπ (s’) ] + (1-α) [ Uπ (s) ]
Learning rate
Discount rate
Observed reward
Previous utility estimate
Previous utility estimate
for successor state
 General idea: Iteratively update utility values, assuming
that current utility values for other (local) states are correct
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
6
Exploration vs. Exploitation
 Problem with naive reinforcement learning:
 What action to take?
 Best apparent action, based on learning to date
– Greedy strategy
– Often prematurely converges to a suboptimal policy!
 Random action
– Will cover entire state space
– Very expensive and slow to learn!
– When to stop being random?
 Balance exploration (try random actions) with
exploitation (use best action so far)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
7
Q-Learning
Q-value: Value of taking action A in state S
(as opposed to V = value of state S)
Estimate Qp for the current behavior policy
After every transition from a nonterminal state

p.
st , do this :

Qst ,at  Qst ,at   Rt 1   Qst 1,at 1  Qst ,at 
If st 1 is terminal, then Q(st 1,at 1 )  0.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
8
Q-Learning Exercise

A


B

G
Q-learning reminder:




Qst ,at   Rt 1   Qst 1,at 1  (1   ) Qst ,at 
 Starting state: A
 Reward function:  in A yields -1 (at time t+1!);  in B yields +1;
all other actions yield -.1; G is a terminal state

 Action sequence:      
 All Q-values are initialized to zero (including Q(G, *))
 Fill in the following table for the six Q-learning updates:
t
at
St
0

A
1

2

3

4

5

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
Rt+1
St+1
Q’(st,at)
9
Q-Learning Exercise

A


B

G
Q-learning reminder:




Qst ,at   Rt 1   Qst 1,at 1  (1   ) Qst ,at 
 Starting state: A
 Reward function:  in A yields -1 (at time t+1!);  in B yields +1;
all other actions yield -.1; G is a terminal state

 Action sequence:      
 All Q-values are initialized to zero (including Q(G, *)); α and γ are 0.9
 Fill in the following table for the six Q-learning updates:
t
at
St
Rt+1
St+1
Q’(st,at)
0

A
-1
A
-0.9
1

A
-1
A
-0.99
2

A
-.1
B
-0.09
3

B
-.1
A
-0.162
4

A
-.1
B
-0.099
5

B
1
G
0.9
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
10