Ch. 10d: Reinforcement Learning

Download Report

Transcript Ch. 10d: Reinforcement Learning

10d
Machine Learning: Symbol-based
10.0
Introduction
10.5
Knowledge and Learning
10.1
A Framework for
Symbol-based Learning
10.6
Unsupervised Learning
10.7
Reinforcement Learning
10.8
Epilogue and
References
10.9
Exercises
10.2
Version Space Search
10.3
The ID3 Decision Tree
Induction Algorithm
10.4
Inductive Bias and
Learnability
Additional references for the slides:
Thomas Dean, James Allen, and Yiannis Aloimonos,
Artificial Intgelligence: Theory and Practice
Addison Wesley, 1995, Section 5.9.
1
Reinforcement Learning
• A form of learning where the agent can explore
and learn through interaction with the
environment
• The agent learns a policy which is a mapping
from states to actions. The policy tells what the
best move is in a particular state.
• It is a general methodology: planning,
decision making, search can all be viewed as
some form of the reinforcement learning.
2
Tic-tac-toe: a different approach
• Recall the minimax approach:
The agent knows its current state. Generates a
two layer search tree taking into account all the
possible moves for itself and the opponent.
Backs up values from the leaf nodes and takes
the best move assuming that the opponent will
also do so.
• An alternative is to directly start playing with
an opponent (does not have to be perfect,
but could as well be). Assume no prior
knowledge or lookahead. Assign “values” to
states:
1 is win
0 is loss or draw
0.5 is anything else
3
Notice that 0.5 is arbitrary,
it cannot differentiate
between good moves and
bad moves. So, the learner
has no guidance initially.
It engages in playing. When
the game ends, if it is a
win, the value 1 will be
propagated backwards. If it
is a draw or a loss, the
value 0 is propagated
backwards. Eventually,
earlier states will be
labeled to reflect their
“true” value.
After several plays, the
learner will learn the best
move given a state (a
policy.)
Issues in generalizing this approach
• How will the state values be initialized or
propagated backwards?
• What if there is no end to the game
(infinite horizon)?
• This is an optimization problem which
suggests that it is hard. How can an optimal
policy be learned?
5
A simple robot domain
@
@
+
0
1
-
+
-
-
+
-
3
@
2
+
@
The robot is in one of the
states: 0, 1, 2, 3. Each one
represents an office, the
offices are connected in a
ring.
Three actions are available:
+ moves to the “next”
state
- moves to the “previous”
state
@ remains at the same
state
6
The robot domain (cont’d)
• The robot can observe the label of the state it
is in and perform any action corresponding to
an arc leading out of its current state.
• We assume that there is a clock governing the
passage of time, and that at each tick of the
clock the robot has to perform an action.
• The environment is deterministic, there is a
unique state resulting from any initial state and
action.
• Each state has a reward:
10 for state 3, 0 for the others.
7
The reinforcement learning problem
• Given information about the environment
 States
 Actions
 State-transition function (or diagram)
• Output a policy p: states → actions, i.e., find
the best action to execute at each state
• Assumes that the state is completely
observable (the agent always knows which
state it is in)
8
Compare three policies
a. Every state is mapped to @
The value of this policy is 0, because the robot
will never get to office 3.
b. Every state is mapped to +
policy 0
The value of this policy is , because the robot
will end up in office 3 infinitely often.
c. Every state is except 3 is mapped to +, 3 is
mapped to @
policy 1
The value of this policy is also , because the
robot will end up (stay) in office 3 infinitely often.
9
Compare three policies
So, it is easy to rule case a out, but how can we
show that policy 1 is better than policy 0? One
way would be to compute the average reward
per tick:
POLICY 0
The average reward per
tick for state 0 is 10/4.
POLICY 1
The average reward per
tick for state 0 is 10.
Another way would be to assign higher values
for immediate rewards and apply a discount
to future rewards.
10
Discounted cumulative reward
Assume that the robot associates a higher
value with more immediate rewards and
therefore discounts future rewards.
The discount rate () is a number between 0 and
1 used to discount future rewards.
The discounted cumulative reward for a
particular state with respect to a given policy is
the sum for n from 0 to infinity of n times the
reward associated with the state reached after
the n-th tick of the clock.
POLICY 0
POLICY 1
The discounted cumulative
reward for state 0 is
1.33.
The discounted cumulative
reward for state 0 is
2.5.
11
Discounted cumulative reward (cont’d)
Take  = 0.5
For state 0 with respect to policy 0:
0.50 x 0 + 0.51 x 0 + 0.52 x 0 + 0.53 x 10 +
0.54 x 0 + 0.55 x 0 + 0.56 x 0 + 0.57 x 10 + …
= 1.25 + 0.078 + … = 1.33 in the limit
For state 0 with respect to policy 1:
0.50 x 0 + 0.51 x 0 + 0.52 x 0 + 0.53 x 10 +
0.54 x 10 + 0.55 x 10 + 0.56 x 10 + 0.57 x 10 + …
= 2.5 in the limit
12
Discounted cumulative reward (cont’d)
Let j be a state,
R(j) be the reward for ending up in state j,
 be a fixed policy,
(j) be the action dictated by  in state j,
f(j,a) be the next state given the robot starts in
state j and performs action a,
Vi(j) be the estimated value of state j with
respect to the policy  after the i-th iteration of
the algorithm
Using a dynamic programming algorithm, one
can obtain a good estimate of V, the value
function for policy  as i  .
13
A dynamic programming algorithm to
compute values for states for a policy 
1. For each j, set V0(j) to 0.
2. Set i to 0.
3. For each j, set Vi+1 (j) to R(j) +  Vi( f(j,) ) ).
4. Set i to i + 1.
5. If i is equal to the maximum number of
iterations,
then return Vi otherwise, return to step 3.
14
Values of states for policy 0
• initialize
 V(0) = 0
 V(1) = 0
 V(2) = 0
 V(3) = 0
• iteration 0
 For office 0: R(0) +  V(1) = 0 + 0.5 x 0 = 0
 For office 1: R(1) +  V(2) = 0 + 0.5 x 0 = 0
 For office 2: R(2) +  V(3) = 0 + 0.5 x 0 = 0
 For office 3: R(3) +  V(1) = 10 + 0.5 x 0 = 10
 (iteration 0 essentially initializes values of states to their
immediate rewards)
15
Values of states for policy 0 (cont’d)
• iteration 0
V(0) = V(1) = V(2) = 0
V(3)=10
• iteration 1
 For office 0: R(0) +  V(1) = 0 + 0.5 x 0 = 0
 For office 1: R(1) +  V(2) = 0 + 0.5 x 0 = 0
 For office 2: R(2) +  V(3) = 0 + 0.5 x 10 = 5
 For office 3: R(3) +  V(0) = 10 + 0.5 x 0 = 10
• iteration 2
 For office 0: R(0) +  V(1) = 0 + 0.5 x 0 = 0
 For office 1: R(1) +  V(2) = 0 + 0.5 x 5 = 2.5
 For office 2: R(2) +  V(3) = 0 + 0.5 x 10 = 5
 For office 3: R(3) +  V(0) = 10 + 0.5 x 0 = 10
16
Values of states for policy 0 (cont’d)
• iteration 2
V(0) = 0 V(1) = 2.5 V(2) = 5 V(3) = 10
• iteration 3
 For office 0: R(0) +  V(1) = 0 + 0.5 x 2.5 = 1.25
 For office 1: R(1) +  V(2) = 0 + 0.5 x 5 = 2.5
 For office 2: R(2) +  V(3) = 0 + 0.5 x 10 = 5
 For office 3: R(3) +  V(0) = 10 + 0.5 x 0 = 10
• iteration 4
 For office 0: R(0) +  V(1) = 0 + 0.5 x 2.5 = 1.25
 For office 1: R(1) +  V(2) = 0 + 0.5 x 5 = 2.5
 For office 2: R(2) +  V(3) = 0 + 0.5 x 10 = 5
 For office 3: R(3) +  V(1) = 10 + 0.5 x 1.25 = 10.625
17
Values of states for policy 1
• initialize
 V(0) = 0
 V(1) = 0
 V(2) = 0
 V(3) = 0
• iteration 0
 For office 0: R(0) +  V(1) = 0 + 0.5 x 0 = 0
 For office 1: R(1) +  V(2) = 0 + 0.5 x 0 = 0
 For office 2: R(2) +  V(3) = 0 + 0.5 x 0 = 0
 For office 3: R(3) +  V(3) = 10 + 0.5 x 0 = 10
18
Values of states for policy 1 (cont’d)
• iteration 0
V(0) = V(1) = V(2) = 0
V(3)=15
• iteration 1
 For office 0: R(0) +  V(1) = 0 + 0.5 x 0 = 0
 For office 1: R(1) +  V(2) = 0 + 0.5 x 0 = 0
 For office 2: R(2) +  V(3) = 0 + 0.5 x 10 = 5
 For office 3: R(3) +  V(3) = 10 + 0.5 x 10 = 15
• iteration 2
 For office 0: R(0) +  V(1) = 0 + 0.5 x 0 = 0
 For office 1: R(1) +  V(2) = 0 + 0.5 x 5 = 2.5
 For office 2: R(2) +  V(3) = 0 + 0.5 x 15 = 7.5
 For office 3: R(3) +  V(3) = 10 + 0.5 x 15 = 17.5
19
Values of states for policy 1 (cont’d)
• iteration 2
V(0) = 0 V(1) = 2.5 V(2) = 7.5 V(3) = 17.5
• iteration 3
 For office 0: R(0) +  V(1) = 0 + 0.5 x 2.5 = 1.25
 For office 1: R(1) +  V(2) = 0 + 0.5 x 7.5 = 3.75
 For office 2: R(2) +  V(3) = 0 + 0.5 x 17.5 = 8.75
 For office 3: R(3) +  V(3) = 10 + 0.5 x 17.5 = 18.75
• iteration 4
 For office 0: R(0) +  V(1) = 0 + 0.5 x 3.75 = 1.875
 For office 1: R(1) +  V(2) = 0 + 0.5 x 8.75 = 4.375
 For office 2: R(2) +  V(3) = 0 + 0.5 x 18.75 = 9.375
 For office 3: R(3) +  V(3) = 10 + 0.5 x 18.75 = 19.375
20
Compare policies
• Policy 0 after iteration 4
 For office 0: R(0) +  V(1) = 0 + 0.5 x 2.5 = 1.25
 For office 1: R(1) +  V(2) = 0 + 0.5 x 5 = 2.5
 For office 2: R(2) +  V(3) = 0 + 0.5 x 10 = 5
 For office 3: R(3) +  V(1) = 10 + 0.5 x 1.25 = 10.625
• Policy 1 after iteration 4
 For office 0: R(0) +  V(1) = 0 + 0.5 x 3.75 = 1.875
 For office 1: R(1) +  V(2) = 0 + 0.5 x 8.75 = 4.375
 For office 2: R(2) +  V(3) = 0 + 0.5 x 18.75 = 9.375
 For office 3: R(3) +  V(3) = 10 + 0.5 x 18.75 = 19.375
• Policy 1 is better because each state has
higher value compared to policy 0
21
Temporal credit assignment problem
• It is the problem of assigning credit or blame
to the actions in a sequence of actions where
feedback is available only at the end of the
sequence.
• When you lose a game of chess or checkers,
the blame for your loss cannot necessarily be
attributed to the last move you made, or even
the next-to-the-last move.
• Dynamic programming solves the temporal
credit assignment problem by propagating
rewards backwards to earlier states and hence
to actions earlier in the sequence of actions
determined by a policy.
22
Computing an optimal policy
Given a method for estimating the value of
states with respect to a fixed policy, it is
possible to find an optimal policy. We would like
to maximize the discounted cumulative reward.
Policy iteration [Howard, 1960] is an algorithm
that uses the algorithm for computing the value
of a state as a subroutine.
23
Policy iteration algorithm
1. Let 0 be an arbitrary policy.
2. Set i to 0.
3. Compute V0 (j) for each j.
4. Compute a new policy i+1 so that i+1 (j) is the
action a maximizing R(j) +  Vi( f(j,) ) .
5. If i+1 = i , then return i; otherwise,
set i to i + 1, and go to step 3.
24
Policy iteration algorithm (cont’d)
A policy  is said to be the optimal policy if
there is no other policy ’ and state j such that
V’ (j) > V (j) and for all k  j V’ (j) > V (j) .
The policy iteration algorithm is guaranteed to
terminate in a finite number of steps with an
optimal policy.
25
Comments on reinforcement learning
• A general model where an agent can learn to
function in dynamic environments
• The agent can learn while interacting with the
environment
• No prior knowledge except the (probabilistic)
transitions is assumed
• Can be generalized to stochastic domains (an
action might have several different probabilistic
consequences, i.e., the state-transition function
is not deterministic)
• Can also be generalized to domains where the
reward function is not known
26
Famous example: TD-Gammon (Tosauro, 1995)
• Learns to play Backgammon
• Immediate reward:
+100 if win
-100 if lose
0 for all other states
• Trained by playing 1.5 million games against
itself (several weeks)
• Now approximately equal to best human
player (won World Cup of Backgammon in
1992; among top 3 since 1995)
• Predecessor: NeuroGammon [Tesauro and
Sejnowski, 1989] learned from examples of
labeled moves (very tedious for human expert)
27
Other examples
• Robot learning to dock on battery charger
• Pole balancing
• Elevator dispatching [Crites and Barto, 1995]:
better than industry standard
• Inventory management [Van Roy et. Al]:
10-15% improvement over industry standards
• Job-shop scheduling for NASA space
missions [Zhang and Dietterich, 1997]
• Dynamic channel assignment in cellular
phones [Singh and Bertsekas, 1994]
• Robotic soccer
28
Common characteristics
• delayed reward
• opportunity for active exploration
• possibility that state only partially observable
• possible need to learn multiple tasks with
same sensors/effectors
• there may not be an adequate teacher
29