Structure and Synthesis of Robot Motion Introduction

Download Report

Transcript Structure and Synthesis of Robot Motion Introduction

Reinforcement Learning
Dynamic Programming I
Subramanian Ramamoorthy
School of Informatics
31 January, 2012
Continuing from last time…
MDPs and Formulation of RL Problem
31/01/2012
2
Returns
Suppose t he sequence of rewards aft er step
t is :
rt 1 , rt 2 , rt  3 ,
What do we want t o maximize?
In general,
we want tomaximizethee xpe cte dre turn, E Rt , for each step t.
Episodic tasks: interaction breaks naturally into episodes, e.g., plays of a
game, trips through a maze.
Rt  rt 1  rt 2 
 rT ,
where T is a final time step at which a terminal state is reached, ending an episode.
31/01/2012
3
Returns for Continuing Tasks
Continuing tasks: interaction does not have natural episodes.
Discounted return:

Rt  rt 1   rt  2   2 rt 3      k rt  k 1 ,
k 0
where  ,0    1, is thediscountrate.
shortsighted 0    1 farsighted
31/01/2012
4
Example – Setting Up Rewards
Avoid failure: the pole falling beyond
a critical angle or the cart hitting end of
track.
As an episodic task where episode ends upon failure:
reward  1 for each st ep before failure
 return  number of st eps before failure
As a continuing task with discounted return:
reward  1 upon failure;0 ot herwise
 ret urn    k , for k steps before failure
In either case, return is maximized by
avoiding failure for as long as possible.
31/01/2012
5
Another Example
Get to the top of the hill
as quickly as possible.
reward  1 for each st ep wheren otat t op of hill
 ret urn   num ber of st eps before reaching t op of hill
Return is maximized by minimizing
number of steps to reach the top of the hill.
31/01/2012
6
A Unified Notation
• In episodic tasks, we number the time steps of each
episode starting from zero.
• We usually do not have to distinguish between episodes,
so we write s t instead of st, j for the state at step t of
episode j.
• Think of each episode as ending in an absorbing state that
always produces reward of zero:

k
• We can cover all cases by writing R 

rt  k 1 ,

t
k 0
where  can be 1 onlyif a zero reward absorbingstateis always reached.
31/01/2012
7
Markov Decision Processes
• If a reinforcement learning task has the Markov Property, it is
basically a Markov Decision Process (MDP).
• If state and action sets are finite, it is a finite MDP.
• To define a finite MDP, you need to give:
– state and action sets
– one-step “dynamics” defined by transition probabilities:
Psas  Prst 1  s st  s,at  a for alls, s S, a  A(s).
– reward probabilities:

31/01/2012
Rass  Ert 1 st  s,at  a,st 1  s for alls, s S, a  A(s).
8
The RL Problem
Main Elements:
• States, s
• Actions, a
• State transition dynamics often, stochastic & unknown
• Reward (r) process - possibly
stochastic
Objective: Policy pt(s,a)
– probability distribution over
actions given current state
31/01/2012
Assumption:
Environment defines
a finite-state MDP
9
An Example Finite MDP
Recycling Robot
• At each step, robot has to decide whether it should (1) actively
search for a can, (2) wait for someone to bring it a can, or (3) go to
home base and recharge.
• Searching is better but runs down the battery; if it runs out of
power while searching, it has to be rescued (which is bad).
• Decisions made on basis of current energy level: high, low.
• Reward = number of cans collected
31/01/2012
10
Recycling Robot MDP
S  high, low
Rs
e a r c h
A(high)  search
, wait

Rw
a i t
A(low)  search
, wait
, recharge

 expected no. of cans while searching
 expected no. of cans while waiting
Rs
R
e a r c hw a i t

31/01/2012
11
Enumerated in Tabular Form
31/01/2012
12
Given such an enumeration of transitions and
corresponding costs/rewards, what is the best
sequence of actions?
We know we want to maximize:

Rt    k rt  k 1
k 0
So, what must one do?
31/01/2012
13
The Shortest Path Problem
31/01/2012
14
Finite-State Systems and Shortest Paths
– state space sk is a finite set for each k
– ak can get you from sk to fk(sk, ak) at a cost gk(xk, uk)
Length ≈ Cost
≈ Sum of length of arcs
Solve this first
Jk(i) = minj [akij + Jk+1(j)]
31/01/2012
15
Value Functions
• The value of a state is the expected return starting from that
state; depends on the agent’s policy:
State- val ue function for pol icy
p:
  k

V (s)  Ep Rt st  s Ep   rt k 1 st  s
k 0

p
• The value of taking an action in a state under policy p is the
expected return starting from that state, taking that action,
and thereafter following p :
Acti on- val ue function for polipcy:
 k

Q (s, a)  Ep Rt st  s, at  a Ep  rt  k 1 st  s,at  a 
k  0

16
p
31/01/2012
Recursive Equation for Value
The basic idea:
Rt  rt 1   rt 2   2 rt  3   3 rt  4
 rt 1   rt 2   rt 3   rt  4
2

 rt 1   Rt 1
So:
V p ( s)  Ep  Rt st  s
 Ep  rt 1   V st 1  st  s
31/01/2012
17
Optimality in MDPs – Bellman Equation
31/01/2012
18
More on the Bellman Equation

V p (s)  p (s, a)  Psas R ass   V p (s)
a
s

This is a set of equations (in fact, linear), one for each state.
The value function for p is its unique solution.
Backup diagrams:
for V p
31/01/2012
for Qp
19
Bellman Equation for Q
31/01/2012
20
Gridworld
• Actions: north, south, east, west; deterministic.
• If it would take agent off the grid: no move but reward = –1
• Other actions produce reward = 0, except actions that move
agent out of special states A and B as shown.
State-value function
for equiprobable
random policy;
= 0.9
31/01/2012
21
Golf
• State is ball location
• Reward of –1 for each stroke
until the ball is in the hole
• Value of a state?
• Actions:
– putt (use putter)
– driver (use driver)
• putt succeeds anywhere on
the green
22
Optimal Value Functions
• For finite MDPs, policies can be partially ordered:
p
p 
p  p if and only ifV (s)  V (s) for alls S
• There are always one or more policies that are better than or
equal to all the others. These are the optimal policies. We
denote them all p*.
• Optimal policies share the same optimal state-value function:
p

V (s)  maxV (s) for all s S
p
• Optimal policies also share the same optimal action-value
function:

p
Q (s,a)  maxQ (s, a) for all s S and a A(s)
p
This is the expected return for taking action a in state s and thereafter
following an optimal policy.
31/01/2012
23
Optimal Value Function for Golf
• We can hit the ball farther with driver than with
putter, but with less accuracy
• Q*(s,driver) gives the value or using driver first, then
using whichever actions are best
31/01/2012
24
Bellman Optimality Equation for V*
The value of a state under an optimal policy must equal
the expected return for the best action from that state:
p

V ( s)  max Q ( s, a)
aA( s )


 max E rt 1   V  ( st 1 ) st  s, at  a
aA( s )

 max  Psas R ass   V  ( s)
aA( s )
s

The relevant backup diagram:
V* is the unique solution of this system of nonlinear equations.
31/01/2012
25
Recursive Form of V*
31/01/2012
26
Bellman Optimality Equation for Q*


Q ( s, a)  E rt 1   maxQ ( st 1 , a) st  s, at  a

a
  Psas R ass   maxQ ( s, a)
s
a

The relevant backup diagram:
Q* is the unique solution of this system of nonlinear equations.
31/01/2012
27
Why Optimal State-Value Functions are Useful
Any policy that is greedy with respect to V* is an optimal policy.
Therefore, given V* , one-step-ahead search produces the
long-term optimal actions.
E.g., back to the gridworld (from your S+B book):
p*
31/01/2012
28
What About Optimal Action-Value Functions?
Given Q*, the agent does not even
have to do a one-step-ahead search:
31/01/2012
29
Solving the Bellman Optimality Equation
• Finding an optimal policy by solving the Bellman
Optimality Equation requires the following:
– accurate knowledge of environment dynamics;
– we have enough space and time to do the computation;
– the Markov Property.
• How much space and time do we need?
– polynomial in number of states (via dynamic programming),
– BUT, number of states is often huge (e.g., backgammon has
about 1020 states)
• We usually have to settle for approximations.
• Many RL methods can be understood as approximately
solving the Bellman Optimality Equation.
31/01/2012
30