Structure and Synthesis of Robot Motion Introduction
Download
Report
Transcript Structure and Synthesis of Robot Motion Introduction
Reinforcement Learning
Dynamic Programming I
Subramanian Ramamoorthy
School of Informatics
31 January, 2012
Continuing from last time…
MDPs and Formulation of RL Problem
31/01/2012
2
Returns
Suppose t he sequence of rewards aft er step
t is :
rt 1 , rt 2 , rt 3 ,
What do we want t o maximize?
In general,
we want tomaximizethee xpe cte dre turn, E Rt , for each step t.
Episodic tasks: interaction breaks naturally into episodes, e.g., plays of a
game, trips through a maze.
Rt rt 1 rt 2
rT ,
where T is a final time step at which a terminal state is reached, ending an episode.
31/01/2012
3
Returns for Continuing Tasks
Continuing tasks: interaction does not have natural episodes.
Discounted return:
Rt rt 1 rt 2 2 rt 3 k rt k 1 ,
k 0
where ,0 1, is thediscountrate.
shortsighted 0 1 farsighted
31/01/2012
4
Example – Setting Up Rewards
Avoid failure: the pole falling beyond
a critical angle or the cart hitting end of
track.
As an episodic task where episode ends upon failure:
reward 1 for each st ep before failure
return number of st eps before failure
As a continuing task with discounted return:
reward 1 upon failure;0 ot herwise
ret urn k , for k steps before failure
In either case, return is maximized by
avoiding failure for as long as possible.
31/01/2012
5
Another Example
Get to the top of the hill
as quickly as possible.
reward 1 for each st ep wheren otat t op of hill
ret urn num ber of st eps before reaching t op of hill
Return is maximized by minimizing
number of steps to reach the top of the hill.
31/01/2012
6
A Unified Notation
• In episodic tasks, we number the time steps of each
episode starting from zero.
• We usually do not have to distinguish between episodes,
so we write s t instead of st, j for the state at step t of
episode j.
• Think of each episode as ending in an absorbing state that
always produces reward of zero:
k
• We can cover all cases by writing R
rt k 1 ,
t
k 0
where can be 1 onlyif a zero reward absorbingstateis always reached.
31/01/2012
7
Markov Decision Processes
• If a reinforcement learning task has the Markov Property, it is
basically a Markov Decision Process (MDP).
• If state and action sets are finite, it is a finite MDP.
• To define a finite MDP, you need to give:
– state and action sets
– one-step “dynamics” defined by transition probabilities:
Psas Prst 1 s st s,at a for alls, s S, a A(s).
– reward probabilities:
31/01/2012
Rass Ert 1 st s,at a,st 1 s for alls, s S, a A(s).
8
The RL Problem
Main Elements:
• States, s
• Actions, a
• State transition dynamics often, stochastic & unknown
• Reward (r) process - possibly
stochastic
Objective: Policy pt(s,a)
– probability distribution over
actions given current state
31/01/2012
Assumption:
Environment defines
a finite-state MDP
9
An Example Finite MDP
Recycling Robot
• At each step, robot has to decide whether it should (1) actively
search for a can, (2) wait for someone to bring it a can, or (3) go to
home base and recharge.
• Searching is better but runs down the battery; if it runs out of
power while searching, it has to be rescued (which is bad).
• Decisions made on basis of current energy level: high, low.
• Reward = number of cans collected
31/01/2012
10
Recycling Robot MDP
S high, low
Rs
e a r c h
A(high) search
, wait
Rw
a i t
A(low) search
, wait
, recharge
expected no. of cans while searching
expected no. of cans while waiting
Rs
R
e a r c hw a i t
31/01/2012
11
Enumerated in Tabular Form
31/01/2012
12
Given such an enumeration of transitions and
corresponding costs/rewards, what is the best
sequence of actions?
We know we want to maximize:
Rt k rt k 1
k 0
So, what must one do?
31/01/2012
13
The Shortest Path Problem
31/01/2012
14
Finite-State Systems and Shortest Paths
– state space sk is a finite set for each k
– ak can get you from sk to fk(sk, ak) at a cost gk(xk, uk)
Length ≈ Cost
≈ Sum of length of arcs
Solve this first
Jk(i) = minj [akij + Jk+1(j)]
31/01/2012
15
Value Functions
• The value of a state is the expected return starting from that
state; depends on the agent’s policy:
State- val ue function for pol icy
p:
k
V (s) Ep Rt st s Ep rt k 1 st s
k 0
p
• The value of taking an action in a state under policy p is the
expected return starting from that state, taking that action,
and thereafter following p :
Acti on- val ue function for polipcy:
k
Q (s, a) Ep Rt st s, at a Ep rt k 1 st s,at a
k 0
16
p
31/01/2012
Recursive Equation for Value
The basic idea:
Rt rt 1 rt 2 2 rt 3 3 rt 4
rt 1 rt 2 rt 3 rt 4
2
rt 1 Rt 1
So:
V p ( s) Ep Rt st s
Ep rt 1 V st 1 st s
31/01/2012
17
Optimality in MDPs – Bellman Equation
31/01/2012
18
More on the Bellman Equation
V p (s) p (s, a) Psas R ass V p (s)
a
s
This is a set of equations (in fact, linear), one for each state.
The value function for p is its unique solution.
Backup diagrams:
for V p
31/01/2012
for Qp
19
Bellman Equation for Q
31/01/2012
20
Gridworld
• Actions: north, south, east, west; deterministic.
• If it would take agent off the grid: no move but reward = –1
• Other actions produce reward = 0, except actions that move
agent out of special states A and B as shown.
State-value function
for equiprobable
random policy;
= 0.9
31/01/2012
21
Golf
• State is ball location
• Reward of –1 for each stroke
until the ball is in the hole
• Value of a state?
• Actions:
– putt (use putter)
– driver (use driver)
• putt succeeds anywhere on
the green
22
Optimal Value Functions
• For finite MDPs, policies can be partially ordered:
p
p
p p if and only ifV (s) V (s) for alls S
• There are always one or more policies that are better than or
equal to all the others. These are the optimal policies. We
denote them all p*.
• Optimal policies share the same optimal state-value function:
p
V (s) maxV (s) for all s S
p
• Optimal policies also share the same optimal action-value
function:
p
Q (s,a) maxQ (s, a) for all s S and a A(s)
p
This is the expected return for taking action a in state s and thereafter
following an optimal policy.
31/01/2012
23
Optimal Value Function for Golf
• We can hit the ball farther with driver than with
putter, but with less accuracy
• Q*(s,driver) gives the value or using driver first, then
using whichever actions are best
31/01/2012
24
Bellman Optimality Equation for V*
The value of a state under an optimal policy must equal
the expected return for the best action from that state:
p
V ( s) max Q ( s, a)
aA( s )
max E rt 1 V ( st 1 ) st s, at a
aA( s )
max Psas R ass V ( s)
aA( s )
s
The relevant backup diagram:
V* is the unique solution of this system of nonlinear equations.
31/01/2012
25
Recursive Form of V*
31/01/2012
26
Bellman Optimality Equation for Q*
Q ( s, a) E rt 1 maxQ ( st 1 , a) st s, at a
a
Psas R ass maxQ ( s, a)
s
a
The relevant backup diagram:
Q* is the unique solution of this system of nonlinear equations.
31/01/2012
27
Why Optimal State-Value Functions are Useful
Any policy that is greedy with respect to V* is an optimal policy.
Therefore, given V* , one-step-ahead search produces the
long-term optimal actions.
E.g., back to the gridworld (from your S+B book):
p*
31/01/2012
28
What About Optimal Action-Value Functions?
Given Q*, the agent does not even
have to do a one-step-ahead search:
31/01/2012
29
Solving the Bellman Optimality Equation
• Finding an optimal policy by solving the Bellman
Optimality Equation requires the following:
– accurate knowledge of environment dynamics;
– we have enough space and time to do the computation;
– the Markov Property.
• How much space and time do we need?
– polynomial in number of states (via dynamic programming),
– BUT, number of states is often huge (e.g., backgammon has
about 1020 states)
• We usually have to settle for approximations.
• Many RL methods can be understood as approximately
solving the Bellman Optimality Equation.
31/01/2012
30