Transcript Slide 1

Reinforcement Learning
Slides for this part are adapted from
those of Dan Klein@UCB
Does self learning through simulator.
[Infants don’t get to “simulate” the
world since they neither have
T(.) nor R(.) of their world]
Objective(s) of Reinforcement Learning
• Given
– your effectors and perceptors
• Assume full observability of state as well as rewardxx
– The world (raw in tooth and claw)
– (sometimes) a simulator [so you get ergodicity and can repeat futures]
• Learn how to perform well
– This may involve
• Learning state values
– State rewards have to be learned too; but this is easy
• Learning action values
– (q-function; so we can pick the right action)
• Learning transition model
– (representation; so we can put the rewards and transition using Bellman equations to
learn value and policy)
• Learning policy directly
– So we can short circuit and go directly to what is the right thing to do
Dimensions of Variation
of RL Algorithms
Passive vs. Active
Model-based vs. Model-free
• Passive vs. Active
• Model-based vs. Model-free
– Passive: Assume the agent is
already following a policy (so
there is no action choice to be
made; you just need to learn
the state values and may be
action model)
– Active: Need to learn both
the optimal policy and the
state values (and may be
action model)
– Model-based  Have/learn
action models (i.e. transition
probabilities)
• Eg. Approximate DP
– Model-free  Skip them and
directly learn what action to
do when (without necessarily
finding out the exact model of
the action)
• E.g. Q-learning
Dimensions of variation (Contd)
Extent of Backup
• Full DP
– Adjust value based on values
of all the neighbors (as
predicted by the transition
model)
– Can only be done when
transition model is present
• Temporal difference
– Adjust value based only on
the actual transitions
observed
Generalization
• Learn Tabular
representations
• Learn feature-based
(factored) representations
– Online inductive learning
methods..
When you were a kid, your policy was mostly
dictated by your parents (if it is 6AM, wake up and
go to school). You however did “learn” to detest
Mondays and look foraward to Fridays..
Inductive Learning over direct
estimation
• States are represented in terms of features
• The long term cumulative rewards
experienced from the states become their
labels
• Do inductive learning (regression) to find the
function that maps features to values
– This generalizes the experience beyond the
specific states we saw
We are basically doing EMPIRICAL Policy Evaluation!
But we know this will be wasteful
(since it misses the correlation between values of neibhoring states!)
Do DP-based policy
evaluation!
Passive
Robustness in the face of
Model Uncertainty
• Suppose you ran through a red light a couple of times, and
reached home faster
– Should we learn that running through red lights is a good
action?
• General issue with maximum-likelihood learning
– If you tossed a coin thrice and it came heads twice, can you say
that the probability of heads is 2/3?
– General solution: Bayesian Learning
• Keep a prior on the hypothesis space; and compute posterior given
the examples
• Bayesian Reinforcement Learning
– Risk Averse solution
• Suppose your model is one of K, do the action that is least harmful
across the K models
Active
Model Completeness issue
Greedy in the
Limit of
Infinite
Exploration
Must try all state-action
combinations infinitely
often; but must become
greedy in the limit
(e.g set it to f(1/t)
Idea: Keep track of the number of times a state/action pair
has been explored; below a threshold, boost the value of
that pair (optimism for exploration)
U+ is set to R+ (max optimistic reward) as long as N(s,a) is below a threshold
Qn: What if a very unlikely negative (or positive) transition
biases the estimate?
Temporal Difference won‘t directly work for
Active Learning
SARSA
(State-Action-Reward-State-Action)
• Q-learning is not as fully dependent on the experience as you might
think
– You are assuming that the best action a’ will be done from s’
(where best action is computed by maxing over Q values)
– Why not actually see what action actually got done?
• SARSA—wait to see what action actually is chosen (no maxing)
• SARSA is on-policy (it watches the policy) while Q-learning is offpolicy (it predicts what action will be done)
– SARSA is more realistic and thus better when, let us say, the agent is in a
multi-agent world where it is being “lead” from action to action..
• E.g. A kid passing by a candy store on the way to school and expecting to stop there, but
realizing that his mom controls the steering wheel.
– Q-learning is more flexible (it will learn the actual values even when it is
being guided by a random policy)
Learning/Planning/Acting
What you miss in the absence of a model is the
ability to “simulate in your mind”
You can’t draw an RTDP tree if all you have are
Q* values—since Q* tells you what action you should do in a state
but wont tell you where that would lead you…
--For that latter bit, you need to actually ACT in the world
(If you have an external simulator, you can use that in lieu of the world
but you still can’t do the RTDP tree in your mind)
Relating TD and Monte Carlo
•
Both Monte Carlo and TD learn from samples
(traces)
– Monte Carlo waits until the trace hits a sink state,
and then (discount) adds all the rewards of the
trace
– TD on the other hand considers the current state
s, and the next experienced state s0
•
You can think of what TD is doing as “truncating”
the experience and summarizing the aggregated
reward of the entire trace starting from s0 in terms
of the current value estimate of s0
– Why truncate at the very first state s’? How about
going from s s0s1s2..sk and truncate
the remaining trace (by assuming that its
aggregate reward is just the current value of sk)
•
(sort of like how deep down you go in game trees
before applying evaluation function)
– In this generalized view, TD corresponds to k=0
and Monte Carlo corresponds to k=infinity
Generalizing TD to TD(l)
•
TD(l) can be thought of
as doing 1,2…k step
predictions of the value
of the state, and taking
their weighted average
– Weighting is done in
terms of l such that
– l=0 corresponds to TD
– l=1 corresponds to
Monte Carlo
•
Note that the last
backup doesn’t have
(1- l) factor…
Reason:
After Tth state the
remaining infinite
# states will all have
the same aggregated
Backup—but each
is discounted in l.
So, we have a
1/(1- l) factor that
Cancels out the (1- l)
No (1- l) factor!
Dimensions of Reinforcement Learning
Large State Spaces
• When a problem has a large state space we
can not longer represent the V or Q functions
as explicit tables
• Even if we had enough memory
– Never enough training data!
– Learning takes too long
• What to do??
39
[Slides from Alan Fern]
Function Approximation
• Never enough training data!
– Must generalize what is learned from one situation to other
“similar” new situations
• Idea:
– Instead of using large table to represent V or Q, use a
parameterized function
• The number of parameters should be small compared to
number of states (generally exponentially fewer parameters)
– Learn parameters from experience
– When we update the parameters based on observations in one
state, then our V or Q estimate will also change for other similar
states
• I.e. the parameterization facilitates generalization of
experience
40
Linear Function Approximation
• Define a set of state features f1(s), …, fn(s)
– The features are used as our representation of states
– States with similar feature values will be considered to be similar
• A common approximation is to represent V(s) as a weighted sum
of the features (i.e. a linear approximation)
Vˆ ( s ) =    f ( s )   f ( s )  ...   f ( s )

0
1
1
2
2
n
n
• The approximation accuracy is fundamentally limited by the
information provided by the features
• Can we always define features that allow for a perfect linear
approximation?
– Yes. Assign each state an indicator feature. (I.e. i’th feature is 1 iff i’th state
is present and i represents value of i’th state)
– Of course this requires far to many features and gives no generalization.
41
Example
• Consider grid problem with no obstacles, deterministic actions
U/D/L/R (49 states)
• Features for state s=(x,y): f1(s)=x, f2(s)=y (just 2 features)
• V(s) = 0 + 1 x + 2 y
6
• Is there a good linear
approximation?
– Yes.
– 0 =10, 1 = -1, 2 = -1
– (note upper right is origin)
0
10
0
• V(s) = 10 - x - y
subtracts Manhattan dist.
from goal reward
6
42
But What If We Change Reward …
• V(s) = 0 + 1 x + 2 y
• Is there a good linear approximation?
– No.
0
0
10
43
But What If…
• V(s) = 0 + 1 x + 2 y+ 3 z
 Include new feature z
3
0
0
 z= |3-x| + |3-y|
 z is dist. to goal location
 Does this allow a
good linear approx?
 0 =10, 1 = 2 = 0,
0 = -1
44
Feature Engineering….
10
3
Linear Function Approximation
• Define a set of features f1(s), …, fn(s)
– The features are used as our representation of states
– States with similar feature values will be treated
similarly
– More complex functions require more complex features
Vˆ ( s ) =  0   1 f 1 ( s )   2 f 2 ( s )  ...   n f n ( s )
• Our goal is to learn good parameter values (i.e.
feature weights) that approximate the value
function well
– How can we do this?
– Use TD-based RL and somehow update parameters
based on each experience.
47
TD-based RL for Linear Approximators
1. Start with initial parameter values
2. Take action according to an explore/exploit
policy
(should converge to greedy policy, i.e. GLIE)
3. Update estimated model
4. Perform TD update for each parameter
i  ?
5. Goto 2
What is a “TD update” for a parameter?
48
Aside: Gradient Descent
• Given a function f(1,…, n) of n real values =(1,…, n) suppose we
want to minimize f with respect to 
• A common approach to doing this is gradient descent
• The gradient of f at point , denoted by  f(), is an
n-dimensional vector that points in the direction where f
increases most steeply at point 
• Vector calculus tells us that  f() is just a vector of partial
derivatives
  f ( )
 f ( ) 
  f ( ) = 
, ,





1
n 

where
 f ( )
 i
= lim
 0
f ( 1 ,   i 1 ,  i   ,  i  1 ,  ,  n )  f ( )

can decrease f by moving in negative gradient direction
49
Aside: Gradient Descent for Squared Error
• Suppose that we have a sequence of states and target values for
each state
s1 , v ( s1 ) , s 2 , v ( s 2 ) , 
– E.g. produced by the TD-based RL loop
• Our goal is to minimize the sum of squared errors between our
estimated function and each target value:
E
j
squared error of example j
=
1
2
Vˆ

(s j )  v(s j )
our estimated value
for j’th state
)
2
target value for j’th state
• After seeing j’th state the gradient descent rule tells us that we
can decrease error by updating parameters by:
E j
E j
 E j  Vˆ ( s j )
i  i  
,
=
 i
 i
 Vˆ ( s j )   i
learning rate
50
Aside: continued
i  i  
E j
 i

=  i   Vˆ ( s j )  v ( s j )
)
 Vˆ ( s j )
E j
 Vˆ ( s j )
 i
depends on form of
approximator
• For a linear approximation function:
Vˆ ( s ) =  1   1 f 1 ( s )   2 f 2 ( s )  ...   n f n ( s )
• Thus the update becomes:

 Vˆ ( s j )
 i
= fi (s j )
)
 i   i   v ( s j )  Vˆ ( s j ) f i ( s j )
• For linear functions this update is guaranteed to converge
to best approximation for suitable learning rate schedule
51
TD-based RL for Linear Approximators
1.
2.
Start with initial parameter values
Take action according to an explore/exploit policy
(should converge to greedy policy, i.e. GLIE)
Transition from s to s’
Update estimated model
Perform TD update for each parameter
3.
4.

)
 i   i   v ( s )  Vˆ ( s ) f i ( s )
5.
Goto 2
What should we use for “target value” v(s)?
• Use the TD prediction based on the next state s’
this is the same as previous TD method only with approximation
52
v ( s ) = R ( s )   Vˆ ( s ' )
TD-based RL for Linear Approximators
1.
2.
Start with initial parameter values
Take action according to an explore/exploit policy
(should converge to greedy policy, i.e. GLIE)
Update estimated model
Perform TD update for each parameter
3.
4.

)
 i   i   R ( s )   Vˆ ( s ' )  Vˆ ( s ) f i ( s )
5.
Goto 2
• Step 2 requires a model to select greedy action
• For applications such as Backgammon it is easy to get a
simulation-based model
• For others it is difficult to get a good model
• But we can do the same thing for model-free Q-learning
53
Q-learning with Linear Approximators
Qˆ  ( s , a ) =  0   1 f 1 ( s , a )   2 f 2 ( s , a )  ...   n f n ( s , a )
Features are a function of states and actions.
1.
2.
3.
Start with initial parameter values
Take action a according to an explore/exploit policy
(should converge to greedy policy, i.e. GLIE) transitioning from s
to s’
Perform TD update for each parameter
4.
 i   i   R ( s )   max Qˆ  ( s ' , a ' )  Qˆ  ( s , a ) f i ( s , a )
a'
Goto 2

• For both Q and V, these algorithms converge to the closest
linear approximation to optimal Q or V.
54
)
Policy Gradient Ascent
• Let () be the expected value of policy .
– () is just the expected discounted total reward for a trajectory of .
– For simplicity assume each trajectory starts at a single initial state.
• Our objective is to find a  that maximizes ()
• Policy gradient ascent tells us to iteratively update
parameters via:
        ( )
• Problem: () is generally very complex and it is rare that we
can compute a closed form for the gradient of ().
• We will instead estimate the gradient based on experience
62
Gradient Estimation
• Concern: Computing or estimating the gradient of
discontinuous functions can be problematic.
• For our example parametric policy
  ( s ) = arg max Qˆ  ( s , a )
a
is () continuous?
• No.
– There are values of  where arbitrarily small changes,
cause the policy to change.
– Since different policies can have different values this
means that changing  can cause discontinuous jump of
().
63
Example: Discontinous ()
  ( s ) = arg max Qˆ  ( s , a ) =  1 f 1 ( s , a )
a
• Consider a problem with initial state s and two actions a1 and a2
– a1 leads to a very large terminal reward R1
– a2 leads to a very small terminal reward R2
• Fixing 2 to a constant we can plot the ranking assigned to each action by Q
and the corresponding value ()
Discontinuity in () when
ordering of a1 and a2 change
Qˆ  ( s , a1)
R1
()
Qˆ  ( s , a 2 )
1
R2
1
64
Probabilistic Policies
• We would like to avoid policies that drastically change with small
parameter changes, leading to discontinuities
• A probabilistic policy  takes a state as input and returns a
Aka Mixed Policy
distribution over actions
(not needed for
– Given a state s (s,a) returns the probability that  selects action a in s
Optimality…)
• Note that () is still well defined for probabilistic policies
– Now uncertainty of trajectories comes from environment and policy
– Importantly if (s,a) is continuous relative to changing  then () is also continuous
relative to changing 
• A common form for probabilistic policies is the softmax function
or Boltzmann exploration function
  ( s , a ) = Pr( a | s ) =

)
exp Qˆ  ( s , a )
exp Qˆ ( s , a ' )

a ' A


)
65
Empirical Gradient Estimation
• Our first approach to estimating  () is to simply compute
empirical gradient estimates
   ( )
  ( ) 
• Recall that  = (1,…, n) and
  ( ) =
, ,


 1

 n 
so we can compute the gradient by empirically estimating
each partial derivative
  ( )
 i
= lim
 ( 1 ,   i 1 ,  i   ,  i  1 ,  ,  n )   ( )

 0
• So for small  we can estimate the partial derivatives by
 ( 1 ,   i 1 ,  i   ,  i  1 ,  ,  n )   ( )

• This requires estimating n+1 values:
 ( ),
 ( 1 ,   i 1 ,  i   ,  i 1 ,  ,  n ) | i = 1,..., n 
66
Empirical Gradient Estimation
• How do we estimate the quantities
 ( ),
 ( 1 ,   i 1 ,  i   ,  i 1 ,  ,  n ) | i = 1,..., n 
• For each set of parameters, simply execute the policy for N
trials/episodes and average the values achieved across the
Doable without permanent damage if there is a simulator
trials
• This requires a total of N(n+1) episodes to get gradient
estimate
– For stochastic environments and policies the value of N must be
relatively large to get good estimates of the true value
– Often we want to use a relatively large number of parameters
– Often it is expensive to run episodes of the policy
• So while this can work well in many situations, it is often
not a practical approach computationally
• Better approaches try to use the fact that the stochastic
policy is differentiable.
– Can get the gradient by just running the current policy multiple times
67
Applications of Policy Gradient Search
• Policy gradient techniques have
been used to create controllers
for difficult helicopter
maneuvers
• For example, inverted
helicopter flight.
• A planner called FPG also
“won” the 2006 International
Planning Competition
– If you don’t count FF-Replan
68
Slides beyond this not discussed
Policy Gradient Recap
• When policies have much simpler representations than
the corresponding value functions, direct search in policy
space can be a good idea
– Allows us to design complex parametric controllers and optimize details of
parameter settings
• For baseline algorithm the gradient estimates are
unbiased (i.e. they will converge to the right value) but
have high variance
– Can require a large N to get reliable estimates
OLPOMDP offers can trade-off bias and variance via the
discount parameter [Baxter & Bartlett, 2000]
• Can be prone to finding local maxima
– Many ways of dealing with this, e.g. random restarts.
71
Gradient Estimation: Single Step Problems
• For stochastic policies it is possible to estimate  () directly from trajectories of
just the current policy 
– Idea: take advantage of the fact that we know the functional form of the policy
• First consider the simplified case where all trials have length 1
– For simplicity assume each trajectory starts at a single initial state and reward
only depends on action choice
– () is just the expected reward of action selected by .
 ( ) =
   (s
, a ) R (a )
a
where s0 is the initial state and
R(a) is reward of action a
• The gradient of this becomes
o
   ( ) =      ( s o , a ) R ( a ) =
a
     ( s
o
, a ) )R ( a )
a
• How can we estimate this by just observing the execution of ?
72
•
Gradient Estimation: Single Step Problems
Rewriting   ( ) =    ( s , a ) ) R ( a )



o
a
=
   (s
    ( s o , a ) )
o
, a)
o
, a )   log   ( s o , a ) ) R ( a )
  ( so , a )
a
=
   (s
R (a )
a
can get closed form g(s0,a)
• The gradient is just the expected value of g(s0,a)R(a) over
execution trials of 
– Can estimate by executing  for N trials and averaging samples
   ( ) 
1
N
N
 g (s
o
, a j )R ( a j )
j =1
aj is action selected by policy on j’th episode
– Only requires executing  for a number of trials that need not depend on the
number of parameters
73
Gradient Estimation: General Case
• So for the case of a length 1 trajectories we got:
   ( ) 
1
N
N
 g (s
o
, a j )R ( a )
j =1
• For the general case where trajectories have length greater
than one and reward depends on state we can do some work
and get:
length of trajectory j
   ( ) 
1
N
# of trajectories
of current policy
N
Tj
  g (s
j ,t
, a j ,t ) R j ( s j ,t )
j =1 t =1
Observed total reward in trajectory j
from step t to end
• sjt is t’th state of j’th episode, ajt is t’th action of epidode j
• The derivation of this is straightforward but messy.
74
How to interpret gradient expression?
Direction to move parameters in order to
increase the probability that policy selects
ajt in state sjt
g ( s , a ) =   log   ( s , a ) )
   ( ) 
1
N
N
Tj
  g (s
j ,t
, a j ,t ) R j ( s j ,t )
j =1 t =1
Total reward observed after taking ajt in state sjt
• So the overall gradient is a reward weighted combination of
individual gradient directions
– For large Rj(sj,t) will increase probability of aj,t in sj,t
– For negative Rj(sj,t) will decrease probability of aj,t in sj,t
• Intuitively this increases probability of taking actions that
typically are followed by good reward sequences
75
Basic Policy Gradient Algorithm
• Repeat until stopping condition
1.
Execute  for N trajectories while storing the state, action, reward
sequences
2.   
1
N
N
Tj
  g (s
j ,t
, a j ,t ) R j ( s j ,t )
j =1 t =1
3.       
• One disadvantage of this approach is the small number of
updates per amount of experience
– Also requires a notion of trajectory rather than an infinite sequence
of experience
• Online policy gradient algorithms perform updates after each
step in environment (often learn faster)
76
Computing the Gradient of Policy
• Both algorithms require computation of
g ( s , a ) =   log   ( s , a ) )
• For the Boltzmann distribution with linear approximation we
have:
  (s, a ) =
where

)
exp Qˆ  ( s , a )
exp Qˆ ( s , a ' )



)
a ' A
Qˆ  ( s , a ) =  0   1 f 1 ( s , a )   2 f 2 ( s , a )  ...   n f n ( s , a )
• Here the partial derivatives needed for g(s,a) are:
 log   ( s , a ) )
 i
= fi (s, a )     (s, a ' ) fi (s, a ' )
a'
79