Reinforcement Learning: Tutorial + Rethinking State, Action & Reward Satinder Singh Computer Science & Engineering University of Michigan.

Download Report

Transcript Reinforcement Learning: Tutorial + Rethinking State, Action & Reward Satinder Singh Computer Science & Engineering University of Michigan.

Reinforcement Learning:
Tutorial +
Rethinking State, Action & Reward
Satinder Singh
Computer Science & Engineering
University of Michigan
Rich Sutton &
Doina Precup
Michael Littman
Richard Lewis
Andrew Barto
Michael Kearns
Matthew Rudary Erik Talvitie
Britton Wolfe David Wingate
Michael James
Jonathan Sorg
Outline
• RL Tutorial
– Markov Decision Processes (MDPs)
– Policy Evaluation
– Optimal Control
•
•
•
•
Options (Rethinking Actions)
Predictions (Rethinking States)
Internal Rewards (Rethinking Rewards)
Random Thought(s)
Markov Decision Process (MDP)
•
•
•
•
•
•
S: finite state space
A: finite action space
P: transition probabilities P(i|j,a)
R: payoff function R(i)
: deterministic policy
V(i): return for policy  when started in state i
V(i) = E{r0+ r1 + 2r2 + 3r3+… | s0=i}
• *: optimal policy is argmax V



Planning (Policy Evaluation)

Planning (Optimal Control)

So far…
• MDP definition
• Policy Evaluation Problem
– Planning (Iteration)
– Learning (TD)
• Optimal Control Problem
– Planning (Q-value Iteration; Linear Programming)
– Learning (Q-learning)
Q-learning first provably convergent adaptive
optimal control algorithm (Chris Watkins!)
So why aren’t we done?
Exploration-Exploitation
• Epsilon-Greedy
– With small probability explore, else exploit
– Suitably decreasing epsilon guarantees asymptotic
convergence (GLIE: Singh, Jaakkola, Szepesvari & Littman)
• Optimism Under Uncertainty
– For arbitrary MDPs, guarantee convergence to near-optimal
policy in sample and time complexity polynomial in reasonable
size-parameters of the MDP (PAC style)
• E3 (Kearns & Singh), Rmax (Brafman & Tennenholtz), Metric-E3 (Kakade,
Kearns & Langford), MBIE (Strehl, Li, Littman), others
• Bayesian Methods
– State space is state of knowledge (posteriors). In principle MDP
methods including above extend (BAMDPs).
So what is holding us back?
So far…
• All Lookup tables
– Asymptotic (& rates in some cases) convergence
• Practical Problems (like very large state
spaces; scaling!)
– Function Approximation
– Sampling Trees
Gradient Descent
e.g., w T  s,a

Sparse Coarse Coding
Sampling Trees Approach
Sparse Sampling
Monte-Carlo Sampling
UCT (Kocsis & Szepesvari; others)
In time independent of size
of state space (polynomial
in horizon) near-optimal
action with high probability
in root state
(Kearns, Ng, Mansour)
So far…
• Basics of RL
– Lots of nice results and applications
• But crucially given
– States
– Actions
– Rewards
• For flexible AI; these are rarely given or easily
determined
States?
• In many engineering/OR/game problems, this
is clear.
• In many narrow-domains this might be clear.
• But what about for a dog or a person? Or a
future android?
• Simultaneously
– Too many sensors
– Too few sensors
Approaches
• Bayesian methods / Graphical Models /
Relational models / Probabilistic Relational
Models
– Fruitful endeavors that you have heard much
about in this summer school
• An alternative approach based on predictions
as representations of state
– PSRs (Littman, Sutton & Singh; Singh, James &
Rudary; Jaeger’s OOMs)
In this part…
• Systems that are
– Discrete time
– Finite observations
– Controlled (finite actions) or uncontrolled
POMDPs…
actions…
at
nominal-states
at+1
Belief-states are
distributions over
hidden
nominal-states
T
st
st+1
st+2
ot+1
ot+2
O
ot
observations
Predictions / futures
• What is a future?
Uncontrolled system: a future is sequence of observations
t = o1 o2 … ok
Controlled system: a future is sequence of observations for a
sequence of actions t = a1 o1 a2 o2 … ak ok
• What is a prediction for a future?
Uncontrolled system: p(t) = prob(o1=o1,…,ok=ok)
Controlled system:
p(t) = prob(o1=o1,…,ok=ok|a1=a1,…,ak=ak)
System Dynamics Vector
All possible futures!
t1
p(t1)
t2 t3 t4
ti
p(ti)
There may be 0’s in here…
Mathematical construct that IS the system (not a model)
Any exact model of a system should be able to generate this vector
For both controlled and uncontrolled systems…
Lots of constraints on the entries of this vector
A “System” is a distribution over all futures…
System Dynamics Matrix
t1 t2 t3 t4
h1 =  p(t1)
h2
h3
hj
p(t1|hj)
ti
p(ti)
p(ti|hj)
Uncontrolled system: ti=o1o2…ok hj=o1o2…on
p(ti|hj) = prob(on+1=o1,…on+k=ok|o1o2 …on)
Controlled system: ti=a1o1…akok hj=a1o1…anon
p(ti|hj) = prob(on+1=o1,…on+k| a1o1…anon,an+1=a1,…,an+k=ak)
tests or experiments…
Again, this
construct IS
the system
(not a model)
System Dynamics Matrix
t1 t2 t3 t4
h1 =  p(t1)
h2
h3
Only those
histories that
can happen
hj
p(t1|hj)
ti
p(ti)
p(ti|hj)
All rows are determined uniquely by the first row
Linear dimension of dynamical system is the rank (say N) of its
system dynamics matrix (only consider finite rank systems here)
Any model must be able to generate this matrix
System Dynamics Matrix
Q = {q1 q2 … qN}
Core tests
t1 t2 t3 t4
h1 =  p(t1)
h2
h3
hj
p(t1|hj)
ti
p(ti)
p(ti|hj)
h
p(t|h) = p(Q|h)T mt
; note that mt is independent of h !
Prediction for any test is linear combination of the
predictions of the core tests
p(Q|h)
t
nth-order Markov Models
all length one tests
At most as many
unique rows as
possible
n-length histories
(lets say K)
t1 t2 t3 t4
h1 =  p(t1)
h2
Model
parameters
h3
hj
p(t1|hj)
Unique histories
Theorem: All K-history (nth-order) Markov models are dynamical
systems of linear-dimension <= K
ti
p(ti)
p(ti|hj)
K-history Markov models…
• Theorem: there exist dynamical systems of lineardimension N that cannot be modeled by any finiteorder Markov model
Consider a system in which the first observation
determines which of two sub-systems is entered…
POMDPs…
actions…
at
nominal-states
at+1
Belief-states are
distributions over
hidden
nominal-states
T
st
st+1
st+2
ot+1
ot+2
O
ot
observations
Learning POMDP models from data (EM)
does not work very well; almost no applications
POMDPs…
• n underlying or nominal-states
State representation for any history h (belief-state) b(h) [a
probability distribution over nominal-states]
• Update parameters
– Transition probabilities Ta (one for every a); Observation
probabilities Oao; (for every a,o)
Initial belief state b(h0)
• b(hao) = b(h)TaOao/Z = b(h) Bao/Z
• For t = a1o1…akok; p(t|h) =b(h) Ta1Oa1o1…TakOakok = b(h) Bt
predictions linear in belief-states
POMDPs
h0
tt11 tt22 tt33 tt44
p(t10|h
b(h
)B0t1)
tti i
p(ti0|h
b(h
)B0ti)
hj
p(t1j|h
b(h
)Bjt1)
p(tij|h
b(h
)Bjti)
h1
D = h2
Linear combination with weights b(h2) of the rows
corresponding to the n rows whose belief-states are unit-basis
•Theorem: Every POMDP with ‘n’ nominal states is a
dynamical system of linear-dimension  n
POMDPs
h0
t1 t2 t3 t4
b(h0)Bt1
ti
b(h0)Bti
hj
b(hj)Bt1
b(hj)Bti
h1
D = h2
Theorem: there exist dynamical systems of finite linear-dimension
that cannot be modeled by any finite nominal-state POMDP
Intuition: POMDP restricted to positive linear combinations…
PSRs
t 1 t2
h1 =  p(t1)
h2
h3
hj
q1 q2 . . . qN
t3 t4
p(t1|hj)
ti
p(ti)
p(ti|hj)
Core tests Q = {q1 q2 . . . qN}
State representation: p(Q|h) = [p(q1|h) . . . p(qN|h)]
t1 t2 t3 t4
PSRs
h1 =  p(t )
1
h2
h3
hj
ti
p(Q|h1)m
p(Q|hj)m
p(t1|hj)
p(Q|h)mt
h
hao
t
?
p(Q|h) is a sufficient statistic for history h
Updating Linear PSRs
• Update core test qi on taking action a and
observing o in history h
• Note: one only needs parameters for the one
step extensions to the core tests!
m’s can have negative entries!!
model parameters
Update Parameters…
All 1-step extensions of core tests
All 1-step tests
a1o1
ajoj
t1 t 2
h1 =  p(t1)
h2
h3
hj
p(t1|hj)
q1 q2 . . . qN
t3 t4
ti
a1o1q1
a jo jq N
Linear PSRs
• Theorem: Every discrete-time dynamical
system of linear-dimension ‘n’ is equivalent to
a linear PSR with ‘n’ core tests
Actions…
Actions?
• MDPs: Assume uniform time-scale of actions
• We can clearly plan at all kinds of temporal
scales including variable temporal scales
simultaneously
• Macro-Actions
– Options (Sutton, Precup & Singh; picked up by
many others)
– MAXQ (Dietterich; picked up by many others)
– HAMs (Parr & Russell; picked up by others)

Options
A generalization of actions to include courses of action
An option is a triple o  I,  ,  
 IS is the set of states in which o may be started
  :SA [0,1] is the policy followed during o
  :S [0,1] is the probability of terminating in each state
Option execution is assumed to be call-and-return
Example: docking
I : all states in which charger is in sight
hand-crafted controller
 : terminate when docked or charger not visible
Options can take variable number of steps
Rooms Example
4 rooms
4 hallways
ROOM
HALLWAYS
4 unreliable
primitive actions
up
left
O1
right
Fail 33%
of the time
down
G?
O2
G?
8 multi-step options
(to each room's 2 hallways)
Given goal location,
quickly plan shortest route
Goal states are given
a terminal value of 1
All rewards zero
 = .9
Options define a Semi-Markov Decison
Process (SMDP)
Discrete time
Homogeneous discount
Continuous time
Discrete events
Interval-dependent discount
Discrete time
Overlaid discrete events
Interval-dependent discount
A discrete-time SMDP overlaid on an MDP
Can be analyzed at either level
MDP + Options = SMDP
Theorem:
For any MDP,
and any set of options,
the decision process that chooses among the options,
executing each to termination,
is an SMDP.
Thus all Bellman equations and DP results extend for value
functions over options and models of options
(cf. SMDP theory).
What does the SMDP connection
give us?
 Policies over options :  :SO  [0,1]
 Value functions over options : V  (s), Q (s,o), VO* (s), QO* (s,o)
 Learning methods : Bradtke & Duff (1995), Parr (1998)
 Models of options
 Planning methods : e.g. value iteration, policy iteration, Dyna...
 A coherent theory of learning and planning with courses of
action at variable time scales, yet at the same level
A theoretical fondation for what we really need!
But the most interesting issues are beyond SMDPs...
Models of Options
Knowing how an option is executed is not enough for reasoning about
it, or planning with it. We need information about its consequences
The model of the consequences of starting option o in state s has :
 a reward part
rso  E{r1   r2  ...   k1rk | s0  s, o taken in s0 , lasts k steps}
 a next - state part
o
pss'
 E{ ksk s' | s0  s, o taken in s0 , lasts k steps}

1 if s' sk is the termination state, 0 otherwise
This form follows from SMDP theory. Such models can be used
interchangeably with models of primitive actions in Bellman equations.
Room Example
4 rooms
4 hallways
ROOM
HALLWAYS
4 unreliable
primitive actions
up
left
O1
right
Fail 33%
of the time
down
G?
O2
G?
8 multi-step options
(to each room's 2 hallways)
Given goal location,
quickly plan shortest route
Goal states are given
a terminal value of 1
All rewards zero
 = .9
Example: Synchronous Value Iteration
Generalized to Options
Initialize : V0 (s)  0
Iterate :
s  S
o
Vk1 (s)  max[rso   pss'
Vk (s')]
oO
s  S
s'S
The algorithm converges to the optimal value function, given the options :
lim Vk  VO*
k 
Once VO* is computed, O* is readily determined.
If O  A, the algorithm reduces to conventional value iteration
If A  O, then VO*  V *
Rooms Example
Example with GoalSubgoal
both primitive actions and options
What does the SMDP connection
give us?
 Policies over options :  : S  O [0,1]
 Value functions over options : V  (s), Q (s, o), VO* (s), QO* (s,o)
 Learning methods : Bradtke & Duff (1995), Parr (1998)
 Models of options
 Planning methods : e.g. value iteration, policy iteration, Dyna...
 A coherent theory of learning and planning with courses of
action at variable time scales, yet at the same level
A theoretical foundation for what we really need!
But the most interesting issues are beyond SMDPs...
Advantages of Dual MDP/SMDP View
At the SMDP level
Compute value functions and policies over options
with the benefit of increased speed / flexibility
At the MDP level
Learn how to execute an option for achieving a
given goal
Between the MDP and SMDP level
Improve over existing options (e.g. by terminating early)
Learn about the effects of several options in parallel, without
executing them to termination
Between MDPs and SMDPs
• Termination Improvement
Improving the value function by changing the termination
conditions of options
• Intra-Option Learning
Learning the values of options in parallel, without executing them
to termination
Learning the models of options in parallel, without executing
them to termination
• Tasks and Subgoals
Learning the policies inside the options
Termination Improvement
Idea: We can do better by sometimes interrupting ongoing options
- forcing them to terminate before says to
Theorem : For any policy over options  :SO  [0,1],
suppose we interrupt its options one or more times, when
Q (s,o)  Q (s, (s)), where s is the state at that time
o is the ongoing option
to obtain ':SO ' [0,1],
Then ' >  (it attains more or equal reward everywhere)
Application : Suppose we have determined QO* and thus   O* .
Then ' is guaranteed better than O*
and is available with no additional computation.
Landmarks Task
Task: navigate from S to G as
fast as possible
4 primitive actions, for taking
tiny steps up, down, left, right
7 controllers for going straight
to each one of the landmarks,
from within a circular region
where the landmark is visible
In this task, planning at the level of primitive actions is
computationally intractable, we need the controllers
Termination Improvement for
Landmarks Task
Allowing early termination based on models
improves the value function at no additional
cost!
Intra-Option Learning Methods
for Markov Options
Idea: take advantage of each fragment of experience
SMDP Q-learning:
• execute option to termination, keeping track of reward along
the way
• at the end, update only the option taken, based on reward and
value of state in which option terminates
Intra-option Q-learning:
• after each primitive action, update all the options that could have
taken that action, based on the reward and the expected value
from the next state on
Proven to converge to correct values, under same assumptions
as 1-step Q-learning
Intra-Option Learning Methods
for Markov Options
Idea: take advantage of each fragment of experience
SMDP Learning: execute option to termination,then update only the
Intra-Option Learning: after each primitive action, update all the options
that could have taken that action
Proven to converge to correct values, under same assumptions
as 1-step Q-learning
option taken
Summary: Benefits of Options
• Transfer
– Solutions to sub-tasks can be saved and reused
– Domain knowledge can be provided as options and subgoals
• Potentially much faster learning and planning
– By representing action at an appropriate temporal scale
• Models of options are a form of knowledge
representation
– Expressive
– Clear
– Suitable for learning and planning
• Much more to learn than just one policy, one set of values
– A framework for “constructivism” – for finding models of
the world that are useful for rapid planning and learning
Rewards…
Rewards?
• Where do rewards come from?
– In narrow engineering tasks this may be clear
– But even in such tasks, it need not be (Andrew Ng)
• Internal rewards!
– Breaking the confound (Singh, Lewis, Sorg, Barto)
– Specific internal rewards
• Prediction errors (Schmidhuber & many others)
• Surprise/Novelty/Curiosity (Oudeyer, Kaplan,
Schmidhuber, Singh, Barto & many others)
Power and generality of RL
One reason RL is powerful
• Reward functions permit specification of what the
agent is to do, independently of how it is to do it
• RL theory and algorithms are insensitive to the source
of rewards – hence their generality
CogSci: This generality also defers questions about
the nature of reward functions: RL is focused on
post-reward algorithms
ML/AI: This generality comes at the cost of a
detrimental preferences-parameters confound
Preferences-Parameters Confound
• The reward function confounds two roles
simultaneously in RL agents
1. (Preferences) It expresses the agent-designer’s
preferences over behaviors
2. (Parameters) Through the reward hypothesis it
expresses the RL agent’s goals/purposes and
becomes parameters of actual agent behavior
These roles are distinct; there is no apriori
reason for them to be confounded
Example - Rewards as Parameters
• Q-learning
– (Usual) Parameters
• Initial Q-value function Q0: S×A -> scalars
• Learning rate, α; Exploration rate, ε
– Loop
•
•
•
•
Current state st
Choose action at (greedy with prob. 1 – ε; random else)
Next state st+1; receive reward rt+1
Update: Qt+1(st,at) = (1 - α)Qt(st,at) + α(rt+1 + γ maxb Qt(st+1,b))
Reward function is also a parameter
(perhaps the most influential parameter!)
Disentangling the PP Confound
• There are two reward functions
1) Agent-designer’s reward U : behavior -> scalar
2) Agent’s reward: RI (in additive form)
• Sets up a meta-optimization problem
– Agent G(RI;Θ); Environment E; Agent-designer’s U
– V(Ri) = E [ U(behavior) | G(RI; Θ) acts in E ]
– R*I = arg max
V(RI)
RI  Re ws
– (E could be a distribution over environments)

Sanity Check
• Theorem: (proof by definition ♪♫♪♬…)
Provided the search space of rewards includes the
agent-designer’s reward, the agent using the best
internal reward can do no worse than the agent
using the agent-designer’s reward
• Disentangling the PP confound cannot hurt!
Natural Agents
• Where do Rewards come from?
– Agent designer is evolution
– Evolution’s utility function is the fitness function
– Natural agents are RL agents
• Neuroscience evidence backing this up in terms of dopamine as
RL-reward
– Evolution designs the natural agent’s reward function.
– Agent learns a secondary reward function or value function
using RL to adapt to its specific environment.
– Two time scales: evolution & within-lifetime
Natural Agents
• Reward functions are an important locus of
adaptation in adaptive agents; they are a key
mechanism for converting distal pressures on fitness
into proximal pressures on behavior.
• It is possible to precisely formulate this adaptation as
a search over possible reward functions, in which
reward functions are evaluated in terms of their
fitness-conferring abilities
Implicit in above: reward is not fitness – reward captures fitness
pressures but is simultaneously a locus of knowledge about
interactions of environment regularities and agent structure
Computational Experiments
Solving the Meta-Problem
• Loop (Select a next reward to explore)
– Loop (Sample an environment from distribution over
environments)
• Initialize agent with selected reward as parameter
• Generate behavior and compute fitness/agent-designer’s
utility
– Average across samples to get value of reward to
agent-designer/evolution
• Return best reward found
• More generally, a mapping from
multidimensional agent rewards to scalar value
for agent-designer/evolution
Experiment: Fish-or-Bait
E: Fixed location for fish and bait
A: movement actions, eat, carry
A: observes location & food, bait when
at those locations & hunger-level &
carrying-status
Bait can be carried or eaten
Fish can be eaten only if bait is
carried on agent
Eat fish -> not-hungry for 1 step
Eat bait -> med-hungry for 1 step else hungry
Fitness: F(h) increment of 1.0 for each fish
0.04 for each bait eaten
Reward Space
Reward features: hunger-level (3 values)
Multiple experiments: for varying lifetimes/horizons
Lifetime length at which agent
has enough time to learn to
eat fish with designer’s reward
Lifetime length at which agent
has enough time to learn to
eat fish with internal reward
> by 3.0
Reward Space
Small mitigation
effect
> by 3.0
Full mitigation
effect
More results
Time it takes the best reward at 26,000
to learn to start to fish
Change in Agent Rewards
Agent rewards are not “nice” transformations of the agent-designer’s reward!
ML: The PP Confound
• Conjecture:
– Stronger Theorem (cartoon form here)
For Bounded agents, the best internal reward
obtained via solving the meta-optimization
problem will strictly outperform the confounded
reward.
• Corollary: Confounding is fine in unbounded
agents.
• Internal rewards mitigate agent boundedness
– Empirical work thus far with one theoretical result
Mitigation?
Increasing Agent-Designer
Utility
Unbounded agent
with confounded reward
Bounded agent with
best internal reward
Bounded agent
with confounded reward
Experiment: Foraging
E: Worm when eaten disappears.
new worm appears at random location
A: (actions) movement, eat
A: (observations) location, whether it is hungry,
not see where worm is unless at worm loc.
A: is not-hungry for 1-step on eating worm
Model-based learning agent: builds MDP
model from observation experience
and always acts greedily
Fitness: F(h) increments by one for every worm eaten
Mitigating Agent-State Boundedness
• Bound: Agent has limited state information
• Contrary to most RL tasks, the agent has to
persistently explore (not converge to a policy)
• Reward space: linear function of two features
– (Real) Inverse-Recency, i.e., inverse of how long
ago did agent execute action last in state
– (Binary) Hunger-level
Mitigating Agent-State Boundedness
Reward Type
βhunger
βrecency
Random
0
0
Asymptotic Agent-Designer
Utility per 10,000 time
steps
98
Agentdesigner
Best Agent
1
0
0.16
0.0123
0.999
754
Bayes-Optimal
1543
Experiment: Foraging
• Same foraging domain
• Agent can see worm’s location (and so no
agent state boundedness)
• Agent learns an MDP model (recall worm
appears in random locations)
• Agent can only do depth-limited planning
• Separate experiment for each value of depth
Mitigating Depth-Limited Planning
Internal Reward (Lessons so far)
1.
2.
3.
4.
5.
6.
RL (and indeed decision-theoretic approaches to AI) suffer from the
Preferences-Parameters confound. In fact, there are two reward
functions: a) the agent-designer’s reward, and b) the agent’s reward
Disentangling the confound creates a specific meta-optimization
problem
Solving the meta-problem mitigates agent-boundedness
The agent’s reward function need not be “nicely” related to the
agent-designer’s reward (e.g., non-monotonic transformations)
Good agent rewards are sensitive to details (e.g., type) of agentboundedness as well as distribution over environments (capturing
invariances across environments).
Small changes in agent’s reward can lead to large changes in agent
behavior and thus large changes in the utility obtained by the agentdesigner.

Distribution over MDPs
• MDP = S,A,R ,P ,
• Beliefs b(θ)
• Bayesian Planning (Intractable)

 s,b ,a
Q( s,b ,a)  Rb (s,a)    Pb (s'| s,a)V ( s',b' )
• Mean-MDP Planning
s'
s,a
Qb (s,a)  Rb (s,a)    Pb (s'| s,a)Vb (s')
s'
Mean-MDP + Reward Bonus
)
R˜ b (s,a) Rb (s,a)  Rb (s,a)
s,a
Q˜ b (s,a)  R˜ b (s,a)    Pb (s'| s,a)V˜b (s')
s'
Candidates for reward bonus
• 1/n (Kolter & Ng )
• 1/√n (MBIE)
• Variance of Posterior
Variance of Posterior
• Variance of transitions and rewards
 R2 (s,a) 
b


R (s,a)2 b( )  Rb (s,a)2
 P2 (s'|s,a)   P (s'| s,a)2 b( )  Pb (s'| s,a)2
b


Rˆ b (s,a)  R Rb (s,a)   P  Pb (s'|s,a)
s'
Variance of Posterior as Reward
Theorem: A (staged version) of the algorithm
will follow an ε-optimal policy from its current
state on all but a O(1/ε,1/δ,poly(|S|,|A|))
number of steps with probability at least 1 – δ.
In essence, variance-of-posterior based additive
internal reward mitigates the limitation of
mean-MDP planning.
Hunt-the-Wumpus World
Wumpus Results
Parameter
Objective Reward /
Episode
Variance
0.24
0.508
1/n
0.012
0.293
1/√n
0.012
0.291
BOSS
K = 20
0.183
Mean-MDP
N/A
0.266
Random Thought
• The focus here has been on inference (and
there is tremendous progress)
• Often cognitive science tasks require actions.
• To the extent that agents were bounded
(relative to task?) the emphasis on inference
may be misplaced
– Focusing computational resources on good
inference may not be best use for good decisionmaking