Forward Models in Reinforcement Learning

Download Report

Transcript Forward Models in Reinforcement Learning

Apprentissage par Renforcement
Reinforcement Learning
Kenji Doya
[email protected]
ATR Human Information Science Laboratories
CREST, Japan Science and Technology Corporation
Outline
Introduction to Reinforcement Learning (RL)
Markov decision process (MDP)
Current topics
RL in Continuous Space and Time
Model-free and model-based approaches
Learning to Stand Up
Discrete plans and continuous control
Modular Decomposition
Multiple model-based RL (MMRL)
Learning to Walk(Doya & Nakano, 1985)
Action: cycle of 4 postures
Reward: speed sensor output
QuickTimeý Dz
H.263 êLí£ÉvÉçÉOÉâÉÄ
ǙDZÇÃÉsÉNÉ`ÉÉǾå©ÇÈÇ…ÇÕïKóvÇ­Ç•
ÅB
Multiple solutions: creeping, jumping,…
Markov Decision Process (MDP)
Environment
reward r
dynamics P(s’|s,a)
reward P(r|s,a)
action a
agent
environment
Agent
state s
policy P(a|s)
Goal: maximize cumulative future rewards
E[ r(t+1) + g r(t+2) + …]
0≤g≤1: discount factor
Value Function and TD error
State value function
V(s) = E[ r(t+1) + g r(t+2) + …| s(t)=s, P(a|s)]
0≤g≤1: discount factor
Consistency condition
d(t) = r(t) + g V(s(t)) - V(s(t-1)) = 0
new estimate - old estimate
Dual role of temporal difference (TD) error d(t)
Reward prediction: d(t)  0 in average
Action selection: d(t)>0 better than average
Example: Navigation
Reward field
QuickTimeý Dz
ÉrÉf ÉI êLí£ÉvÉçÉOÉâÉÄ
ǙDZÇÃÉsÉNÉ`ÉÉǾå©ÇÈÇ…ÇÕïKóvÇ­Ç•
ÅB
Value function
g=0.9
g=0.5
2
2
1
1
0
0
-1
-1
-2
-2
6
6
6
4
2
4
2
6
4
2
4
2
Actor-Critic Architecture
critic: V(s)
reward r
TD error d
actor: P(a|s)
action a
environment
state s
Critic: future reward prediction
update value DV(s(t-1))  d(t)
Actor: action reinforcement
increase P(a(t-1)|s(t-1)) if d(t) > 0
Q Learning
Action value function
Q(s,a) = E[ r(t+1) + g r(t+2) + …|
s(t)=s, a(t)=a, P(a|s)]
= E[ r(t+1) + g V(s(t+1))| s(t)=s, a(t)=a]
Action selection
a(t) = argmaxa Q(s(t),a) with prob. 1-e
Update
Q(s(t),a(t)) := r(t+1) + g maxa[ Q(s(t+1),a)]
Q(s(t),a(t)) := r(t+1) + g Q(s(t+1),a(t+1))
Dynamic Programming and RL
Dynamic Programming
given models P(s’|s,a) and P(r|s,a)
off-line solution of Bellman equation
V*(s) = maxa [ rrP(r|s,a) + gs’V(s’)P(s’|s,a)]
Reinforcement Learning
on-line learning with TD error
d(t) = r(t) + gV(s(t) - V(s(t-1))
DV(s(t-1)) = a d(t)
DQ(s(t-1),a(t-1)) = a d(t)
Model-free and Model-based RL
Model-free: e.g., learn action values
Q(s,a) := r(s,a) + g Q(s’,a’)
a = argmaxa Q(s,a)
Model-based: forward model P(s’|s,a)
action selection:
a = argmaxa E[ R(s,a) + g s’ V(s’)P(s’|s,a)]
simulation: learn V(s) and/or Q(s,a) off-line
dynamic programming: solve Bellman eq.
V(s) = maxa E[ R(s,a) + g s’ V(s’)P(s’|s,a)]
Current Topics
Convergence proofs
with function approximators
Learning with hidden states: POMDP
estimate belief states
reactive, stochastic policy
parameterized finite-state policies
Hierarchical architectures
learn to select fixed sub-modules
train sub-modules
both
Partially Observable Markov
Decision Process (POMDP)
Update the belief state
observation P(o|s): not identity
belief state b=(P(s1), P(s2),…): real valued
P(sk|o)  P(o|sk) i P(sk|si,a) P(si)
Tiger Problem
(Kaelbing et al., 1998)
state: a tiger is in {left,right}
action: {left, right, listen}
observation with 15% error
policy tree
finite state policy
Outline
Introduction to Reinforcement Learning (RL)
Markov decision process (MDP)
Current topics
RL in Continuous Space and Time
Model-free and model-based approaches
Learning to Stand Up
Discrete plans and continuous control
Modular Decomposition
Multiple model-based RL (MMRL)
Why Continuous?
Analog control problems
discretization  poor control performance
how to discretize?
Better theoretical properties
differential algorithms
use of local linear models
Continuous TD learning
Dynamics
xÝ f (x,u)


wt
Value function
V(x(t)) 
TD error
1
Ý
d(t)  r(t)  V (t)  V (t)
Discount factor
Gradient Policy
t
e

r(w)dw

Dt
Dt

, g  1
1 g

V f 
u(t)  g

x ux( t )
On-line Learning of State Value
state x=(angle, angular vel.)
QuickTimeý Dz
H.263 êLí£ÉvÉçÉOÉâÉÄ
ǙDZÇÃÉsÉNÉ`ÉÉǾå©ÇÈÇ…ÇÕïKóvÇ­Ç•
ÅB
V(x)
Example: Cart-pole Swing up
Reward: height of the tip
Punish: crash to wall
QuickTimeý Dz
H.263 êLí£ÉvÉçÉOÉâÉÄ
ǙDZÇÃÉsÉNÉ`ÉÉǾå©ÇÈÇ…ÇÕïKóvÇ­Ç•
ÅB
Fast Learning by Internal Models
Pole balancing (Stefan Schaal, USC)
Forward model
of pole dynamics
Inverse model
of arm dynamics
QuickTimeý Dz
DV - NTSC êLí£ÉvÉçÉOÉâÉÄ
ǙDZÇÃÉsÉNÉ`ÉÉǾå©ÇÈÇ …ÇÕïKóvÇ­Ç•
ÅB
Internal Models for Planning
Devil sticking (Chris Atkeson, CMU)
QuickTimeý Dz
êLí£ÉvÉçÉOÉâÉÄ
ǙDZÇÃÉsÉNÉ`ÉÉǾå©ÇÈÇ…ÇÕïKóvÇ­Ç•
ÅB
Outline
Introduction to Reinforcement Learning (RL)
Markov decision process (MDP)
Current topics
RL in Continuous Space and Time
Model-free and model-based approaches
Learning to Stand Up
Discrete plans and continuous control
Modular Decomposition
Multiple model-based RL (MMRL)
Need for Hierarchical Architecture
Performance of control
Many high-precision sensors and actuator
Prohibitively long time for learning
Speed of learning
Search in low-dimensional, low-resolution space
Learning to Stand up
(Morimoto & Doya, 1998)
Reward: height of the head
Punishment: tumble
State: pitch and joint angles, their derivatives
Simulation  many thousands of trials to learn
Hierarchical Architecture
Upper level
discrete state/time
kinematics
action: subgoals
reward: total task
Lower level
continuous state/time
dynamics
action: motor torque
reward:
achieving subgoals
Q(S,A)
sequence of
subgoals
V(s)
a=g(s)
Learning in Simulation
Upper level
subgoals
QuickTi meý Dz
Animation êLí£ÉvÉçÉOÉâÉÄ
ǙDZÇÃÉsÉNÉ`ÉÉ Ç¾å©ÇÈÇ…ÇÕ ïKóvÇ­Ç•
ÅB
Lower level
control
early learning
after ~700 trials
Learning with Real Hardware
(Morimoto & Doya, 2001)
after simulation
QuickTimeý Dz
ÉVÉlÉpÉbÉN êLí£ÉvÉçÉOÉâÉÄ
ǙDZÇÃÉsÉN É`ÉÉÇ ¾å©ÇÈÇ…ÇÕïKóv Ç­Ç•
ÅB
QuickTimeý Dz
ÉVÉlÉpÉbÉN êLí£ÉvÉçÉOÉâÉÄ
ǙDZÇÃÉsÉNÉ`ÉÉǾå©ÇÈÇ…ÇÕïKóvÇ­Ç•
ÅB
after ~100 physical trials
Adaptation by lower control modules
Outline
Introduction to Reinforcement Learning (RL)
Markov decision process (MDP)
Current topics
RL in Continuous Space and Time
Model-free and model-based approaches
Learning to Stand Up
Discrete plans and continuous control
Modular Decomposition
Multiple model-based RL (MMRL)
Modularity in Motor Learning
Fast De-adaptation and Re-adaptation
switching rather than re-learning
Combination of Learned Modules
serial/parallel/sigmoidal mixture
‘Soft’ Switching of Adaptive Modules
‘Hard’ switching based on prediction errors
(Narendra et al., 1995)
Can result in sub-optimal task decomposition with
initially poor prediction models.
‘Soft’ switching by ‘softmax’ of prediction errors
(Wolpert and Kawato, 1998)
Can use ‘annealing’ for optimal decomposition.
(Pawelzik et al., 1996)
Responsibility by Competition
predict state change
module n
:
module 1
reward r(t)
value Vi (x)
di(t)
x(t)
policy  i(x)
ui(t)
RL controller
^ (t)
responsibility i
predictor
u(t)
exp[-Ei (t)/22]
state
predictor x. (t)
i
action
u(t)

Environment
ty
ili t)
b
i
(
ax ons l  i
m
a
t
p
f
s n
so re sig .
x(t)
Predictor
action u(t)
state x(t)
xˆÝi (t)  f i (x(t),u(t))
responsibility
2 
 1 ˆ
exp 2 2 xÝi (t)  xÝ(t) 


i (t)  n
2 
 1 ˆ
exp 2 2 xÝj (t)  xÝ(t) 
j 1
 ˆ
 xÝi (t)  xÝ(t)
 softm ax

2

 

weight output/learning
n
u(t)   i(t)gi (x(t))
i1
2
Multiple Linear Quadratic Controllers
Linear dynamic models
xˆÝi (t)  Ai (x(t)  xi )  Biu(t)
Quadratic reward models
rˆi (x(t),u(t )) ri  12 (x(t)  xi ) Qi (x(t)  xi )  12 u(t )Ri u(t)
Value functions
 
s t
Vi (x(t))  t e
r(x(s),u(s))ds   12 (x(t)  x i )Pi (x(t)  x i )
1
1
0  Pi Ai  AiPi  Pi Bi R BiPi  Qi  Pi


Action outputs
n
u(t)   i (t)Ki (x(t)  xi ); K i  R BiPi
i 1
1
Swing-up control of a pendulum
Red: module 1
Green: module 2
t=20. 0
1.2
1
0.8
0.6
x2 []
0.4
Qu ickT imeý Dz
Vi deo ê Lí£É vÉç ÉOÉ âÉÄ
Ç™ DZ ÇÃÉs ÉNÉ`ÉÉÇ¾å ©ÇÈ Ç… ÇÕïKó vÇ­ Ç•
ÅB
0.2
0
-0.2
-0.4
-0.6
-0.8
-0.4
-0.2
0
0.2
0.4
x1 []
0.6
0.8
1
Non-linearity and Non-stationarity
Specialization by predictability in space and time
Swing-up control of an ‘Acrobot’
Reward: height of the center of mass
Linearized around four fixed points
Swing-up motions
R=0.001
QuickTimeý Dz
Video êLí£ÉvÉçÉOÉâÉÄ
ǙDZÇà ÉsÉNÉ`ÉÉÇ ¾å©ÇÈÇ …ÇÕïKóvÇ­Ç •
ÅB
R=0.002
QuickTimeý Dz
Video êLí£ÉvÉçÉOÉâÉÄ
ǙDZÇà ÉsÉNÉ`ÉÉÇ ¾å©ÇÈÇ …ÇÕïKóvÇ­Ç •
ÅB
Module switching
trajectories x(t) R=0.001
R=0.002
responsibility i : symbol-like representation
1-2-1-2-1-3-4-1-3-4-3-4
1-2-1-2-1-2-1-3-4-1-3-4
Stand Up by Multiple Modules
Seven locally linear models
t=0 10
Quic kT imeý Dz
ÉrÉfÉI êLí£ÉvÉçÉOÉâÉÄ
ǙDZÇÃÉs ÉNÉ`ÉÉÇ¾å ©ÇÈÇ…ÇÕïKó vÇ­Ç•
ÅB
Quic kT imeý Dz
ÉrÉfÉI êLí£ÉvÉçÉOÉâÉÄ
ǙDZÇÃÉs ÉNÉ`ÉÉÇ¾å ©ÇÈÇ…ÇÕïKó vÇ­Ç•
ÅB
7
7
6
6
Resp
5
Resp
5
Tex tEnd
Tex tEnd
4
4
3
3
2
2
1
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
1
0
1
2
3
4
5
6
Segmentation of Observed Trajectory
Predicted motor output
predictor 1
controller 1
predictor i
^x. o (t)
i
controller i
uoi (t)
o
Predicted state change
ˆxÝo  f (xo(t),u o (t))
i
i
i
Predicted responsibility
predictor n
softmax
xo(t)
^
i(t)
oi(t)
u (t)  gi (x (t))
o
i
controller n
demonstrator
.
xo(t)
xo(t)
oi (t) 
e
n

1
2 2
e
j 1

o
o
xˆÝi (t )xÝ (t)
1
2 2
2
xˆÝoj (t )xÝo (t)
2
Imitation of Acrobot Swing-up
q1(0)=/12
QuickTimeý Dz
Video êLí£ÉvÉçÉOÉâÉÄ
ǙDZÇà ÉsÉNÉ`ÉÉÇ ¾å©ÇÈÇ …ÇÕïKóvÇ­Ç •
ÅB
q1(0)=/6
QuickTimeý Dz
Video êLí£ÉvÉçÉOÉâÉÄ
ǙDZÇà ÉsÉNÉ`ÉÉÇ ¾å©ÇÈÇ …ÇÕïKóvÇ­Ç •
ÅB
q1(0)=/12 (imitation)
QuickTimeý Dz
Video êLí£ÉvÉçÉOÉâÉÄ
ǙDZÇà ÉsÉNÉ`ÉÉÇ ¾å©ÇÈÇ …ÇÕïKóvÇ­Ç •
ÅB
Outline
Introduction to Reinforcement Learning (RL)
Markov decision process (MDP)
Current topics
RL in Continuous Space and Time
Model-free and model-based approaches
Learning to Stand Up
Discrete plans and continuous control
Modular Decomposition
Multiple model-based RL (MMRL)
Future Directions
Autonomous learning agents
Tuning of meta-parameters
Design of rewards
Selection of necessary/sufficient state coding
Neural mechanisms of RL
Dopamine neurons: encoding TD error
Basal ganglia: value-based action selection
Cerebellum: internal models
Cerebral cortex: modular decomposition
What is Reward for a robot?
Should be grounded by
Self preservation: self recharging
Self reproduction: copying control program
Cyber Rodent
The Cyber Rodent Project
Learning mechanisms under realistic constraints of
self-preservation and self-reproduction
acquisition of task-oriented internal representation
metalearning algorithms
constraints of finite time and energy
mechanisms for collaborative behaviors
roles of communication
abstract/emotional, concrete/symbolic
gene exchange rules for evolution
Input/Output
Sensory
CCD camera
range sensor
IR proximity x8
acceleration/gylo
microphone x2
Motor
two wheels
jaw
R/G/B LED
speaker
Computation/Communication
CPU: Hitachi SH-4 CPU
FPGA image processor
IO modules
Communication
IR port
wireless LAN
Software
learning/evolution
dynamic simulation