Transcript Ch 17. Optimal control theory and the linear Bellman equation

Ch 17. Optimal control theory
and the linear Bellman equation
HJ Kappen
BTSM Seminar
12.07.19.(Thu)
Summarized by Joon Shik Kim
Introduction
• Optimising a sequence of actions to attain some
future goal is the general topic of control theory.
• In an example of a human throwing a spear to kill
an animal, a sequence of actions can be assigned a
cost consists of two terms.
• The first is a path cost that specifies the energy
consumption to contract the muscles.
• The second is an end cost that specifies whether the
spear will kill animal, just hurt it, or miss it.
• The optimal control solution is a sequence of motor
commands that results in killing the animal by
throwing the spear with minimal physical effort.
Discrete Time Control (1/3)
• x t  1  x t  f ( t , x t , u t ), t  0,1, ..., T  1,
where xt is an n-dimensional vector describing the
state of the system and ut is an m-dimensional vector
that specifies the control or action at time t.
• A cost function that assigns a cost to each sequence
of controls
T 1
C ( x 0 , u 0:T 1 )   ( x T ) 
 R (t , x , u
t
t0
t
)
where R(t,x,u) is the cost associated with taking
action u at time t in state x, and Φ(xT) is the cost
associated with ending up in state xT at time T.
Discrete Time Control (3/3)
• The problem of optimal control is to find
the sequence u0:T-1 that minimises
C(x0, u0:T-1).
• The optimal cost-to-go

J ( t , x t )  m in   ( x T ) 
u t :T 1

T 1

st

R ( s, xs , u s ) 

 m in ( R ( t , x t , u t )  J ( t  1, x t  f ( t , x t , u t ))).
ut
Discrete Time Control (1/3)
• The algorithm to compute the optimal
control, trajectory, and the cost is given
by
• 1. Initialization: J (T , x )   ( x ).
• 2. Backwards: For t=T-1,…,0 and for x
compute
u t ( x )  arg m in{ R ( t , x , u )  J ( t  1, x  f ( t , x , u ))},
*
u
J ( t , x )  R ( t , x , u t )  J ( t  1, x  f ( t , x , u t )).
*
*
• 3. Forwards: For t=0,…,T-1 compute
x t  1  x t  f ( t , x t , u t ( x t )).
*
*
*
*
*
The HJB Equation (1/2)
•
J ( t , x )  m in ( R , x , u ) dt  J ( t  dt , x  f ( x , u , t ) dt )),
u
 m in ( R ( t , x , u ) dt  J ( t , x )   t J ( t , x ) dt   x J ( t , x ) f ( x , u , t ) dt ),
u
•
  t J ( t , x )  m in ( R ( t , x , u )  f ( x , u , t )  x J ( x , t )).
u
(Hamilton-
Jacobi-Belman equation)
• The optimal control at the current x, t is
given by
u ( x , t )  arg m in ( R , u , t )  f ( x , u , t )  x J ( t , x )).
u
• Boundary condition is
J ( x , T )   ( x ).
The HJB Equation (2/2)
Optimal control of mass on a spring
Stochastic Differential Equations
(1/2)
• Consider the random walk on the line
x t 1  x t   t ,  t    ,
with x0=0.
t
• In a closed form, x t   i 1  i .
•  x t   0,  x    t .
• In the continuous time limit we define
2
t
(Wiener Process)
dx t  x t  dt  x t  d 
• The conditional probability distribution
 ( x , t | x 0 , 0) 
 ( x  x0 ) 2
exp  
2 t
2  t

1

.

Stochastic Optimal Control Theory
(2/2)
• dx  f ( x ( t ), u ( t ), t ) dt  d 
• dξ is a Wiener process with
 d  d     ( t , x , u ) dt .
• Since <dx2> is of order dt, we must
make a Taylor expansion up to order dx2.
i
j
ij
1


2
  t J ( t , x )  m in  R ( t , x , u )  f ( x , u , t )  x J ( x , t )   ( t , x , u )  x J ( x , t )  .
u
2


Stochastic Hamilton-Jacobi-Bellman equation
 dx   f ( x , u , t ) dt : drift
 dx    ( t , x , u ) dt : diffusion
2
Path Integral Control (1/2)
• In the problem of linear control and
equation can be transformed into a
linear equation by a log transformation
of the cost-to-go. J ( x , t )    log  ( x , t ).
HJB becomes
1
 V
T
T
2 
  t ( x , t )     f   T r ( g g  )   .
2
 

Path Integral Control (2/2)
• Let  ( y ,  | x , t ) describe a diffusion process
for   t defined Fokker-Planck equation
   
V

 ( x, t ) 
  ( f )
T
 dy  ( y , T
1
2


T r  ( g g  ) .
2
T
| x , y ) exp(   ( y ) /  ).
(1)
The Diffusion Process as a Path
Integral (1/2)
• Let’s look at the first term in the
equation 1 in the previous slide. The first
term describes a process that kills a
sample trajectory with a rate of V(x,t)dt/λ.
• Sampling process and Monte Carlo
dx  f ( x , t ) dt  g ( x , t ) d  ,
x  x  dx , With probability 1-V(x,t)dt/λ,
xi  † ,
 ( x, t ) 
with probability V(x,t)/λ, in this case, path is killed.
 dy  ( y , T
| x , t ) exp(   ( y ) /  ) 
1
N

i alive
exp(   ( x i ( T ))  ).
The Diffusion Process as a Path
Integral (2/2)
•
p ( x (t  T ) | x , t ) 
 1

exp   S ( x ( t  T ))  .
 ( x, t )
 

1
where ψ is a partition function, J is a freeenergy, S is the energy of a path, and λ
the temperature.
Discussion
• One can extend the path integral control
of formalism to multiple agents that
jointly solve a task. In this case the
agents need to coordinate their actions
not only through time, but also among
each other to maximise a common
reward function.
• The path integral method has great
potential for application in robotics.