F.L. Lewis Supported by : NSF - PAUL WERBOS Moncrief-O’Donnell Endowed Chair ARO – JIM OVERHOLT Head, Controls & Sensors Group Automation & Robotics Research Institute.

Download Report

Transcript F.L. Lewis Supported by : NSF - PAUL WERBOS Moncrief-O’Donnell Endowed Chair ARO – JIM OVERHOLT Head, Controls & Sensors Group Automation & Robotics Research Institute.

F.L. Lewis
Supported by :
NSF - PAUL WERBOS
Moncrief-O’Donnell Endowed Chair
ARO – JIM OVERHOLT
Head, Controls & Sensors Group
Automation & Robotics Research Institute (ARRI)
The University of Texas at Arlington
ADP for Feedback Control
Talk available online at
http://ARRI.uta.edu/acs
2007 IEEE International Symposium on Approximate
Dynamic Programming and Reinforcement Learning
David Fogel, General Chair
Derong Liu, Program Chair
Remi Munos, Program Co-Chair
Jennie Si, Program Co-Chair
Donald C. Wunsch, Program Co-Chair
Automation & Robotics Research Institute (ARRI)
Relevance- Machine Feedback Control
High-Speed Precision Motion Control with unmodeled dynamics, vibration suppression,
disturbance rejection, friction compensation, deadzone/backlash control
Barrel tip
position
qr2
qr1
Industrial
Machines
El
turret with backlash
and compliant drive train
Az
compliant coupling
qd
Military Land
Systems
moving tank platform
terrain and
vehicle vibration
disturbances d(t)
vibratory modes
qf(t)
Vehicle mass m
active
damping
uc
(if used)
forward speed
y(t)
kc
vertical motion
z(t)
cc
k
Parallel
Damper
mc
c
Vehicle
Suspension
Series Damper
+
suspension
+
wheel
w(t)
surface roughness
(t)
Single-Wheel/Terrain System with Nonlinearities
Aerospace
barrel flexible
modes qf
INTELLIGENT CONTROL TOOLS
Fuzzy Associative Memory (FAM)
Neural Network (NN)
(Includes Adaptive Control)
Fuzzy Logic Rule Base
NN
Input
Input Membership Fns.
Output
Output Membership Fns.
Input x
Input x
NN
Output u
Output u
Both FAM and NN define a function u= f(x) from inputs to outputs
FAM and NN can both be used for: 1. Classification and Decision-Making
2. Control
NN Includes Adaptive Control (Adaptive control is a 1-layer NN)
Neural Network Properties
 Learning
 Recall
 Function approximation
 Generalization
 Classification
 Association
 Pattern recognition
 Clustering
 Robustness to single node failure
 Repair and reconfiguration
Nervous system cell.
http://www.sirinet.net/~jgjohnso/index.html
Two-layer feedforward static neural network (NN)
VT
s(.)
x1
s(.)
1
y1
s(.)
y2
s(.)
ym
3
xn
inputs
s(.)
2
x2
s(.)
WT
L
s(.)
outputs
hidden layer
Summation eqs
 K

 n


yi  s  wiks   vkj x j  vk 0   wi 0 
 k 1

j

1




Matrix eqs
y  W T s (V T x)
Have the universal approximation property
Overcome Barron’s fundamental accuracy limitation of 1-layer NN
Neural Network Robot Controller
Universal Approximation Property
Feedback linearization
..
q
d
Nonlinear Inner Loop
Feedforward Loop
qd
^f(x)
e
[I]
r
Kv

q
Robot System
Robust Control v(t)
Term
PD Tracking Loop
Problem- Nonlinear in the NN weights so
that standard proof techniques do not work
Easy to implement with a few more lines of code
Learning feature allows for on-line updates to NN memory as dynamics change
Handles unmodelled dynamics, disturbances, actuator problems such as friction
NN universal basis property means no regression matrix is needed
Nonlinear controller allows faster & more precise motion
Extension of Adaptive Control to nonlinear-in parameters systems
No regression matrix needed
Theorem 1 (NN Weight Tuning for Stability)
Let the desired trajectory qd (t ) and its derivatives be bounded. Let the initial tracking error be
within a certain allowable set U . Let Z M be a known upper bound on the Frobenius norm of the
unknown ideal weights Z .
Take the control input as
  Wˆ T s (Vˆ T x)  K v r  v
Let weight tuning be provided by
Can also use simplified tuning- Hebbian
But tracking error is larger
with
v(t )   K Z ( Z
F
 Z M )r .
Forward Prop term?

Wˆ  Fsˆr T  Fsˆ 'Vˆ T xrT  F r Wˆ ,

Vˆ  Gx(sˆ 'T Wˆ r )T  G r Vˆ
T
T
with any constant matrices F  F  0, G  G  0 , and scalar tuning parameter   0 . Initialize
the weight estimates as Wˆ  0,Vˆ  random .
Then the filtered tracking error r (t ) and NN weight estimates Wˆ ,Vˆ are uniformly ultimately
bounded. Moreover, arbitrarily small tracking error may be achieved by selecting large control
gains K v .
Extra robustifying terms-
Backprop termsWerbos
Narendra’s e-mod extended to NLIP systems
More complex Systems?
Force Control
Flexible pointing systems
Vehicle active suspension
SBIR Contracts
Won 1996 SBA Tibbets Award
4 US Patents
NSF Tech Transfer to industry
Flexible & Vibratory Systems
Add an extra feedback loop
Two NN needed
Use passivity to show stability
Backstepping
..
qd
Nonlinear FB Linearization Loop
NN#1
e
e = e.
^F (x)
1
[I]
qd =
r
qd
.
qd
Kr
Robust Control
Term
vi(t)
1/KB1
id
h
ue
Kh
q
qr = . r
qr
Robot
System
i
^F (x)
2
NN#2
Tracking Loop
Backstepping Loop
Neural network backstepping controller for Flexible-Joint robot arm
Advantages over traditional Backstepping- no regression functions needed
Actuator Nonlinearities -
Deadzone, saturation, backlash
NN in Feedforward Loop- Deadzone Compensation
q
d
Estimate
of Nonlinear
Function
little critic network
NN Deadzone
Precompensator

f ( x )
qd
e
-
Actor:
Critic:
[
r
Kv
I
w
II
u
D(u)
v
T
Wˆ i  Ts i (U i w)r T Wˆ T s ' (U T u )U T  k1T r Wˆ i  k 2T r Wˆ i Wˆ i
T
Wˆ  Ss ' (U T u )U T Wˆis i (U i w)r T  k1 S r Wˆ

Mechanical
System
q
Acts like a 2-layer NN
With enhanced
backprop tuning !
Needed when all states are not measured
NN Observers
i.e. Output feedback
Recurrent NN Observer
zˆ  xˆ  k ~
x
1
2
D 1
ˆ T s (xˆ , xˆ )  M 1 (x ) (t )  K~
zˆ 2  W
x1
o
o
1
2
1
 x1 
q   
x2 
Neural Network
Controller
~
x1
Tune NN observer  



 

hc ( x1 , x2 )
e
e   
e
qd 
qd  
qd 
Kv
[
rˆ ( t )
ho ( x1 , x 2 )

 (t )
x1
~
x1
x1

Wˆo  k D Fos o ( xˆ ) x~1T
  o Fo x~1 Wˆo   o FoWˆo
z2
x2
kp
M-1(.) Kv
ROBOT
vc
Neural Network
Observer
Tune Action NN -

Wˆ c  Fcs c ( xˆ1 , xˆ 2 )rˆT
kD

 cFc rˆ Wˆ c
Also Use CMAC NN, Fuzzy Logic systems
Fuzzy Logic System
= NN with VECTOR thresholds
Separable Gaussian activation
functions for RBF NN
Tune first layer weights, e.g.
Centroids and spreadsActivation fns move around
Dynamic Focusing of Awareness
Separable triangular activation
functions for CMAC NN
Elastic Fuzzy Logic- c.f. P. Werbos
 ( z , a ,b , c )   B ( z , a ,b )
c2
Weights importance of factors in the rules
 cos ( a( z  b )) 
 ( z , a ,b , c )  
2
2 
1

a
(
z

b
)


2
Effect of change of membership
function spread "a"
c2
Effect of change of membership
function elasticities "c"
Elastic Fuzzy Logic Control
Control
Tune Membership Functions
u( t )   K v r  ĝ( x , x d )
â  K a AT Ŵr  k a K a â r

b̂  K b B T Ŵr  k b K b b̂ r
Tune Control Rep. Values
ĉ  K cC TŴr  kc K c ĉ r

Ŵ  KW ( ˆ  Aâ  Bb̂  Cĉ )r T  kW KWŴ r
Fuzzy Rule Base
Input Membership
Functions
Output Membership
Functions
ĝ ( x , x d )
e(t)
xd(t)
-
[

I]
r(t)
Kv
-
Controlled
Plant
x(t)
Better Performance
Start with 5x5 uniform grid of MFS
After tuning-
Builds its own basis setDynamic Focusing of Awareness
Optimality in Biological Systems
Cell Homeostasis
The individual cell is a complex
feedback control system. It pumps
ions across the cell membrane to
maintain homeostatis, and has only
limited energy to do so.
Permeability control of the cell membrane
Cellular Metabolism
http://www.accessexcellence.org/RC/VL/GG/index.html
Optimality in Control Systems Design
Rocket Orbit Injection
R. Kalman 1960
Dynamics
r  w
v2  F
w   2  sin 
r r
m
 wv F
v 
 cos 
r
m
m   Fm
Objectives
Get to orbit in minimum time
Use minimum fuel
http://microsat.sm.bmstu.ru/e-library/Launch/Dnepr_GEO.pdf
2. Neural Network Solution of Optimal Design Equations
Nearly Optimal Control
Based on HJ Optimal Design Equations
Known system dynamics
Preliminary Off-line tuning
1. Neural Networks for Feedback Control
Based on FB Control Approach
Unknown system dynamics
Extended adaptive control
On-line tuning
to NLIP systems
No regression matrix
Murad Abu Khalaf
System
H-Infinity Control Using Neural Networks
Performance output
z
x  f ( x)  g ( x)u  k ( x)d
yx
y
Measured
output
disturbance
d
u
control
z   ( x, u )
u  l ( y)
where
z  hT h  u
2
L2 Gain Problem
2
Find control u(t) so that



2

0


0
2
d (t ) dt
T
(
h
 h  u )dt
2
z (t ) dt
0


2
2
d (t ) dt
0
Zero-Sum differential Nash game
For all L2 disturbances
And a prescribed gain 2
Cannot solve HJI !!
Successive Solution- Algorithm 1:
Let  be prescribed and fixed.
Murad Abu Khalaf
u 0 a stabilizing control with region of asymptotic stability  0
1. Outer loop- update control
Initial disturbance d 0  0
2. Inner loop- update disturbance
Solve Value Equation
Consistency equation
For Value Function
(V i j )
x
T
 f  gu
uj
 kd   h h  2   T ( )d   2 (d i ) T d i  0
T
j
0
Inner loop update disturbance
d
i 1
1 T
V i j
 2 k ( x)
2
x
go to 2.

Iterate i until convergence to d  , V  j with RAS  j
Outer loop update control action
u j 1
 T
V  j
    g ( x)
x

1
2
Go to 1.


Iterate j until convergence to u  , V  , with RAS  
CT Policy Iteration for H-Infinity Control



Murad Abu Khalaf
Problem- Cannot solve the Value Equation!
(V i j )
x
T
 f  gu
uj
j
 kd   h T h  2   T ( )d   2 (d i ) T d i  0
0
Neural Network Approximation for Computational Technique
Neural Network to approximate V(i)(x)
L
V ( x)   w(ji )s j ( x)  WLT (i )s L ( x),
(i )
L
(Can use 2-layer NN!)
j 1
Value function gradient approximation is
VL
x
(i )
s L ( L)
(i )
T
(i )

WL  s L ( x)WL
x
T
Substitute into Value Equation to get
0
T
wij s ( x) x  r ( x, u j , d i )

T
wij s ( x) f
( x, u j , d )  h h  u j
i
T
2
 d
2
i
2
Therefore, one may solve for NN weights at iteration (i,j)
VFA converts partial differential equation into algebraic equation in terms of NN weights
Murad Abu Khalaf
Neural Network Optimal Feedback Controller
Optimal Solution
d
1 T
k ( x)s LTWL .
2

u   12  g T ( x)s L WL
T

A NN feedback controller with nearly optimal weights
Finite Horizon Control
Cheng Tao
Fixed-Final-Time HJB Optimal Control
Optimal cost
V  x, t 

t
*
Optimal control
*





V
x
,
t
 min  L  
u t  
x






T

 f x   g x u x 

1
T V  x, t 
u x    R 1 g x 
2
x
*
*
This yields the time-varying Hamilton-Jacobi-Bellman (HJB) equation
V x, t  V x, t 
1 V x, t 

f  x   Q x  
t
x
4
x
*
*
*T
g x R g x 
1
T
V x, t 
0
x
*
Cheng Tao
HJB Solution by NN Value Function Approximation
L
VL x, t    w j t s j x   wLT t s L x 
Time-varying weights
j 1
Note that
V L x, t  σ x 

w L t   σ TL x w L t 
x
x
Irwin Sandberg
T
L
where σ L x is the Jacobian σ L x x
V L x, t 
 TL t σ L x 
w
t
Policy iteration not needed!
Approximating V x, t  in the HJB equation gives an ODE in the NN weights
 TL t σ L x   w TL t σ L x  f x 
w
1
 w TL t σ L x g x R 1 g T  x σ TL x w L t 
4
 Q x   e L  x 
Solve by least-squares – simply integrate backwards to find NN weights
Control is
1
T
u *  x    R 1 g x  s LT wL (t )
2
ARRI Research Roadmap in Neural Networks
3. Approximate Dynamic Programming – 2006Nearly Optimal Control
Based on recursive equation for the optimal value
Usually Known system dynamics (except Q learning)
Extend adaptive control to
The Goal – unknown dynamics
yield OPTIMAL controllers.
On-line tuning
No canonical form needed.
Optimal Adaptive Control
2. Neural Network Solution of Optimal Design Equations – 2002-2006
Nearly Optimal Control
Based on HJ Optimal Design Equations
Known system dynamics
Preliminary Off-line tuning
Nearly optimal solution of
controls design equations.
No canonical form needed.
1. Neural Networks for Feedback Control – 1995-2002
Extended adaptive control
Based on FB Control Approach
Unknown system dynamics
to NLIP systems
On-line tuning
No regression matrix
NN- FB lin., sing. pert., backstepping, force control, dynamic inversion, etc.
Four ADP Methods proposed by Werbos
Critic NN to approximate:
Heuristic dynamic programming
Value
V ( xk )
Dual heuristic programming
Gradient
V
x
AD Heuristic dynamic programming
(Watkins Q Learning)
Q function Q( xk , u k )
AD Dual heuristic programming
Gradients
Q
,
x
Action NN to approximate the Control
Bertsekas- Neurodynamic Programming
Barto & Bradtke- Q-learning proof (Imposed a settling time)
Q
u
Discrete-Time Optimal Control

cost
Vh ( xk )    i k r ( xi , ui )
i k
Value function recursion
Vh ( xk )  r ( xk , h( xk ))  Vh ( xk 1 )
u k  h( xk ) = the prescribed control input function
Hamiltonian
H ( xk , V ( xk ), h)  r ( xk , h( xk ))  Vh ( xk 1 )  Vh ( xk )
Optimal cost
V * ( xk )  min (r ( xk , h( xk ))  Vh ( xk 1 ))
h
Bellman’s Principle
V * ( xk )  min (r ( xk , uk )  V * ( xk 1 ))
uk
Optimal Control
h * ( xk )  arg min (r ( xk , uk )  V * ( xk 1 ))
uk
System dynamics does not appear
Solutions by Comp. Intelligence Community
Use System Dynamics
System
xk 1  f ( xk )  g ( xk )uk

V ( x0 )   xk Qxk  uk Ruk
k 0
DT HJB equation
V  ( xk )  min  xkT Qxk  ukT Ruk  V  ( xk 1 ) 
uk
 min  xkT Qxk  ukT Ruk  V   f ( xk )  g ( xk )uk  
uk

1 1
T dV ( xk 1 )
u ( xk )   R g ( xk )
2
dxk 1

Difficult to solve
Few practical solutions by Control Systems Community
Greedy Value Fn. Update- Approximate Dynamic Programming
ADP Method 1 - Heuristic Dynamic Programming (HDP)
Paul Werbos
Policy Iteration
V j 1 ( xk )  r ( xk , h j ( xk ))  V j 1 ( xk 1 )
h j 1 ( xk )  arg min (r ( xk , uk )  V j 1 ( xk 1 ))
uk
Lyapunov eq.
( A  BL j )T Pj 1 ( A  BL j )  Pj 1  Q  LTj RL j
For LQR
Underlying RE
L j  ( I  B T Pj B) 1 B T Pj A
Hewer 1971
Initial stabilizing control is needed
ADP Greedy Cost Update
V j 1 ( xk )  r ( xk , h j ( xk ))  V j ( xk 1 )
h j 1 ( xk )  arg min (r ( xk , uk )  V j 1 ( xk 1 ))
Simple recursion
uk
For LQR
Pj 1  ( A  BL j )T Pj ( A  BL j )  Q  LTj RL j
Underlying RE
L j  ( I  B T Pj B) 1 B T Pj A
Lancaster & Rodman
proved convergence
Initial stabilizing control is NOT needed
DT HDP vs. Receding Horizon Optimal Control
Forward-in-time HDP
Pi 1  AT Pi A  Q  AT Pi B( I  BT Pi B) 1 BT Pi A
P0  0
Backward-in-time optimization – RHC
Pk  AT Pk 1 A  Q  AT Pk 1B( I  BT Pk 1B) 1 BT Pk 1 A
PN 
Control Lyapunov Function
Q Learning - Action Dependent ADP
Define Q function
Qh ( xk , uk )  r ( xk , uk )  Vh ( xk 1 )
Note
uk arbitrary
policy h(.) used after time k
Qh ( xk , h( xk ))  Vh ( xk )
Recursion for Q
Qh ( xk , uk )  r ( xk , uk )  Qh ( xk 1 , h( xk 1 ))
Simple expression of Bellman’s principle
V * ( xk )  min (Q* ( xk , uk ))
uk
h * ( xk )  arg min (Q* ( xk , uk ))
uk
Q Learning does not need to know f(xk) or g(xk)
For LQR
T
T
V ( x)  W  ( x)  x Px
V is quadratic in x
Qh ( xk , uk )  r ( xk , uk )  Vh ( xk 1 )
 xkT Qxk  ukT Ruk  ( Axk  Buk )T P( Axk  Buk )
x 
  k
u k 
T
T
T
Q  AT PA
 xk   xk   H xx
AT PB   xk   xk 

H
  

   


T
T
R  B PB  u k  u k 
 B PA
u k  u k   H ux
H xu   xk 
H uu  u k 
Q is quadratic in x and u
Control update is found by
so
0
Q
 2[ BT PAx k  ( R  BT PB)u k ]  2[ H ux xk  H uu u k ]
u k
1
uk  ( R  BT PB) 1 BT PAxk  H uu
H ux xk  L j 1 xk
Control found only from Q function
A and B not needed
Model-free policy iteration
Q Policy Iteration
Q j 1 ( xk , uk )  r ( xk , uk )  Q j 1 ( xk 1 , L j xk 1 )


Bradtke, Ydstie,
Barto
W jT1  ( xk , u k )   ( xk 1 , L j xk 1 )  r ( xk , L j xk )
Control policy update
Stable initial control needed
h j 1 ( xk )  arg min (Q j 1 ( xk , uk ))
uk
1
uk  H uu
H ux xk  L j 1 xk
Greedy Q Fn. Update - Approximate Dynamic Programming
ADP Method 3. Q Learning
Action-Dependent Heuristic Dynamic Programming (ADHDP)
Greedy Q Update
Model-free ADP
Paul Werbos
Q j 1 ( xk , u k )  r ( xk , u k )  Q j ( xk 1 , h j ( xk 1 ))
W jT1 ( xk , u k )  r ( xk , L j xk )  W jT  ( xk 1 , L j xk 1 )  target j 1
Update weights by RLS or backprop.
Q learning actually solves the Riccati Equation
WITHOUT knowing the plant dynamics
Model-free ADP
Direct OPTIMAL ADAPTIVE CONTROL
Works for Nonlinear Systems
Proofs?
Robustness?
Comparison with adaptive control methods?
Asma Al-Tamimi
ADP for Discrete-Time H-infinity Control
Finding Nash Game Equilbrium
 HDP
 DHP
 AD HDP – Q learning
 AD DHP
Asma Al-Tamimi
ADP for DT H∞ Optimal Control Systems
Disturbance
Penalty output
zk
Measured yk
output
x k 1  Ax k  Bu k  Ewk
y k  xk ,
wk
uk Control
uk=Lxk
where
zkT zk  xkT Qx k  ukT uk
Find control uk so that

T
T
x
Qx

u
 k k k uk
k 0

T
w
 k wk
i 0
2
for all L2 disturbances
and a prescribed gain
2 when the system is
at rest, x0=0.
Asma Al-Tamimi
Two known ways for Discrete-time
H-infinity iterative solution
Policy iteration for game solution
Pi 1  Α T Pi 1Α  Q  LTi RLi   2 K iT K i
Requires stable
initial policy
Α  A  EK j  BLi
Ai  A  EK j
Aj  A  BLi
Li  ( I  BT Pi B) 1 BT Pi Ai
K j   2 ( E T Pi E   2 I ) 1 E T Pi Aj
ADP Greedy iteration
Pi 1  A Pi A  Q  [ A Pi B
T
T
1
 I  B Pi B
B Pi E   BT Pi A
A Pi E ]  T

T
2   T
E
P
A
E
P
E


I
E
P
A
i
i
i 

 
T
T
T
Both require full knowledge of system dynamics
Does not
require a stable
initial policy
Asma Al-Tamimi
•
DT Game
Heuristic Dynamic Programming:
Forward-in-time Formulation
An Approximate Dynamic Programming Scheme (ADP) where one has the
following incremental optimization


Vi 1 ( xk )  min max xkT Qx k  u kT u k   2 wkT wk  Vi ( xk 1 )
uk
wk
which is equivalently written as
Vi 1 ( xk )  xkT Qx k  uiT ( xk )ui ( xk )   2 wiT ( xk ) wi ( xk )  Vi ( xk 1 )
Asma Al-Tamimi
HDP- Linear System Case
Vˆ ( x, pi )  piT x
x  ( x12 , , x1 x n , x 22 , x 2 x3 , , x n 1 x n , x n2 )
Value function update
p x  x Qx k  ( Li xk ) ( Li xk )   ( K i xk ) ( K i xk )  p xk 1
T
i 1 k
T
k
Control update
T
2
uˆ ( x, Li )  LTi x
T
T
i
Solve by batch LS
or RLS
wˆ ( x, K i )  K iT x
Li  ( I  B T Pi B  B T Pi E ( E T Pi E   2 I ) 1 E T Pi B) 1 
( B T Pi E ( E T Pi E   2 I ) 1 E T Pi A  B T Pi A),
K i  ( E T Pi E   2 I  E T Pi B( I  B T Pi B) 1 B T Pi E ) 1 
( E T Pi B( I  B T Pi B) 1 B T Pi A  E T Pi A).
Control gain
A, B, E needed

Disturbance gain
Showed that this is equivalent to iteration on the Underlying Game Riccati equation
1
Pi 1  AT Pi A  Q  [ AT Pi B
 I  B T Pi B
B T Pi E   B T Pi A
T
A Pi E ]  T
 

E T Pi E   2 I   E T Pi A
 E Pi A
Which is known to converge- Stoorvogel, Basar
Q-Learning for DT H-infinity Control:
Action Dependent Heuristic Dynamic
Programming
Asma Al-Tamimi
•
Dynamic Programming: Backward-in-time
Q ( xk , uk , wk )  ( xkT Rx k  ukT uk   2 wkT wk  V  ( xk 1 ))
 (uk , wk )  arg{ min max Q ( xk , uk , wk )}
uk
•
wk
Adaptive Dynamic Programming: Forward-in-time
Qi 1 ( xk , uk , wk )  xkT Rx k  ukT uk   2 wkT wk  min max Qi ( xk 1 , uk 1 , wk 1 )
u k 1
wk 1
 xkT Rx k  ukT uk   2 wkT wk  Vi ( xk 1 )
 xkT Rx k  ukT uk   2 wkT wk  Vi ( Axk  Bu k  Ewk )
ui ( xk )  Li xk ,
wi ( xk )  Ki xk
Linear Quadratic case- V and Q are quadratic
Asma Al-Tamimi
V  ( xk )  xkT Pxk
Q learning for H-infinity Control
Q ( xk , uk , wk )  r ( xk , uk , wk )  V  ( xk 1 )
  xkT
ukT
wkT  H  xkT
ukT
wkT 
T
Q function update
Qi 1 ( xk , uˆi ( xk ), wˆ i ( xk ))  xkT Rx k  uˆi ( xk )T uˆi ( xk )   2 wˆ i ( xk )T wˆ i ( xk ) 
Qi ( xk 1 , uˆi ( xk 1 ), wˆ i ( xk 1 ))
[ xkT ukT wkT ]H i 1[ xkT ukT wkT ]T  xkT Rx k  ukT uk   2 wkT wk  [ xkT1 ukT1 wkT1 ]H i [ xkT1 ukT1 wkT1 ]T
Control Action and Disturbance updates
ui ( xk )  Li xk ,
wi ( xk )  Ki xk
i
i 1
i
i
i 1
i
Li  ( H uui  H uw
H ww
H wu
)1 ( H uw
H ww
H wx
 H uxi ),
i
i
i
i
i
Ki  ( H ww
 H wu
H uui 1 H uw
)1 ( H wu
H uui 1 H uxi  H wx
).
 H xx
H
 ux
 H wx
H xu
H uu
H wu
H xw 
H uw 
H ww 
A, B, E NOT needed

Quadratic Basis set is used to allow on-line solution
Qˆ ( z , hi )  zT Hi z  hiT z
where
z   xT
wT 
uT
T
z  ( z12 ,
and
Asma Al-Tamimi
, z1 zq , z22 , z2 z3 ,
, zq 1 zq , zq2 )
Quadratic Kronecker basis
Q function update
Qi 1 ( xk , uˆi ( xk ), wˆ i ( xk ))  xkT Rx k  uˆi ( xk )T uˆi ( xk )   2 wˆ i ( xk )T wˆ i ( xk ) 
Qi ( xk 1 , uˆi ( xk 1 ), wˆ i ( xk 1 ))
Solve for ‘NN weights’ - the elements of kernel matrix H
h z ( xk )  x Rx k  uˆi ( xk ) uˆi ( xk )   wˆ i ( xk ) wˆ i ( xk )  h z ( xk 1 )
T
i 1
T
k
T
2
T
T
i
Use batch LS or
online RLS
Control and Disturbance Updates
uˆi ( x)  Li x
wˆ i ( x)  K i x
Probing Noise injected to get Persistence of Excitation
uˆei ( xk )  Li xk  n1k
ˆ ei ( xk )  Ki xk  n2 k
w
Proof- Still converges to exact result
Asma Al-Tamimi
H-inf Q learning Convergence Proofs
•
Convergence – H-inf Q learning is equivalent to solving
0   A
Q 0
H i 1   0 I
0    Li A
 xk  B
xk 1  A
 uk  E wk
2
yk  xk , 
 0 0   I   K i A
B
Li B
Ki B
T
E 
B
 A
Li E  H i  Li A Li B
 K i A K i B
K i E 
E 
Li E 
K i E 
without knowing the system matrices
•
The result is a model free Direct Adaptive Controller that
converges to an H-infinity optimal controller
•
No requirement what so ever on the model plant matrices
Direct H-infinity Adaptive Control
Asma Al-Tamimi
Compare to Q function for H2 Optimal Control Case
Qh ( xk , uk )  r ( xk , uk )  Vh ( xk 1 )
 xkT Qxk  ukT Ruk  ( Axk  Buk )T P( Axk  Buk )
x 
  k
u k 
T
T
T
Q  AT PA
 xk   xk   H xx
AT PB   xk   xk 

H

   
u   u   H
T
T
u
u
R  B PB   k   k 
 B PA
 k   k   ux
H-infinity Game Q function
H xu   xk 
H uu  u k 
Asma Al-Tamimi
ADP for Nonlinear Systems:
Convergence Proof
 HDP
Asma Al-Tamimi
Discrete-time Nonlinear
Adaptive Dynamic Programming:
System dynamics
xk 1  f ( xk )  g ( xk )u( xk )
V ( xk )  i k xiT Qxi  uiT Rui

Value function recursion
V ( xk )  x Qxk  u Ruk   i  k 1 xiT Qxi  uiT Rui
T
k
T
k

 xkT Qxk  ukT Ruk  V ( xk 1 )
HDP
ui ( xk )  arg min( xkT Qxk  u T Ru  Vi ( xk 1 ))
u
Vi 1  min( xkT Qxk  u T Ru  Vi ( xk 1 ))
u
 xkT Qxk  uiT ( xk ) Rui ( xk )  Vi ( f ( xk )  g ( xk )ui ( xk ))
Asma Al-Tamimi
Proof of convergence of DT nonlinear HDP
Flavor of proofs
Standard Neural Network VFA for On-Line Implementation
NN for Value - Critic
NN for control action
Vˆi ( xk ,WVi )  WViT  ( xk )
HDP
(can use 2-layer NN)
uˆi ( xk ,Wui )  W s ( xk )
T
ui
Vi 1  min( xkT Qxk  u T Ru  Vi ( xk 1 ))
u
 x Qxk  uiT ( xk ) Rui ( xk )  Vi ( f ( xk )  g ( xk )ui ( xk ))
T
k
ui ( xk )  arg min( xkT Qxk  u T Ru  Vi ( xk 1 ))
u
d ( ( xk ),WViT )  xkT Qxk  uˆiT ( xk ) Ruˆi ( xk )  Vˆi ( xk 1 )
Define target cost function
 xkT Qxk  uˆiT ( xk ) Ruˆi ( xk )  WViT  ( xk 1 )
Explicit equation for cost – use LS for Critic NN update
WVi 1  arg min{ | W
WVi 1
 ( xk )  d ( ( xk ),W ) | dxk }
T
Vi 1
T
Vi
2



WVi 1     ( xk ) ( xk )T dx 


1
  ( x )d
k
T
( ( xk ),WViT ,WuiT )dx

Implicit equation for DT control- use gradient descent for action update
 x Qxk  uˆ ( xk , ) Ruˆ ( xk , )  
Wui  arg min 

  ˆ
ˆ
V
(
f
(
x
)

g
(
x
)
u
(
x
,

))
k
k
k
 i

T
k
T
Wui ( j 1)  Wui ( j )  
 ( xkT Qxk  uˆiT( j ) Ruˆi ( j )  Vˆi ( xk 1 )
Wui ( j )
Wuij 1  Wuij  s ( xk )(2 Ruˆi ( j )  g ( xk )T
 ( xk 1 )
WVi )T
xk 1
Backpropagation- P. Werbos
Issues with Nonlinear ADP
LS solution for Critic NN update


WVi 1     ( xk ) ( xk )T dx 


Selection of NN Training Set
1
T
T
T

(
x
)
d
(

(
x
),
W
,
W
)dx
k
k
Vi
ui


x2
x2
x1
x1
time
time
Integral over a region of state-space
Approximate using a set of points
Take sample points along a single trajectory
Batch LS
Recursive Least-Squares RLS
Set of points over a region vs. points along a trajectory
For Linear systems- these are the same
Conjecture- For Nonlinear systems
They are the same under a persistence of excitation condition
- Exploration
Interesting Fact for HDP for Nonlinear systems
Linear Case
h j ( xk )  L j xk  ( I  B T Pj B) 1 B T Pj Axk
must know system A and B matrices
NN for control action
uˆi ( xk ,Wui )  WuiT s ( xk )
Implicit equation for DT control- use gradient descent for action update
 x Qxk  uˆ ( xk , ) Ruˆ ( xk , )  
Wui  arg min 

  ˆ
ˆ
V
(
f
(
x
)

g
(
x
)
u
(
x
,

))
k
k
k
 i

T
k
Wui ( j 1)  Wui ( j )  
T
 ( xkT Qxk  uˆiT( j ) Ruˆi ( j )  Vˆi ( xk 1 )
Wui ( j )
Wuij 1  Wuij  s ( xk )(2 Ruˆi ( j )  g ( xk )T
 ( xk 1 )
WVi )T
xk 1
Note that state internal dynamics f(xk) is NOT needed in nonlinear case since:
1. NN Approximation for action is used
2. xk+1 is measured
Draguna Vrabie
ADP for Continuous-Time Systems
 Policy Iteration
 HDP
Continuous-Time Optimal Control
x  f ( x, u )
System
Cost


t
t
V ( x(t ))   r ( x, u ) dt   (Q( x)  u T Ru ) dt
c.f. DT value recursion,
where f(), g() do not appear
Vh ( xk )  r ( xk , h( xk ))  Vh ( xk 1 )
Hamiltonian
V
 V 
 V 
0  V  r ( x, u )  
, u)
 x  r ( x, u )  
 f ( x, u )  r ( x, u )  H ( x,

x

x

x




T
Optimal cost
Bellman
Optimal control
T
T
T



 V  
 V 



0  min r ( x, u )  
 x  min r ( x, u )  
 f ( x, u ) 

u (t ) 
 x   u (t ) 
 x 


T
T



*
 V *  


V


 x  min r ( x, u )  
 f ( x, u ) 
0  min  r ( x, u )  




 x   u (t ) 
u (t ) 
x 







*

V
h ( x(t ))   1 2 R g ( x)
x
1
*
T
HJB equation
V (0)  0
T
T
*
*
 dV * 


1 T dV
1 dV
 f  Q( x)  4 
 gR g
0  
dx
 dx 
 dx 
V (0)  0
Linear system, quadratic cost System: x  Ax  Bu
Utility: r ( x, u)  xT Qx  u T Ru ; R  0, Q  0

The cost is quadratic
V ( x(t ))   r ( x, u)d  xT (t ) Px(t )
t
Optimal control (state feed-back):
u(t )   R 1BT ( x) Px(t )   Lx(t )
HJB equation is the algebraic Riccati equation (ARE):
0  PA  AT P  Q  PBR 1BT P
CT Policy Iteration
Utility
r ( x, u)  Q( x)  u T Ru
Cost for any given u(t)
T
V
 V 
0
, u)
 f ( x, u )  r ( x, u )  H ( x ,

x

x


Lyapunov equation
Iterative solution
Pick stabilizing initial control
Find cost
 V j
0  
 x
T

 f ( x, h j ( x))  r ( x, h j ( x))


V j ( 0)  0
Update control
h j 1 ( x)   1 2 R 1 g T ( x)
• Convergence proved by Saridis 1979 if
Lyapunov eq. solved exactly
• Beard & Saridis used complicated Galerkin
Integrals to solve Lyapunov eq.
• Abu Khalaf & Lewis used NN to approx. V for
nonlinear systems and proved convergence
V j
x
Full system dynamics must be known
LQR Policy iteration = Kleinman algorithm
1. For a given control policy u   Lk x solve for the cost:
0  Ak T Pk  Pk Ak  C T C  Lk T RLk
Lyapunov eq.
Ak  A  BLk
2. Improve policy:
Lk  R 1BT Pk 1
 If started with a stabilizing control policy L0 the matrix Pk
monotonically converges to the unique positive definite
solution of the Riccati equation.
 Every iteration step will return a stabilizing controller.
 The system has to be known.
Kleinman 1968
Policy Iteration Solution
Policy iteration
T
( A  BBT Pi )T Pi 1  Pi 1 ( A  BBT Pi )  PBB
Pi  Q  0
i
This is in fact a Newton’s Method
Ric( P)  AT P  PA  Q  PBBT P
Then, Policy Iteration is

Pi 1  Pi  RicPi

1
Ric( Pi ),
i  0,1,
Frechet Derivative
Ric 'Pi ( P)  ( A  BB T Pi )T P  P( A  BB T Pi )
Synopsis on Policy Iteration and ADP
Discrete-time
Policy iteration
V j 1 ( xk )  r ( xk , h j ( xk ))  V j 1 ( xk 1 )
 r ( xk , h j ( xk ))  V j 1[ f ( xk )  g ( xk )h j ( xk )]
h j ( xk )  L j  ( I  B T Pj B) 1 B T Pj Axk
If xk+1 is measured,
do not need knowledge of
f(x) or g(x)
Need to know f(xk) AND g(xk)
for control update
ADP Greedy cost update
V j 1 ( xk )  r ( xk , h j ( xk ))  V j ( xk 1 )
Continuous-time
Policy iteration
 V j
0  
 x
T

 V
 x  r ( x, h j ( x))   j

 x


h j 1 ( x) 
1
2R
1 T
g ( x)
T

 [ f ( x)  g ( x)h j ( x)]  r ( x, h j ( x))


V j
x
What is Greedy ADP for CT Systems ??
Either measure dx/dt
or must know f(x), g(x)
Need to know ONLY
g(x) for control update
Draguna Vrabie
Policy Iterations without Lyapunov Equations
• An alternative to using policy iterations with Lyapunov equations is
the following form of policy iterations:

V j ( x0 )   [Q( x)  W (u j )]dt
Measure the cost
0
 1 1 dV j 
u j 1 ( x)    2 R g 

dx


• Note that in this case, to solve for the Lyapunov function, you do not
need to know the information about f(x).
Murray, Saeks, and Lendaris
Methods to obtain the solution
 Dynamic programming
 built on Bellman’s optimality principle – alternative
form for CT Systems [Lewis & Syrmos 1995]
t  t

V * ( x(t ))  min   r ( x( ), u ( )) d
u ( )
t
t  t  t 

 V ( x(t  t )) 

*
r( x( ), u( ))  xT ( )Qx ( )  uT ( ) Ru ( )
Draguna Vrabie
Solving for the cost – Our approach
u  Lx
For a given control
The cost satisfies
t T
V ( x(t )) 
T
T
(
x
Qx

u
Ru )dt  V ( x(t  T ))

t
c.f. DT case
Vh ( xk )  r ( xk , h( xk ))  Vh ( xk 1 )
LQR case
t T
x(t )T Px(t ) 
T
T
T
(
x
Qx

u
Ru
)
dt

x
(
t

T
)
Px(t  T )

t
Optimal gain is
f(x) and g(x) do not appear
L  R 1BT P
Policy Evaluation – Critic update
Let K be any state feedback gain for the
system (1). One can measure the associated
cost over the infinite time horizon
V (t , x(t )) 
t T

x( )T (Q  K T RK ) x( )d W (t T , x(t T ))
t
where W (t T , x(t T )) is an initial infinite
horizon cost to go.
What to do about the tail – issues in Receding Horizon Control
Draguna Vrabie
Now Greedy ADP can be defined for CT Systems
Solving for the cost – Our approach
CT ADP Greedy iteration
u k (t )   Lk x(t )
Control policy
Cost update
LQR
Vk 1( x(t0 )) 
x0T Pk 1x0

t0  T

t0  T

T
( xT Qx  u k Ru k )dt  Vk ( x(t0  T ))
t0
T
( x Qx  u
kT
Ru k )dt  x1T Pk x1
t0
A and B do not appear
Control gain update
Lk 1  R 1BT Pk 1
B needed for control update
Implement using quadratic basis set
t T
pi 1T x (t ) 
 x( )
T
(Q  Pi BR 1BT Pi ) x( )d  pi T x (t  T )
t
u(t+T) in terms of x(t+T) - OK

No initial stabilizing control needed
Direct Optimal Adaptive Control for Partially Unknown CT Systems
Algorithm Implementation
T
Measure cost increment by adding V as a state. Then V  xT Qx  u k Ruk
The Critic update
xT (t ) Pi1x(t ) 
t T

xT ( )(Q  K iT RK i ) x( )d  xT (t T ) Pi x (t T )
t
can be setup as
pi1T x (t ) 
t T

Quadratic basis set
x( )T (Q  KiT RKi ) x( )d  piT x (t T )  d ( x (t ), K i )
t
Evaluating d ( x (t ), Ki ) for n(n+1)/2 trajectory points,
one can setup a least squares problem to solve
pi1  ( XX T )1 XY
X [ x1(t ) x 2 (t ) ... x N (t )]
Y [d ( x1, Ki ) d ( x 2 , Ki ) ... d ( x N , Ki )]T
Or use recursive Least-Squares along the trajectory
Draguna Vrabie
Analysis of the algorithm
u k   Lk x
For a given control policy
x  Ax  Bu ;
with Lk  R
1 T
B Pk
x(0)
Ak  A  BR1BT Pk
x  e Ak t x(0)
t T
Greedy update Vi 1 ( x(t ))  t
x Qx  u
T
T
i
Rui  d  Vi ( x(t  T )), V0  0 is equivalent to
t 0 T
Pk 1 

e
AkT t
(Q  Lk RLk )e
T
Ak t
dt  e
AkT (T  t 0 )
Pk e Ak (T t0 )
t0
a strange pseudo-discretized RE
c.f. DT RE

Pk 1  A Pk A  Q  A Pk B Pk  B PK B
T
T
T


1

BT Pk A
Pk 1  A Pk Ak  Q  L Pk  B PK B Lk
T
k
T
k
T
Draguna Vrabie
Analysis of the algorithm
This extra term means the initial
Control action need not be stabilizing
Lemma 2. CT HDP is equivalent to
T
Pk 1  Pk   e Ak t ( Pk A  APk  Q  Lk RLk )e Ak t dt
T
T
Ak  A  BR1BT Pk
0
When ADP converges, the resulting P satisfies the Continuous-Time ARE !!
ADP solves the CT ARE without knowledge of the system dynamics f(x)
Solve the Riccati Equation
WITHOUT knowing the plant dynamics
Model-free ADP
Direct OPTIMAL ADAPTIVE CONTROL
Works for Nonlinear Systems
Proofs?
Robustness?
Comparison with adaptive control methods?
Gain update (Policy)
Lk
0
1
2
3
4
5
k
Control
u k (t )   Lk x(t )
t
Sample periods need not be the same
Continuous-time control with discrete gain updates
Neurobiology
Higher Central Control of Afferent Input
Descending tracts from the brain influence not only motor neurons but
also the gamma-neurons which regulate sensitivity of the muscle spindle.
Central control of end-organ sensitivity has been demonstrated.
Many brain structures exert control of the first synapse in ascending
systems.
Role of cerebello rubrospinal tract and Purkinje Cells?
T.C. Rugh and H.D. Patton, Physiology and Biophysics, p. 213, 497,Saunders,
London, 1966.
Small Time-Step Approximate Tuning for Continuous-Time Adaptive Critics
D
V

V
V

V
r
( xt , u t )
V
t
t
H ( x,
, u )  V ( x)  r ( x, u )  t 1
 r ( x, u )  t 1

x
t
t
t
D
*
r
(
x
,
u
)

V
(
x
)

V
( xt )
*
t
t
t 1
A1 ( xt , u t ) 
t
Baird’s Advantage function
Advantage learning is a sort of first-order approximation to our method
Results comparing the performances
of DT-ADHDP and CT-HDP
Submitted to IJCNN’07 Conference
Asma Al-Tamimi and Draguna Vrabie
System, cost function, optimal solution
System – power plant
x  Ax  Bu xR , uR
n
 -0.665
Cost
m
V * ( x0 )  min  ( xT Qx  uT Ru )d
u (t )

 0
3.663 3.663
0 

A
 6.86
0
13.736 13.736 


0
0
0 
 6
8
B   0 0 13.736 0 
T
0

0
0
Q  In ; R  Im
Wang, Y., R. Zhou, C. Wen - 1993
CARE:
AT P  PA PBR1BT P Q  0
PCARE
 0.4750
 0.4766

 0.0601

 0.4751
0.4766 0.0601 0.4751 
0.7831 0.1237 0.3829 

0.1237 0.0513 0.0298 

0.3829 0.0298 2.3370 

V ( x0 )  min  ( xT QCT x  uT RCT u )d
CT HDP results
*
u (t )
0
P matrix parameters P(1,1),P(1,3),P(2,4),P(4,4)
2.5
2
The state measurements were
taken at each 0.1s time
period.
P(1,1)
P(1,3)
P(2,4)
P(4,4)
1.5
A cost function update was
performed at each 1.5s.
1
0.5
0
-0.5
0
10
20
30
40
50
60
For the 60s duration of the
simulation a number of 40
iterations (control policy
updates) were performed.
Time (s)
Convergence of the P matrix parameters
for CT HDP
PCT  HDP
 0.4753
 0.4771

0.0602

0.4770
0.4771 0.0602 0.4770 
0.7838 0.1238 0.3852 
0.1238 0.0513 0.0302 

0.3852 0.0302 2.3462 
The discrete version was obtained by discretizing the continuous time
model using zero-order hold method with the sample time T=0.01s.
V * ( xk )  min  t  k  xtT (QCT T ) xt  utT ( RCT T )ut 

DT ADHDP results
ut[ k , ]
P matrix parameters P(1,1),P(1,3),P(2,4),P(4,4)
2.5
The state measurements were
taken at each 0.01s time
period.
P(1,1)
P(1,3)
P(2,4)
P(4,4)
2
1.5
A cost function update was
performed at each .15s.
1
0.5
0
-0.5
0
10
20
30
40
50
60
For the 60s duration of the
simulation a number of 400
iterations (control policy
updates) were performed.
Time (s)
Convergence of the P matrix parameters
for DT ADHDP
PDT  ADHDP
Continuous-time used only 40 iterations!
0.4802
 0.4768

 0.0603

0.4754
0.4768 0.0603 0.4754 
0.7887 0.1239 0.3834 

0.1239 0.0567 0.0300 

0.3843 0.0300 2.3433
Comparison of CT and DT ADP
• CT HDP
– Partially model free (the system A matrix is not
required to be known)
• DT ADHDP – Q learning
– Completely model free
The DT ADHP algorithm is computationally more
intensive than the CT HDP since it is using a smaller
sampling period
4 US Patents
Sponsored by Paul Werbos
NSF
Call for Papers
IEEE Transactions on Systems, Man, & Cybernetics- Part B
Special Issue on
Adaptive Dynamic Programming and Reinforcement Learning
in Feedback Control
George Lendaris
Derong Liu
F.L Lewis
Papers due 1 August 2007