L-BFGS and Dynamical System

Download Report

Transcript L-BFGS and Dynamical System

L-BFGS and Delayed Dynamical Systems
Approach for Unconstrained Optimization
Xiaohui XIE
Supervisor: Dr. Hon Wah TAM
1
Outline


Problem background and introduction
Analysis for dynamical systems with time delay
 Introduction
of dynamical systems
 Delayed dynamical systems approach
 Uniqueness property of dynamical systems



Numerical testing
Main stages of this research
APPENDIX
2
1. Problem background and introduction
Optimization problems are classified into four parts, our
research is focusing on unconstrained optimization
problems.
min f  x 
xRn
f : R n  R1
(UP)
3
Descent direction
A common theme behind all these methods is to find a
direction p  x   R n so that there exists an   0 such
that
f x  p   f x
  0,  .
4
Steepest descent method
For (UP), p is a descent direction at
x
 f  x  p  0
T
p  f x  or p  f x / f x 2 is a descent
direction for f  x  .
5
Method of Steepest Descent
f  x k  f  x k .
Find k that solves min
 0
Then xk 1  xk  k f  xk  .
Unfortunately, the steepest descent method
converges only linearly, and sometimes very
slowly linearly.
6
Newton’s method


Newton’s direction—  
2
f xk

1
f  x k 
Newton’s method
1
2
Given x0 , compute xk 1  xk   f  xk  f  xk  ,
k  k  1.
Although Newton’s method converges very fast,
the Hessian matrix is difficult to compute.
7
Quasi-Newton method—BFGS


Instead of using the Hessian matrix, the quasi-Newton
methods approximate it.
In quasi-Newton methods, the inverse of the Hessian
matrix is approximated in each iteration by a positive
definite (p.d.) matrix, say H k .
 pk   H k f  xk 

H k being symmetric and p.d. implies the descent
property.
8
BFGS
The most important quasi-Newton formula— BFGS.
H
BFGS
k 1
T

yk H k yk
 H k  1 
T
s
k yk

 sk sk T  sk yk T H k  H k yk sk T


T
 s T y 
s
k yk
 k k 




(2)
where sk  xk 1  xk yk  f xk 1   f xk   g k 1  g k
T
BFGS
THEOREM 1 If H k is a p.d. matrix, and sk yk  0 ,
BFGS
H
then
in (2) is also positive definite.
k 1
T
T
(Hint: we can write H k  LLT , and let a  L z and b  L yk )
9
Limited-Memory Quasi-Newton Methods
—L-BFGS



Limited-memory quasi-Newton methods are useful for
solving large problems whose Hessian matrices cannot
be computed at a reasonable cost or are not sparse.
Various limited-memory methods have been proposed;
we focus mainly on an algorithm known as L-BFGS.
T
T
(3)
H V H V   s s
k 1
k 
k
1
T
yk s k
,
sk  xk 1  xk
k
k
k k k
Vk  I   k y k s k
T
yk  f k 1  f k
10
The L-BFGS approximation H k 1 satisfies the following formula:

for k 1  m
H k 1  VkT VkT1
 VkT
V0T H 0V0
V1T  0 s0 s0T V1
Vk 1Vk
Vk
(6)
 VkT VkT1  k  2 sk  2 skT 2Vk 1Vk
 V  k 1 sk 1 s
T
k
T
k 1
Vk
  k sk skT .

for k 1  m
H k 1  VkT VkT1
 VkT
VkT m 1 H 0Vk  m 1
Vk 1Vk
VkT m  2 k  m 1sk  m 1skT m 1Vk  m  2
Vk
(7)
 VkT VkT1 k  2 sk  2 skT 2Vk 1Vk
 VkT k 1 sk 1skT1Vk
 k sk skT .
11
2. Analysis for dynamical systems with time
delay
The unconstrained problem (UP) is reproduced.
(8)
min f  x 
f : Rn  R1
xR
n
It is very important that the optimization problem
is posted in the continuous form, i.e. x can be
changed continuously.
The conventional methods are addressed in the
discrete form.
12

Dynamical system approach
The essence of this approach is to convert (UP) into a
dynamical system or an ordinary differential equation
(o.d.e.) so that the solution of this problem corresponds
to a stable equilibrium point of this dynamical system.

Neural network approach
The mathematical representation of neural network is an
ordinary differential equation which is asymptotically
stable at any isolated solution point.
13
Consider the following simple dynamical system or ode
dx t 
 px 
dt
(9)
*
n
DEFINITION 1. (Equilibrium point). A point x  R is called an
equilibrium point of (9) if p  x*   0 .
DEFINITION 3. (Convergence). Let x t  be the solution of (9). An
*
isolated equilibrium point x is convergent if there exists a   0 such
that if x t0   x*   , x  t   x* as t   .
14
Some Dynamical system versions

Based on the steepest descent direction
dx
 f  x  t  
dt

Based on the Newton’s direction
1
dx  t 
   2 f  x  t    f  x  t  
dt

Other dynamical systems
dx  t 
 s t   p  x t 
dt
a t  
d 2 x t 
dt
2
 b t   B  x t  
dx  t 
dt
 p  x t 
15



Dynamical system approach can solve very large
problems.
How to find a “good” p  x  ?
The dynamical system approach normally consists of the
following three steps:




to establish an ode system
to study the convergence of the solution x t  of the ode as
t   ; and
to solve the ode system numerically.
Even though the solutions of ode systems are
continuous, the actual computation has to be done
discretely.
16
Delayed dynamical systems approach
steepest
descent
direction
slow
convergence
Newton’s
direction
difficult to
compute
fast convergence and easy to calculate
17
The delayed dynamical systems approach solves the
delayed o.d.e.
dx  t 
dt
  H ( x (t ), x (t  1 (t )),..., x(t   m (t )))f  x(t )  ,
(13)
For tm 1  t , we use
H  x  t  , x  tm 1  ,
, x  t0   : H m  x  t  , x  tm 1  ,
: Vm 1  t  Vm  2  tm 1 
T
T
V1  t2  V0  t1  H 0V0  t1 V1  t2  Vm  2  tm 1 Vm 1  t 
T
 Vm 1  t  Vm  2  tm 1 
T
, x  t1  , x  t0  
T
T
V1  t2  0  t1  s0  t1  s0  t1  V1  t2  Vm  2  tm 1 Vm 1  t 
T
T

(13A)
 Vm 1  t  m  2  tm 1  sm  2  tm 1  sm  2  tm 1  Vm 1  t 
T
T
  m 1  t  sm 1  t  sm 1  t  .
T
Where
ym 1  t   f  x  t    f  x  tm 1  
sm 1  t   x  t   x  tm 1  ,
m 1  t  
1
ym 1  t  s m 1  t 
T
,
Vm 1  t   I  m 1  t  y m 1  t  sm 1  t  .
T
To compute xm at t m .
18
Beyond this point we save only m previous values of x. The
definition of H is now, for m  k,
For tk  t ,
H  x  t  , x  tk  ,
, x  tk  m  2  , x  tk  m 1   : H k 1  x  t  , x  tk  ,
: Vk  t  Vk 1  tk 
T
T
Vk  m  2  tk  m  3  Vk  m 1  tk  m  2  H 0Vk  m 1  tk m  2 Vk m  2  tk m  3  Vk 1  tk Vk  t 
T
 Vk  t  Vk 1  tk 
T

, x  tk  m  2  , x  tk  m 1  
T
Vk  m  2  tk  m  3  k  m 1  tk  m  2  sk  m 1  tk m  2  sk m 1  tk m  2  Vk m  2  tk  m  3  Vk 1  tk Vk  t 
T
T
T
 Vk  t  k 1  tk  sk 1  tk  sk 1  tk  Vk  t 
T
T
(13B)
 k  t  sk  t  sk  t  .
T
where
yk  t   f  x  t    f  x  tk  
sk  t   x  t   x  tk  ,
k  t  
1
yk  t  s k  t 
T
,
Vk  t   I   k  t  y k  t  sk  t  .
T
19
Uniqueness property of dynamical systems
F ( x1 )  F ( x2 )  L x1  x2
Lipschitz continuity
H (u, w)f (u )  H (u, w)f (u )  L1 u  u ,
H (u, w)f (u )  H (u, w)f (u )  L2 w  w .
20
Lemma 2.6
Let F : R  R be continuously differentiable in the
F
n
D

R
,
x

D
J

open convex set
, and let
be
x
Lipschitz continuous at x in the neighborhood D
using a vector norm and the induced matrix operator
norm and the Lipschitz constant  . Then, for any
x  p  D,
n
m
F ( x  p)  F ( x)  J ( x) p 

2
p
2
3. Numerical testing
Test problems
● Extended Rosenbrock function
● Penalty function Ⅰ
● Variable dimensioned function
● Linear function-rank 1
Result of modified Rosenbrock problem
t
value
step
L-BFGS
2
0
497
Steepest descent
23.2813
0.0006
53557
Comparison of function value
m=2
m=4
m=6
Comparison of norm of gradient
m=2
m=4
m=6
A new code — Radar 5

The code RADAR5 is for stiff problems, including
differential-algebraic and neutral delay equations with
constant or state-dependent (eventually vanishing)
delays.
My '(t )  f (t , y(t ), y(t  1 (t , y(t ))),
y(t0 )  y0 , y(t )  g (t )
for
t  t0
, y(t   m (t, y(t ))))
4. Main stages of this research






Prove that the function H in (13) is positive definite.
(APPENDIX)
Prove that H is Lipschitz continuous.
Show that the solution to (13) is asymptotically stable.
Show that (13) has a better rate of convergence than the
dynamical system based on the steepest descent
direction.
Perform numerical testing.
Apply this new optimization method to practical
problems.
27
APPENDIX
To show that H in (13) is positive definite
Property 1. If H 0 is positive definite, the matrix H defined
T
by (13) is positive definite (provided yi si  0 for all i ).
I proved this result by induction. Since the continuous
analog of the L-BFGS formula has two cases, the proof
needs to cater for each of them.
28
for


k 1  m
When m  1 , H k 1 is p.d. (Theorem 1)
Assume that Hkl 1 is p.d. when m  l
H kl 1  VkTVkT1
 Vk T
VkTl 1 H 0Vk l 1
Vk 1Vk  {VkT
Vk l 3T k l  2 sk l  2 sk l  2TVk l 3
VkTl  2 k l 1sk l 1skTl 1Vk l  2
Vk 
Vk
 VkTVkT1 k 2 sk 2 skT2Vk 1Vk
 VkT k 1sk 1skT1Vk  k sk skT }.

*
If m  l  1
H kl 11  VkTVkT1


VkTl 1 VkTl H 0Vk l  k l sk l skTl Vk l 1
Vk 1Vk    .
29
for

k 1  m
In this case there is no m exists.
Hk 1  VkT HkVk  k sk skT

By the assumption
is also p.d..
Hk
is p.d., it is obvious that
H k 1
30
31