L-BFGS and Dynamical System
Download
Report
Transcript L-BFGS and Dynamical System
L-BFGS and Delayed Dynamical Systems
Approach for Unconstrained Optimization
Xiaohui XIE
Supervisor: Dr. Hon Wah TAM
1
Outline
Problem background and introduction
Analysis for dynamical systems with time delay
Introduction
of dynamical systems
Delayed dynamical systems approach
Uniqueness property of dynamical systems
Numerical testing
Main stages of this research
APPENDIX
2
1. Problem background and introduction
Optimization problems are classified into four parts, our
research is focusing on unconstrained optimization
problems.
min f x
xRn
f : R n R1
(UP)
3
Descent direction
A common theme behind all these methods is to find a
direction p x R n so that there exists an 0 such
that
f x p f x
0, .
4
Steepest descent method
For (UP), p is a descent direction at
x
f x p 0
T
p f x or p f x / f x 2 is a descent
direction for f x .
5
Method of Steepest Descent
f x k f x k .
Find k that solves min
0
Then xk 1 xk k f xk .
Unfortunately, the steepest descent method
converges only linearly, and sometimes very
slowly linearly.
6
Newton’s method
Newton’s direction—
2
f xk
1
f x k
Newton’s method
1
2
Given x0 , compute xk 1 xk f xk f xk ,
k k 1.
Although Newton’s method converges very fast,
the Hessian matrix is difficult to compute.
7
Quasi-Newton method—BFGS
Instead of using the Hessian matrix, the quasi-Newton
methods approximate it.
In quasi-Newton methods, the inverse of the Hessian
matrix is approximated in each iteration by a positive
definite (p.d.) matrix, say H k .
pk H k f xk
H k being symmetric and p.d. implies the descent
property.
8
BFGS
The most important quasi-Newton formula— BFGS.
H
BFGS
k 1
T
yk H k yk
H k 1
T
s
k yk
sk sk T sk yk T H k H k yk sk T
T
s T y
s
k yk
k k
(2)
where sk xk 1 xk yk f xk 1 f xk g k 1 g k
T
BFGS
THEOREM 1 If H k is a p.d. matrix, and sk yk 0 ,
BFGS
H
then
in (2) is also positive definite.
k 1
T
T
(Hint: we can write H k LLT , and let a L z and b L yk )
9
Limited-Memory Quasi-Newton Methods
—L-BFGS
Limited-memory quasi-Newton methods are useful for
solving large problems whose Hessian matrices cannot
be computed at a reasonable cost or are not sparse.
Various limited-memory methods have been proposed;
we focus mainly on an algorithm known as L-BFGS.
T
T
(3)
H V H V s s
k 1
k
k
1
T
yk s k
,
sk xk 1 xk
k
k
k k k
Vk I k y k s k
T
yk f k 1 f k
10
The L-BFGS approximation H k 1 satisfies the following formula:
for k 1 m
H k 1 VkT VkT1
VkT
V0T H 0V0
V1T 0 s0 s0T V1
Vk 1Vk
Vk
(6)
VkT VkT1 k 2 sk 2 skT 2Vk 1Vk
V k 1 sk 1 s
T
k
T
k 1
Vk
k sk skT .
for k 1 m
H k 1 VkT VkT1
VkT
VkT m 1 H 0Vk m 1
Vk 1Vk
VkT m 2 k m 1sk m 1skT m 1Vk m 2
Vk
(7)
VkT VkT1 k 2 sk 2 skT 2Vk 1Vk
VkT k 1 sk 1skT1Vk
k sk skT .
11
2. Analysis for dynamical systems with time
delay
The unconstrained problem (UP) is reproduced.
(8)
min f x
f : Rn R1
xR
n
It is very important that the optimization problem
is posted in the continuous form, i.e. x can be
changed continuously.
The conventional methods are addressed in the
discrete form.
12
Dynamical system approach
The essence of this approach is to convert (UP) into a
dynamical system or an ordinary differential equation
(o.d.e.) so that the solution of this problem corresponds
to a stable equilibrium point of this dynamical system.
Neural network approach
The mathematical representation of neural network is an
ordinary differential equation which is asymptotically
stable at any isolated solution point.
13
Consider the following simple dynamical system or ode
dx t
px
dt
(9)
*
n
DEFINITION 1. (Equilibrium point). A point x R is called an
equilibrium point of (9) if p x* 0 .
DEFINITION 3. (Convergence). Let x t be the solution of (9). An
*
isolated equilibrium point x is convergent if there exists a 0 such
that if x t0 x* , x t x* as t .
14
Some Dynamical system versions
Based on the steepest descent direction
dx
f x t
dt
Based on the Newton’s direction
1
dx t
2 f x t f x t
dt
Other dynamical systems
dx t
s t p x t
dt
a t
d 2 x t
dt
2
b t B x t
dx t
dt
p x t
15
Dynamical system approach can solve very large
problems.
How to find a “good” p x ?
The dynamical system approach normally consists of the
following three steps:
to establish an ode system
to study the convergence of the solution x t of the ode as
t ; and
to solve the ode system numerically.
Even though the solutions of ode systems are
continuous, the actual computation has to be done
discretely.
16
Delayed dynamical systems approach
steepest
descent
direction
slow
convergence
Newton’s
direction
difficult to
compute
fast convergence and easy to calculate
17
The delayed dynamical systems approach solves the
delayed o.d.e.
dx t
dt
H ( x (t ), x (t 1 (t )),..., x(t m (t )))f x(t ) ,
(13)
For tm 1 t , we use
H x t , x tm 1 ,
, x t0 : H m x t , x tm 1 ,
: Vm 1 t Vm 2 tm 1
T
T
V1 t2 V0 t1 H 0V0 t1 V1 t2 Vm 2 tm 1 Vm 1 t
T
Vm 1 t Vm 2 tm 1
T
, x t1 , x t0
T
T
V1 t2 0 t1 s0 t1 s0 t1 V1 t2 Vm 2 tm 1 Vm 1 t
T
T
(13A)
Vm 1 t m 2 tm 1 sm 2 tm 1 sm 2 tm 1 Vm 1 t
T
T
m 1 t sm 1 t sm 1 t .
T
Where
ym 1 t f x t f x tm 1
sm 1 t x t x tm 1 ,
m 1 t
1
ym 1 t s m 1 t
T
,
Vm 1 t I m 1 t y m 1 t sm 1 t .
T
To compute xm at t m .
18
Beyond this point we save only m previous values of x. The
definition of H is now, for m k,
For tk t ,
H x t , x tk ,
, x tk m 2 , x tk m 1 : H k 1 x t , x tk ,
: Vk t Vk 1 tk
T
T
Vk m 2 tk m 3 Vk m 1 tk m 2 H 0Vk m 1 tk m 2 Vk m 2 tk m 3 Vk 1 tk Vk t
T
Vk t Vk 1 tk
T
, x tk m 2 , x tk m 1
T
Vk m 2 tk m 3 k m 1 tk m 2 sk m 1 tk m 2 sk m 1 tk m 2 Vk m 2 tk m 3 Vk 1 tk Vk t
T
T
T
Vk t k 1 tk sk 1 tk sk 1 tk Vk t
T
T
(13B)
k t sk t sk t .
T
where
yk t f x t f x tk
sk t x t x tk ,
k t
1
yk t s k t
T
,
Vk t I k t y k t sk t .
T
19
Uniqueness property of dynamical systems
F ( x1 ) F ( x2 ) L x1 x2
Lipschitz continuity
H (u, w)f (u ) H (u, w)f (u ) L1 u u ,
H (u, w)f (u ) H (u, w)f (u ) L2 w w .
20
Lemma 2.6
Let F : R R be continuously differentiable in the
F
n
D
R
,
x
D
J
open convex set
, and let
be
x
Lipschitz continuous at x in the neighborhood D
using a vector norm and the induced matrix operator
norm and the Lipschitz constant . Then, for any
x p D,
n
m
F ( x p) F ( x) J ( x) p
2
p
2
3. Numerical testing
Test problems
● Extended Rosenbrock function
● Penalty function Ⅰ
● Variable dimensioned function
● Linear function-rank 1
Result of modified Rosenbrock problem
t
value
step
L-BFGS
2
0
497
Steepest descent
23.2813
0.0006
53557
Comparison of function value
m=2
m=4
m=6
Comparison of norm of gradient
m=2
m=4
m=6
A new code — Radar 5
The code RADAR5 is for stiff problems, including
differential-algebraic and neutral delay equations with
constant or state-dependent (eventually vanishing)
delays.
My '(t ) f (t , y(t ), y(t 1 (t , y(t ))),
y(t0 ) y0 , y(t ) g (t )
for
t t0
, y(t m (t, y(t ))))
4. Main stages of this research
Prove that the function H in (13) is positive definite.
(APPENDIX)
Prove that H is Lipschitz continuous.
Show that the solution to (13) is asymptotically stable.
Show that (13) has a better rate of convergence than the
dynamical system based on the steepest descent
direction.
Perform numerical testing.
Apply this new optimization method to practical
problems.
27
APPENDIX
To show that H in (13) is positive definite
Property 1. If H 0 is positive definite, the matrix H defined
T
by (13) is positive definite (provided yi si 0 for all i ).
I proved this result by induction. Since the continuous
analog of the L-BFGS formula has two cases, the proof
needs to cater for each of them.
28
for
k 1 m
When m 1 , H k 1 is p.d. (Theorem 1)
Assume that Hkl 1 is p.d. when m l
H kl 1 VkTVkT1
Vk T
VkTl 1 H 0Vk l 1
Vk 1Vk {VkT
Vk l 3T k l 2 sk l 2 sk l 2TVk l 3
VkTl 2 k l 1sk l 1skTl 1Vk l 2
Vk
Vk
VkTVkT1 k 2 sk 2 skT2Vk 1Vk
VkT k 1sk 1skT1Vk k sk skT }.
*
If m l 1
H kl 11 VkTVkT1
VkTl 1 VkTl H 0Vk l k l sk l skTl Vk l 1
Vk 1Vk .
29
for
k 1 m
In this case there is no m exists.
Hk 1 VkT HkVk k sk skT
By the assumption
is also p.d..
Hk
is p.d., it is obvious that
H k 1
30
31