Curved Trajectories towards Local Minimum of a Function

Download Report

Transcript Curved Trajectories towards Local Minimum of a Function

Curved Trajectories towards
Local Minimum of a Function
…or How to get all you can from the Taylor Series
Al Jimenez
Mathematics Department
California Polytechnic State University
San Luis Obispo, CA 93407
Summer, 2007
Introduction and Notation
• The Problem
Minimize
f ( x ),
n
x
• Derivatives:
f:
n

f ( x), f ( x), f ( x), f (4) ( x), etc
• A local min x* is a critical point: f ( x*)  0
• Necessary condition: f ( x*) ≥ 0
Typical Iterative Methods
• Sequence x1, x2 ,..., xk , xk 1 is generated from x0
• Such that f ( xk 1 )  f ( xk  pk vk )  f ( xk )
• With vk a vector with property f ( xk )vk  0
a descent direction
• And pk > 0 typically approximates solution of
Minimize f ( xk  pvk )
p
called the line search or the scalar search
• Proven to converge for smooth functions
Current Methods
• Selecting vk has huge effect on convergence rate:
– Steepest Descent:
1st order
vk   f ( xk )
–
–
–
–
–
vk    f ( xk ) f ( xk ) 2nd order,
Newton’s direction:
but may not be a descent direction when far from a min
Conjugate Directions uses vk-1, vk-2, ...
Quasi-Newton/Variable metric also uses vk-1, vk-2, ...
High order Tensor models fit prior iteration values
Number of derivatives available affects method
• The scalar search
– Accuracy of scalar minimization
– Quadratic models: “Trust Region”
1
The Basic Idea
We seek a
solution to:
f ( x)  0
calling it x = x*
By a change
of variable:
x  h( z )
we never really
find h(z) directly
g(z) is composition
That results in
f ( h( z ) )  g ( z)
new function:
function
Such that:
g ( z)  0
Is easy to solve for
a z = z*
Easy to Solve Functions g(z) = 0
• Used for this talk: g ( z)  z , z*  0
z
• Other possibilities: g ( z)  e  1, z*  0
• PhD work showed potential by selecting
an appropriate one based on the function
being minimized from limited explorations
p
Infinite Series of Solution
1
1
2


x  h( zk )  h ( zk ) zk  h ( zk ) zk  h( zk ) zk3  ...
2
6
*
h( zk )  p  f ( xk ) zkp 1
1
h( zk )   f ( xk )  p( p  1) zkp 2  f ( xk )h( zk )h( zk ) 
1
h( zk )   f ( xk )  p( p  1)( p  2) zkp 3  3 f ( xk )h( zk )h( zk )  f (4) ( xk )h( zk )h( zk )h( zk ) 
1
• Matrix vector products, but shown with
exponents for connections with scalar
Taylor series.
Infinite Series of Solution…
• Define:
 f ( xk ) d 2 
f ( xk )
 f ( xk ) d 3 
1
f ( xk )d 2d 2
2
 f ( xk ) d 4  f ( xk )d 2d 3 
• Then:
1 (4)
f ( xk ) d 2 d 2 d 2
6
1

2
x  xk   pd 2    p( p  1)d 2  p d 3 
2

1

  p( p  1)( p  2)d 2  p 2 ( p  1)d 3  p 3d 4   ...
6

*
• For p = 1:
x  xk  d2  d3  d4  ....
*
Curved Trajectories Algorithm
• At kth iteration, estimate  , then calculate:
 f ( xk ) d 2 
f ( xk )
 f ( xk ) d 3 
1
 f ( xk   d 2 )  (  1) f ( xk )

1

f
(
x
)
d

 k  4 3  f ( xk   d 2   2d 3 )  (  1) f ( xk ) 

2
• Select order, modify di , and select pk
2nd order:
3rd order:
4th order:
xk 1  xk  pd 2
3
1
 d 2  p   d 2  2d 3  p 2
2
2
11
1
xk 1  xk   d 2  p   d 2  2d 3  p 2   d 2  6d 3  6d 4  p 3
6
6
xk 1  xk 
Results on Rosenbrock Banana
Shaped Function
f ( x, y )  100( y  x 2 )2  (1  x)2
-1.2 
x , 
,
 1.0 
x0  [ x
y]T  [1.2 1]T
x*  [1 1]T
f , 24.2000 ,
-215.600 
Gradient , 

-88.00


1330.00

Hessian , 
 480.0
d2
-0.02472 
, 

 -0.3807 
-0.02444 
d3 , 

0.05805


480.0 

200 
-0.02420 
d4 , 

0.05687


1.2 1. p ( 0.04532  p ( 0.02416  0.003879 p ) )
4th order xk1 , 

1.0

1.
p
(

0.6979

p
(
0.4968

0.06462
p
)
)


• Algorithm selects
x1  [0.1156 0.1479]T , f  2.59
x2
x0
x3
f = 24.2
f = 24.2
f=4
x1
f = 0.5
3D View
"Rosenbrock's banana-shaped valley"
iteration:, 0,
norm x-x* , 2.20,
iteration:, 1,
max iterations, 25,
-1.2
x, 
,
 1.0 
Nfuns, Ngrads, Nhess:
, 7, 5, 1,
Nfuns, Ngrads, Nhess:
, 1, 1, 0
-216.
Gradient, 
,
-88.0
f , 24.20,
order:, 4,
p:, 5.,
h2D:, 0,
0.115575300000000006

norm x-x* , 1.23, x , 
, f , 2.591,
0.147857030000000000

iteration:, 2,
Nfuns, Ngrads, Nhess:
, 1 3, 8, 2,
order:, 3,
p:, 6.,
gnorm =, 233.
k2 , 0.000312
, d3normerr, 0.15210-8
-7.99
Gradient, 
,
 26.9
h2D:, 1,
gnorm, 28.1
k2 , 0.000364
, d3normerr, 0.87110-8
1.08503529999999992

 0.631
norm x-x* , 0.196, x , 
,
f
,
0.007344
,
Gradient
,

-0.213,
1.17623889999999998



iteration:, 3,
Nfuns, Ngrads, Nhess:
, 17, 11, 3,
order:, 4,
p:, 1.,
h2D:, 0,
k2 , 0.000306
, d3normerr, 0.0000115
1.00052460000000010

 0.200 
norm x-x* , 0.000762
, x,
,
f
,
0.00002502
,
Gradient
,

,
1.00055200000000011

-0.0995



iteration:, 4,
Nfuns, Ngrads, Nhess:
, 2 0, 14, 4,
order:, 4,
p:, 1.,
1.00000040000000001

-12
norm x-x* , 0.86410-6 , x , 
, f , 0.154510 ,
1.00000080000000002

iteration:, 5,
Nfuns, Ngrads, Nhess:
, 2 2, 16, 5,
 1. 
norm x-x* , 0., x ,  ,
 1. 
order:, 3,
-33
f , 0.135110 ,
p:, 1.,
gnorm, 0.666
h2D:, 0,
k2 , 0.00125, d3normerr, 0.23910-10
 0.30610-5 
Gradient, 
,
-0.11410-5 
h2D:, 0,
gnorm, 0.223
gnorm, 0.32710-5
k2 , 0.9, d3normerr, 0.63710-17
 0.16010-15 
Gradient, 
,
-0.69110-16 
gnorm, 0.17510-15
Rosenbrock’s Function
f ( x )  100( x2  x )  (1  x1 ) 2
x0  [1.2 1]T
x *  [1 1]T
Counters
xk  x *
f ( xk )
f ( xk )
Order
p
k
#f #G #H
0
1
1
0
2.2
24.2
233
1
7
5
1
4
5
1.23
2.59
28.1
2
13
8
2
3
6
0.196
0.007344
0.666
3
17 11
3
4
1
0.00076
0.223
2.5  10 5
4
20 14
4
4
1
8.6  10 7
1.5  1013
3.3  106
5
22 16
5
3
1
 10 16
1.4  1034
1.8  10 16
Table 2. Summary of Curved Trajectories Algorithm performance on banana-shape valley
function. # f is number of function evaluations, # G is number of gradient evaluations, and #
H is number of Hessian evaluations.
2 2
1
Fletcher and Powel’s Helical valley Function
f ( x )  100[( x3  10 ) 2  ( r  1) 2 ]  x32
x0  [1 0 0]T
x*  [1 0 0]T
r  x12  x22 ,
 tan 1 ( x2 / x1 ) /(2 ),
x1  0

1
0.5  tan ( x2 / x1 ) /(2 ), x1  0
Counters
xk  x *
f ( xk )
f ( xk )
Order
p
k #f #G #H
0
1
1
0
2
2500
1880
1
3
3
1
3
1
5.34
24.85
16.6
2 15
6
2
2
0.009
4.22
21.77
73
3 19
9
3
3
1
2.75
10.04
43.8
4 23 12
4
4
1
1.85
3.01
17
5 27 15
5
4
1
0.643
2.748
35.1
6 34 19
6
4
1
0.133
0.2624
15.1
7 38 22
7
4
1
0.00165
0.000028
0.143
8 42 25
8
4
1
1.2  108
1.3  1016
1.3  107
9 44 27
9
3
1
1.5  1023
2.4  1046
2  1022
Table 3. Curved Trajectories Algorithm performance on helical valley function which is
not defined at any point [0 0 s]T , s  . This function is quite a challenge given the
continuity requirements.
Wood’s saddle point function
f ( x )  100( x2  x12 ) 2  (1  x1 ) 2  90( x4  x32 ) 2  (1  x3 )2
 10.1[( x2  1)2  ( x4  1) 2 ]  19.8( x2  1)( x4  1)
x0  [3 1 3 1]T
x*  [1 1 1 1]T
Counters
xk  x *
f ( xk )
f ( xk )
Order
p
k #f #G #H
0
1
1
0
6.32
19192
16400
1
7
5
1
4
4
0.386
52.25
342
2 11
8
2
3
2
0.286
38.26
330
3 18 12
3
4
2
0.784
0.6307
9.57
4 24 16
4
3
0.68
0.0566
0.4228
27.4
5 30 20
5
4
2
0.00164
0.0336
4  106
6 34 23
6
4
1
7  1010
3.5  1019
1.8  10 8
7 35 24
7
2
1
 1016
2.3  1036
3.3  1017
Table 4. Curved Trajectories Algorithm performance on Wood’s function with a saddle
point that traps many algorithms.
Powel’s singular Hessian at the solution
f ( x )  ( x1  10 x2 ) 2  5( x3  x4 ) 2  ( x2  2 x3 ) 4  10( x1  x4 ) 4
x0  [3 1 0 1]T
x*  [0 0 0 0]T
Counters
xk  x *
f ( xk )
f ( xk )
Order
p
k #f #G #H
0
1
1
0
3.32
215
459
1
7
5
1
4
3
0.82
1.99
16.8
2 19
10
2
4
2.06
0.0421
0.000014
0.00229
3 26
14
3
4
3
0.000327
5.1  1014
1  109
4 29
17
4
4
3
4.15  105
1.3  1017
2.2  1012
5 32
20
5
4
3
5.26  106
3.4  1021
4.5  1015
Table 5. Curved Trajectories Algorithm performance on Powell’s function with singular
Hessian at the solution, which means solution has multiplicity greater than one. The p = 3
0
0 
 2 20

0 
selection, suggests multiplicity of 3. Hessian at x* = 20 200 0

0
10 10
0


0 10 10 
0
Cragg and Levy’s function with exponential, tangent, large exponents and singular Hessians
f ( x )  ( e x1  x2 )4  100( x2  x3 )6  tan 4 ( x3  x4 )  x18  ( x4  1) 2
x0  [1 2 2 2]T
x*  [0 1 1 1]T
Counters
xk  x *
f ( xk )
f ( xk )
Order
p
K #f #G #H
0
1
1
0
2
2.266
12.3
1
6
4
1
2
0.735
1.64
1.254
9.07
2
13
7
2
4
1
1.14
0.3391
4.12
3
19 11
3
4
2
0.396
0.002114
0.056
4
27 16
4
3
2.15
0.245
0.00344
5.7  105
5
33 20
5
4
3
0.0176
3.4  10 9
1.1  105
6
36 23
6
4
3
0.00549
3.9  1013
2.09  109
7
39 26
7
4
3
0.000755
7.9  1017
4.5  1012
Table 6. Curved Trajectories Algorithm performance on Cragg and Levy’s steep walls function
 103.2 16.81

with singular Hessians: at x0 =  16.81 6.186
0
 0

0
 0
0 0
0 0  , and at x* =

0 0

0 2
0
0

0

0
0 0 0
0 0 0

0 0 0

0 0 2
Initial Comparisons
Function
Algorithm
Counters
FR
DFP
B
F
CTAn
CTA
Iterations
27
19
35
39
4
4
#f
155
96
51
47
47
26
#G
28
20
36
47
27
14
Iterations
36
20
21
35
9
8
#f
202
141
140
42
108
41
#G
37
21
22
42
69
28
Iterations
189
57
42
60
7
6
#f
3288
475
310
61
64
38
#G
190
58
43
61
92
23
Powell
Singular
Hessian
Iterations
104
36
38
60
4
3
#f
624
434
374
68
38
25
#G
105
37
39
68
71
13
Cragg-Levy
Singular
Hessians
Iterations
39
96
84
82
6
5
#f
221
424
350
91
207
34
#G
40
97
85
91
62
25
Rosenbrock
banana valley
Fletcher
helical valley
Wood saddle
Cuter Performance Profiles
CPU-time Profile (127 problems < 500 variables)
Cumulative Distribution
100%
90%
80%
70%
60%
50%
40%
30%
1
2
3
4
5
6
7
8
9
Normalized CPU-time/problem
10
11
12
CTA
CTAn
CTAnn
CG Descent
Lancelot
Tenmin
L-BFGS
L-BFGS-B
Cuter Performance Profiles
CPU-time Profile (51 problems >= 500 variables)
Cumulative Distribution
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
1
3
5
7
9
11
13
Normalized CPU-time/problem
CTA
CTAn
CG Descent
Lancelot
L-BFGS-B
Partial List of Research Pursuits
 Handle several functions to minimize:
Pareto Optimal point.
 Combine Trust-Region Method, or other
strategies.
 Explore the family of infinite series for
combination of composition functions.
 Handle constraint functions.
Hessian < 0 Changes
Rotations
Rotations 3D
Conclusions, what’s new:
• Infinite series families for the solution to a
nonlinear vector function equation
• High order terms accurately approximated
from the Gradient and the Hessian
• Scalar searches that may be along
polynomial curved trajectories
• Testing shows considerable promise for
problems even as large as 10000
variables