Minimization or Maximization of Functions

Transcript Minimization or Maximization of Functions

Optimisation Methods
Minimization or
Maximization of Functions
(Readings – 10.0 – 10.7 of NRC)
1
Introduction
You are given a single function f that depends
on one or more independent variables. You
want to find the value of those variables where
f takes on a maximum or a minimum value.
An extremum (maximum or minimum point)
can be either global (truly the highest or lowest
function value) or local (the highest or lowest
in a finite neighborhood and not on the
boundary of that neighborhood).
The unconstrained multi-variable problem is
written as
min f(x)
x  RN
where x is a vector of the decision variables.
2
Introduction
Extrema of a function in an interval. Points A, C, and E are local, but not global
maxima. Points B and F are local, but not global minima. The global maximum occurs
at G, which is on the boundary of the interval so that the derivative of the function
need not vanish there. The global minimum is at D. At point E, derivatives higher than
the first vanish, a situation which can cause difficulty for some algorithms. The points
X, Y , and Z are said to “bracket” the minimum F , since Y is less than both X and Z.
3
Contour Plots
A contour plot consists of contour lines where
each contour line indicates a specific value of
the function f(x1,x2).
4
Solution Methods
The solution methods are classified into 3 broad categories:
1. Direct (zero order) search methods:
a. Bisection Search
b. Golden Section Search
c. Parabolic Interpolation and Brent’s Method
d. Simplex Method
e. Powell’s Method
2. Gradient based (first order) methods:
a. Steepest descent
b. Conjugate gradient
3. Second order methods:
a. Newton
b. Modified Newton
c. Quasi-Newton
5
Direct (zero order) search methods
They require only function values.
The are computationally uncomplicated.
They are slow.
6
How Small is Tolerably Small
(1-ε)b < b < (1 + ε)b where ε is computers
precision (3 x 10-8 for single and 10-15 for
double precision)
But: f(x) near b is (Taylor’s Theorem) is
The second term is negligible compared to the
first when
Which is 3 x 10-4 for single and 10-8 for double
precision
7
Bisection Method for Finding Roots of a Function
Bisection method : finds roots of functions in
one dimension. The root is supposed to have
been bracketed in an interval (a,b).
Evaluate the function at an intermediate point
x and obtain a new, smaller bracketing
interval, either (a,x) or (x,b).
The process continues until the bracketing
interval is acceptably small.
It is optimal to choose x to be the midpoint of
(a,b) so that the decrease in the interval length
is maximized when the function is as
uncooperative as it can be, i.e., when the luck
of the draw forces you to take the bigger
bisected segment.
8
Golden Section Search – 1D
Successive bracketing of a minimum. The minimum is originally
bracketed by points 1,3,2. The function is evaluated at 4, which replaces
2; then at 5, which replaces 1; then at 6, which replaces 4. The rule at
each stage is to keep a center point that is lower than the two outside
points. After the steps shown, the minimum is bracketed by points
5,3,6.
9
Golden Section Search – Discussion 1





New search interval will be either
between x1 and x4 with a length of a+c , or
between x2 and x3 with a length of b
To ensure that b = a+c, the algorithm
should choose x4 = x1 − x2 + x3.
Question of where x2 should be placed in
relation to x1 and x3.
The golden section search chooses the
spacing between these points in such a way
that these points have the same proportion
of spacing as the subsequent
triple x1,x2,x4 or x2,x4,x3.
By maintaining the same proportion of
spacing throughout the algorithm, we avoid
a situation in which x2 is very close
to x1 or x3, and guarantee that the interval
width shrinks by the same constant
proportion in each step.
10
Golden Section Search – Discussion 1
 Mathematically, to ensure that the
spacing after evaluating f(x4) is
proportional to the spacing prior to
that evaluation, if f(x4) is f4a and
our new triplet of points is x1, x2,
and x4 then we want c/a = a/b.
 However, if f(x4) is f4b and our new
triplet of points is x2, x4,
and x3 then we want c/(b-c) = a/b
 Eliminating c from these two
simultaneous equations yields
(b/a)2=(b/a)+1 and solving gives
b/a = φ, the golden ratio, where:
11
Golden Section Search – Discussion 2
Given (a,b,c), suppose b is a fraction w
between a and c and the next trial point x is an
additional fraction z between a and c.
The next bracketing segment will either be of
length w + z or of length 1 – w. To minimise
the worst case possibility these should be equal
giving
Scale similarity implies that x should be the
same fraction in b to c as b was in a to c
giving:
Solving these gives w = 0.38197, the golden
mean / section
12
Golden Section Search – Discussion 3
13
Golden Section Search – Discussion 3
14
Golden Section Search – Discussion 3
15
Golden Section Search – Discussion 3
16
Golden Section Search – Discussion 3
17
Golden Section Search – Discussion 3
18
Golden Section Search – Discussion 3
19
Golden Section Search – Discussion 3
20
Golden Section Search – Discussion 3
21
Parabolic Interpolation
 The Golden Section Search is designed to
handle the worse possible case of function
minimisation where the function has erratic
behaviour
 However most functions, if they are sufficiently
smooth, are nearly parabolic near a minima.
 Given three points near a minima, successively
fitting a parabola to these three points should
help to get a point closer to the minimum.
22
Parabolic Interpolation and Brent’s Method
The formula for x at the minimum of a
parabola through three points f(a), f(b) and
f(c) is:
23
Parabolic Interpolation and Brent’s Method
The exacting task is to invent a scheme that
relies on a sure-but-slow technique, like golden
section search, when the function is not
cooperative, but that switches over to parabolic
interpolation when the function allows.
The task is nontrivial for several reasons,
including these:
 The housekeeping needed to avoid unnecessary
function evaluations in switching between the two
methods can be complicated.
 Careful attention must be given to the “endgame,”
where the function is being evaluated very near to
the round-off limit.
 The scheme for detecting a cooperative versus noncooperative function must be very robust.
24
Brent’s Method
Keeps track of 6 function points:
a and b bracket the minimum
Least function value found is at x
Second least function value at w
v is the previous value of w
u is the point at which function most recently evaluated
Parabolic interpolation is attempted fitting through x, v
and w.
To be acceptable, the parabolic step must be between a
and b, and imply a movement from x that is less than
half the movement of the step before.
Where this is not working Brent’s Method alternates
between parabolic steps and golden sections.
25
Brent’s Method with First Derivatives
First derivatives can be used within Brent’s
Method as follows: The sign of the derivative at
the central point of the bracketing triplet
(a,b,c) indicates uniquely whether the next test
point should be taken in the interval (a,b) or in
the interval (b,c). The value of this derivative
and of the derivative at the second-best-so-far
point are extrapolated to zero by the secant
method (inverse linear interpolation).
We impose the same sort of restrictions on this
new trial point as in Brent’s method. If the trial
point must be rejected, we bisect the interval
under scrutiny.
26
Downhill Simplex Method in Multi-Dimensions
Bisection Methods only work in one dimension,
The downhill simplex method handles multidimensional problems and is due to Nelder and
Mead. The method requires only function
evaluations, not derivatives.
A simplex is the geometrical figure consisting,
in N dimensions, of N +1 points (or vertices)
and all their interconnecting line segments,
polygonal faces, etc.
In two dimensions, a simplex is a triangle. In
three dimensions it is a tetrahedron, not
necessarily the regular tetrahedron.
27
Downhill Simplex Method in Multi-Dimensions
After initialisation, the downhill
simplex method takes a series of
steps, most steps just moving the
point of the simplex where the
function is largest through the
opposite face of the simplex to a
lower point. These steps are called
reflections, and they are
constructed to conserve the
volume of the simplex (hence
maintain its non-degeneracy).
When it can do so, the method
expands the simplex in one or
another direction to take larger
steps.
When it reaches a “valley floor,”
the method contracts itself in the
transverse direction and tries to
ooze down the valley.
If the simplex is trying to “pass
through the eye of a needle,” it
contracts itself in all directions,
pulling itself in around its lowest
(best) point.
28
Downhill Simplex Method in Multi-Dimensions
 Let xi be the location of the ith vertex, ordered
f(x1)>f(x2)…>f(xD+1).
 Center of face of the simplex defined by all
vertices other than the one we are trying to
improve, x  1 D1 x
mean
D

i 2
i
 Since all of the others have a better function
value, they give a good direction to move in;
reflection
x1  x1new  x mean  (x mean  x1 )
 2x mean  x1
29
Downhill Simplex Method in Multi-Dimensions
 If a new position is better, it is worth checking
to see if it’s even better to double the size of
the step; expansion
x1  x1new  x mean  2(x mean  x1 )
 3x mean  2x1
 If a new position is worse, it means we
overshot. Then, reflect and shrink
x1  x1new  x mean  (1 / 2)(x mean  x1 )
 (3 / 2)x mean  (1 / 2)x1
30
Downhill Simplex Method in Multi-Dimensions
 If after reflecting and shrinking a new position is still
worse, we can try just shrinking;
x1  x1new  x mean  (1 / 2)(x mean  x1 )
 (1 / 2)x mean  (1 / 2)x1
 If after shrinking a new position is still worse, give up
and shrink all of the vertices towards the best one
xi  xinew  xi  (1 / 2)(xi  x D 1 )
 (1 / 2)(xi  x D 1 )
 When it reaches a minimum it will give up and shrink
down around it, triggering a stopping decision when the
values are no longer improving.
31
Downhill Simplex Method in Multi-Dimensions
Solve min f ( x)  2x13  4x1 x22 10x1 x2  x22
by applying 5 iterations of the simplex method, starting
with x0 = [5, 2]T.
5
4
3
2
(5, 2)
f = 234
1
0
-1
-2
-3
0
1
2
3
4
5
6
7
32
Downhill Simplex Method in Multi-Dimensions
Iteration 1
5
4
(5.51, 4.63)
f = 576.31
3
2
(6.8, 4.12)
f = 851.91
1
(5, 2)
f = 234
0
-1
-2
-3
0
1
2
3
4
5
6
7
33
Downhill Simplex Method in Multi-Dimensions
Iteration 2
5
4
(5.51, 4.63)
f = 576.31
3
2
1
(3.63, 2.51)
f = 102.88
0
(5, 2)
f = 234
-1
-2
-3
0
1
2
3
4
5
6
7
34
Downhill Simplex Method in Multi-Dimensions
Iteration 3
5
4
3
2
(3.63, 2.517)
f = 102.88
1
(5, 2)
f = 234
0
-1
(3.12, -0.1204)
f = 64.71
-2
-3
0
1
2
3
4
5
6
7
35
Downhill Simplex Method in Multi-Dimensions
Iteration 4
5
4
3
2
(3.638, 2.517)
f = 102.88
1
0
-1
(1.75, 0.397)
f = 5.15
(3.12, -0.1204)
f = 64.71
-2
-3
0
1
2
3
4
5
6
7
36
Downhill Simplex Method in Multi-Dimensions
Iteration 5
5
4
3
2
1
The solution
0
(1.758, 0.3972)
f = 5.51
-1
(3.12, -0.12)
f = 64.71
-2
-3
0
1
2
3
4
5
6
7
(1.24, -2.24)
f = 61.877
37
Downhill Simplex Method in Multi-Dimensions
Rosenbrock’s “banana”
function
F=100(x2-x12)2+(x1-1)2
38
Downhill Simplex Method in Multi-Dimensions
39
Direction Set Methods
General Scheme
Initial Step
set k = 0
supply an initial guess, xk, within any specified constraints
Iterative Step
calculate a search direction pk
determine an appropriate step length lk
set xk+1 to xk+ lk pk
Stopping Criteria
if convergence criteria reached
optimum vector is xk+1
stop
else
set k = k + 1
repeat Iterative Step
40
Direction Set (Powell’s) Method
Sometimes it is not possible to
estimate the derivative ∂f to
obtain the direction in a steepest
descent method
First guess, minimize along one
coordinate axis, then along other
and so on. Repeat
Can be very slow to converge
Conjugate directions: Directions which are independent
of each other so that minimizing along each one does
not move away from the minimum in the other
directions.
Powell introduced a method to obtain conjugate
directions without computing the derivative.
41
Direction Set (Powell’s) Method
If f is minimised along u, then f must be perpendicular to u at the
minimum. The function may be expanded using the Taylor series around
the origin p as:
f
1
2 f
1
f (x)  f (p)   xi  
xixj  ...  c  b  x  x H  x
2 i , j xixj
2
i xi
By taking the gradient of the Taylor expansion
f  b  H  x
The change in gradient when moving in one direction is:
 (f )  H  (x)
After f is minimised along u, the algorithm proposes a new direction v so
that minimisation along v does not spoil the minimum along u. For this
to be true, the function gradient must stay perpendicular to u
u   (f )  0  u  H  v
When this is true, u and v are said to be conjugate and we get quadratic
convergence to the minimum
42
Direction Set (Powell’s) Method
1. Initialise the set of directions ui to the basis vectors
2. Repeat until function stops decreasing:
1. Save starting position as P0
2. For i = 0..N-1, move Pi to the minimum along direction ui and
call this point Pi+1
3. For i = 0..N-2, set ui = ui+1
4. Set uN-1 = PN-P0
5. Move PN to the minimum along direction uN-1 and call this
point P0
Powell showed that, for a quadratic form, k iterations of the above
procedure produce a set of directions ui whose last k members are
mutually conjugate. Therefore, N iterations involving N(N+1) line
minimisations will exactly minimise a quadratic form.
43
Direction Set (Powell’s) Method
44
Gradient Based Methods
They employ the gradient information.
They are iterative methods and employ the
iteration procedure
x( k 1)  x( k )  α( k ) s( x( k ) )
where
(k) : step size
s(x(k)): direction.
The methods differ in how s(x(k)) is computed.
45
Steepest Descent Method
Let x(k) be the current point.
The Taylor expansion of the objective function about
x(k):
f ( x ( k )  α ( k ) s ( k ) )  f ( x ( k ) )  f ( x ( k ) )T (α ( k ) s ( k ) )
We need the next point to have a lower objective
function value than the current point:
f ( x( k )  α( k ) s ( k ) )  f ( x( k ) )  f ( x( k ) )T (α( k ) s ( k ) )  0
That is equivalent to
f ( x ( k ) )T s ( k )  0
The smallest value of this product is when
s ( k )  f ( x ( k ) )
46
Steepest Descent Method
We call this direction the steepest descent direction.
Another proof of the steepest descent direction is to
recognize that the gradient always points towards
increasing value of the objective function.
Taking the negative of the gradient, then, leads to the
decreasing value of the objective function.
Now the direction is determined, a single variable
search is needed to determine the value of the step
size.
In every iteration of the method, the direction and step
size are computed.
47
Steepest Descent Method
48
Steepest Descent Method
Notes
The good thing about the steepest descent
method is that it always converges.
What’s bad about it is that it converges slower
as the minimum is approached.
49
Steepest Descent Method
The gradient represents the perpendicular line
to the tangent of the contour line of the
function at a particular point.
f ( x (k ) )
50
Steepest Descent Method
The steepest descent method zigzags its
way towards the optimum point.
This is because each direction is orthogonal
to the previous direction.
x*
x ( 3)
x (1)
x ( 2)
51
Conjugate Gradient Method
Review
Two vectors u and v are said to be conjugate
with respect to matrix C if uT C v = 0.
1
1 / 2
 8  4
For example,let u   , v    and C  

0
1

4
6
 
 


T hen,u T Cv  0.
The two vectors are C-conjugate.
A set of conjugate vectors is called a conjugate set.
The eigenvectors of the matrix are conjugate with
respect to it.
52
Conjugate Gradient Method
If u and v are conjugate and v and w are conjugate,
then u and w are conjugate.
Conjugate directions, which are vectors, are used to
find the minimum of a function.
The minimum of a quadratic function of N variables
can be found after exactly N searches along
conjugate directions.
53
Conjugate Gradient Method
The question now is, how can we conveniently generate
conjugate directions?
For a quadratic function f(x), the gradient is given by
f ( x)  Cx  b  g ( x)
Take two points x(0) and x(1), the change in the gradient is
given by
g ( x)  g ( x(1) )  g ( x(0) )  C( x(1)  x(0) )  Cx
The iteration procedure we will apply is
x(k+1) = x(k) + (k) s(x(k))
The search directions are calculated as
s ( k )   g ( k )  γ( k 1) s ( k 1) ,
for k  1,2,...,N 1
with s(0)= - g(0),
54
Conjugate Gradient Method
If the steepest descent direction is used, we know
that
( k 1)T ( k )
g
g
0
We want to choose (k-1) such that s(k) is C-conjugate
to s(k-1).
Take the first direction:
s(1) = - g(1) + (0)s(0) = - g(1) - (0) g(0)
We require s(0) and s(1) to be C-conjugate:
s(1)T C s(0) = 0
[g(1) + (0) g(0)]T C s(0) = 0
We know that s ( 0 ) 
x
α (0)
55
Conjugate Gradient Method
Therefore,
 x 
[ g (1)  γ ( 0) g ( 0) ]T C ( 0)   0
α 
From the quadratic property,
[g(1) + (0) g(0)]T g = 0
After expansion,
0
g(1)T g(1) + (0) g(0)T g(1) – g(1)T g(0) – (0) g(0)T g(0) =
From this,
γ
(0)

g
(1) 2
g ( 0)
2
56
Conjugate Gradient Method
Therefore, the general iteration is given by
s(k )
 f ( k ) 2 
 s ( k 1)
 f ( k )  
 f ( k 1) 2 


for k = 1, …,N-1.
If the function is not quadratic, more iterations may
be required.
57
Conjugate Gradient Method
The steepest descent direction is deflected so the
minimum is reached directly.
x
( 2)
 f ( x(1) )
s (1)
x (1)
s (0)
x (0)
58
Conjugate Gradient Method
59
Newton’s Method
It is a second order method.
Let x(k) be the current point.
The Taylor expansion of the objective function about
x(k):
f ( x)  f ( x ( k ) )  f ( x ( k ) )T x  12 xT  2 f ( x ( k ) )x  O(x 3 )
The quadratic approximation of f(x) is
~
f ( x)  f ( x ( k ) )  f ( x ( k ) )T x  12 xT  2 f ( x ( k ) )x
We need to find the critical point of the approximation:
f ( x ( k ) )   2 f ( x ( k ) )x  0
 x   2 f ( x ( k ) ) 1 f ( x ( k ) )
60
Newton’s Method
The Newton optimization method is
x( k 1)  x( k )  2 f ( x( k ) )1 f ( x( k ) )
If the function f(x) is quadratic, the solution can be
found in exactly one step.
61
Newton’s Method
62
Modified Newton’s Method
Newton method can be unreliable for non-quadratic
functions.
The Newton step will often be large when x(0) is far
from x*.
To solve this problem, we add a step length:
x(k 1)  x(k )  α(k )2 f ( x(k ) )1f ( x(k ) )
63
Quasi-Newton Method
Quasi-Newton methods use a Hessian-like matrix but
without calculating second-order derivatives.
Sometimes, these methods are referred to as the
variable metric methods because A changes at each
iteration.
( k 1)
 x( k )  A( k )f ( x( k ) )
Take the general formula: x
When A(k) = I (identity matrix), the formula becomes
the formula of the steepest descent method.
When A(k) = 2f(x(k))-1, the formula becomes the
formula of the Newton method.
Quasi-Newton methods are based primarily upon
properties of quadratic functions and they are
designed to mimic Newton method using only firstorder information.
64
Quasi-Newton Method
Starting from a positive definite matrix, the quasiNewton methods gradually build up an approximate
Hessian matrix by using gradient information from
the previous iterations.
The matrix A is kept positive definite; hence the
direction
s(k) = - A(k)f(x(k))
remains a descent direction.
There are several ways to update the matrix A, one
of which is
T
A ( k 1)  A ( k ) 
A(k ) γ(k ) γ(k ) A(k )
γ
(k )T
A γ
(k ) (k )

where A(0) = I, (k) = x(k+1) – x(k)
(k) = f(x(k+1)) – f(x(k)).
δ(k )δ(k )
T
T
δ(k ) γ (k )
and
65
Quasi-Newton Methods (DFP)
The DFP formula used to update the matrix A is
A
(k )
A
( k 1)

x
x
( k 1)
x
( k 1) T
( k 1) T
g
( k 1)

A
( k 1)
g
g
( k 1)
( k 1) T
g
( k 1) T
A ( k 1)
A ( k 1) g ( k 1)
where A(0) = I, x(k) = x(k+1) – x(k) and
g(k) = g(x(k+1)) - g(x(k)) =f(x(k+1)) – f(x(k)).
The DFP formula preserves symmetry and positive
definiteness so that the sequence A(1), A(2), … will
also be symmetric and positive definite.
66
Quasi-Newton Methods (DFP)
67
Lagrange Multipliers - Introduction
68
Lagrange Multipliers - Introduction
69
Lagrange Multipliers - Introduction
70
Lagrange Multipliers - Introduction
Checking for Maximality : Closed Interval Method
71
Lagrange Multipliers - Introduction
Checking for Maximality : First Derivative Test
72
Lagrange Multipliers - Introduction
Checking for Maximality : Second Derivative Test
73
Lagrange Multipliers - Introduction
74
Lagrange Multipliers - Method
75
Lagrange Multipliers - Method
76
Lagrange Multipliers - Method
77
Lagrange Multipliers - Examples
78
Lagrange Multipliers - Examples
79
Lagrange Multipliers - Examples
80
Lagrange Multipliers - Examples
81
Lagrange Multipliers - Examples
82
Lagrange Multipliers - Examples
83
Lagrange Multipliers - Examples
84
Lagrange Multipliers - Examples
85
Lagrange Multipliers - Examples
86
Lagrange Multipliers - Examples
87
Lagrange Multipliers - Examples
88
Lagrange Multipliers - Examples
89
Lagrange Multipliers - Examples
90
Lagrange Multipliers - Examples
91
The Kuhn-Tucker Conditions
 The Kuhn-Tucker conditions are used to solve
NLPs of the following type:
max (or min) f ( x1 , x2 ,..., xn )
s.t .
g1 ( x1 , x2 ,..., xn )  b1
g 2 ( x1 , x2 ,..., xn )  b2

g m ( x1 , x2 ,..., xn )  bm
 The Kuhn-Tucker conditions are necessary
for a point x  ( x1, x2 ,...,xn ) to solve the NLP.
92
The Kuhn-Tucker Conditions
 Suppose the NLP is a maximization problem. If
x  ( x1 , x2 ,...,xn ) is an optimal solution to NLP,
then x  ( x1, x2 ,...,xn ) must satisfy the m constraints
in the NLP, and there must exist multipliers 1,
2, …, m satisfying
f ( x ) i  m g i ( x )
  i
0
x j
x j
i 1
( j  1, 2 , ..., n)
i [bi  g i ( x )]  0
(i  1, 2 , ..., m)
i  0
(i  1, 2 , ..., m)
93
The Kuhn-Tucker Conditions
 Suppose the NLP is a minimization problem. If
x  ( x1 , x2 ,...,xn ) is an optimal solution to NLP,
then x  ( x1, x2 ,...,xn ) must satisfy the m constraints
in the NLP, and there must exist multipliers 1,
2, …, m satisfying
f ( x ) i m g i ( x )
  i
0
x j
x j
i 1
( j  1, 2 , ..., n)
i [bi  g i ( x )]  0
(i  1, 2 , ..., m)
i  0
(i  1, 2 , ..., m)
94
The Kuhn-Tucker Conditions
 Unless a constraint qualification or regularity
condition is satisfied at an optimal point x ,
the Kuhn-Tucker conditions may fail to hold
at x .
 LINGO can be used to solve NLPs with
inequality (and possibly equality) constraints.
 If LINGO displays the message DUAL
CONDITIONS:SATISFIED then you know it has
found the point satisfying the Kuhn-Tucker
conditions.
95