資訊工程所醫學影像處理實驗室

Download Report

Transcript 資訊工程所醫學影像處理實驗室

Chapter 4
Multilayer Perceptrons
授課教師: 張傳育 博士 (Chuan-Yu Chang Ph.D.)
E-mail: [email protected]
Tel: (05)5342601 ext. 4337
Office: ES709
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
Introduction
 Multilayer perceptrons
The network is consists of a set of sensory units, that
constitute the input layer, one or more hidden layers of
computation nodes, and an output layer of computation
nodes.
屬於supervised learning,是一種基於error-correction
learning的error back-propagation algorithm
 包含兩個階段:forward pass和backward pass
 Forward pass: an activity pattern is applied to the sensory
nodes , and its effect propagates through the network
layer by layer.
 Backward pass: the synaptic weights are all adjusted in
accordance with an error-correction rule.
• The error signal is propagated backward through the network.
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
2
Introduction (cont.)
 A multilayer perceptron has three distinctive
characteristic:
The mode of each neuron in the network includes a
nonlinear activation function.
 Sigmoidal nonlinearity
The network contains one or more layers of hidden
neurons.
 Enable the network to learn complex tasks by extracting
progressively more meaningful features from the input
patterns.
The network exhibits a high degrees of connectivity,
determined by the synapses of the network.
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
3
Some preliminaries
The architectural graph of a MLP with two
hidden layers and an output layer.
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
4
Some preliminaries (cont.)
 Two kinds of signals are identified:
Function signals
 Comes in at the input end of the network, propagates
forward through the network, and emerges at the output
end of the network as an output signal.
Error signal
 Originates at an output neuron of the network, and
propagates backward through the network
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
5
Some preliminaries (cont.)
在multilayer perceptron中的每個隱藏層或
輸出層的神經元都被設計來執行兩種計算
功能:
神經元的輸出是輸入訊號經由權重的連結經一非
線性轉換所得到的結果
用來估計梯度向量(gradient vector) ,以回傳到
整個網路
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
6
Signal-flow graph highlighting
the details of output neuron j
Neuron j
y0  1
w j 0 ( n)  b j ( n)
d j (n)
y j (n)   j (v j (n))
yi (n)
w ji (n)
v j (n)
 ()
y j (n)
1
m
v j (n)   w ji (n) yi (n)
e j (n)
i 0
chain rule
E (n) E (n) e j (n) y j (n) v j (n)

wij (n) e j (n) y j (n) v j (n) wij (n)
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
7
Back-propagation algorithm (cont.)
The error signal at the output of neuron j at iteration
n is defined by
(4.1)
e j n  d j n  y j n
The instantaneous value of the total error energy is
defined as
1
E n    e 2j n 
2
jc
輸出層神經元的個數
(4.2)
The average squared error energy is obtained by
summing E(n) over all n and then normalizing with
respect to the set size
N
訓練樣本的個數
N
1
Eav  E n 
N n1
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
(4.3)
8
Back-propagation algorithm (cont.)
Eav represents the cost function as a measure of
learning performance.
The objective of the learning process is to
adjust the free parameters of the network to
minimize Eav.
The weights are updated on a pattern-bypattern basis until one epoch, that is, one
complete presentation of the entire training set
has been dealt with.
The adjustments to the weights are made in
accordance with the respective errors computed
for each pattern presented to the network.
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
9
Back-propagation algorithm (cont.)
The induced local field vj(n) is defined as
m
v j n    w ji n  yi n 
(4.4)
i 0
The function signal yj(n) appearing at the output
of neuron j at iteration n is
y j n   j v j n
(4.5)
根據LMS及credit-assignment,倒傳遞網路會使
用chain rule來計算每個神經鍵的修正權重
Sensitivity factor
E n E n e j n y j n v j n

w ji n e j n y j n v j n w ji n
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
(4.6)
10
Back-propagation algorithm (cont.)
Eq(4.2)兩邊對ej(n)作偏微分
E n
 e j n
e j n
(4.7)
Eq(4.1)兩邊對yj(n)作偏微分
e j n
y j n
 1
Eq(4.5) 對vj(n)作偏微分
y j n
  j ' v j n
v j n
Eq(4.4)對wj(n)作偏微分
v j n
w ji n
 yi n
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
(4.8)
(4.9)
(4.10)
11
Back-propagation algorithm (cont.)
 將Eq(4.7)到Eq(4.10)整合起來
E n
 e j n j ' v j nyi n
w ji n
(4.11)
 根據delta rule來修正權重值
E n 
w ji n   
w ji n 
 因此,將Eq(4.11)代入Eq(4.12)可得
wji n  d j nyi n
其中,local gradient dj(n)定義成
d j n   
 
(4.12)
(4.13)
E n 
v j n 
E n  e j n  y j n 
w ji n  y j n  v j n 
 e j n  j ' v j n 
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
(4.14)
12
E ( n) 
y0  1
1
2
 e 2j (n)
jC
e j ( n)  d j ( n )  y j ( n )
w j 0 ( n)  b j ( n)
d j (n)
y j (n)   j (v j (n))
yi (n)
w ji (n)
v j (n)
 ()
y j (n)
1
e j (n)
m
v j (n)   w ji (n) yi (n)
i 0
E (n) E (n) e j (n) y j (n) v j (n)

wij (n) e j (n) y j (n) v j (n) wij (n)
E (n)
 e j ( n)
e j (n)
e j (n)
y j (n)
 1
y j (n)
v j (n)
  j (v j (n))
v j (n)
wij (n)
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
 yi (n)
13
Back-propagation algorithm (cont.)
 Case 1: neuron j is an output node
使用Eq(4.1)計算與此顆神經元有關的誤差ej(n) ,再使
用Eq(4.14)計算其local gradient dj(n) 。
 Case 2: neuron j is a hidden node
There is no specified desired response for that
neuron.
The error signal for a hidden neuron would
have to be determined recursively in terms of
the error signals of all the neuron to which that
hidden neuron is directly connected.
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
14
Signal-flow graph of output neuron k connected to hidden
neuron j
Fig. 4.4
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
15
Back-propagation algorithm (cont.)
根據Eq(4.14) ,可定義hidden neuron j的local
gradient dj(n)
d j n   
 
因為
E n  y j n 
y j n  v j n 
E n 
 j ' v j n ,
y j n 
E n  
neuron j is hidden
1
ek2 n 

2 kc
(4.15)
(4.16)
對Eq(4.16)進行偏微分
E n 
e n 
  ek k
y j n 
y j n 
k
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
(4.17)
16
Back-propagation algorithm (cont.)
使用chain rule,E(4.17)可改寫成
E n
e n  vk n 
  ek n k
y j n k
vk n y j n 
(4.18)
從圖四發現
ek n  d k n  yk n
 d k n  k vk n,
因此
neuronk is an output node
ek n 
  k ' vk n 
vk n 
從圖四也發現
(4.19)
(4.20)
m
vk n   wkj ny j n
j 0
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
(4.21)
17
Back-propagation algorithm (cont.)
對Eq(4.21)對yj(n)偏微分
vk n 
 wkj n 
y j n 
(4.22)
將Eq(4.20)和Eq(4.22)代入Eq(4.18)可得
E n 
  ek n  k ' vk n wkj n 
y j n 
k
(4.23)
  d k n wkj n 
k
最後將Eq(4.23)代入Eq(4.15)可得local gradient dj(n)的
back-propagation formula
d j n   j ' v j nd k nwkj n,
neuron j is hidden
(4.24)
k
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
18
Neuron k
Neuron j
y0  1
y0  1
w j 0 ( n)  b j ( n)
yi (n)
w ji (n)
v j (n)
d k (n)
 () y j (n)
wkj (n)
 ()
vk (n)
yk (n)
1
ek (n)
ek (n)  d k (n)  yk (n)  d k (n)   k (vk (n))
m
ek (n)
  k (vk (n))
vk (n)
vk (n)   wkj (n) y j (n)
j 0
vk (n)
 wkj (n)
y j (n)
E (n)
  ek (n) k (vk (n)) wkj (n)   δ k (n)wkj (n)
y j (n)
k
k
δ j (n)   j (v j (n))δ k (n)wkj (n)
資訊工程所 醫學影像處理實驗室(k Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
19
Back-propagation algorithm (cont.)
 Summary
 The correction wji(n) applied to the synaptic weight connecting
neuron i to neuron j is defined by the delta rule
 W eight   Learning   Local   Input signal

 
 
 

 correction   rate parameter   gradient   of neuron j 

 
  d n   






w
n

y
n
ji
j
i








(4.25)
 The local gradient dj(n) depends on whether neuron j is an output
node or a hidden node
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
20
Back-propagation algorithm (cont.)
The two passes of Computation
Forward pass
The synaptic weights remain unaltered throughout
the network, and the function signals of the network
are computed on a neuron-by-neuron basis.
The output of neuron j is
(4.26)
y j n   v j n
where induced local field of neuron j is defined by
m
v j n    w ji n  yi n 
Backward pass
i 0
(4.27)
Starts at the output layer by passing the error signals
leftward through the network, layer by layer, and
recursively computing the d for each neuron.
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
21
Back-propagation algorithm (cont.)
 Activation function
在MLP中,(.)必須是連續可微
普遍使用的activation function為Sigmodial nonlinearity
Logistic function
 j v j n 
1
1  exp av j n


a  0 and    v j n   
(4.30)
對Eq(4.30)微分可得
 j ' v j n  
a exp avj n 
1  exp av n
(4.31)
2
j
因為yj(n)=j(vj(n))Eq(4.30)可改寫成
 j ' v j n  ayj n1  y j n
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
(4.32)
22
Back-propagation algorithm (cont.)
在輸出層的神經元 j,yj(n)=oj(n) ,因此,神經元 j 的
local gradient可表示成
d j n  e j n  j ' v j n 




 a d j n   o j n  o j n  1  o j n  , neuron j is an output node
(4.33)
隱藏層中的任意神經元 j 的local gradient可表示成
d j n   j ' v j nd k n wkj n

k

 ay j n  1 - y j n d k n wkj n, neuron j is hidden
(4.34)
k
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
23
Back-propagation algorithm (cont.)
Hyperbolic tangent function
 j v j n  a tanhbvj n, a, b  0
(4.35)
對vj(n)微分
 j ' v j n   absech2 bv j n 

 


b
 ab 1  tanh bv j n   a  y j n  a  y j n 
a
2
(4.36)
在輸出層的神經元j的local gradient
d j n   e j n  j ' v j n 





b
d j n   o j n  a  o j n  a  o j n 
a
(4.37)
在隱藏層的神經元j的local gradient
d j n    j ' v j n  d k n wkj n 
k



b
a  y j n  a  y j n   d k n wkj n , neuron j is hidden
a
k
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )

Graduate School of Computer Science & Information Engineering
(4.38)
24
Rate of Learning
BP演算法提供一個使用最陡坡降法在權重空間
軌跡的近似。
較小的learning rate parameter ,網路中神經鍵權重
的改變也較少,因此, 在權重向量的軌跡較平滑。然
而其收斂時間較長。
較大的learning rate parameter ,網路中神經鍵權重
的改變也較大,可加速網路的收斂速度,然而在權重
向量的軌跡則會出現震盪狀。網路可能會不穩定。
可藉由momentum項的引入,以增加學習速度,同時
避免網路的不穩定性
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
25
Rate of Learning
 BP approximate the trajectory of steepest descent
 smaller learning-rate parameter makes smoother path
 Increasing rate of learning yet avoiding danger of
instability by including a momentum term
w ji (n)  α w ji (n  1)  ηδ j (n) y j (n)
where  is momentum constant
n
n
w ji (n)  η α δ j (t ) yi (t )  η α n-t
t 0
n-t
t 0
a.沿波前進
E (n)
w ji (n)
 網路要converge,momentum constant必須嚴格
限制在0 | |  1
b.谷底振盪
 a.當偏微分項與上次同號,加速(wji(n)值變大)
i.e. η加大 (加快downhill)
 b.當偏微分項與上次反號,相消(wji(n)值變小)
(對符號的震盪有stabilizing effect)
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
26
Rate of Learning (cont.)
 引入momentum項,有benefit of preventing the
learning process from terminating in a shallow
local minimum
 Learning rate 可以是connection-dependent.
 Synaptic weights可以是某些固定,某些可調。
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
27
Mode of Training
 Mode of Training
 Epoch : one complete presentation of training data
 randomize the order of presentation for each epoch
 Sequential mode: on-line, pattern, stochastic mode




for each training sample, synaptic weights are updated
require less storage
converge much fast, particularly training data is redundant
random order makes trapping at local minimum less likely
 Batch mode
 Weight updating is performed after the presentation of all the training
examples that constitute an epoch.
 at the end of one epoch, synaptic weights are updated
e j (n)
Eav (n)
η N
w ji (n)  η
   e j ( n)
w ji (n)
N n 1
w ji (n)
(4.42-43)
 may be robust with outliers
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
28
Sequential Mode v.s. Batch Mode of Training
1.
2.
Pattern Mode:每次Mini
1
e 2j (n)

2 jC
每給一pattern,計算出 w ji (n),update it ,再給另一
pattern ,計算出新的w ji (n) ,update……
Batch Mode:
當N個pattern都用過後,其總 w ji為
1
w ji 
N
N
1

w
(
n
)

 ji
N
n 1
E (n)





 w (n) N
n 1
ji
N
N
e j (n)
n 1
w ji (n)
 e j ( n)
所有的pattern都算過,求出總平均update值之後再做更
N
改 MiniEav  1 
 e 2j (n)
2N
n 1 jC
Eavg (n)
η
w ji (n)  η

w ji (n)
N
N
e j (n)
n 1
w ji (n)
 e j ( n)
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
29
Sequential Mode 與 Batch Mode 之比較
 Sequential:
 On-line 以 sequential 較易,因為須較少之 storage
 在 weight space 為 stochastic 之特性,不易被 trapped 到
local minimum
 但由於 stochastic ,故較難以 theoretical prove 證明其 convergence
由於其 simple 及 effective to large and difficult problems ,
其仍 popular
 Batch:
 accurate estimate of the gradient vector
 Convergence to local minimum is guaranteed
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
30
Stopping Criteria
 No well-defined stopping criteria
 Terminate when gradient vector g(w) = 0 (一階微分等於
零)
 located at local or global minimum
 Terminate when error measure is stationary.
 The BP algorithm is considered to have converged when the Euclidean
norm of the gradient vector reaches a sufficiently small gradient
threshold.
• 缺點:for successful trials, learning times may be long.
 The BP algorithm is considered to have converged when the absolute
rate of change in the average squared error per epoch is sufficiently
small.
• 缺點:太大的squared error變化門檻值,造成網路太早停止學習。
 Terminate if NN’s generalization performance is adequate
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
31
Summary of the BP algorithm
1.
2.
Initialization:
a. good choice of initial value
b. weights 不可太大
目標:讓neuron sum vj 落於activation範圍內。
Presentations of training Examples

3.
Forward Computation

4.
Compute the induced local fields and function signals of the network
by proceeding forward through the network, layer by layer.
Backward Computation

5.
Present the network with an epoch of training examples.
Compute the local gradients ds of the network.
Iteration



給予新的training example,重複步驟3和4,直到滿足stopping
criterion。
The order of presentation of training examples should be randomized
from epoch to epoch.
The momentum and learning-rate parameter are typically decreased
as the number of training iteration increases.
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
32
Signal-flow graphical summary of backpropagation learning
Sensitivity graph:用
來計算local gradient
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
33
XOR Problem
單層perceptron因為沒有hidden neuron,所以無法
分割非線性的pattern。
可使用具有單一隱藏層的網路來解XOR問題
每個神經元是以McCulloch-Pitts Model (threshold model)
-1.5
Neuron 1
(.)
+1
Neuron 3
-2
+1
+1
(.)
(.)
+1
+1
-0.5
-0.5
Neuron 2
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
34
XOR Problem (cont.)
第一顆隱藏層神經元所決
定的decision boundary
第二顆隱藏層神經元所決
定的decision boundary
The function of the output neuron
is to construct a linear combination
of the decision boundaries formed
by the two hidden neurons.
整個網路所決定的decision
boundary
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
35
Some Hints for Making the B-P
 Sequential vs. Batch update
 The sequential mode of BP learning is computationally faster
than the batch mode.
 Maximizing information content:
 Using pattern that is radically different from all those previously
used  also called emphasizing scheme
但須留意: outliers 及不可將input distribution給扭曲了。
 Activation function
 反對稱(antisymmetric)的激發函數的學習速度比非對稱
(nonsymmetric)的激發函數快。
 典型的反對稱激發函數為hyperbolic tangent。
 Target values
 The target values (desired response) be chosen within the range
of the activation function。
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
36
Antisymmetric activation function
Nonsymmetric activation function
f(-v)=- f(v)
Hyperbolic tangent activation function
f(v)= a tanh (bv)
a=1.7159, b=2/3
f(1)=1
f(-1)=-1
At the origin the slope of the activation function is close to unity,
f(0)=ab=1.7159*2/3=1.1424
The second derivative of f(v) attains its maximum value at v=1.
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
37
Some Hints for Making the B-P
 Normalizing the input
 Preprocessed to have mean
close to zero
 Input should be uncorrelated
 Decorrelated input should be
scaled to have their
covariance approximately
equal.
By PCA
Operation of mean removal, decorrelation,
and covariance equalization
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
38
Some Hints for Making the B-P (cont.)
 Initialization
Weights should be relatively small
具有zero mean ,且其variance等於連接到此神經元神
經鍵數目的倒數。
 Learning from hints:
允許將已知的資訊加到f() ,以加速學習。
 Learning rate
後層的learning rate需比前層的learning rate低。
當神經元的input多的時候,給較小的learning rate。
LeCun(1993) ,連接到此神經元神經鍵數目的平方根
的倒數。(Eq 4.49)
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
39
Initialization
考慮一使用hyperbolic tangent function的multiplayer perceptron網路,
令其bias為0,則神經元j的induced local field可表示成
m
v j   w ji yi
i 1
assume 1.
平均值為0
Unit variance
assume 2.
assume 3.
Inputs are uncorrelated
assume 4.
Zero mean
Let the variance of weight be represented by
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
40
Then we have
0
m
m
i 1
i 1
v  E[v j ]  E[ w ji yi ]   E[ w ji ]E[ yi ]  0
0
m m


2
2
2
 v  E[(v j   v ) ]  E[v j ]  E  w ji w jk yi yk 
 i 1 j 1

m

m

m
  E w ji w jk E  yi yk    E[ w2ji ]  m w2
i 1 j 1
i 1
=1 as i=k
A good strategy for initializing the synaptic weights, the standard
deviation of the induced local field of a neuron須落在 activation 之
linear 範圍內
因為hyperbolic tangent 之二階微分在
則
2


v 1
m w2  1
w  m
1
v=1時有maximum
2
m is the number of synaptic connections of a neuron
(4.49)
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
41
Some Hints for Making the B-P

Perform Better:
1. Asymmetric activation function
2. dj should be offset by some E away from the limiting value
3. Weight and bias limited within a small range
where Fi 為neuron i 之 fan-in
4. all neurons should learn at the same rate

2.4 2.4
(
,
)
Fi
Fi
a. Fan-in 大, then η 小
b. 後面layer η 小
5. use pattern-by-pattern updating
6. Pattern orders for training should be randomized
7. Prior information should be used in assigning f(‧)
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
42
Output Representation and Decision Rule
理論上,一個M-class的分類問題,需要有 M個輸出來
表示所有可能的分類。
令ykj表示輸入xj在網路中第k個輸出
yk , j  Fk x j ,
k  1,2,...,M
(4.50)
其中Fk(.)定義網路的輸入與第k個輸出的對應函數
為方便表示,令


 F x , F x ,...,F x 
 Fx 
y j  y j ,1 , y j , 2 ,..., y M , j
T
(4.51)
T
1
j
2
j
M
j
j
xj
Multilayer
Perceptron
w
y1,j
y2,j
yM,j
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
43
Output Representation and Decision Rule
 我們希望:
 After a multilayer perceptron is trained, what should the optimum
decision rule be for classification the M outputs of the network?
 任何合理的output decision必須基於
F : m  x  y   M
 最小化empirical risk functional
2

1 N
R
2N
 d j  Fx j 
(4.52)
(4.53)
j 1
 假設網路是以binary target來訓練,因此
1
d kj  
0
when the prototypex j belongsto classCk
when the prototypex j does not belongclassCk
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
(4.54)
44
Output Representation and Decision Rule
 因此,class Ck可表示成M維target vector
0 

 
1  kt h element
 

0
 一個MLP classifier適當的output decision rule,就如同一
個近似的Bayes rule:
 Classify the random vector x as belonging to class Ck if
Fk x  Fj x
for all j  k
(4.55)
where Fk(x) and Fj(x) are elements of the vector-valued mapping
function.
 F1 x  
 F x  
The vector x is assigned membership in a
Fx    2 
  


 FM x 
particular class if the corresponding output
value is greater than some fixed threshold.
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
45
Computer Experiment
 利用一電腦實驗來顯示multilayer perceptron作為
pattern classifier的行為分析。
 Class 1:
 1
1
2

f x (x |C1 ) 
2 12
exp  2 x  μ1 
 2 1

(4.56)
μ1  mean vector  0,0T ,  12  variance  1
 Class 2:
 1
f x (x |C2 ) 
exp  2 x  μ 2
2
 2 2
2 2
1
2



(4.57)
μ1  2,0 ,  12  4
T
且
p(C1 )  p(C2 )  0.5
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
46
Computer Experiment
 Eq(4.56)和Eq(4.57)的圖
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
47
Class C1、Class C2、joint scatter diagram
Class 1
500點的分佈
Class 2
500點的分佈
Class 1和Class 2
的 joint 分佈
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
48
Bayesian Decision Boundary
Let
f x (x|C1 ) and
p(C2 )
Λ(x) 
 
1
f x (x|C2 )
p(C1 )
if (x)    x C1
if (x)    x C2
if p(C1 )  p(C2 )  0.5    1
(x)  1
為 boundary
2

令 (x)  2 exp(  1 x  μ 2  1 x  μ 2 )  1
1
2
 12
2 12
2 22
  12 
1
1
取log

2
2


Hence
2 22
Error:
x  μ2

2 12
x  μ1
 4 log
 2 
 2
x  xc
2
 r2
Pe  P(e |C1 ) p1  P(e |C2 ) p2  0.5  0.1056  0.5  0.2642  0.1849
Pc  1  Pe  0.8151
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
49
參數設定(1) :決定最佳的隱藏層神經元數量
單一隱藏層、以sequential mode來train BP:下表為相關參
數設定
Number of hidden neurons:
m1
(2,∞)
Learning-rate parameter:
η
(0,1)
Momentum
constant :
α
(0,1)
以最少之hidden neurons, 以達到與Bayesian classifier相
近之performance 選擇 2 hidden neuron開始
表4.2中的mean-square error可由Eq(4.53)求得。
A small MSE does not necessarily imply good generalization
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
50
參數設定(1) :決定最佳的隱藏層神經元數量
網路經過 N個訓練樣本的訓練後,已經收斂,則正確分
類的機率可表示成
Pc, N   p1 Pc, NC1   p2 Pc, NC2 
where p1  p2  0.5
Pc, NC1   
1  N 
Pc, NC2   1  
f x xC1 dx
1  N 
f x xC2 dx
其中,Ω1(N)表示MLP將向量 x分成第一類C1的範圍。
然而實際上, Ω1(N)是經由訓練所得到,很難得到一數
學式。
 使用實驗的方式,隨機從C1和C2中(一樣機率),挑取另外
一組樣本,來訓練 MLP網路。
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
51
參數設定(1) :決定最佳的隱藏層神經元數量
令A為不屬於原本 N個測試樣本中可被正確分類的個數,
則
A
pN 
N
套用Chernoff bound可得


P pN  p  e   2 exp  2e 2 N  d
 當e=0.01, d=0.01時,可得N=26500
 因此選用的測試樣本大小為32000。
 表4.2的Pc是將10次的測試平均後的結果。
雖然比兩個隱藏神經
元有較低的MSE,但
分類錯誤率較高。
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
52
參數設定(2) :決定最佳的learning & momentum
常數
 A lower MSE over a training set does not necessarily
imply good generalization.
 Heuristic and experimental procedures dominate the optimal
selection of  and .
 在此實驗中,MLP有兩個隱藏神經元,={0.01, 0.1, 0.5,
0.9} ,momentum ={0.0, 0.1, 0.5, 0.9} ,training set有
500 examples,訓練次數700次。
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
53
=0.01
=0.1
=0.9
=0.5
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
54
MSE
Best learning curves selected from the four parts
Conclusions:
1.
2.
3.
η小,slow convergence,但較易找到deeper local minima
η0, α1 increase speed of convergence,
若η1,則 α須0,系統才會穩定
{η=0.5, α=0.9 }及{η=0.9, α=0.9 }系統造成oscillations ,且最後收斂的MSE值偏高。
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
55
參數設定(3) :決定最佳的network design
M opt  2
opt  0.1
 opt  0.5
training set=1000patterns
testing set=32000patterns
20次independent runs找最佳,最差及平均
MSE
0.34
fastest
average
slowest
0.30
0.26
0.22
0.20
0
5
10
20
30
40
50
20次中最佳的3個decision boundaries.
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
56
Generalization Capability
 A network is said to generalize well when the inputoutput mapping computed by the network is correct for
test data never used in creating or training the network.
 Assumed that the test data are drawn from the same
population used to generate the training data.
 The learning process may be viewed as a “curve-fitting”
problem.
 A NN is designed to generalize well will produce a
correct input-output mapping even when the input is
slightly different from the examples used to train the
network.
 A NN learns too many input-output examples, the
network may end up memorizing the training data.
(overfitting, or overtraining)
 When the network is overtrained, it loses the ability to
generalize between similar input-output patterns.
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
57
Generalization Capability (cont.)
Generalization is affected by:
1. the size of training set
2. the architecture of the network
3. the physical complexity of the problem
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
58
Generalization Capability (cont.)
 Sufficient Training Set Size for a Valid Generalization
 當網路結構固定,需要多大的training set才能獲得一個良好的
generalization?
 當training set固定,則怎樣的網路架構能達到良好的
generalization?
 根據VC dimension
W 
(4.85)
N  O 
e 
where W is the total number of free parameters (synaptic
weights and biases)
e denotes the fraction of classification errors permitted on test
data
 10%的誤差,則訓練樣本數至少應該是10倍的自由參數個數。
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
59
Generalization Capability (cont.)
Let d[-1,1] : classification
M: total number of hidden nodes
e: fraction of error permitted on the test
w: total number of weight
According to Baum & Haussler ( 1989 ) :
The network will almost certainly provide generalization if :
1: The fraction of errors on training set is less than e / 2
2: N  32w ln( 32 M )
e
e
training set size
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
60
Approximations of functions
 在MLP中應該提供多少隱藏層才能提供一能對任何連續函數的近似實
現?
 How a MLP provides the approximation capability to any
continuous mapping? (Universal Approximation Theorem)
 Cybenko 1989, Funahashi 1989, Hornik 1989
Let φ(‧): nonconstant, bounded, monotone-increasing, continuous function.
相當於MLP的隱藏層輸出
:continuous function on Im0:m0-dimensional hypercube
[0,1] m0
f  C ( I m0 )
given
m0
m1
e  0, M ,  i , bi , w ji  F ( x1 ,..., xm 0 )    i ( wij x j  bi )
and
i 1
j 1
F ( x1 ,..., xm0 )  f ( x1 ,..., xm0 )  e {x1 ,..., xm0 }  I m0
相當於MLP的輸出
  given sufficient hidden neurons (M夠大), MLP可approximate
continuous function on Im0
 A signal hidden layer is sufficient for a MLP to compute a uniform
e approximation to a given training set.
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
61
Cross-validation
首先將available data set隨機分成training set 和test set,再
進一步將training set 分成
a: estimation of the model ( training )
b: evaluation of the performance( validation )
1: 用test set data可測試是否overfitting
2: 可一邊training, 一邊test.
若validation performance
未降反升time to stop!
3: 用以調η大小,若MSE in
generalization 未持續下降,
則reduce η↓
generalization 可以小network,
或大network,少epochs
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
62
Cross-validation (cont.)
 Periodic estimation-followed-by-validation
 After a period of estimation (training), the synaptic weights and
bias levels of the MLP are all fixed, and the network is operated
in its forward mode. The validation error is thus measured for
each example in the validation subset.
 When the validation phase is completed, the estimation is
resumed for another period, and the process is repeated.
 當網路學習次數超過validation的最低點時,表示training
data含有noise,所以網路就須停止訓練。
 若training data為noise free,該如何決定其early stop
point?
 如果estimation和validation error無法同時趨近於零,表示網路不
具有精確model此函數的能力。只好嘗試找尋integrated-squared
error。
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
63
Cross-validation (cont.)
 A statistical theory of the overfitting phenomenon
 Amari (1996)探討使用Early stopping method時應注意事項。
 Batch learning, single hidden layer
 Two modes of behavior are identified depending on the size of
the training set.
 Nonasymptotic mode, (N<W)
• Early stopping method的訓練方式會比exhaustive training法的
generalization好。
• 當N<30W時,會發生Overfitting
• 將訓練樣本分成estimation及validation的最佳比值
2W  1  1
2W  1
1
 1
2W
ropt  1 
• 若W為較大值時
ropt
• 例如:W=100, ropt=0.07,表示93%的訓練樣本分為estimation
subset,7%的訓練樣本分為validation subset。
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
64
Cross-validation (cont.)
Asymptotic mode (N>30W)
• Exhaustive learning is satisfactory when the size of the
training sample is large compared to the number of
network parameters.
Multifold cross-validation
適用於當labeled example少的時候。
Dividing the available set of N examples into K
subsets.
留一個當測試樣本,其餘的當訓練樣本。
經過 K次的試驗後,可得平均的MSE。
Leave-one-out method
• N-1 examples are used to train the model, and the model
is validated by testing it on the example left out.
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
65
Network Pruning Techniques
 Minimizing the size of the network while
maintaining good performance.
Network growing
 從較小的網路結構開始,當無法達到想要的設計規格時,
增加一個神經元或是一隱藏層。
Network pruning
 從一較大的、對想要的問題能有適當效能解的網路結構開
始,刪除某些神經元或是神經鍵。
 只探討network pruning
Regulation of certain synaptic connections from the
network
Deletion of certain synaptic connections from the
network
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
66
Network Pruning Techniques
 Complexity Regularization
 We need an appropriate tradeoff between reliability of the training
data and goodness of the model.
 By minimizing the total risk expressed as
Rw  Cs w   lCc w 
(4.94)
where Es(w) is the standard performance measure (在MLP中為
MSE)
Ec(w) is the complexity penalty
lis a regularization parameter
 The complexity Ec(w) can be obtained by
(使F(x,w)的第k次對輸入向量x的微分值變小)
(4.95)
 smoothnessof F (x, w)
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
67
Network Pruning Techniques
 Three complexity regularization
Weight decay
 強迫網路中的某些神經鍵的權重值趨近於零,而允許某些神
經鍵的權重值維持相對大的值。
 將神經鍵的權重值概分成兩群:具有較大影響力及較小影響
力(excess weight)。
 這些過剩的權重將被設為趨近於零,以改善generalization
Ec w   w
2


wi2
(4.96)
iCtotal
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
68
Network Pruning Techniques
Weight elimination
 The complexity penalty定義如下:
Ec w  
w / w 

1  w / w 
2
i
iCtotal
0
預先給定的參數值 (4.97)
2
i
0
 當|wi|<<w0時,complexity penalty對wi而言趨近於零。這意謂
著,第i個神經鍵的權重是不可靠的,應該從網路中刪除。
 當|wi|>>w0時,complexity penalty對wi而言趨近於1。這意謂
著,第i個神經鍵的權重是重要的。
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
69
Network Pruning
當|wi|<<w0時,complexity penalty趨近於0
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
70
Network Pruning
 Approximate smoother
 適用於具有一隱藏層及輸出層只有一個神經元的MLP
Ec w  
M

woj2
wj
p
(4.98)
j 1
其中woj為輸出層的權重,wj則為隱藏層第j個神經元的權重向量
(4.99)
 這種方法比前面的weight decay和weight elimination好,因為
• 它有將隱藏層和輸出層的神經鍵分開考慮
• 它能擷取兩個神經鍵權重的相互關係。
• 但這種方法比較複雜。
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
71
Hessian-based network pruning(新版)
 使用error surface的二階微分資訊來進行network pruning,目的在於
network complexity和training-error performance間作一折衷。
 使用Taylor series建構一local approximation of the cost function Eav
擾亂量
在w的梯度
Hessian matrix
Eav (w  w) Eav (w)  g T (w)w 
1
3
w T Hw  O( w )
2
 目的是找出一組參數,從MLP網路中刪除會使cost function的增加量
最少。假設
 External approximation:網路必須訓練後才能將某些參數刪除,因此g
會設為零。
 Ouadratic approximation: 假設在local 或global 最小值附近的error
surface是接近二次方的,因此較高階的項將可被刪除
Eav Eav (w  w ) Eav (w )
1
1
w T Hw  w T Hw
2
2
資訊工程所
醫學影像處理實驗室(
Medical Image Processing Lab.
T )
因為當network
converges,
一階微分為0
g
(w)  0
Graduate School of Computer Science & Information Engineering
 g T (w )w 
72
Hessian Matrix of Error Surface (舊版)
在converge之後拿除某些weights,但儘量不影響 e
將E 對 δwi 做 Taylor series 展開
1
 hijdwidw j  higher order
2
i
i j
E
E
其中 g i 
, hij 
wi
wi w j
E
在平衡點 , local minimum  w  0
i
dE   g idwi 
1.
w
 w1 
2
2
i j
w 
T
若將某一wi拿除,相當於在wi加上某一量wi,
1i
 2
[0,...,0,1,0,...,0]    wi  0
而其結果 wi  wi  0  1Ti w  wi  0



w
 i
To minimize E ,given the constraint 1Ti w  wi  0
  
1
T
T
By Lagrangian S  w Hw  l (1i w  wi )
2
2. 若為guadratic E function,則higher order ≈ 0


Then dE  1  hijdwidw j  1 dwT Hdw
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
73
Solving Lagrangian S with respect w
1.  S  h w  l  0
 ij j

j
 wi

 S   h w  0 for k  i
kj
j
 wk
j
將 w代入constraint that
1Ti w  wi  0
 Hw  l1i  0 (vector form)
 w  lH 11i ………(*)
 l1Ti H 11i  wi  0 [ H 1 ]ii : H-1之第ii個element
wi
1
 l[ H ]ii  wi  0  l  1
[ H ]ii
將 l 代入(*) then
w
w   i1 H 11i
[ H ]ii
如果將wi拿除,而儘量不影響整個network,則所有w所須改變之量(w )
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
74
Solving Lagrangian S with respect w
將
w  
wi
1
[H ]ii
1
H 1i
代入 S  1 w T Hw  l (1Ti w  wi )
2
第一項
wi
 wi
1
1 wi2
T
1
1
 (1i H
)H( 1 H 1i ) 
1
2
2 [H 1 ]ii
[H ]ii
[H ]ii
第二項:由constraint知為0
可由下式去verify:
 1Ti (
wi
[H 1 ]ii
H 11i )  wi
  wi  wi  0
因此
wi確實經 w update後
會被拿除。
1 wi2
S
2 [H 1 ]i ,i
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
75
Computing the inverse Hessian matrix : H-1
當dimension太大時H-1難求得
2

av
where H  E
2
w
根據
Eav (w ) 
Then
1
2N
Eav
1

w
N
H( N ) 
 2Eav
w 2
1

N
N
 (d (n)  o(n)) 2 
n 1
total # of examples
 (d (n)  F (w, x)) 2
n 1
F (w, x)
 w (d (n)  F (w, x))
n 1
1

N
N
 F (w, x)  F (w, x) T  2 F (w, x)

  w  w   w 2 (d (n)  F (w, x)) 



n 1 

N
 F (w, x)  F (w, x) 
  w  w 


n 1 
N
1
2N
N
T
當network fully trained,
d(n)-o(n)  0
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
76
1 F (w, x)
令  ( n) 
w
N
Then
n
H(n)    (k ) T (k )  H(n  1)   (n) T (n)
n  1,2,..., N
(4.110)
k 1
令 A  B 1  CDCT ,由matrix inverse lemma
1
T
1 T
 A  B  BC(D  C BC) C B
(*)
C   ( n)
比較Eq(4.110)和(*)可得 A  H(n)
Β 1  H(n  1)
D 1
Then we have
H 1 (n  1) (n) T (n)H 1 (n  1)
H (n)  H (n  1) 
1   T (n)H 1 (n  1) (n)
1
1
scalar
令 H (0)  d I ,可iteratively求得 H 1 (n)
1
1
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
77
Virtues and limitations of BP Training
 BP的特性:
 gradient descent, not an optimization technique
 Simple to compute locally
 Perform stochastic gradient descent
 Connectionism
 Feature Detection
 Hidden neurons play a critical role as feature dectors.
 Function Approximation
 以BP訓練的MLP是一種nested sigmoidal scheme。
 Computational Efficiency
 BP演算法每次update的參數為polynomial,因此BP演算法具有
computational efficiency 。
 Sensitivity Analysis
 BP可對輸入和輸出的實現進行sensitivity analysis
 Robustness
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
78
Virtues and limitations of BP Training
 Convergence
 BP在權重空間的error surface上以instantaneous estimate方式求梯
度。因此是一種stochastic。
 當error surface太過平坦,將造成收斂速度太慢。反之若error
surface太陡,將越過error surface的最低點。
 負梯度向量方向可能指向錯誤的方向。
 Local Minima
 BP learning is basically a hill climbing technique, it runs the risk of
being trapped in a local minimum.
 Scaling
 how well the network behaves as the computational task increases
in size and complexity
 e.g. to perform the parity function, it was found that the computation
time increases exponentially with the number of inputs
 對一個MLP採用fully connected是不明智的。
 網路的架構和加在神經鍵上的權重應該結合欲解決問題的prior
information
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
79
Accelerated Convergence
1. Every network parameter has its own learning rate
2. Every η allowed to vary from one iteration to the next
3. 某一 weight 之 update
4.
E 有相同符號
wi
η
E 若符號經常變換η
wi
the use of a different and time-varying η
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
80
Supervised Learning Viewed as an Optimization Problem
將MLP的監督式學習視為一種數值最佳化的問題。
 使用Taylor series建構一local approximation of the cost function
Eav
在w的梯度
Eav (w n   w n ) Eav (w n )
擾亂量
Hessian matrix
 g T (n)w n 
1
 w T n Hn w n 
2
 (高階項 )
 其中
Eav w 
 2Eav w 
gn  
, Hn  
w w  w n 
w 2 w  w n 
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
81
Supervised Learning Viewed as an Optimization
Problem
BP演算法的最陡坡降法的標準權重修正量為
wn  gn
優點是容易實現,但缺點則是當問題較大時,收斂速度太
慢。即使可以加上慣性量,相對的也增加參數設定上的困
擾。
為了在收斂效能上有明顯的提升,必須在訓練過程使用高
階資訊。
假設在目前的w(n)附近的error surface為quadratic
approximation,則權重向量最佳的修正值可定義為
w * n  H1 ngn
問題是H-1很難求得。
因此採用Conjugate-Gradient Method
 可加速收斂速度,同時避免計算H-1的複雜度。
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
82
Conjugate-Gradient Method
 A-conjugate
 Given the matrix A, we say that a set of nonzero vectors s(0),
s(1),…,s(W-1) is A-conjugate, if the following condition is satisfied
sT (n)As( j )  0, n  j
則這些向量之間為線性獨立(linearly independent)
 < Proof > :假設不是 linearly, then
因為根據前面的假設A為正值,且s(0)不為零
因此上式不可能為零
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
83
Conjugate-Gradient Method
Considering the minimization of the quadratic function
1 T
f (x)  x Ax  bT x  c  Ax  b  0
2
(4.122)
A is a nw的symmetric, positive definite matrix,則解為
x*  A 1b
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
(4.123)
84
Conjugate-Gradient Method
Example 4.1
x  x0 , x1 T
v  A1 / 2 x
Orthogonal direction vector
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
85
Conjugate-Gradient Method
 For a given set of A-conjugate vectors s(0), s(1),…, s(W-1)則conjugate
direction method for unconstrained minimization of f(x)為
xn  1  xn   nsn,
n  0,1,..., W  1
(4.125)
where (n) 是一scalar,定義成
f xn   nsn  min f xn  sn

資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
(4.126)
86
Conjugate-Gradient Method
(1) 將(4.125)帶入(4.122)  f (x)
(2)求
f (x)
 ,可解出
0

 x(n)  A1b
(3)因此,conjugate direction法保證,從任意的x(0)開始,
經由最多W次iteration,可找到最佳解x* 。
因此,依照conjugate direction法,x(n+1)逐步的在線性空
間Dn上最小化f(x)
也就是
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
87
Conjugate-Gradient Method
Note: 上面方法須預知A - conjugatevectors,亦即
S (0), S (1),...,S ( w  1)
改為以下方法來推算S (0), S (1),...,S ( w  1)
 f 
此項其實也等於   為gradient descent!!
 x 
其中,
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
88
Conjugate-Gradient Method
 (n) 之推導
 S (n) and S (n  1) 必須為 A - conjugate,
故 S T (n  1) AS (n)  S T (n  1) A[r (n)   (n) S (n  1)]  0
 S T (n  1) AS (n  1)   S T (n  1) Ar(n)

表由此  (n) 求得之S (n)彼此之間為A - conjugate
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
89
Conjugate-Gradient Method
For computational reasons, it would be desirable to
evaluate (n) without explicit knowledge of A
有人提出下列兩種方法:
Polak-Ribiere formula
Superior for
Non-quadratic optimization
Feletcher-Reeves formula
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
90
Conjugate-Gradient Method
1 : non - quadrat ic, P olak- Ribiere is superior to
Flet cher- Reevesform.
(a) 在做 search 時, 其direct ion將逐漸偏離, 造成 Algrit hm
t ojam.ie, S (n) 幾乎  r (n). 此時 r (n)  r ( n  1).
則 P olak- Ribiere 之  (n)  0
 S (n)  r (n), breaking t he jam 故較佳
(b) 有時 P olak- Ribiere 方法亦會 cycleindefinet ely
改進方法, 以
  max PR ,0
亦即當  PR  0, 則   0, 相當於 rest art t he search
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
91
Conjugate-Gradient Method
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
92
Conjugate-Gradient Method
當計算 (n),不知道 A 時
In computing (n) using line search :
e av ( w  s), 其中 w  s 相當為 tracea line in the
w - dimensional vectorspace of w
 bracketingphase: find bracket[] 
 sectioningphase: 產生下一sequence之 
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
93
Conjugate-Gradient Method
 Ex:
1   2   3
而 e av (1 )  e av (3 )  e av ( 2 )
(亦即在1 ,3 中間, 存在一 2 ,
其 e av ( 2 ) 比 e av (1 ),e av (3 )都小)
則 minimum點在 [1 , 3 ] 之間
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
94
Conjugate-Gradient Method
若以1 , 2 , 3 做一 parabolic,找到此 parabolic的頂點4 ,
測 e av 4 
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
95
Conjugate-Gradient Method
則若 e av (4 )  e av (2 ), then wehave1  2  4  3
而 e av (1 )  e av (3 )  e av (2 )  e av (4 ),
則表示 min 在 2 ,3 之間
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
96
Conjugate-Gradient Method
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
97
Conjugate-Gradient Method
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
98
Conjugate-Gradient Method
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
99
Conjugate-Gradient Method
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
100