Hierarchical Mixture of Experts
Download
Report
Transcript Hierarchical Mixture of Experts
Hierarchical Mixture of Experts
Presented by Qi An
Machine learning reading group
Duke University
07/15/2005
Outline
Background
Hierarchical tree structure
Gating networks
Expert networks
E-M algorithm
Experimental results
Conclusions
Background
The idea of mixture of experts
Hierarchical mixture of experts
First presented by Jacobs and Hintons in 1988
Proposed by Jordan and Jacobs in 1994
Difference from previous mixture model
Mixing weights depends on both the input and the
output
Example (ME)
One-layer structure
μ
Ellipsoidal
Gating function
g1
Gating
Network
x
g2
μ1
Expert
Network
x
g3
μ2
Expert
Network
x
μ3
Expert
Network
x
Example (HME)
Hierarchical tree structure
Linear Gating
function
Expert network
At the leaves of trees
for each expert:
ij Ux
ij f (ij )
linear predictor
output of the expert
link function
For example: logistic function
for binary classification
Gating network
At the nonterminal of the tree
top layer:
other layer:
i v x
T
i
exp( i )
gi
exp( k )
k
ij vijT x
g j|i
exp(ij )
exp(
k
ik
)
Output
At the non-leaves nodes
top node:
μ gi μi
i
other nodes:
μi g j|i μij
j
Probability model
For each expert, assume the true output y is
chosen from a distribution P with mean μij
P ( y x, ij )
Therefore, the total probability of generating
y from x is given by
P( y x, ) gi ( x, vi ) g j|i ( x, vij ) P( y x,ij )
i
j
Posterior probabilities
Since the gij and gi are computed based only
on the input x, we refer them as prior
probabilities.
We can define the posterior probabilities with
the knowledge of both the input x and the
output y using Bayes’ rule
h j|i
g j|i Pij ( y )
g
j
P ( y)
j|i ij
hi
g i g j|i Pij ( y )
j
g g
i
i
j
P ( y)
j|i ij
E-M algorithm
Introduce auxiliary variables zij which have an
interpretation as the labels that corresponds
to the experts.
The probability model can be simplified with
the knowledge of auxiliary variables
(t )
P( y , z
(t )
ij
x , ) g g P ( y ) g g P ( y )
(t )
(t )
i
(t )
j |i ij
(t )
i
j
(t )
i
(t )
j|i ij
(t )
zij( t )
E-M algorithm
Complete-data likelihood:
lc ( ; y) zij(t ) ln gi(t ) ln g (jt|i) ln Pij ( y (t ) )
t
i
j
The E-step
Q( , ( p ) ) Ez (lc ( ; y)) hij(t ) ln gi(t ) ln g (jt|i) ln Pij ( y (t ) )
t
i
j
E-M algorithm
The M-step
ijp1 arg max hij(t ) ln Pij ( y (t ) )
ij
t
vip 1 arg max hk(t ) ln g k(t )
vi
t
k
vijp1 arg max hk(t ) hl(|kt ) ln gl(|tk)
vij
t
k
l
IRLS
Iteratively reweighted least squares alg.
An iterative algorithm for computing the
maximum likelihood estimates of the
parameters of a generalized linear model
A special case for Fisher scoring method
1
r 1
l l
E
T
r
Algorithm
E-step
M-step
Online algorithm
This algorithm can be used for online
regression
For Expert network:
Uij(t 1) Uij(t ) hi(t ) h(j|ti) ( y(t ) ij(t ) ) x(t )T Rij(t )
where Rij is the inverse covariance matrix for EN(i,j)
R R
(t )
ij
1
( t 1)
ij
1
Rij(t 1) x(t ) x(t )T Rij(t 1)
[hij(t ) ]1 x(t )T Rij(t 1) x(t )
Online algorithm
For Gating network:
vi(t 1) vi(t ) Si(t ) (ln hi(t ) i(t ) ) x(t )
where Si is the inverse covariance matrix
(t )
i
S
and
S
1
( t 1)
i
Si(t 1) x(t ) x(t )T Si(t 1)
x(t )T Si(t 1) x(t )
1
vij(t 1) vij(t ) Sij(t ) hi(t ) (ln h(j|ti) ij(t ) ) x(t )
where Sij is the inverse covariance matrix
S
(t )
ij
S
1
( t 1)
ij
1
Sij(t 1) x (t ) x (t )T Sij(t 1)
[hi(t ) ]1 x(t )T Sij(t 1) x(t )
Results
Simulated data of a four-joint robot arm
moving in three-dimensional space
Results
Conclusions
Introduce a tree-structured
architecture for supervised learning
Much faster than traditional backpropagation algorithm
Can be used for on-line learning
Thank you
Questions?