Hierarchical Mixture of Experts

Transcript Hierarchical Mixture of Experts

Hierarchical Mixture of Experts
Presented by Qi An
Machine learning reading group
Duke University
07/15/2005
Outline







Background
Hierarchical tree structure
Gating networks
Expert networks
E-M algorithm
Experimental results
Conclusions
Background

The idea of mixture of experts


Hierarchical mixture of experts


First presented by Jacobs and Hintons in 1988
Proposed by Jordan and Jacobs in 1994
Difference from previous mixture model

Mixing weights depends on both the input and the
output
Example (ME)
One-layer structure
μ
Ellipsoidal
Gating function
g1
Gating
Network
x
g2
μ1
Expert
Network
x
g3
μ2
Expert
Network
x
μ3
Expert
Network
x
Example (HME)
Hierarchical tree structure
Linear Gating
function
Expert network

At the leaves of trees
for each expert:
ij  Ux
ij  f (ij )
linear predictor
output of the expert
link function
For example: logistic function
for binary classification
Gating network

At the nonterminal of the tree
top layer:
other layer:
i  v x
T
i
exp( i )
gi 
 exp( k )
k
ij  vijT x
g j|i 
exp(ij )
 exp(
k
ik
)
Output

At the non-leaves nodes
top node:
μ   gi μi
i
other nodes:
μi   g j|i μij
j
Probability model

For each expert, assume the true output y is
chosen from a distribution P with mean μij
P ( y x, ij )

Therefore, the total probability of generating
y from x is given by
P( y x, )   gi ( x, vi ) g j|i ( x, vij ) P( y x,ij )
i
j
Posterior probabilities


Since the gij and gi are computed based only
on the input x, we refer them as prior
probabilities.
We can define the posterior probabilities with
the knowledge of both the input x and the
output y using Bayes’ rule
h j|i 
g j|i Pij ( y )
g
j
P ( y)
j|i ij
hi 
g i  g j|i Pij ( y )
j
g g
i
i
j
P ( y)
j|i ij
E-M algorithm


Introduce auxiliary variables zij which have an
interpretation as the labels that corresponds
to the experts.
The probability model can be simplified with
the knowledge of auxiliary variables
(t )
P( y , z
(t )
ij


x , )  g g P ( y )   g g P ( y )
(t )
(t )
i
(t )
j |i ij
(t )
i
j
(t )
i
(t )
j|i ij
(t )
zij( t )
E-M algorithm

Complete-data likelihood:


lc ( ; y)   zij(t ) ln gi(t )  ln g (jt|i)  ln Pij ( y (t ) )
t

i
j
The E-step


Q( , ( p ) )  Ez (lc ( ; y))   hij(t ) ln gi(t )  ln g (jt|i)  ln Pij ( y (t ) )
t
i
j
E-M algorithm

The M-step
ijp1  arg max hij(t ) ln Pij ( y (t ) )
 ij
t
vip 1  arg max  hk(t ) ln g k(t )
vi
t
k
vijp1  arg max hk(t )  hl(|kt ) ln gl(|tk)
vij
t
k
l
IRLS

Iteratively reweighted least squares alg.


An iterative algorithm for computing the
maximum likelihood estimates of the
parameters of a generalized linear model
A special case for Fisher scoring method
1

r 1
  l   l
    E 
T 
     
r
Algorithm
E-step
M-step
Online algorithm


This algorithm can be used for online
regression
For Expert network:
Uij(t 1)  Uij(t )  hi(t ) h(j|ti) ( y(t )  ij(t ) ) x(t )T Rij(t )
where Rij is the inverse covariance matrix for EN(i,j)
R  R
(t )
ij
1
( t 1)
ij

1
Rij(t 1) x(t ) x(t )T Rij(t 1)
[hij(t ) ]1  x(t )T Rij(t 1) x(t )
Online algorithm

For Gating network:
vi(t 1)  vi(t )  Si(t ) (ln hi(t )  i(t ) ) x(t )
where Si is the inverse covariance matrix
(t )
i
S
and
 S
1
( t 1)
i
Si(t 1) x(t ) x(t )T Si(t 1)

  x(t )T Si(t 1) x(t )
1
vij(t 1)  vij(t )  Sij(t ) hi(t ) (ln h(j|ti)  ij(t ) ) x(t )
where Sij is the inverse covariance matrix
S
(t )
ij
 S
1
( t 1)
ij

1
Sij(t 1) x (t ) x (t )T Sij(t 1)
[hi(t ) ]1  x(t )T Sij(t 1) x(t )
Results

Simulated data of a four-joint robot arm
moving in three-dimensional space
Results
Conclusions



Introduce a tree-structured
architecture for supervised learning
Much faster than traditional backpropagation algorithm
Can be used for on-line learning
Thank you
Questions?