Transcript Document

Hierarchical Dirichlet Process and
Infinite Hidden Markov Model
Paper by Y. W. Teh, M. I. Jordan, M. J. Beal & D. M. Blei,
NIPS 2004
Duke University Machine Learning Group
Presented by Kai Ni
February 17, 2006
Outline
• Motivation
• Dirichlet Processes (DP)
• Hierarchical Dirichlet Processes (HDP)
• Infinite Hidden Markov Model (iHMM)
• Results & Conclusions
Motivation
• Problem – “multi-task learning” in which the “tasks” are
clustering problems.
• Goal – Share clusters among multiple, related clustering
problems. The number of clusters are open-ended and
inferred automatically by the model.
• Application
– Genome pattern analysis
– Information retrieval of corpus
Hierarchical Model
• A single clustering problem can be analyzed as a Dirichlet
process (DP).

G    k 
– G ~ DP( 0 , G0 )
k 1
– Draws G from DP are discrete, generally not distinct.
k
• For J groups, we consider Gj for j=1~J is a group-specific

DP. G ~ DP( , G )
G 
 
j
0j
0j
j

k 1
jk
 jk
• To share information, we link the group-specific DPs
–
G j ~ DP(0 , G0 ( )) If G(τ) is continuous, the draws Gj have no
atoms in common with probability one.
– HDP solution: G0 is itself a draw from a DP(, H)
Dirichlet Process &
Hierarchical Dirichlet Process
• Three different perspectives
– Stick-breaking
– Chinese restaurant
– Infinite mixture models
• Setup
DP
G ~ DP( , G0 )
HDP
G0 |  , H ~ DP( , H )
G j |  0 , G0 ~ DP(0 , G0 )
• Properties of DP
–
–
–
Stick-breaking View
• A mathematical explicit form of DP. Draws from DP are
discrete.
• In DP
G


k 1
k

k
with
 ~ Stick ( ),  k ~ G0
• In HDP
Gj 


k 1
jk

k
G0 


k 1
k

π j ~ DP ( 0 ,  )
 ~ Stick ( )
 k ~ G0
k ~ H
k
DP – Chinese Restaurant Process
• Exhibit clustering property
• Φ1,…,Φi-1, i.i.d., r.v., distributed according to G; Ө1,…, ӨK
to be the distinct values taken on by Φ1,…,Φi-1, nk be # of
Φi’= Өk, 0<i’<i,
HDP – Chinese Restaurant Franchise
• First level: within each group, DP mixture
– G j ~ DP(0 , G0 ),  ji | G j ~ G j , x ji |  ji ~ F ( ji )
– Φj1,…,Φj(i-1), i.i.d., r.v., distributed according to Gj; Ѱj1,…, ѰjTj to be
the values taken on by Φj1,…,Φj(i-1), njk be # of Φji’= Ѱjt, 0<i’<i.
• Second level: across group, sharing clusters
– Base measure of each group is a draw from DP:
 jt | G0 ~ G0 , G0 ~ DP( , H )
– Ө1,…, ӨK to be the values taken on by Ѱj1,…, ѰjTj , mk be # of Ѱjt=Өk,
all j, t.
HDP – CRF graph
• The values of  are shared between groups, as well as
within groups. This is a key property of HDP.
Integrating out G0
DP Mixture Model
• One of the most important application of DP: nonparametric
prior distribution on the components of a mixture model.
• G can be looked as an infinite mixture model.
G ~ DP( 0 , G 0 )
G 


k 1
k
k
i | G ~ G
x i |  i ~ F ( i )
HDP mixture model
• HDP can be used as the
prior distribution over the
factors for nested group
data.
• We consider a two-level
DPs. G0 links the child Gj
DPs and forces them to
share components. Gj is
conditionally independent
given G0
Infinite Hidden Markov Model
• The number of hidden states is allowed to be countably infinite.
• The transition probabilities given in the ith row of the transition
matrix A can be interpreted as mixing proportions
 = (ai1, ai2, …, aik, …)
• Thus each row of the A in HMM is a DP. Also these DPs must
be linked, because they should have same set of “next states”.
HDP provides the natural framework for the infinite HMM.
iHMM via HDP
• Assign observations to groups, where the groups are indexed by the
value of the previous state variable in the sequence. Then the current
state and emission distribution define a group-specific mixture model.
• Multiple iHMMs can be linked by adding an additional level of
Bayesian hierarchy, letting a master DP couple each of the iHMM,
each of which is a set of DPs.
HDP & iHMM
HDP (CRF aspect)
iHMM
Group
Restaurant J (fixed) By Si-1 (random)
Data
Customer xji
Hidden
factor
Table ji = k,
Dish k ~ H
DP weights
Popularity jk,
Likelihood
F(xji| ji )
yi
k=1~
Si = k, k=1~
B (Si , : )
k=1~
A (Si-1, : )
B (Si, yi)
Non-trivialities in iHMM
• HDP assumes a fixed partition of the data into groups
while HMM is for time-series data, and the definition of
groups is itself random.
• Consider CRF aspect of HDP, the number of restaurant is
infinite. Also in the sampling scheme, changing st may
affect all subsequent data assignment.
• CRF is natural to describe the iHMM, however it is
awkward for sampling. We need to use sampling algorithm
from other respects for the iHMM.
HDP Results
iHMM Results
Conclusion
• HDP is a hierarchical, nonparametric model for clustering
problems involving multiple groups of data.
• The mixture components are shared across groups and the
appropriate number is determined by HDP automatically.
• HDP can be extended to infinite HMM model, providing
effective inference algorithm.
Reference
• Y.W. Teh, M.I. Jordan, M.J. Beal and D.M. Blei, “Sharing
Clusters among Related Groups: Hierarchical Dirichlet
Processes”, NIPS 2004.
• Beal, M.J., Ghahramani, Z. and Rasmussen, C.E., “The
Infinite Hidden Markov Model”, NIPS 2002
• Y.W. Teh, M.I. Jordan, M.J. Beal and D.M. Blei,
“Hierarchical Dirichlet Processes”, Revised version to
appear in JASA, 2006.