Transcript Document

Sharing Clusters Among Related Groups:
Hierarchical Dirichelet Processes
Y. W. Tech, M. I. Jordan, M. J. Beal & D. M. Blei
NIPS 2004
Presented by Yuting Qi
ECE Dept., Duke Univ.
08/26/05
Overview






Motivation
Dirichlet Processes
Hierarchical Dirichlet Processes
Inference
Experimental results
Conclusions
Motivation


Multi-task learning: clustering
Goal:


Share clusters among multiple related
clustering problems (model-based).
Approach:



Hierarchical;
Nonparametric Bayesian;
DP Mixture Model: learn a generative
model over the data, treating the classes
as hidden variables;
Dirichlet Processes



Let (,) be a measurable space, G0 be a probability measure on
the space, and  be a positive real number.
A Dirichlet process is any distribution of a random probability
measure G over (,) such that, for all finite partitions (A1,…,Ar) of
,
G ~DP(, G0 ) if G is a random probability measure with
distribution given by the Dirichlet process.


Draws G from DP are generally not distinct, discrete,
Өk~G0, βk are random and depend on .
Properties:
,
Chinese Restaurant Processes


CRP(the polya urn scheme)
Φ1,…,Φi-1, i.i.d., r.v., distributed according to G; Ө1,…, ӨK
to be the distinct values taken on by Φ1,…,Φi-1, nk be # of
Φi’= Өk, 0<i’<i,
This slide is from “Chinese Restaurants and Stick-Breaking: An Introduction to the Dirichlet Process”, NLP Group, Stanford, Feb. 2005
DP Mixture Model

One of the most important application of DP: nonparametric
prior distribution on the components of a mixture model.
G ~ DP(0 , G0 )
i | G ~ G
xi | i ~ F (i )

Why no direct application of density estimation? Because G is
discrete?
HDP – Problem statement

We have J groups of data, {Xj}, j=1,…, J. For each group, Xj={xji},
i=1, …, nj.

In each group, Xj={xji} are modeled with a mixture model. The
mixing proportions are specific to the group.

Different groups share the same set of mixture components
(underlying clusters,
), but different group is a different
combination of the mixture components.

Goal:


Discover the distribution of
Discover the distribution of
within a group;
across groups;
HDP - General representation



G0: the global prob. measure ~ DP(r, H) , r: concentration
parameter, H is the base measure.
Gj: the probability distribution for group j, ~ DP(α, G0).
Φji : the hidden parameters of distribution F(Φji) corresponding to
xji.

The overall model is:

Two-level DPs.
HDP - General representation


G0 places non-zeros mass only on
, i.i.d, r.v. distributed according to H.
, thus,
HDP-CR franchise

First level: within each group, DP mixture



G j ~ DP(0 , G0 ),  ji | G j ~ G j , x ji |  ji ~ F ( ji )
Φj1,…,Φji-1, i.i.d., r.v., distributed according to Gj; Ѱj1,…, ѰjTj to
be the values taken on by Φj1,…,Φji-1, njk be # of Φji’= Ѱjt,
0<i’<i.
Second level: across group, sharing components


Base measure of each group is a draw from DP:
Ѱjt | G0 ~ G0, G0 ~ DP(r, H),
Ө1,…, ӨK to be the values taken on by Ѱj1,…, ѰjTj , mk be # of
Ѱjt=Өk, all j, t.
HDP-CR franchise

Values of Φji are shared among groups.
Integrating out G0
Inference- MCMC

Gibbs sampling the posterior in the CR franchise:

Instead of directly dealing with Φji & Ѱjt to get p(Φ, Ѱ|X),
p(t, k, Ө|X) is achieved by sampling t, k, Ө, where,



t={tji}, tji is the table index that Φji associated with, Φji=Ѱjt .
K={kjt}, kjt is the index that Ѱjt takes value on Өk, Ѱjt=Өkjt.
ji
Knowing the prior distribution as shown in CPR franchise,
the posterior is sampled iteratively,

Sampling t:

Sampling K:

Sampling Ө:
Experiments on the synthetic data

Data description:




We have three group data;
Each group is a Gaussian mixture;
Different group can share same clusters;
Each cluster has 50 2-D data points, features are independent;
Original data
6
1
Group1
Group2
Group3
5
2
x(2)
4
3
7
4
1
6
2
5
3
4
1
2
2
1
3
0
2
4
6
x(1)
8
10
6
5
Group 3: [5, 6, 1, 7]
Group 1: [1, 2, 3, 7]
3
7
7
4
6
5
Group 2: [3, 4, 5, 7]
Experiments on the synthetic data

HDPs definition:
here, F(xji|φji) is Gussian distribution, φji={μji, σji}; φji take
values on one of θk={μk, σk}, k=1….
μ ~ N(m, σ/β), σ-1 ~ Gamma (a, b), i. e., H is NormGamma joint distribution. m, β, a, b are given
hyperparameters.

Goal:

Model each group as a Gaussian mixture
;

Model the cluster distribution over groups
;
Experiments on the synthetic data

Results on Synthetic Data

Global distribution:
Estimated underlying distribution
Global mixing propotion (over groups)
6
3
Group1
Group2
Group3
5
2.5
Mixing propotion
x(2)
4
3
2
2
1.5
1
0.5
1
0
2
4
6
x(1)
Estimated
8
10
0
1
2
3
4
5
6
7
Component Index
8
9
over all groups and the corresponding mixing proportions
The number of components is openended, here only partial is shown.
10
Experiments on the synthetic data
1-th group mixture propotion (over data)
Mixture within each group:
50
45
2-th group mixture propotion (over data)
35
50
30
45
25
40
Mixing propotion
20
15
10
5
0
1
2
3
4
5
6
7
Component Index
8
9
10
35
30
25
20
15
10
3-th group mixture propotion (over data)
50
5
45
0
40
Mixing propotion
Mixting propotion
40
1
2
3
4
5
6
7
Component Index
8
9
10
35
30
25
20
15
10
5
0
1
2
3
4
5
6
7
Component Index
8
9
10
The number of components in
each group is also open-ended,
here only partial is shown.
Conclusions & discussions



This hierarchical Bayesian method can automatically
determine the appropriate number of mixture components
needed.
A set of DPs are coupled via their base measure to
achieve the component sharing among groups.
Non-parametric priors; not non-parametric density
estimation.