Bayesian Generalized Product Partition Model

Transcript Bayesian Generalized Product Partition Model

Bayesian Generalized Product
Partition Model
By David Dunson and Ju-Hyun Park
Presentation by Eric Wang 2/15/08
Outline
• Introduce Product Partition Models (PPM).
• Relate PPM to DP via the Blackwell-MacQueen Polya Urn scheme.
• Introduce predictor dependence into PPM to form Generalized
PPM (GPPM).
• Discussion and Results
• Conclusion
Product Partition Model
• A PPM is formally defined as
k
k
f ( y|S )   f h ( y h ),  (S )  c0  c(S*h )
*
*
h 1
h 1
(1)
– Where S*  (S1* ,...,S*k )is a partition of {1,...,n}.
– Let yh  { yi : i  S*h } denote the data for subjects in cluster h, h
= 1,…,k.
– Therefore, the probability of partition S* is therefore the
product of all its independent subsets.
– The posterior cohesion on S* after seeing data y is also a
PPM,
c(S*h ) f h ( yh )
Product Partition Model
• A PPM can also be induced hierarchically
ind
yi | θ , S ~ f ( Si )
ind k
ind
Si ~   h h ,  h ~ G0
h 1
– Where Si  h if i  S h* , S  (S1,..,Sn )'.
• Taking k   induces a nonparametric PPM.
• A prior on the weights   (1,... k )' imposes a particular form on
the cohesion: a convenient choice corresponds to the Dirichlet
Process.
Relating DP and PPM
• In DP, G ~ DP(G
. 0)
– G is seen in stick breaking. If it is marginalized out, it yields the
Blackwell-MacQueen (1973) formulation:
– Where  i is the unique value taken by the ith data.
– The joint distribution of the a particular set
is therefore
due to the independence of the data.
Relating DP and PPM
• It can be shown directly that the Blackwell-MacQueen
formulation leads to
(2)
• Where kh is the number of data taking unique value h.
•
is the unique value of the l th subject in cluster h, re-sorted
by their ids: {h
1,1 ,...,h 1, k , h  2 ,1 ,...,h  2 , k ,...,h  k ,1 ,...,h  k , k }

 
 

1
h 1
• Also,
cohesion is
Then:
2
h2
k
hk
, is a normalizing constant and the
(3)
Relating DP and PPM
• From slide 3, writing the prior and likelihood together:
• Notice that from (1), G can be marginalized out to get the same
form
(4)
• Specifically, integrate over all possible unique values which can be
taken by h for subset h.
Relating DP and PPM
• Therefore, DP is a special case of PPM with cohesion
and normalizing constant
.
• However, (2) follows the premise of DP that data is exhcangeable
and does not incorporate dependence on predictors.
• Next, PPMs will be generalized such that predictor dependence is
incorporated.
Generalized PPM
• The goal of the paper is to formulate (1) such that the cohesion
depends on the subject’s predictor:
• This can be done following a process very similar to the nonpredictor case above.
• Once again, the connection between DP and PPM will be used,
this will henceforth be referred to as GPPM
• The formulation is interesting because the predictors
will be treated as random variables rather
than known fixed values (as in KSBP).
GPPM
• Consider the following hierarchical model
– Where
,
constitutes a base
measure on
and , the parameters of the data and
predictor, respectively.
– This model will segment data {1,…,n} into k clusters. As
*
before, i  S h denotes that subject i belongs to cluster h.
–
and
, which denote the unique values of the
parameters associated with the subject and its predictor,
shown below
GPPM
• The joint distribution of
can be developed in a similar manner to (2):
(5)
• The conditional distribution of
given predictors
is
(6)
• For comparison, (2) is shown below:
(2)
• The cohesion in (6) is
(7)
• (7) meets the criteria originally set out.
GPPM
• Some thoughts on GPPM so far:
– As noted earlier the posterior distribution of PPMs are still in the class of
PPMs, but with updated cohesion.
c(S*h ) f h ( yh )
– Similiarly, the posterior of a GPPM will also take the form of a GPPM
– (2) and (6) are quite similar. The extra portion of (6) is the marginalized
probability of the predictor .
– If
, then the GPPM reverts to the Blackwell-MacQueen
formulation, seen clearly in the following theorem.
Generalized Polya Urn Scheme
• The following theorem shows that the GPPM can induce a
Blackwell-MacQueen Polya Urn scheme, generalized for predictor
dependence:
Generalized Polya Urn Scheme
• By the above theorem, data i will do either 1) or 2)
– 1) Draw a previously unseen unique value
proportional to the
concentration parameter  and the base measure on the predictor
– 2) Draw a previously used unique value
equal to the parameters of
cluster h proportional to the number of data which have previously chosen
that unique value and the marginal likelihoods of its predictor value across
the clusters.
• Further, since the predictors are treated as random variables,
updating the posteriors on each cluster’s predictor parameters
means that GPPM is a flexible, non-parametric way to adapt the
distance measure in predictor space.
• In this paper G is always integrated out; however, Dunson alludes
to variational techniques which could still be developed in similar
fashion following the fast Variational DP proposed by Kurihara et
al (2006).
Generalized Polya Urn Scheme
• Consider, for example, a Normal-Wishart prior on the predictor as follows
• Where
and
distribution with
are multiplicative constants and
degrees of freedom and mean
is a Wishart
• Notice that this formulation adds another multiplier
to the precision of
the predictor distribution. This analogously corresponds to kernel width in
KSBP, and encourages tight local clustering in predictor space.
• The marginal distributions on the predictors from Theorem 1 take the forms
shown on the next slide.
Generalized Polya Urn Scheme
•
The marginal distribution of the predictor in the first weight:
Non-central multivariate t-distribution with degrees of freedom
Mean
and scale
•
The marginal distribution of the predictor in the second weight has the same functional
form but with updated hyperparameters:
f ( x | 0*x , x* , *0 x ) 

((  p) / 2)
1
*
*1
* 


1

(
x


)'

(
x


0
x
0
x
0
x )
( x* ) p / 2 ( x* / 2) | *0 x |1/ 2   x*

*
x
 ( *x  p ) / 2
where
And
is the empirical mean of the predictors in cluster h, without predictor i.
Generalized Polya Urn Scheme
• Posterior updating in this model is straightforward using MCMC. The
conditional posterior of the parameters is
where
is the base prior updated with the data likelihood
and the weights from Theorem 1
• The indicators are updated separately from the cluster parameters
membership indicators are sampled from it multinomial posterior:
• Next, update the parameters conditioned on
and number of clusters k.
. The
Results
• Dunson et al. demonstrates results using the following model on
conditional density regression problems
• Where
P-dimensional predictor
Data likelihood
Parameters of cluster h.
• Demonstrate results on 3 datasets:
– Simulated Single Gaussian (p = 2)
– Simulated Mixture of two Gaussians (p = 2)
– Epidemiology data (p = 3)
Results
• Simulated single Gaussian data, 500 data points
–
is generated iid from a uniform distribution over (0,1).
– Data was simulated using
• Algorithm was run for 10,000 iterations with 1,000 iteration burnin. Fast mixing and good estimates.
Raw Data
Below are conditional distributions on y
for two different values of x. The dotted
lines is truth, the solid line is the
estimation, and the dashed lines are 99%
credibility intervals
y
x
Results
• Simulated 2 Gaussian results, 500 data points
– is generated iid from a uniform distribution over (0,1).
– Data was simulated using
PPM
GPPM
Here, the left column of plots
are for a PPM (nongeneralized, while the right
column plots is the GPPM on
the same dataset. Notice
much better fitting in the
bottom plots, and that the
GPPM is not dragged toward 0
as the second peak appears
when
approaches 0.
Results
• Epidemiologic Application:
• DDE is shown to increase the rate of pre-term birth. Two
predictors
and
correspond to DDE dose for child i, and
mother’s age after normalization, respectively.
• Dataset size was 2,313 subjects.
• MCMC GPPM was run for 30,000 iterations with 10,000 iteration
burn-in.
• The results confirmed earlier findings that DDE causes a slightly
decreasing trend as DDE level rises.
• These findings are similar to previous KSBP work on the same
dataset, but the implementation was simpler.
Results
Raw Data
Dashed lines indicate 99%
credibility intervals
Conclusion
• A GPPM was formulated beginning with the Blackwell-MacQueen
Polya Urn scheme.
• The GPPM incorporates predictor dependence by treating the
predictor as a random variable.
– It is similar in spirit to the KSBP, but is able to bypass issues such as kernel
width selection and the inability to implement a continuous distribution in
predictor space.
• Future research directions could explore Dunson’s mention of a
variational method similar to the formulation proposed in this
paper.