The use of centred parameterisations and Markov chain

Download Report

Transcript The use of centred parameterisations and Markov chain

Simple methods to improve MCMC
efficiency in random effect models
William Browne*, Mousa Golalizadeh*,
Martin Green and Fiona Steele
Universities of Bristol and Nottingham
*Thanks to ESRC for supporting this work
Summary
•
•
•
•
•
•
•
Introduction.
Application 1 – Clutch size in great tits.
Method 1: Hierarchical centering.
Method 2: Parameter expansion.
Application 2 – Mastitis incidence in dairy cattle.
Method 3: Orthogonal predictors.
Application 3 – Contraceptive discontinuation in
Indonesia.
• Conclusions.
Introduction/Synopsis
• MCMC methods allow easy fitting of complex
random effects models
• The simplest (default) MCMC algorithms can
produce poorly mixing chains.
• By reparameterising the model one can greatly
improve mixing.
• These reparameterisations are easy to
implement in WinBUGS (or MLwiN)
• The choice of reparameterisation depends in
part on the model/dataset.
Application 1: Great tit nesting
behaviour (crossed random effects)
• Original work was collaborative research with
Richard Pettifor (Institute of Zoology, London),
and Robin McCleery and Ben Sheldon
(University of Oxford).
Application 1: Great tit nesting
behaviour (crossed random effects)
• A longitudinal study of great tits
nesting in Wytham Woods,
Oxfordshire.
• 6 responses : 3 continuous & 3 binary.
• Clutch size, lay date and mean
nestling mass.
• Nest success, male and female
survival.
• Data: 4165 nesting attempts over a
period of 34 years.
• There are 4 higher-level classifications
of the data: female parent, male
parent, nestbox and year.
• We only consider Clutch size here
Data background
The data structure can be summarised as follows:
Source
Number
of IDs
Median
#obs
Mean
#obs
Year
34
104
122.5
Nestbox
968
4
4.30
Male parent
2986
1
1.39
Female parent
2944
1
1.41
Note there is very little information on each
individual male and female bird but we can get
some estimates of variability via a random effects
model.
MCMC efficiency for clutch size
response
• The MCMC algorithm used in the univariate analysis of clutch size
was a simple 10-step Gibbs sampling algorithm.
• .
(2)
(3)
(4)
(5)
y   u
u
u
u
e
i
m ale(i )
fem ale(i ) nestbox(i )
year(i ) i
(2)
(3)
u
~ N (0,  2 ), u
~ N (0,  2 ),
m ale(i )
u (2)
fem ale(i )
u (3)
(4)
(5)
u
~ N (0,  2 ), u
~ N (0,  2 ),
nestbox(i )
u (4)
year(i )
u (5)
e ~ N (0,  2 )
i
e
• To compare methods for each parameter we can look at the
effective sample sizes (ESS) which give an estimate of how many
‘independent samples we have’ for each parameter as opposed to
50,000 dependent samples.

• ESS = # of iterations/,
  1  2  (k )
k 1
Clutch Size
Parameter
Estimate (S.E.)
β
8.808 (0.109)
0.365 (0.101)
0.107 (0.026)
0.046 (0.043)
0.974 (0.062)
1.064 (0.055)
 u2(5)
(Year)
 u2( 4) (Nest box)
 u2(3) (Male)
 u2( 2) (Female)
 e2 (Observation)
Percentage
variance
14.3%
4.2%
1.8%
38.1%
41.6%
Here we see that the average clutch size is just
below 9 eggs with large variability between female
birds and some variability between years. Male birds
and nest boxes have less impact.
Effective Sample sizes
Parameter
Fixed Effect
Year
Nestbox
Male
Female
Observation
Time
MLwiN
WinBUGS
671
602
30632
29604
833
788
36
33
3098
3685
110
135
519s
2601s
We will now consider methods that will improve the
ESS values for particular parameters. We will firstly
consider the fixed effect parameter.
Trace and autocorrelation plots for fixed effect
using standard Gibbs sampling algorithm
Hierarchical Centering
This method was devised by Gelfand et al. (1995) for
use in nested models. Basically (where feasible)
parameters are moved up the hierarchy in a model
reformulation. For example:
yij   0  u j  eij , u j ~ N (0, u2 ), eij ~ N (0, e2 )
is equivalent to
yij   j  eij ,  j ~ N (0 , u2 ), eij ~ N (0, e2 )
The motivation here is we remove the strong negative
correlation between the fixed and random effects by
reformulation.
Hierarchical Centering
In our cross-classified model we have 4 possible hierarchies
up which we can move parameters. We have chosen to move
the fixed effect up the year hierarchy as it’s variance had
biggest ESS although this choice is rather arbitrary.
2)
( 3)
( 4)
(5)
y i  u (female
( i )  u male ( i )  u nestbox ( i )   year ( i )  ei ,
2)
2
( 3)
2
u (female
~
N
(
0
,

),
u
~
N
(
0
,

(i )
u ( 2)
male ( i )
u ( 3) ),
( 4)
2
(5)
2
2
u nestbox
( i ) ~ N (0,  u ( 4 ) ),  year ( i ) ~ N (  0 ,  u ( 5 ) ), ei ~ N (0,  e ),
 0  1,  u2( 2) ~  1 ( ,  ), u2(3) ~  1 ( ,  ),
 u2( 4) ~  1 ( ,  ), u2(5) ~  1 ( ,  ), e2 ~  1 ( ,  ).
The ESS for the fixed effect increases 50-fold from 602 to
35,063 while for the year level variance we have a smaller
improvement from 29,604 to 34,626. Note this formulation
also runs faster 1864s vs 2601s (in WinBUGS).
Trace and autocorrelation plots for fixed effect
using hierarchical centering formulation
Parameter Expansion
• We next consider the variances and in particular
the between-male bird variance.
• When the posterior distribution of a variance
parameter has some mass near zero this can hamper
the mixing of the chains for both the variance
parameter and the associated random effects.
• The pictures over the page illustrate such poor
mixing.
• One solution is parameter expansion (Liu et al.
1998).
• In this method we add an extra parameter to the
model to improve mixing.
Trace plots for between males variance and a
sample male effect using standard Gibbs sampling
algorithm
Autocorrelation plot for male variance and a
sample male effect using standard Gibbs sampling
algorithm
Parameter Expansion
In our example we use parameter expansion for all 4
hierarchies. Note the  parameters have an impact on
both the random effects and their variance.
2)
( 3)
( 4)
(5)
yi   0   2 v (female
( i )   3vmale ( i )   4 vnestbox ( i )   5 v year ( i )  ei ,
2)
2
( 3)
2
v (female
( i ) ~ N (0,  v ( 2 ) ), vmale ( i ) ~ N (0,  v ( 3) ),
( 4)
2
(5)
2
2
vnestbox
( i ) ~ N (0,  v ( 4 ) ), v year ( i ) ~ N (0,  v ( 5 ) ), ei ~ N (0,  e ),
 0  1, k  1,  v2( k ) ~  1 ( ,  ), k  2,..5,  e2 ~  1 ( ,  )
The original parameters can be found by:
ui(k )   k vi(k ) and u2(k )   k2 v2(k )
Note the models are not identical as we now have
different prior distributions for the variances.
Parameter Expansion
• For the between males variance we have a 20-fold
increase in ESS from 33 to 600.
• The parameter expanded model has different prior
distributions for the variances although these priors
are still ‘diffuse’.
• It should be noted that the point and interval estimate
of the level 2 variance has changed from
0.034 (0.002,0.126) to 0.064 (0.000,0.172).
• Parameter expansion is computationally slower
3662s vs 2601s for our example.
Trace plots for between males variance and a
sample male effect using parameter expansion.
Combining the two methods
Hierarchical centering and parameter expansion can
easily be combined in the same model. Here we
perform centering on the year classification and
parameter expansion on the other 3 hierarchies.
2)
( 3)
( 4)
( 5)
yi   0   2 v (female
( i )   3vmale ( i )   4 vnestbox ( i )   year ( i )  ei ,
2)
2
( 3)
2
v (female
( i ) ~ N (0,  v ( 2 ) ), vmale ( i ) ~ N (0,  v ( 3) ),
( 4)
2
(5)
2
2
vnestbox
( i ) ~ N (0,  u ( 4 ) ),  year ( i ) ~ N (  0 ,  v ( 5 ) ), ei ~ N (0,  e ),
 0  1, k  1,  v2( k ) ~  1 ( ,  ), k  2,..5,  e2 ~  1 ( ,  )
Effective Sample sizes
As we can see below the effective sample sizes for
all parameters are improved for this formulation
while running time remains approximately the same.
Parameter
Fixed Effect
Year
Nestbox
Male
Female
Observation
Time
WinBUGS
originally
602
WinBUGS
combined
34296
29604
34817
788
5170
33
557
3685
8580
135
1431
2601s
2526s
Applications 2 & 3: Multilevel
discrete-time survival analysis
• Used to model the durations until the occurrence of
events e.g. length of time to death.
• Different from standard regression due to right-censoring
and time varying covariates.
• In many applications events may be repeatable and the
outcome is duration of continuous exposure to the risk of
an event.
• In our applications cows may suffer mastitis more than
once and women can initiate and discontinue use of
contraceptives several times.
• We begin with two levels of data episodes nested within
individuals.
• With a discrete time approach we expand the data to
create a lower level – time interval, which represents a
fixed period of time and is nested within episode.
Data Expansion
• Suppose we have time intervals (e.g. days, weeks)
indexed by t = 1,...,K where K is the maximum duration
of an episode.
• Let tij denote the number of intervals for which individual i
is observed in episode j.
• Before modelling we now need to expand the data for
each episode ij to obtain tij records.
• We have ytij=0 for t=1,…,tij-1 and the response for t= tij is
1 if an episode ends in an event and 0 for censored
episodes.
• Example: Individual observed for 7 months and has
event after 4 months:
Response vector (0,0,0,1,0,0,0),
Time intervals (1,2,3,4,1,2,3)
Modelling
• Having restructured the data we can analyse the event
occurrence indicator ytij using standard methods for
clustered binary response data.
ytij ~ Bernouilli( tij )
logit( tij )  z t α  x tij β  u j
u j ~ N (0,  u2 )
• Here we model the probability of an event occurrence
πtij as a function of duration (zt) and covariates xtij with
uj being an individual specific random effect
representing unobserved time-invariant individual
characteristics.
MCMC algorithm for random effect
logistic regression models
Consider the following simple model:
yij ~ Bernoulli( ij )
logit( ij )   0  u j , u j ~ N (0,  u2 )
p(  0 )  1, p( u2 ) ~  1 ( ,  )
A standard MCMC algorithm (as used in MLwiN (Rasbash et al.
(2000), Browne (2003)) is
Step 1: Update using random walk Metropolis sampling.
0
Step 2: Update u j , j  1,...,60 using random walk Metropolis sampling.
Step 3: Update  u2 from its inverse Gamma full conditional using
Gibbs Sampling
Hierarchical centered formulation
We can reparameterise by replacing the residuals uj
with random effects u*j= β0+uj. Here the u*j are
(hierarchically) centred around β0.
yij ~ Bernoulli( ij )
logit( ij )  u *j , u *j ~ N (  0 ,  u2 )
p(  0 )  1, p( u2 ) ~  1 ( ,  )
This formulation allows Gibbs sampling to be used for
the fixed effect β0 in addition to the random effects
variance. It is interesting to consider how an MCMC
algorithm for this centered formulation can be
expressed in terms of the original parameterisation.
MCMC Algorithm
At iteration t+1
• Step 1: Update β0 from it’s normal conditional
distribution:
p(  0 | y, u* ,  u2 ) ~ N ( ˆ0 , Dˆ )
ˆ0  
u j ˆ  u2
 ,D 
.
60
j 1 60
60
(t )
0
(t )
(t )
(t 1)
Adjust u j  u j  0  0 j  1,...,60
to keep u*j fixed.
The other steps are unchanged by reparameterisation
as the random walk Metropolis sampling for u*j is
equivalent to the same step for uj and the final step
is unchanged.
Why should hierarchical centering be useful
for discrete time survival models ?
• As we have seen discrete time survival models are a
special case of random effects logistic regression
models.
• Expansion results in many higher level predictors that
can be (hierarchically) centered around.
• Hierarchical centering should improve mixing but also
speed up estimation.
• Will be particularly useful when we have many level 1
units per level 2 unit and large level 2 variance, where
the speed up will be greatest – see application 2.
• Not always useful but other potential solutions – see
application 3.
Application 2: Mastitis incidence in
dairy cattle
• Mastitis – inflammation of mammary gland of
dairy cows, usually caused by a bacterial
infection. Infections arising in dry (non-lactating)
period often result in clinical mastitis in early
lactation.
• Green et al. (2007) use multilevel survival
models to investigate how cow, farm and
management factors in the dry period correlate
with mastitis incidence.
• Data – 2 years x 52 dairy farms – 103 farm
years – 8,710 cow lactation periods following dry
periods expands to 256,382 records in total!
Mastitis model
• The Model has many predictors.
ytijk ~ Bernouilli( tijk )
logit( tijk )  z t α  x tijkβ  u jk  v0 k  v1k P arity1ijk
u jk ~ N (0,  u2 ), vk ~ MVN (0,  v )
p ( l )  1, l  0,...,3, p (  l )  1, l  1,...,15
p ( u2 ) ~  1 ( ,  ), p ( v ) ~ IW 1 (2,2  S p )
• The duration terms zt consists of an intercept plus polynomials in
log time to order 3.
• The 4 levels are farm, farm year, cow dry period and weekly obs.
although no random effects occur at the dry period level.
• The model can easily be converted to a hierarchical centered
formulation by centering around the 10 farm-year level fixed
effects
Mastitis predictors
• Predictors at dry period level (6) are :
• Parity of cow (5 categories – 4 dummies)
• Dummies for Somatic cell count high before drying off, and whether
two SCC readings were available.
• At the farm level (9):
• One dummy for cows remaining standing for 30 minutes after dry
cow treatment,
• Two dummy variables to indicate whether cubicle bedding is
disinfected in the early dry period and whether this is not applicable
due to the system used.
• Two dummy variables to indicate if transition cow cubicles are
bedded at least once daily and whether there are in fact transition
cow cubicles.
• A dummy variable as to whether the cubicle bedding is disinfected in
the transition dry period
• Three dummy variables to indicate whether the transition cubicle
feed and loaf area is scraped daily, more often than daily or doesn't
exist.
Mastitis results
• The model was run for 50,000 iterations after an
adapting period and a further burnin of 500
iterations in MLwiN.
• Due to the number of parameters in model we
will in the table that follows simply give effective
sample size (ESS) values although it should also
be noted that parameter values for the noncentred formulation were rather variable due to
the small ESS
• The estimation took ~19 hours for the noncentred formulation and 11½ hours for the
centred formulation.
Effective sample sizes
Param
N Centred Centred
Param
N Centred
Centred
α0
32
2695
β7
291
3187
α1
875
926
β8
703
3554
α2
98
75
β9
43
5359
α3
99
75
β10
44
3448
β1
1095
1058
β11
34
2674
β2
3747
2087
β12
118
2268
β3
4759
3275
β13
57
2524
β4
5165
4271
β14
288
1945
β5
2765
1438
β15
50
5183
β6
1386
1248
σ2v00
630
764
σ2v11
2439
2275
σ2v01
1555
1566
σ2u
796
1052
Method 3a: Orthogonal polynomials
• It is noticeable that the parameters α1 -α3 have small ESS. These
correspond to the predictor polynomials in log duration and the three
correlations between the predictors are -0.61, 0.79 and -0.90
respectively.
• We can without altering the rest of the analysis replace this group of
predictors via a reparameterisation that makes the predictors less
correlated and in fact we choose to make them orthogonal i.e.
256382
* *
z
 ji zki  0j, k , j  k
i 1
• To do this we can keep the first predictor and then add the rest in
one at a time solving the above at each stage.
• This results in predictors z* :
• logt, logt2+0.94logt, logt3+ 1.50logt2-2.05logt
Results of orthogonal polynomials
Fitting this model will give coefficient α* that can be easily converted
back to the coefficients for the original predictors:
α1= α*1+0.94 α *2-2.05 α *3, α2= α*2+1.50 α *3, , α3= α*3
Effective sample size comparison:
Parameter
Standard
Orthogonal
α0
2695
3337
α1
926
1997
α2
75
779
α3
75
1038
Application 3: Contraceptive
discontinuation in Indonesia
• Steele et al. (2004) use multilevel multistate models to study
transitions in and out of contraceptive use in Indonesia. Here we
consider a simplification of their model which considers only the
transition from use to non-use, commonly referred to as
contraceptive discontinuation.
• The data come from the 1997 Indonesia Demographic and Health
Survey. Contraceptive use histories were collected retrospectively
for the six-year period prior to the survey, and include information on
the month and year of starting and ceasing use, the method used,
and the reason for discontinuation.
• The analysis is based on 17,833 episodes of contraceptive use for
12,594 women, where an episode is defined as a continuous period
of using the same method of contraception.
• Restructuring the data to discrete-time format with monthly time
intervals leads to 365,205 records. To reduce the size, monthly
intervals are grouped into six-month intervals and a binomial
response is defined with denominator equal to the number of
months of use within a six-month interval. Aggregation of intervals
leads to a dataset with 68,515 records.
Model
• We here have intervals nested within episodes nested
within women.
ytij ~ Binomial(ntij ,  tij )
logit( tij )  z t α  xtijβ  u j
u j ~ N (0,  u2 ), p( u2 ) ~  1 ( ,  ).
p( l )  1, l  0,...,4, p(  l )  1, l  1,...,10
Here we include discrete categories for the duration terms zt –
a piecewise constant hazard with categories representing 6-11
months, 12-23 months, 24-35 months and >35 months with a
base category of 0-5 months.
Predictors
• At the episode level we have:
Age categorized as <25,25-34 and 34-49.
Contraceptive method categorized as
pill/injectable, Norplant/IUD, other modern and
traditional.
• At the woman level we have:
Education (3 categories).
Type of region of residence (urban/rural).
Socioeconomic status (low, medium or high).
Results
The table below gives the effective sample sizes based on runs of
250,000 iterations.
Param
N Centred Centred
Param
N Centred Centred
α0
1665
28
β4
20517
18820
α1
13405
91
β5
21118
19313
α2
11553
81
β6
2036
50
α3
11900
126
β7
2093
44
α4
10463
161
β8
16965
79
β1
13855
163
β9
6488
33
β2
15933
11840
β10
6876
55
β3
19911
20562
σ2u
14
14
Here the hierarchical centered formulation does really badly. This is
because the cluster variance σ2u is very small: estimates of 0.041 and
0.022 for the two methods
• It is well known that
hierarchical centering works
best if the cluster level
variance is substantial.
• Here we see that both the
variance is small and the
distribution of the residuals
is not very normal.
• This is due to a few women
who discontinue usage very
quickly and often, whilst
many women never
discontinue!
Std residuals
A closer look at the residuals
Normal
scores
Simple logistic regression
• We will consider first removing the random effects from
the model (due to their small variance) which will result
in a simple logistic regression model.
• It will then be impossible to perform hierarchical
centering however we will consider an extension to the
orthogonalisation performed in the previous application.
• Note that Hills and Smith (1992) talk about using
orthogonal parameterisations and Roberts and Gilks give
it one sentence in ‘MCMC in Practice’. Here we consider
it in combination with the simple (single site) random
walk Metropolis sampler where reduction of correlation in
the posterior is perhaps most important.
Method 3b:Orthogonal
parameterisation
• For simplicity assume we have all predictors in one matrix P and
that we can write ztα+xtijβ as ptijθ where θ=(α,β).
• Step 1: Number the predictors in some ordering 1,…,N.
• Step 2: Take each predictor in turn and replace it with a predictor
that is orthogonal to all the already considered predictors.
• For predictor pk. Creat e
pk*  w1,k p1  w2,k p2  ...  wk 1,k pk 1  pk
so that( pi* )T pk*  0i  k
• Note this requires solving k-1 equations in k-1 unknown w
parameters.
• A different orthogonal set of predictors results from each ordering.
Orthogonal parameterisation
• The second step of the algorithm produces both a set of
orthogonal predictors that span the same space as the
original predictors and a group of w coefficients that can
be combined to form a lower diagonal matrix W.
• We can fit this model and recover the coefficients for the
original predictors by pre-multiplication by WT.
• It is worth noting here that we use improper Uniform
priors for the coefficients and if we used proper priors we
would need to also calculate the Jacobian for the
reparameterisation to ensure the same priors are used.
• We ordered the predictors in what follows so that the
level 2 predictors were last before performing
reparameterisation.
Results
• The following is based on 50,000 iterations:
Param
Original
Orthogonal
Param
Original
Orthogonal
α0
403
11578
β4
11306
12010
α1
4249
12049
β5
9877
12023
α2
3617
11878
β6
500
10945
α3
4643
11920
β7
514
11610
α4
5260
11864
β8
5646
10591
β1
5114
12908
β9
1466
11214
β2
7787
10188
β10
1686
10249
β3
10291
8518
Here we see almost universal benefit of the orthogonal parameterisation
with virtually zero time costs and very little programming!
Combining orthogonalisation with
parameter expansion
• Combining orthogonalisation and parameter expansion
we have:
ytij ~ Binomial(ntij ,  tij )
logit( tij )  ptij*  *   v j
v j ~ N (0,  v2 ), p( ) ~ N (0,106 )
p( l* )  1, l  0,...,14, p( v2 ) ~  1 ( ,  )
so that  ( ,  )  W T  * , uk   vk and  u2   2 v2
We ran this model using WinBUGS and only 25,000
iterations following a burnin of 500 iterations which
took 34 hours compared to 23½ for 250k in MLwiN
without parameter expansion. The results are given
overleaf.
Results for full model
• Here we compare simply using the orthogonal approach in MLwiN for
250k with both orthogonal predictors and parameter expansion in
WinBUGS for 25k. Note this takes ~1.5 times as long:
Param
Orthog.
MLwiN
Param
Expan
Param
Orthog.
MLwiN
Param
Expan
α0
22714
14009
β4
23533
23488
α1
23931
22017
β5
24498
23792
α2
24136
15024
β6
23816
22546
α3
23303
4859
β7
23428
24422
α4
22457
1881
β8
22860
22995
β1
23347
22609
β9
23697
23960
β2
22779
20883
β10
23624
23383
β3
22105
14032
σ2u
20
318
Trace plots of variance chain
Here we see the far greater mixing
of the variance chain after
parameter expansion.
It is worth noting that parameter
expansion uses a different prior for
σ2u and results in an estimate of
0.059 (0.048) as opposed to 0.008
(0.006) without and earlier
estimates of 0.041(0.026) and
0.022 (0.018) before
orthogonalisation.
Note however that all estimates
bar parameter expansion are
based on very low ESS!
Before Parameter Expansion
After Parameter Expansion
Conclusions
• Hierarchical centering – as is well known - works well
when the cluster variance is big but is no good for small
variances
• Other research building on hierarchical centering
includes Papaspiliopoulus et al. (2003,2007) showing
how to construct partially non-centred parameterisations.
• Parameter expansion works well to improve mixing when
the cluster variance is small but results in a different prior
for the variance.
• Orthogonalisation of predictors appears to be a good
idea generally but is slightly more involved than the other
reparameterisations i.e. the predictors need
orthogonalising outside WinBUGS and the chains need
transforming back.
• An interesting area of further research is choosing the
order for orthogonalisation i.e. which set of orthogonal
predictors to use.
References
•
•
•
•
•
•
•
Browne, W.J. (2003). MCMC Estimation in MLwiN. London: Institute of
Education, University of London
Browne, W.J. (2004). An illustration of the use of reparameterisation
methods for improving MCMC efficiency in crossed random effect models
Multilevel Modelling Newsletter 16 (1): 13-25
Gamerman D. (1997) Sampling from the posterior distribution in generalized
linear mixed models. Statistics and Computing. 7, 57--68.
Gelfand, A.E., Sahu, S.K. and Carlin, B.P. (1995) Efficient parameterisations
for normal linear mixed models. Biometrika 83, 479--488
Gelman, A., Huang, Z., van Dyk, D., and Boscardin, W.J. (2007). Using
redundant parameterizations to fit hierarchical models. Journal of
Computational and Graphical Statistics (to appear).
Green, M.J., Bradley, A.J., Medley, G.F. and Browne, W.J. (2007) Cow,
Farm and Management Factors during the Dry Period that determine the
rate of Clinical Mastitis after calving. Journal of Dairy Science 90: 3764-3776.
Hills, S.E. and Smith, A.F.M. (1992) Parameterization Issues in Bayesian
Inference. In Bayesian Statistics 4, (J M Bernardo, J O Berger, A P Dawid,
and A F M Smith, eds), Oxford University Press, UK, pp. 227--246.
References cont.
•
•
•
•
•
•
•
Liu, C., Rubin, D.B., and Wu, Y.N. (1998) Parameter expansion to
accelerate EM: The PX-EM algorithm. Biometrika 85 (4): 755-770.
Liu, J.S., Wu, Y.N. (1999) Parameter Expansion for Data Augmentation.
Journal Of The American Statistical Association 94: 1264-1274
Papaspiliopoulos, O, Roberts, G.O. and Skold, M. (2003) Non-centred
Parameterisations for Hierarchical Models and Data Augmentation. In
Bayesian Statistics 7, (J M Bernardo, M J Bayarri, J O Berger, A P Dawid, D
Heckerman, A F M Smith and M West, eds), Oxford University Press, UK,
pp. 307--32
Papaspiliopoulos, O, Roberts, G.O. and Skold, M. (2007) A General
Framework for the Parametrization of Hierarchical Models. Statistical
Science 22, 59--73.
Rasbash, J., Browne, W.J., Healy, M, Cameron, B and Charlton, C. (2000).
The MLwiN software package version 1.10. London: Institute of Education,
University of London.
Steele, F., Goldstein, H. and Browne, W.J. (2004). A general multilevel
multistate competing risks model for event history data, with an application
to a study of contraceptive use dynamics. Statistical Modelling 4: 145--159
Van Dyk, D.A., and Meng, X-L. (2001) The Art of Data Augmentation.
Journal of Computational and Graphical Statistics. 10, 1--50.