Lecture 2 - Eis.bris.ac.uk

Download Report

Transcript Lecture 2 - Eis.bris.ac.uk

Lecture 9
Model Comparison using MCMC
and further models
Lecture Contents
•
•
•
•
•
•
•
Model comparison
DIC diagnostic
Random slopes regression model
Priors for variance matrices
MLwiN RSR demonstration
Other predictor variables
DIC in WinBUGS
Bayesian Model Comparison
• In Bayesian statistics model comparison is a thorny
issue!!
• In MLwiN we used to suggest running IGLS for model
selection then MCMC on your chosen model.
Why is it a thorny issue?
• The posterior f(θ|Y) does not allow criticism of the model
in light of the observed data nor comparison amonst
models.
• It is f(Y) that can be used to assess model performance.
• Regardless of the model, f(Y) is a density over the space
of observables which can be compared with what was
actually observed.
Bayes Factors
• If we observe YOBS and have 2 models M1 and
M2 then the Bayes Factor is
f (YOBS | M 1 )
BF 
f (YOBS | M 2 )
.
This provides the relative weight of evidence for
model M1 compared to model M2.
Rough calibration of the Bates factor has been
proposed:
BF
<1
1-3
3-12
12-150
>150
Evidence
-ve
weak
+ve
Strong
V. Strong
Problems with Bayes Factor
1. When prior is vague -> f(θ) is improper
This implies that even though f(θ |Y) may be
proper, f(Y) is improper so Bayes Factors cannot
be used!
2. Computation of the Bayes factor itself requires
high-dimensional integration.
3. Lindley’s paradox – data points to rejection but
prior is diffuse so denominator of Bayes factor
much smaller than numerator and too much
weight given to parsimonious models.
Other related ideas
• Prior predictive distributions f(Y).
• Cross-validation predictive distributions
F(yr|Y(r)).
• Posterior predictive distributions f(Y|Yobs).
• Model uncertainty – where the model is
itself a parameter to be estimated.
• Bayesian model averaging.
• Reversible jump MCMC.
Model Comparison for random
effect models
• As we will typically use diffuse priors,
Bayes factors are not an option here.
• The methods listed previously are
possibilities but not built into software
packages.
• The Deviance Information Criterion (DIC)
is one possibility but is it a saviour for
Bayesian model choice or a white
elephant?
DIC – Spiegelhalter et al. (2002)
Plus points:
1. Discussion paper proposing it written by
leading figures in Bayesian modelling.
2. Available in both MLwiN and WinBUGS
for standard models
Minus points:
The paper was given a very mixed reception
at the RSS when it was discussed!
DIC
A natural way to compare models is to use a criterion
based on a trade-off between the fit of the data to the
model and the corresponding complexity of the model.
DIC does this in a Bayesian way.
DIC = ‘goodness of fit’ + ‘complexity’.
Fit is measured by deviance
D( )  2 log L(data |  )
Complexity is measured by an estimate of the ‘effective
number of parameters’ defined as
pD  E | y [ D]  D( E | y [ ])
 D  D( )
i.e. Posterior mean deviance minus the deviance
evaluated at the posterior mean of the parameters.
DIC (continued)
• The DIC is then defined analagously to AIC as
DIC  D( )  2 pD
 D  pD
Models with smaller DIC are better supported by
the data.
• DIC can be monitored in WinBUGS from
Inference/DIC menu.
• DIC is available in MLwiN under the
Model/MCMC menu.
Education dataset
• We can fit a simple
(Bayesian) linear
regression in MLwiN
• The DIC output is as
follows:
Param
Dbar
9763.54
D(thetabar)
9760.51
PD
3.02
DIC
9766.56
← Note PD ~ 3 = the actual
number of parameters
Variance components model
Here we consider the
random intercepts
model from earlier
practicals
This is the parallel lines model
Change in DIC
Model
Dbar
Dthetabar
PD
DIC
Regression 9763.54
9760.51
3.02
9766.56
VC
9149.16
59.98 9269.13
9209.15
Here we see the clear improvement in fitting
random effects for school.
Note that the effective number of parameters is
~60 compared with 68 actual parameters in the
dataset due to random rather than fixed school
effects.
Random slopes model (crossing lines)
yij   0 j  1 j xij  eij
 0 j   0  u0 j
1 j  1  u1 j
school 1
u0 j 

  u20


:
)

,
0
(
N
~
 

u
u
2 
 u 01  u1 
u1 j 
eij ~
u1,1
N (0, e2 )
0 + 1x1ij
u0,1
u0,2
-3
0
u1,2
1
school 2
3
Fitting an RSR in a Bayesian
Framework
The basic random slopes regression model is as
follows:
yij   0 j  1 j xij  eij
 0 j   0  u0 j
1 j  1  u1 j
u0 j 
  u20

  ~ N (0, u ) : u  
2 
 u 01  u1 
u1 j 
eij ~ N (0, e2 )
To this model we need to add priors for
 , , and u .
2
e
Wishart priors
For a (kxk) variance matrix parameter in a
Normal likelihood the conjugate prior is the
inverse Wishart distribution with parameters ν
and S

p ()  2
S
 /2

vk / 2

k ( k 1) / 4
 (  k 1) / 2

k
i 1
(
 1i
2

1
)
exp( 12 tr ( S 1 ))
This distribution looks complex but is simply a
multivariate generalisation of the inverse
Gamma distribution.
Wishart prior for Ωu-1
In MLwiN we use an inverse Wishart prior for
the precision matrix:
ˆ )
p(u1 ) ~ Wishart p ( p, p
u
ˆ is a priorestimateof 
where 
u
u
Note this is a (weakly informative) prior as the first
parameter represents the prior sample size and is
set to the smallest feasible value. Browne and
Draper have looked at alternative Wishart priors as
well as a Uniform prior and performed simulations.
Gibbs Sampling algorithm for RSR
model
Repeat the following four steps
• 1. Generate β from its (Multivariate)
Normal conditional distribution.
• 2. Generate each uj from its (Multivariate)
Normal conditional distribution.
• 3. Generate Ωu-1 from its Wishart
conditional distribution.
• 3. Generate 1/σe2 from its Gamma
conditional distribution
Bayesian RSR Model for education
dataset
Note IGLS estimates used in prior. Variance (posterior
mean) estimates bigger than IGLS estimates.
DIC for RSR model
Model
Dbar
Dthetabar
PD
DIC
RSR
9122.99
9031.32
91.67
9214.65
VC
9209.15
9149.16
59.98
9269.13
As with the frequentist approach the random slopes
model is an improvement over the random intercepts
model. The additional 65 random parameters only add
32 effective parameters
Trajectories for the RSR model
MCMC Diagnostics for Ωu00
Predictions for the RSR model with
highlighted data
• Here the top and bottom school are highlighted:
Residuals for the RSR
Individually:
and pairwise:
Uniform Priors
Here the level 2 variance
estimates increase as in
Browne and Draper (2000)
Browne and Draper found that
the Wishart priors were
preferable although the
use of the IGLS estimate is
not strictly Bayesian as we
are using the data twice!
Other predictors in the education
dataset
This dataset has other predictors such as
gender and school gender that can be
considered in the practical.
In the next slide we see the equations
window for a model with these added
which has DIC 9189.26 a reduction of over
25 on the earlier RSR model
RSR + gender effects
WinBUGS RSR & gender
model
{
# Level 1 definition
for(i in 1:N) {
normexam[i] ~ dnorm(mu[i],tau)
mu[i]<- beta[1] * cons[i]
+ beta[2] * standlrt[i]
+ beta[3] * girl[i]
+ beta[4] * boysch[i]
+ beta[5] * girlsch[i]
+ u2[school[i],1] * cons[i]
+ u2[school[i],2] * standlrt[i]
}
# Higher level definitions
for (j in 1:n2) {
u2[j,1:2] ~ dmnorm(zero2[1:2],tau.u2[1:2,1:2])
}
# Priors for fixed effects
for (k in 1:5) { beta[k] ~ dflat() }
# Priors for random terms
tau ~ dgamma(0.001000,0.001000)
sigma2 <- 1/tau
for (i in 1:2) {zero2[i] <- 0}
tau.u2[1:2,1:2] ~ dwish(R2[1:2, 1:2],2)
sigma2.u2[1:2,1:2] <- inverse(tau.u2[,])
}
← Here we see the
WiNBUGS code for our
last model.
Notice how MVN and
Wishart distributions are
specified in WinBUGS
DIC in WinBUGS
In WinBUGS DIC is available from the
Inference menu:
The DIC is set after the burnin and then the
DIC button is pressed after running giving:
Dbar = post.mean of -2logL; Dhat = -2LogL at post.mean of
stochastic nodes
Dbar
Dhat
pD
DIC
Normexam 9098.590 9007.960
90.631 9189.220
total
9098.590 9007.960
90.631 9189.220
Parameter estimates in WinBUGS
• Note that here we see that WinBUGS gives similar
estmates as MLwiN for the model. Note that for the fixed
effects β that WinBUGS indexes from 1 while MLwiN
indexes from 0.
node
beta[1]
beta[2]
beta[3]
beta[4]
beta[5]
sigma2
sigma2.u2[1,1]
sigma2.u2[1,2]
sigma2.u2[2,2]
mean
-0.19
0.5539
0.1687
0.1775
0.175
0.5511
0.08777
0.02141
0.01545
sd
0.05236
0.01971
0.03399
0.1008
0.08212
0.0125
0.01885
0.00719
0.00460
(2.5%, 97.5%)
(-0.2996, -0.08937)
(0.5145, 0.5911)
(0.1019, 0.2349)
(-0.02581, 0.3781)
(0.004677, 0.3318)
(0.5272,0.576)
(0.05745, 0.1305)
(0.009262,0.0372)
(0.008271,0.02603)
Next Practical
The next practical is free ranging:
• You can follow the MLwiN chapter on RSR
models that is given.
• You can try out RSR models in WinBUGS.
• You can try out fitting random effect
models to the orthodont dataset using
MCMC.
• You can try out DIC on other models.