Queensland University of Technology CiSSaIM Statistical

Download Report

Transcript Queensland University of Technology CiSSaIM Statistical

Collaborative Centre for Data Analysis,
Modelling and Computation
QUT, GPO Box 2434
Brisbane 4001, Australia
From inference to modelling to
algorithms and back again
Kerrie Mengersen
QUT Brisbane
MCQMC 2012
Acknowledgements: BRAG
Bayesian methods and models
+ Fast computation
+ Applications in environment,
health, biology, industry
So what’s the problem?
Matchmaking 101
Study 1: Presence/absence models
Sama Low Choy
Mark Stanaway
Plant biosecurity
From inference to model
• Observations and data
– Visual inspection symptoms
– Presence / absence data
– Space and time
• Dynamic invasion process
– Growth, spread
• Inference
– Map probability of extent over time
– Useful scale for managing trade / eradication
– Currently use informal qualitative approach
• Hierarchical Bayesian model to formalise the
information
Hierarchical Bayesian model for plant
pest spread
• Data Model: Pr(data | incursion process and data parameters)
– How data is observed given underlying pest extent
• Process Model: Pr(incursion process | process parameters)
– Potential extent given epidemiology / ecology
• Parameter Model: Pr(data and process parameters)
– Prior distribution to describe uncertainty in detectability, exposure, growth …
• The posterior distribution of the incursion process (and
parameters) is related to the prior distribution and data by:
Pr(process, parameters | data) 
Pr(data | process, parameters ) Pr( process | parameters ) Pr(parameters)
Early Warning Surveillance

Priors based on emergency
plant pest characteristics


exposure rate for
colonisation probability
 spread rates to link sites
together for spatial analysis
Add surveillance data

Posterior evaluation

modest reduction in area
freedom
 large reduction in estimated
extent
 residual “risk” maps to target
surveillance
Observation Parameter Estimates
• Taking into account
invasion process
• Hosts
– Host suitability
• Inspector efficiency
– Identify contributions
Study 2: Mixture models
Clair Alston
CAT scanning sheep
From inference to model
What proportions of the sheep
carcase are muscle, fat and
bone?
• Finite mixture model
yi ~ S lj N(mj,sj2)
• Include spatial
information
Inside a sheep
Inside a sheep
Study 3: State space models
Nicole White
Parkinson’s Disease
PD symptom data
•
Current methods for PD subtype classification
rely on a few criteria and do not permit
uncertainty in subgroup membership.
•
Alternative: finite mixture model
(equivalent to a latent class analysis for
multivariate categorical outcomes)
•
Symptom data: Duration of diagnosis,
early onset PD, gender, handedness,
side of onset
From inference to model
yij: ith subject’s
response to item j
1. Define a finite mixture model based on patient
responses to Bernoulli and Multinomial questions.
2. Describe subgroups w.r.t. explanatory variables
3. Obtain patient’s probability of class membership
PD: Symptom
data
PD Signal data:
“How will they respond?”
Inferential aims
Identify spikes
and assign to
unknown no.
source neurons
Compare clusters
between segments
within a recording
and between
recordings at
different locations
of the brain
3 depths
Microelectrode recordings
Each recording was
divided into 2.5sec.
segments
Discriminating
features found
via PCA
From inference to model
DP Model
yi | qi ~ p(yi | qi)
qi ~ G
G ~ DP(a, G0)
P PCs, yi=(yi1,..,yiP) ~ MVN(m,S)
G0 = p(m|S) p(S)
a ~ Ga(2,2)
Average waveforms
Study 4: Spatial dynamic factor models
Chris Strickland
Ian Turner
What can we
learn about
landuse from
MODIS data?
Differentiate landuse
1st trend component
2nd trend comp.
SDFM
common cyclical comp.
• 1st factor has influence on temporal dynamics in
right half of image (woodlands)
• 3rd factor has influence on LH image (grasslands)
Matchmaking 101
Smart models
Example 1: Generalisation
Mixtures are great
but how do we choose k?
Judith Rousseau
f0(x) = Sj=1,..,k0 pj ggj(x)
Propose an overfitting model (k>k0)
Non-identifiable!
All values of q = (p10,..,pk00, 0, g10,..,gk00)
and all values of q = (p10,..,pj,…,pk00, pk+1, g10,..,gk00, gj0)
with pj+pk+1=pj0 fit equally well.
So what?
• Multiplicity of possible solutions =>
MLE does not have a stable asymptotic behaviour.
• Not important when fq is the main object of
interest, but important if we want to recover q.
• It thus becomes crucial to know that the posterior
distribution under overfitted mixtures give
interpretable results
Possible alternatives to
avoid overfitting
Fruhwirth-Schnatter (2006): either one of the
component weights is zero or two of the
component parameters are equal.
• Choose priors that bound the posterior away
from the unidentifiability sets.
• Choose priors that induce shrinkage for
elements of the component parameters.
Problem: may not be able to fit the true model
Our result
Assumptions:
– L1 consistency of the posterior
– Model g is three times differentiable, regular, and
integrable
– Prior on Q is continuous and positive, and the prior
on (p1,..,pk) satisfies
p(p)  p1a1-1…pkak-1
Our result - 1
• If max(a1)<d/2, where d=dim(g), then asymptotically
f(q|x) concentrates on the subset of parameters for
which fq = f0, so k-k0 components have weight 0.
• The reason for this stable behaviour as opposed as the
unstable behaviour of the maximum likelihood
estimator is that integrating out the parameter acts as
a penalization: the posterior essentially puts mass on
the sparsest way to approximate the true density.
Our result - 2
• In contrast, if min(aj, j≤k)>d/2 and k>k0, then 2 or
more components will tend to merge with nonneglectable weights each. This will lead to less
stable behaviour.
• In the intermediate case, if
min(aj, j≤k) ≤d/2 ≤max(aj,j ≤k),
then the situation varies depending on the aj’s, and
on the difference between k and k0.
Implications: Model dimension
• When d/2>max{aj, j=1,..,k},
dk0+k0-1+Sj≥k0+1aj appears as an effective
dimension of the model
• This is different from the number of parameters,
dk+k-1, or from other “effective number of
parameters”
• Similar results are obtained for other situations
Example 1
yi ~ N(0,1); fit pN(m1,1)+(1-p)N(m2,1)
ai=1 > d/2
Example 2
yi ~ N(0,1)
G=pN2(m1,S1)+(1-p)N2(m2, S2), Sj diagonal
d = 3; a1=a2=1 < d/2
Conclusions
• The result validates the use of Bayesian estimation
in mixture models with too many components.
• It is one of the few examples where the prior can
actually have an impact asymptotically, even to
first order (consistency) and where choosing a less
informative prior leads to better results.
• It also shows that the penalization effect of
integrating out the parameter, as considered in the
Bayesian framework is not only useful in model
choice or testing contexts but also in estimating
contexts.
Example 2: Empirical likelihoods
Christian Robert
• Sometimes the likelihood associated with the data
is not completely known or cannot be computed in
a manageable time (eg population genetic models,
hidden Markov models, dynamic models), so
traditional tools based on stochastic simulation
(eg, regular MCMC) are unavailable or unreliable.
• Eg, biosecurity spread model.
Model alternative: ELvIS
• Define parameters of interest as functionals of the cdf
F (eg moments of F), then use Importance Sampling
via the Empirical Likelihood.
• Select the F that maximises the likelihood of the data
under the moment constraint.
• Given a constraint of the form Ep(h(Y)) = q
the EL is defined as
Lel(q|y) = maxFPi=1:n{F(yi)-F(Yi-1}
• For example, in the 1-D case when q = Eq(Y)
the empirical likelihood in q is the maximum of
p1,…,pn under the constraint Si=1:npiyi = q
Quantile distributions
• A quantile distribution is defined by a closed-form
quantile function F-1(p;q) and generally has no
closed form for the density function.
• Properties: very flexible, very fast to simulate
(simple inversion of the uniform distribution).
• Examples: 3/4/5-parameter Tukey’s lambda
distribution and generalisations; Burr family; gand-k and g-and-h distributions.
g-and-k quantile distribution
Methods for estimating a
quantile distribution
• MLE using numerical approximation to the
likelihood
• Moment matching
• Generalised bootstrap
• Location and scale-free functionals
• Percentile matching
• Quantile matching
• ABC
• Sequential MC approaches for multivariate
extensions of the g-and-k
ELvIS in practice
• Two values of q=(A,B,g,k):
q=(0,1,0,0) standard normal distribution
q=(3,2,1,0.5) Allingham’s choice
• Two priors for q:
U(0,5)4
A~U(-5,5), B~U(0,5), g~U(5,5), k~(-1,1)
• Two sample sizes:
n=100
n=1000
ELvIS in practice:
q=(3,2,1,0.5), n=100
Matchmaking 101
A wealth of algorithms!
From model to algorithm
Chris Strickland
Models:
• Logistic regression
• Non-Gaussian state space models
• Spatial dynamic factor models
Evaluate:
• Computation time
• Maximum bias
• sd
• Inefficiency factor (IF)
• Accuracy rate
Logistic Regression
k = 2, 4, 8, 20 covariates; n=1000 observations
• Importance sampling (IS):
– E[h(b)] = h(b) [p(b|y)/q(q|y)] q(b|q) db
with q(b|y) proportional to exp(-0.5(b-b*)TV-1(b-b*))
– MLE b* (mode found by IRWLS)
variance V=-2∂2p(y|X,b)/(dbdbT)|b=b*
• Random walk Metropolis-Hastings (RWMH):
– same proposal distribution
• Adaptive RWMHGarthwaite, Yan, Sisson
– Only needs starting values (b*) – easiest candidate!
Results
Algorithm
IS
k
2
8
20
time
5.1
5.6
6.3
bias
0.07
0.18
0.30
sd
0.06
0.09
0.11
IF
-
acc.rate
-
RWMH
2
8
20
8.1
8.8
9.2
0.08
0.15
0.20
0.07
0.10
0.12
12
30
120
0.56
0.20
0.03
ARWMH
2
8
20
10.5
10.8
12.1
0.08
0.20
0.30
0.07
0.07
0.12
10
35
50
0.24
0.23
0.21
So what?
• If k is small (e.g. 8), even a naïve candidate is ok.
• When k is larger (e.g. 20), we need something
more intelligent, e.g. adaptive MH
• As models become more complicated, we need to
get more sophisticated
Non-Gaussian state space model
Importance sampler
vs
MCMC
vs
Particle filter
vs
Laplace approximation (INLA)
Non-Gaussian state space model
• Importance sampler
√ General algorithms (Durbin&Koopman 2001 - global
approximation) and
Tailored algorithms (Liesenfeld&Richard 2006 – local
approximation)
√ Independence sampler - don’t have to worry about
correlated draws
√ Parallelisable – potentially much faster
× Difficult to come up with a good candidate distribution
× More difficult as the sample and model complexity
increase
× More complex than MCMC to obtain non-standard
expectations
Non-Gaussian state space model
• MCMC
√ Very flexible w.r.t extensions to other models
√ The same algorithms can be used as the sample size and/or
parameter dimension grows (with some provisos)
× Can be slow, and complicated to achieve good acceptance
and mixing
× Single move samplers perform poorly in terms of
mixingPitt&Shephard; Kim,Shephard&Chib
√ Very efficient (mixing) algorithms can be designed for
specific problems, eg stochastic volatilityK,S,C
√ General approaches available – simple reparametrisation can
lead to vastly improved simulation efficiencyStrickland et al
Non-Gaussian state space model
• Particle filters
√ Easy to implement – at least intuitively
√ Updating estimates as the sample size grows is possibly a lot
simpler and cheaper than a full MCMC or IS sampler update
× Perhaps not as easy as appears with parameter learning
× Particle approximation degenerates – may need MCMC
steps, or alternatives
× To do full MCMC updates, need to store the entire history of
particle approximation
Non-Gaussian state space model
• Integrated Nested Laplace Approximation (INLA)
√ Extremely fast
√ If model complexity stays the same, then it can work for
very large problems
√ R interface to code
× Can only hold a small number of hyperparameters than can
be forced in the GMRF approximation, so very restrictive to
the problem set
So what?
• Many algorithms: general versus specific, flexible
versus tailored
• Pros and cons should be weighed against
inferential aims, model and computational
resources
• Blocking and reparametrisation are two good
tricks, but we need to be clever about non-centred
reparametrisation
Spatial Dynamic Factor Models
Yt = Bft + et, et ~ N(0, S), S diag.
factor loadings bj ~ N(0, V(s))
• Spatial correlation:
– Lopes et al. use a GRF, O((p×k*)3)
– Strickland et al. use a GMRF (images: large discrete
spatial domain)
+ Krylov subspace methods to sample from GMRF
posterior, scales linearly O(p×k*).
– Rue uses Cholesky decomposition, more complex,
O(p×k*)3/2)
So what?
Difference between (i) O(p×k*) and (ii) O(p×k*)3/2):
If the data set becomes 100 times larger, (i) will
take 1 million times longer, compared to 100 times
longer for (ii). Instead of waiting 1 hour, you might
have to wait more than 1 year…
Case study
• Spatial domain: 900 pixels
• Temporal: approx 200 periods (every 16 days)
• Total 180,000 observations, ~3600 parameters
10 mins for 15000 MCMC iterations (on a laptop!)
Conclusions
• The model is too complicated and the datasets too
large for IS or PF.
• There are too many parameters for INLA.
• MCMC is thus desirable, but it is extremely
important to choose good algorithms!
As the model becomes more complex, the choice of
smart algorithms becomes more important and can
make a large difference in computation and
estimation
Matchmaking 102
Smart algorithms
Hybrid algorithms
Tierney (1994)
Design an efficient algorithm that
combines features of other
algorithms, in order to overcome
identified weaknesses in the
component algorithms
Comparing algorithms
• Accuracy
– bias (H)
• Efficiency
– rate of convergence, rate of acceptance (A)
– mixing speed (integrated autocorrelation time) tp(H)
• Applicability
– simplicity of set-up, flexibility of tailoring
• Implementation
– coding difficulty, memory storage
– computational demand:
• total number of iterations (T), burnin (T0)
• correlation along the chain, measured by the effective sample size
ESSH = (T-T0)/tp(H)
Hybrid algorithms - 1
Kate Lee, Christian Robert, Ross McVinish
• Metropolis-Hastings Algorithm (MHAs)
– Improve mixing speed via:
• parallel chains (MHA)
• repulsive proposal (MHARP M&R), pinball sampler
• delayed rejection (DRATierney & Mira, DRALP, DRAPinball)
– Improve applicability by:
• reversible jumpGreen
• Metropolis adjusted Langevin (MALATierney&Roberts, Besag&Green)
Simulation study
•
Model
–
–
–
–
•
mixture of 2-D normals, well separated
p = 0.5 N([0,0]T, I2) + 0.5 N([5,5]T, I2)
100 replicated simulations
Each result is obtained after running the algorithm for
1200 seconds using 10 particles
Proposal variance = 4; value in brackets is MSE
(similar results for variance = 2)
Platform
–
–
Version 7.0.4 Matlab
run by SGI Altix XE cluster containing 112 x 64 bit
Intel Xeon cores.
Results
T=no. simulations; A=acceptance rate; H=accuracy; s2H=var(H),
tp(H)=autocorrelation time; ESSH=effective sample size
Results
• MHA
– shortest CPU time per iteration and largest sample size
– need to tune proposal variance to optimise performance
• MALA
– can get trapped in nearest mode if the scaling parameter in the
variance of the proposal is not sufficiently large
• MHA with RP
– induces a fast mixing chain, but need to choose tuning parameter
– expensive to compute
– in rare cases the algorithm can be unstable and get stuck
• DRA
– less correlated chains, higher acceptance rate
– higher computational demand
– Langevin proposal: improves mixing, but the loss in computational
efficiency overwhelmed gain in statistical efficiency
– Normal random walk faster to compute and improves mixing.
Hybrid algorithms - 2
• Population Monte Carlo (PMC)
– Extension of IS by allowing importance function to
adapt to the target in an iterative mannerCappe et al
– PMC with repulsive proposal: create ‘holes’ around
existing particles
• Particle systems and MCMC
– IS + MCMCM&R
– parallel MCMC + SMCdel Moral
Simulation study repeated
• 100 replicates, run for 500 seconds using 50
particles, with first 100 iterations ignored
√ accuracy of estimation
√ fast exploration (significantly reduced
integrated autocorrelation time)
√ No instability (unlike MCMC algorithms)
√ Less sensitivity to importance function
√ Repulsive effect improved mixing
Summary
Relative performance compared to MHA and PMC
Algorithm
MALA
MHARP
DRA
DRALP
DRAPinball
PS
PMCR
Statistical Efficiency
EPM
CR
RC
1
1
1
0
1
0
1
1
0
2
1
0
2
1
0
2
1
0
2
0
Computation
CE
SP
-2
-1
-2
-1
-1
-1
-2
-1
-2
-1
-3
-2
-1
-1
Applicability
FH
CP Mode
0
-1
S
-1
-2
B
0
0
B
0
0
B
0
0
B
-1
-2
B
-1
0
B
EPM=efficiency of proposal move; CR=correlation reduction of chain;
RC=rate of convergence; CE=cost effectiveness; SP=simplicity of programming;
FH=flexibility of hyperparameters; CP=consistency of performance;
Mode=preference between a single mode and multimodal problem
Hybrid algorithms - 3
ABCel
Unlike ABC, ABCel does not require:
Conclusions
1. Combining features of individual algorithms
may lead to complicated characteristics in a
hybrid algorithm.
2. Each individual algorithm may have a strong
individual advantage with respect to a particular
performance criterion, but this does not
guarantee that the hybrid method will enjoy a
joint benefit of these strategies.
3. The combination of algorithms may add
complexity in set-up, programming and
computational expense.
Implementing smart algorithms:
PyMCMC
Chris Strickland
• Python package for fast MCMC
• takes advantage of Python libraries Numpy, Scipy
• Classes for Gibbs, M-H, orientational bias MC,
slice samplers, etc.
• linear (with stochastic search), logit, probit, loglinear, linear mixed-model, probit mixed-model,
nonlinear mixed models, mixture, spatial mixture,
spatial mixture with regressors, time series suite
(including DFM and SDFM)
• Straightforward to optimise, extensible to C or
Fortran, parallelisable (GPU)
PyMCMC
[email protected]
Matchmaking algorithm!
Inference
Past
Future
Cool
problems
Model
Algorithm
Key References
• Lee, K., Mengersen, K., Robert, C.P. (2012) Hybrid models. In
Case Studies in Bayesian Modelling. Eds Alston, Mengersen,
Pettitt. Wiley, to appear.
• Stanaway, M., Reeves, R., Mengersen, K. (2010) Hierarchical
Bayesian modelling of early detection surveillance for plant pest
invasions. J. Environmental and Ecological Statistics.
• Strickland, C., Simpson, D, Denham, R., Turner, I., Mengersen,
K. Fast methods for spatial dynamic factor models. CSDA.
• Strickland, C., Alston, C., Mengersen, K. (2011) PyMCMC. J.
Statistical Software. Under review.
• White, N. Johnson, H., Silburn, P., Mengersen, K. Unsupervised
sorting and comparison of extracellular spikes with Dirichlet
Process Mixture Models. Annals Applied Statistics. Under review