Towards Likelihood Free Inference Tony Pettitt QUT, Brisbane [email protected] Joint work with Rob Reeves CRICOS No.

Download Report

Transcript Towards Likelihood Free Inference Tony Pettitt QUT, Brisbane [email protected] Joint work with Rob Reeves CRICOS No.

Towards Likelihood Free
Inference
Tony Pettitt
QUT, Brisbane
[email protected]
Joint work with Rob Reeves
CRICOS No. 000213J
Queensland University of Technology
Outline
1.
2.
3.
4.
5.
6.
7.
Some problems with intractable likelihoods.
Monte Carlo methods and Inference.
Normalizing constant/partition function.
Likelihood free Markov chain Monte Carlo.
Approximating Hierarchical model
Indirect Inference and likelihood free MCMC
Conclusions.
a university for the
real world
R
CRICOS No. 000213J
Stochastic models (Riley et al, 2003)
Macroparasite within a host.
Juvenile worm grows to adulthood in a cat.
Host fights back with immunity.
Number of Juveniles, Adults and amount of Immunity (all integer).
J (t ), A(t ), I (t ) evolve through time according to Markov process
unknown parameters, eg
Juvenile → Adult rate of maturation
Immunity changes with time
Juveniles die due to Immunity
Moment closure approximations for distribution of ( J (t ), A(t ), I (t ))
limited to restricted parameter values.
a university for the
real world
R
CRICOS No. 000213J
Numerical computation of pr ( J , A, I ) limited by small maximum values
of J, A, I.
Can simulate ( J (t ), A(t ), I (t )) process easily.
Data: J at t=0 and A at t (sacrifice of cat), replicated with several cats
Source: Riley et al, 2003.
a university for the
real world
R
CRICOS No. 000213J
Other stochastic process models include
spatial stochastic expansion of species
(Hamilton et al, 2005; Estoup et al, 2004)
birth-death-mutation process for estimating transmission rate from
TB genotyping
(Tanaka et al, 2006)
population genetic models, eg coalescent models
(Marjoram et al 2003)
Likelihood free Bayesian MCMC methods are often employed with
quite precise priors.
a university for the
real world
R
CRICOS No. 000213J
Normalizing constant/partition function problem.
The algebraic form of the distribution for y is known but it is not
normalized, eg Ising model


p( y |  )  exp 0  yi  1  Ind ( yi  y j ) 
i~ j
 i

For yi {1,1} i  1, , n and i ~ j means neighbours (on a lattice,
say). The normalizing constant involves in general a sum over 2n
terms.
Write p ( y |  ) 
a university for the
f ( y;  )
known

z ( )
unknown
real world
R
CRICOS No. 000213J
N-S and E-W neighbourhood
a university for the
real world
R
CRICOS No. 000213J
Outline
1.
2.
3.
4.
5.
6.
7.
Some problems with intractable likelihoods.
Monte Carlo methods and Inference.
Normalizing constant/partition function.
Likelihood free Markov chain Monte Carlo.
Approximating Hierarchical model
Indirect Inference and likelihood free MCMC
Conclusions.
a university for the
real world
R
CRICOS No. 000213J
Monte Carlo methods and Inference.
Intractable likelihood, instead use easily simulated values of y.
Simulated method of moments (McFadden, 1989).
Method of estimation: comparing theoretical moments or
frequencies with observed moments or frequencies.
Can be implemented using a chi-squared goodness-fit-statistic, eg
Riley et al, 2003. Data: number of adult worms in cat at sacrifice.
a university for the
real world
R
CRICOS No. 000213J
Plot of goodness-of-fit statistic versus parameter.
Greedy Monte Carlo. Precision of estimate?
Source: Riley et al 2003.
a university for the
real world
R
CRICOS No. 000213J
Outline
1.
2.
3.
4.
5.
6.
7.
Some problems with intractable likelihoods.
Monte Carlo methods and Inference.
Normalizing constant/partition function.
Likelihood free Markov chain Monte Carlo.
Approximating Hierarchical model
Indirect Inference and likelihood free MCMC
Conclusions.
a university for the
real world
R
CRICOS No. 000213J
3.
Normalizing constant/partition function and MCMC
(half-way to likelihood free inference)
Here we assume (Møller, Pettitt, Reeves and Berthelsen, 2006)
f ( y;  )
z ( )
p( y |  ) 
known
unknown
z( )   f ( y; )dy and difficult to find.
Key idea Importance sample estimate of
Sample y ~ p( y |  ) then
f ( y; )
z( ) .
unbiased estimate of
f ( y; )
a university for the
z ( )
z ( )
given by
z( )
real world
R
CRICOS No. 000213J
Used off-line to estimate z( ) / z( ) then carry out standard MetropolisHastings with z ( ) interpolation over a grid of values.( eg Green
z ( )
and Richardson, 2002, in a Potts model).
Standard Metropolis Hastings: Simulating from target distribution
p( | y)  p( y |  ) p( ).
Acceptance ratio for changing    , proposal q(  |  )
p( y |  ) p( )
q( |  )
A(  |  ) 

p( y |  ) p( )
q(  |  )
f ( y; ) p( )
q( |  )
z ( )



f ( y; ) p( )
q(  |  )
z ( )
  accepted with probability min{1, A(  |  )} .
Key Question: Can z ( ) be calculated on-line or avoided?
z ( ')
a university for the
real world
R
CRICOS No. 000213J
On-line algorithm – single auxiliary variable method.
Introduce auxiliary variable x on same space as y and extend target
distribution for the MCMC
p( x, | y)  p( x | y, ) p( y |  ) p( ).
Key Question: How to choose distribution of x so that z ( )
removed from A(  |  ).
Now acceptance ratio is A( , x |  , x) as a new pair ( , x) proposed.
Proposal q(  |  ) becomes q( , x |  , x) .
Assume the factorisation q( , x |  , x)  q( x |  ) q(  |  , x)
Choose the proposal so that q( x |  ) 
f ( x; )
z ( )
Then algebra → cancellation of z( )s and
A( , x |  , x) does not depend on z ( )
a university for the
real world
R
CRICOS No. 000213J
Note: Need perfect or exact simulation from
Key Question: How to choose
distribution?
f ( y;  )
for the proposal.
z ( )
p( x | y, ) , the auxiliary variable
The best choice
p( x | y, )  q( x,   | x,  )
f ( x; )

but z( ) needed in M-H!!
z ( )
a university for the
real world
R
CRICOS No. 000213J
Choice (i)
a university for the
real world
R
CRICOS No. 000213J
Choice (ii)
a university for the
real world
R
CRICOS No. 000213J
Choice (i)
f ( x;ˆ( y))
ˆ
Fix  , say at  ( y) a good estimate of  . Then p( x | y, ) 
ˆ
z ( ( y))
so does not depend on  only y and z (ˆ) cancels in A( | ) .
Choice (ii)
f ( x; )
,
z( )
p( x | y, )  approximation to
Eg Partially ordered Markov mesh model for Ising data
Comment
Both choices can suffer from getting stuck because p( x | y, )
f ( x; )
can be very different from the ideal
.
z ( )
a university for the
real world
R
CRICOS No. 000213J
Example: Ising Model
60 by 60 array with  0 =.2,1 =.1
Single auxiliary variable method
(Moller et al, 2006)
Auxiliary variable is Choice (ii).
Approximation to Ising model.
Partially ordered Markov mesh
model with same neighbourhood as Ising
DAG with N, W as
parents, S, E as children
Run chain 500,000 iterations and thin 1 in 100
a university for the
real world
R
CRICOS No. 000213J
Source: Møller et al, 2006
Single auxiliary method tends to get stuck
Murray et al (2006) offer suggestions involving
multiple auxiliary variables
a university for the
real world
R
CRICOS No. 000213J
Outline
1.
2.
3.
4.
5.
6.
7.
Some problems with intractable likelihoods.
Monte Carlo methods and Inference.
Normalizing constant/partition function.
Likelihood free Markov chain Monte Carlo.
Approximating Hierarchical model
Indirect Inference and likelihood free MCMC
Conclusions.
a university for the
real world
R
CRICOS No. 000213J
4.
Likelihood free MCMC
Single Auxiliary Variable Method as almost Approximate Bayesian
Computation (ABC)
We wish to eliminate f ( y; ) / z( ) or equivalently
likelihood from the M-H algorithm.
Solution: The distribution of x given y and
the observed data, p( x |  , y)  Ind ( x  y)
then
with
A( , x |  , x) 
x ~
f ( x; )
z ( )
p( y |  ), the
puts all probability on y,
p( ) q( |  )
Ind ( x  y )
p( ) q(  |  )
the likelihood
This might work for discrete data, sample size small, and if the
proposal q(  |  )  q(  |  , y) were a very good approximation to p(  | y).
If sufficient statistics s(y) exist then Ind (s( x)  s( y)) replaces Ind ( x  y).
a university for the
real world
R
CRICOS No. 000213J
a university for the
real world
R
CRICOS No. 000213J
Likelihood free methods, ABC- MCMC
Change of notation, observed data yobs(fixed), y is pseudo data or
auxiliary data generated from the likelihood p( y |  ) .
Instead of y  yobs, now have y close to yobs in the sense of statistics s(
distance d ( y, yobs ) || s( y)  s( yobs ) || .
ABC allows d ( y, yobs )  e rather than equal to 0
Target distribution for variables ( y, )
p( y, | yobs , e )  p( y |  ) p( ) Ind (d ( y, yobs )  e ).
Standard M-H with proposals
  ~ q(  |  , yobs ) (Marjoram et al 2003; ABC MCMC)
y ~ p( y |  )
for acceptance of ( y, )  ( y, ) .
Ideally e should be small but this leads to very small acceptance
probabilities.
a university for the
real world
R
CRICOS No. 000213J
yobs ),
Issues of implementing Metropolis-Hastings ABC
(a)
Tune for e to get reasonable acceptance probabilities;
(b)
All ( y, ) satisfying d ( y, yobs )  e (hard) accepted
with equal probability
rather than smoothly weighted by d ( y, yobs ) (soft).
(c)
statistics
Choose summary statistics carefully if no sufficient
a university for the
real world
R
CRICOS No. 000213J
Tune for e
A solution is to allow e to vary as a parameter (Bortot et
al, 2004). The target distribution is
p( y, , e | yobs )  p( y |  ) p( ) Ind (d ( y, yobs )  e ) p(e ).
Run chain and post filter output for small values of e
a university for the
real world
R
CRICOS No. 000213J
Outline
1.
2.
3.
4.
5.
6.
7.
Some problems with intractable likelihoods.
Monte Carlo methods and Inference.
Normalizing constant/partition function.
Likelihood free Markov chain Monte Carlo.
Approximating Hierarchical model
Indirect Inference and likelihood free MCMC
Conclusions.
a university for the
real world
R
CRICOS No. 000213J
Beaumont, Zhang and Balding (2002) use kernel smoothing in ABC-MC
a university for the
real world
R
CRICOS No. 000213J
Soft Constraint for d ( y, yobs ) (Reeves and Pettitt , 2005)
Replace Ind (d ( y, yobs )  e ) by exp( 
1
d ( y, yobs )), a soft
2
constraint, with  replacing e .
Key Idea. Interpret as an approximate likelihood for (yobs | y )
Simple case
( yobs | y )  N ( y,  )
with φ known
Approximating Hierarchical model with joint probability
p ( ) p ( y |  ) p ( yobs | y )
a university for the
real world
R
CRICOS No. 000213J
Approximating Hierarchical Model
a university for the
real world
R
CRICOS No. 000213J
Simple case. Normal model with mean  , variance  2 ,sample n.
Sufficient statistic yobs.
d ( y, yobs )  ( y  yobs ) 2
and "likelihood"
( yobs | y )  N ( y ,  )
Integrate out pseudo data y from p ( y,  , yobs ) to get
marginal p ( , yobs ) to obtain
p ( , yobs )  N ( ; yobs ,
2
n
  ) p ( )
Key idea Approximation using pseudo data and match to observed data
introduces variance inflation in likelihood approximation. Will affect posterior
mean (if prior not improper vague) and variance.
a university for the
real world
R
CRICOS No. 000213J
General scheme to overcome variance inflation
and bias of posterior
Implement a tempering scheme with 0  1    k
and ˆ j is the posterior mean estimated from
chain  j .
Combine estimates using weighted least squares
and bias estimated using a quadratic in  , say
ˆ j   0  1 j   2 j2  error , j  1, , k
 0 gives combined estimate of posterior mean for   0
Similarly for posterior variance and quantile estimates
using chain  j
(compare Liu, 2001, MCMC for "indirect models")
a university for the
real world
R
CRICOS No. 000213J
a university for the
real world
R
CRICOS No. 000213J
Parallel Tempering to improve mixing
For chains  j -1 and  j , propose swaps of
( y,  ) values to improve mixing.
M-H ratio does not involve intractable likelihood.
Soft constraint improves mixing
a university for the
real world
R
CRICOS No. 000213J
a university for the
real world
R
CRICOS No. 000213J
Example: Ising Model (continued)
60 by 60 array with  0 =.2,1 =.1
Compare
Exact method, auxiliary variable method
ABC with  = 502 ,30 2 ,152
with correct sufficient statistics
y
i
and
 Ind ( y
i j
i
 yj)
Run chains 500,000 and thin 1 in 100
a university for the
real world
R
CRICOS No. 000213J
a university for the
real world
R
CRICOS No. 000213J
a university for the
real world
R
CRICOS No. 000213J
Outline
1.
2.
3.
4.
5.
6.
7.
Some problems with intractable likelihoods.
Monte Carlo methods and Inference.
Normalizing constant/partition function.
Likelihood free Markov chain Monte Carlo.
Approximating Hierarchical model
Indirect Inference and likelihood free MCMC
Conclusions.
a university for the
real world
R
CRICOS No. 000213J
Indirect Inference (Gourieroux et al, 1993)
(also Heggland and Frigessi, 2004)
Observe data yobs .
Suppose True model
pT ( y |  )
but Approximating model
is intractable
pA ( x |  )
is tractable,
x and y with the same support.
Can find ˆ ( x ) easily.
Put y into x model and obtain ˆ ( y |  ).
Repeat many times for simulated y from
to find an accurate value ˆ ( ).
Find  so that
a university for the
pT ( y |  )
ˆ ( yobs ) is close to ˆ ( ) giving ˆ( yobs ).
real world
R
CRICOS No. 000213J
Hierarchical Model using ideas of Indirect
Inference.(Reeves and Pettitt, 2005; P & R, 2006)
Consider the True hierarchical model
p ( ) pT ( yobs |  )
and the Approximating hierarchical model
p ( ) pT ( y |  ) p A (  | y ) p A ( yobs |  )
with
y is pseudo data from pT ( y |  )
pT ( y |  )
being the True intractable likelihood
pA ( | y)
is an Approximating model posterior distribution
p A ( yobs |  ) is an Approximating likelihood evaluated
at the observed data yobs
Key point: Marginilising the Approximating HM over
random y and  should be close to the True HM
a university for the
real world
R
CRICOS No. 000213J
a university for the
real world
R
CRICOS No. 000213J
A Metropolis Hastings algorithm for Indirect Inference
Can be implemented with the proposal using
the idea from Moller et al (2006).
q ( , y,   |  , y,  )  q(  |  ) pT ( y |  ) p A (   | y)
and then the MH acceptance probability is given with
A( , y,   |  , y,  ) 
p A ( yobs |  ) q ( |  ) p( )
p A ( yobs |  ) q (  |  ) p( )
which replaces the likelihood ratio for intractable pT
and replaces it by a ratio involving tractable p A .
a university for the
real world
R
CRICOS No. 000213J
Example: Ising Model (continued)
60 by 60 array with  0 =.2,1 =.1
MH Indirect Inference implemented with
Approximate Likelihood taken as the POMM
with 2 parameters equivalent to Ising model  0 , 1
20,000 iterations with Approximating posterior
found from "side MH chain" with 400 iterations.
No summary statistics required, implied by
Approximating Likelihood
a university for the
real world
R
CRICOS No. 000213J
a university for the
real world
R
CRICOS No. 000213J
a university for the
real world
R
CRICOS No. 000213J
Some points
• How could approximate posterior be made more
precise?
– Use more parameters in approximating likelihood, the
POMM? (Gouriéroux at al (1993), Heggland and Frigassi (2004)
discuss this in the frequentist setting)
– More iterations for side chain “exact” calculation of
approximate posterior?
• How to choose a good approximating likelihood?
• Relationship to summary statistics approach?
a university for the
real world
R
CRICOS No. 000213J
Outline
1.
2.
3.
4.
5.
6.
7.
Some problems with intractable likelihoods.
Monte Carlo methods and Inference.
Normalizing constant/partition function.
Likelihood free Markov chain Monte Carlo.
Approximating Hierarchical model
Indirect Inference and likelihood free MCMC
Conclusions.
a university for the
real world
R
CRICOS No. 000213J
Conclusions
1.
2.
3.
For the normalizing constant problem we presented a single online M-H algorithm.
We linked these ideas to ABC-MCMC and developed a
hierarchical model (HM) to approximate the true posterior –
showed variance inflation.
We showed that the approximating HM could be tempered
swaps made to improve mixing using parallel chains,
variance inflation effect corrected by smoothing posterior summaries
from the tempered chains.
4.
5.
6.
We extended indirect inference to an HM to find a way of
implementing the Metropolis Hastings algorithm which is likelihood
free.
We demonstrated the ideas with the Ising/autologistic model.
Application to specific examples is on-going and requires
refinement of general approaches.
a university for the
real world
R
CRICOS No. 000213J
Acknowledgements
Support of the Australian Research Council
Co-authors Rob Reeves, Jesper Møller, Kasper
Berthelsen
Discussions with Malcolm Faddy, Gareth Ridall,
Chris Glasbey, Grant Hamilton …
a university for the
real world
R
CRICOS No. 000213J