Markov Chain Monte Carlo - Oxford University Statistics

Download Report

Transcript Markov Chain Monte Carlo - Oxford University Statistics

Markov Chain Monte Carlo
Gil McVean
Tuesday 24th February 2009
1
Bayesian inference
• In Bayesian statistics we want to learn about the probability distribution of
the parameters of interest given the data = the posterior
Prior
Likelihood
Posterior
P( ) P( D |  )
P( | D) 
P( D)
Normalising constant
• In simple cases we can derive analytical expressions for the posterior
– For example, if we have a beta prior: Beta(a,b) for a binomial parameter, after
we have observed n successes and m failures, the posterior is also beta:
Beta(a+n, b+m). The key here is conjugacy.
2
Monte Carlo methods
• In many situations, the normalising constant cannot be calculated
analytically
P( D)   P( ) P( D |  )d
• Typically this is true if we are dealing with multiple parameters,
multidimensional parameters, complex model structures or complex
likelihood functions
– i.e. most situations in modern statistics
• We can use Monte Carlo methods to sample from (and therefore estimate
functions of) the posterior
• Perhaps the most widely used method is called Markov Chain Monte
Carlo, brought to prominence by a paper in 1990 by Gelfand and Smith.
3
Markov Chain Monte Carlo
• The basic idea is to construct a Markov Chain whose stationary distribution
is the required posterior
• A Markov Chain is a stochastic process that generates random variables X1,
X2, …, Xt, where the distribution
P( X t | X1 , X 2 ,, X t 1 )  P( X t | X t 1 )
i.e. the distribution of the next random variable depends only on the current
random variable
• Note that the Xi are typically highly correlated, so each sample is not an
independent draw from the posterior
– Thinning of the chain leads to effectively independent samples
4
Metropolis sampling
• Gibbs-sampling is an example of MCMC that uses marginal conditional
distributions. However, we do not always know what these are.
• However, it is still possible to construct a Markov Chain that has the
posterior as its stationary distribution (Metropolis et al. 1953)
• In the current step, the value of the parameters is Xt. Propose a new set of
parameters, Y in a symmetric manner; i.e. q(Y| X) = q(X| Y)
• Calculate the prior and likelihood functions for the old and new parameter
values. Set the parameter values in the next step of the chain, Xt+1 to Y
with probability a, otherwise set to Xt
 P(Y ) L(Y ) 
a ( X , Y )  min1,

P
(
X
)
L
(
X
)


5
Some notation
• I shall refer to the posterior distribution as the target distribution
 ( )  P( ) L( )
• The transition probabilities in the Markov chain I shall refer to as the
proposal distribution; also known as the transition kernel
q( X | Y )
• When talking about multiple parameters I shall refer to the joint posterior
 (1 ,2 ,,k )
and the conditional distributions
 (2 | 1,,k )   (i | i )
6
An example
• There are two types of bus: frequent (average arrival time = 5 mins) and
rare (average arrival time = 10 mins). I see the following times
7
Number
6
5
4
3
2
1
24-26
22-24
20-22
18-20
16-18
14-16
12-14
10-12
8-10
6-8
4-6
2-4
<2
0
Arrival time
• If I model the inter-arrival times as a mixture of two exponential
distributions, I want to perform inference on the proportion of frequent
and rare buses
P( X i  x) 

5
e
x /5
1    x /10
+
e
10
7
• I will put a beta(1,1) = uniform(0,1) prior on the mixture proportion 
• Starting with an initial guess of  = 0.2, I will propose changes uniform on
  e,  + e
• (Note that if I propose a value < 0 or >1 I must reject it)
• 50, 500 and 5000 samples from the chain give me the following results
1
1
0.8
 0.8
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0
0
0
100
200
300
400
iteration
60
50
40
30
20
10
0
5000
iteration
0.95

0.75
0.95
0.85
0.75

0.65
0
0.55
0.95
0.85
0.75

0.65
0.55
0.45
0.35
0.25
0.15
0
4000
100
0.45
2
3000
200
0.35
4
2000
300
0.25
6
1000
400
0.15
8
0
iteration
0.05
10
500
0.85
50
0.65
40
0.55
30
0.45
20
0.35
10
0.25
0
0.15
0
0.05

0.8
0.05

1
8
Under the bonnet
• Consider a system in which a single parameter can take k possible values
• My proposal is to select at random from one of the k-1 other possible values
• Now suppose the system has reached its stationary distribution so that the
probability of the chain being in a given state is proportional to its prior times
its likelihood
• Consider two states, i and j, with (Xi) > (Xj). The rates of flow in the two
directions are
Flow i to j
Flow j to i
1  (X j )  (X j )
 ( X i )q( X j | X i )a ij   ( X i ) 


k 1  ( X i )
k 1
 ( X j )q ( X i | X j )a ji   ( X j ) 
1
k 1
9
The small print!
•
If detailed balance holds for every pair of states, the system remains in its
stationary distribution
–
•
We just need to guarantee that the stationary distribution is reached
There are three requirements for Xt to converge to the stationary
distribution
I.
X must be irreducible. That is every state must be (eventually) reachable
from every other state
II. X must be aperiodic. This stops the chain from oscillating between
different states
III. X must be positive recurrent. This states that the expected waiting time to
return to a state is finite and means that a chain will stay in its stationary
distribution once it first reaches it
10
The Hastings ratio
• A simple change to the acceptance formula allows the use of asymmetric
proposal distributions and generalises the Metropolis result
  (Y ) q( X | Y ) 
a  min1,


(
X
)
q
(
Y
|
X
)


• Moves that multiply or divide parameters need to follow change-ofvariables rules
  (Y ) q( X | Y ) dY 
a  min1,


(
X
)
q
(
Y
|
X
)
dX


Y  ea X
a ~ U (1,1)
dY
Y
a
e 
dX
X
6
Xt
Sampling from an Exp(1) distribution
5
4
3
2
1
0
0
100
200
300
400
500
600
700
800
900
1000
11
Gibbs sampling as a special case
• In Gibbs sampling we want to find the posterior for a set of parameters
• Each parameter is updated in turn by sampling from the conditional
distribution given the data and the current value of all other parameters
• Consider the case of a single parameter updated using the Metropolis
algorithm where the proposal density is the conditional distribution
a XY
  (Y ) q( X | Y ) 
  (Y )  ( X ) 
 min1,
 min1,
1


  ( X ) q(Y | X ) 
  ( X )  (Y ) 
• In other words, the Gibbs sampler is an MCMC where every proposal is
accepted
q(Yi | X i )   (Yi | X i )
• With multiple parameters you have to be careful about update order to
ensure reversibility
12
Convergence
• The Markov Chain is guaranteed to have the posterior as its stationary
distribution (if well constructed)
• BUT this does not tell you how long you have to run the chain in order to
reach stationarity
– The initial starting point may have a strong influence
– The proposal distributions may lead to low acceptance rates
– The likelihood surface may have local maxima that trap the chain
• Multiple runs from different initial conditions and graphical checks can be
used to assess convergence
– The efficiency of a chain can be measured in terms of the variance of
estimates obtained by running the chain for a short period of time
13
Watching your chain
2
7
6
• Three chains, each
sampling from an Exp(1)
distribution with a
uniform(-x, x) proposal
distribution
x = 0.1 Acceptance = 97%
5
4
3
2
1
0
0
100
200
300
400
600
700
800
900
1000
20
7
• Measure acceptance
rates
500
x = 1 Acceptance = 62%
6
5
4
3
2
1
0
0
100
200
300
400
500
600
700
800
900
1000
200
7
x = 10 Acceptance = 8%
6
5
4
3
2
1
0
0
100
200
300
400
500
600
700
800
900
1000
14
The burn-in
• Often you start a chain far away from the target distribution
– Truth unknown
– Check for convergence
• The first few samples from the chain are poor representatives from the
stationary distribution
• These are usually thrown away as burn-in
20
25
20
15
10
5
0
0
100
200
300
400
500
600
700
800
900
1000
• There is no theory that says how much to throw away, but it’s better to err
on the side of more than less
15
Other uses of MCMC
• Looking at marginal effects
– Suppose we have a multidimensional parameter, we may just want to know
about the marginal posterior distribution of that parameter
• Prediction
– Given our posterior distribution on parameters, we can predict the
distribution of future data by sampling parameters from the posterior and
simulating data given those parameters
– The posterior predictive distribution is a useful source of goodness-of-fit
testing (if predicted data doesn’t look like the data you’ve collected, the model
is poor)
16
Modifications
• An important development has been to allow trans-dimensional moves
(Green 1995), also known as Reversible Jump MCMC
– Useful when looking for change points (e.g. in rate) along a sequence
– Examples include looking for periods of elevated accident rate or genomic
regions of elevated recombination rate
Data
Posterior
probability of
change in rate
Time
• There are many subtle variants of basic MCMC that allow you to increase
efficiency in complex situations (see Liu 2001 for many)
17
A good read
• Markov Chain Monte Carlo in Practice. 1996. eds. Gilks WR, Richardson S
and Spiegelhalter DJ. Chapman & Hall/CRC.
• Bayesian Data Analysis. 2004. Gelman A, Carlin JB, Stern HS and Rubin
DB. Chapman & Hall/CRC.
• Monte Carlo Strategies in Scientific Computing. 2001. Liu JS. SpringerVerlag.
• Chris Holmes’ short course on Bayesian statistics
– http://www.stats.ox.ac.uk/~cholmes/Courses/BDA/bda_mcmc.pdf
18