Transcript Slide 1

Bayesian inference
Gil McVean, Department of Statistics
Monday 17th November 2008
Questions to ask…
•
What is likelihood-based inference?
•
What is Bayesian inference and why is it different?
•
How do you estimate parameters in a Bayesian framework?
•
How do you choose a suitable prior?
•
How do you compare models in Bayesian inference?
A recap on likelihood
•
For any model the maximum information about model parameters is obtained by
considering the likelihood function
•
The likelihood function is proportional to the probability of observing the data
given a specified parameter value
•
The likelihood principle states that all information about the parameters of
interest is contained in the likelihood function
An example
•
Suppose we have data generated from a Poisson distribution. We want to
estimate the parameter of the distribution
•
The probability of observing a particular random variable is
e  X
P( X ;  ) 
X!
•
If we have observed a series of iid Poisson RVs we obtain the joint likelihood by
multiplying the individual probabilities together
e    X1 e    X 2
e  X n
P( X 1 , X 2 ,, X n ;  ) 

 
X 1!
X 2!
X n!
L (  ; X)   e    X i
i
 e  n  nX
Relative likelihood
•
We can compare the evidence for different parameter values through their
relative likelihood
•
For example, suppose we observe counts of 12, 22, 14 and 8 from a Poisson
process
•
The maximum likelihood estimate is 14. The relative likelihood is given by
L (  ; X)
 n (   ˆ )   
 
e
L( ˆ ; X)
 ˆ 
nX
Maximum likelihood estimation
•
The maximum likelihood estimate is the set of parameter values that maximise the
probability of observing the data we got
•
The mle is consistent in that it converges to the truth as the sample size gets
infinitely large
•
The mle is asymptotically efficient in that it achieves the minimum possible
variance (the Cramér-Rao Lower Bound) as n→∞
•
However, the mle is often biased for finite sample sizes
– For example, the mle for the variance parameter in a normal distribution is the sample
variance
Confidence intervals and likelihood
•
Thanks to the CLT there is another useful result that allows us to define
confidence intervals from the log-likelihood surface
•
Specifically, the set of parameter values for which the log-likelihood is not more
than 1.92 less than the maximum likelihood will define a 95% confidence interval
– In the limit of large sample size the LRT is approximately chi-squared distributed under
the null
•
This is a very useful result, but shouldn’t be assumed to hold
– i.e. Check with simulation
Likelihood ratio tests
•
Suppose we have two models, H0 and H1, in which H0 is a special case of H1
•
We can compare the likelihood of the MLEs for the two models
– Note the likelihood under H1 can be no worse than under H0
•
Theory shows that if H0 is true, then twice the difference in log-likelihood is
asymptotically c2 distributed with degrees of freedom equal to the difference in
the number of parameters between H0 and H1
– The likelihood ratio test
 L(ˆ1 ; X) 

  2 log
ˆ ; X) 
L
(

0


•
Theory also tells us that if H1 is true, then the likelihood ratio test is the most
powerful test for discriminating between H0 and H1
– Useful, though perhaps not as useful as it sounds
Criticisms of the frequentist approach
•
The choice between models using P-values is focused on rejecting the null rather
than proving the appropriateness of the alternative
•
Representing uncertainty through the use of confidence intervals is messy and
unintuitive
– Cannot say that the probability of the true parameter being within the interval is 0.95
•
The frequentist approach requires a predefined experimental approach that must
be followed through to completion (at which point data are analysed)
– Bayesian inference naturally adapts to interim analysis, changes in stopping rules,
combining data from different sources
•
Focusing on point estimation leads to models that are ‘over-fitted’ to data
Bayesian estimators
•
Bayesian statistics aims to make statements about the probability attached to
different parameter values given the data you have collected
•
It makes use of Bayes’ theorem
Prior
Likelihood
Posterior
P( ) P( D |  )
P( | D) 
P( D)
Normalising constant
Are parameters random variables?
•
The single most important conceptual difference between Bayesian statistics and
frequentist statistics is the notion that the parameters you are interested in are
themselves random variables
•
This notion is encapsulated in the use of a subjective prior for your parameters
•
Remember that to construct a confidence interval we have to define the set of
possible parameter values
•
A prior does the same thing, but also gives a weight to different values
Example: coin tossing
•
I toss a coin twice and observe two heads
•
I want to perform inference about the probability of obtaining a head on a single
throw for the coin in question
•
The MLE of the probability is 1.0 – yet I have a very strong prior belief that the
answer is 0.5
•
Bayesian statistics forces the researcher to be explicit about prior beliefs but, in
return, can be very specific about what information has been gained by
performing the experiment
•
It also provides a natural way for combining data from different experiments
The posterior
•
Bayesian inference about parameters is contained in the posterior distribution
•
The posterior can be summarised in various ways
Posterior mean
Posterior
Prior
Credible Interval
Choosing priors
•
A prior reflects your belief before the experiment
•
This might be relatively unfocused
– Uniform distributions in the case of single parameters
– Jeffreys prior (and other ‘uninformative’ priors)
•
Or might be highly focused
– In the coin-tossing experiment, most of my prior would be on P=0.5
– In an association study, my prior on a SNP being causal might be 1/107
Using posteriors
•
Posterior summary to provide statements about point estimates and certainty
•
Posterior prediction to make statements about future events
•
Posterior predictive simulation to check the fit of the model to data
Bayes factors
•
Bayes factors can be used to compare the evidence for different models
– These do not need to be nested
•
Bayes factors generalise the likelihood ratio by integrating the likelihood over the
prior
 P( | M ) P(X |  , M )d
 P( | M ) P(X |  , M )d
1
1
1
1
2
2
2
2
1
1
2
2
•
Importantly, if model 2 is a subset of model 1, it does not follow that the Bayes
factor is necessarily greater than 1
– The subspace of model 1 that improves the likelihood may be very small and the extra
parameter carry extra cost
•
It is generally accepted that a BF of 3 is worth mention, a BF of 10 is strong
evidence and a BF of 100 is decisive (Jeffreys)
Example
•
Consider the crossing data of Bateson and Punnett in which we want to estimate
the recombination fraction
Bateson and Punnett experiment
Phenotype and
genotype
•
Observed
Expected from
9:3:3:1 ratio
Purple, long (P_L_)
284
216
Purple, round
(P_ll)
21
72
Red, long (ppL_)
21
72
Red, round (ppll)
55
24
I will use a beta prior for the recombination fraction with parameters 3 and 7
•
Conditional on the total sample (381), the likelihood function is described by the
multinomial
•
We get the following posterior distribution
Posterior mean = 0.134
Posterior mode = 0.13
95% ETPI = 0.10 – 0.16
•
Comparing the model to one in which r = 0.5 gives a BF of 3.9
Bayesian inference and the notion of shrinkage
•
The notion of shrinkage is that you can obtained better estimates by assuming a
certain degree of similarity among the things you want to estimate and a lack of
complexity
•
Practically, this means three things
– Borrowing information across observations
– Penalising inferences that are very different from anything else
– Penalising more complex models
•
The notion of shrinkage is implicit in the use of priors in Bayesian statistics
•
There are also forms of frequentist inference where shrinkage is used
– But NOT MLE