Transcript Document

Introduction to Bayesian inference
and computation for social science
data analysis
Nicky Best
Imperial College, London
www.bias-project.org.uk
Outline
• Overview of Bayesian methods
– Illustration of conjugate Bayesian inference
– MCMC methods
• Examples illustrating:
– Analysis using informative priors
– Hierarchical priors, meta-analysis and evidence
synthesis
– Adjusting for data quality
– Model uncertainty
• Discussion
Overview of Bayesian
inference and computation
Overview of Bayesian methods
• Bayesian methods have been widely applied in
many areas
– medicine / epidemiology / genetics
– ecology / environmental sciences
– finance
– archaeology
– political and social sciences, ………
• Motivations for adopting Bayesian approach vary
– natural and coherent way of thinking about science and
learning
– pragmatic choice that is suitable for the problem in hand
Overview of Bayesian methods
• Medical context: FDA draft guidance
www.fda.gov/cdrh/meetings/072706-bayesian.html:
“Bayesian statistics…provides a coherent method for
learning from evidence as it accumulates”
• Evidence can accumulate in various ways:
– Sequentially
– Measurement of many ‘similar’ units (individuals,
centres, sub-groups, areas, periods…..)
– Measurement of different aspects of a problem
• Evidence can take different forms:
– Data
– Expert judgement
Overview of Bayesian methods
• Bayesian approach also provides formal
framework for propagating uncertainty
– Well suited to building complex models by linking
together multiple sub-models
– Can obtain estimates and uncertainty intervals for any
parameter, function of parameters or predictive
quantity of interest
• Bayesian inference doesn’t rely on asymptotics or
analytic approximations
– Arbitrarily wide range of models can be handled using
same inferential framework
– Focus on specifying realistic models, not on choosing
analytically tractable approximation
Bayesian inference
• Distinguish between
x : known quantities (data)
q : unknown quantities (e.g. regression coefficients, future
outcomes, missing observations)
• Fundamental idea: use probability distributions to represent
uncertainty about unknowns
• Likelihood – model for the data: p( x | q )
• Prior distribution – representing current uncertainty about
unknowns: p(q )
• ApplyingjBayes theorem gives
posterior
distribution
j
/
j
p(µ x) =
R p(µ)p(x µ)
p(µ)p(xjµ)dµ
p(µ)p(x µ)
Conjugate Bayesian inference
• Example: election poll (from Franklin, 2004*)
• Imagine an election campaign where (for
simplicity) we have just a Government/Opposition
vote choice.
• We enter the campaign with a prior distribution for
the proportion supporting Government. This is p(q )
• As the campaign begins, we get polling data. How
should we change our estimate of Government’s
support?
*Adapted from Charles Franklin’s Essex Summer School course slides:
http://www.polisci.wisc.edu/users/franklin/Content/Essex/Lecs/BayesLec01p6up.pdf
Conjugate Bayesian inference
Data and likelihood
• Each poll consists of n voters, x of whom say they
will vote for Government and n - x will vote for the
opposition.
• If we assume we have no information to distinguish
voters in their probability of supporting government
then we µ
have¶a binomial distribution for x
n
»
x
µx (1 ¡ µ)n¡x / µ x (1 ¡ µ)n¡x
x
• This binomial distribution is the likelihood p(x | q )
Conjugate Bayesian inference
Prior
• We need to specify a prior that
– expresses our uncertainty about the election (before it
begins)
– conforms to the nature of the q parameter, i.e. is
continuous but bounded between 0 and 1
• A convenient choice is the Beta distribution
¡(a + b) ¡
p(µ) = Beta(a; b) =
µa 1 (1 ¡ µ)b¡1
¡(a)¡(b)
/ µa¡1 (1 ¡ µ)b¡1
Conjugate Bayesian inference
• Beta(a,b) distribution can take a variety of shapes
depending on its two parameters a and b
Beta(0.5,0.5)
Beta(1,1)
Beta(5,1)
Mean of Beta(a, b)
distribution = a/(a+b)
0.0
0.4
0.8
theta
0.0
Beta(5,5)
0.0
0.4
0.8
theta
0.4
0.8
theta
Beta(5,20)
0.0
0.4
0.8
theta
0.0
0.4
0.8
theta
Beta(50,200)
0.0
0.4
0.8
theta
Variance of Beta(a,b)
distribution =
ab(a+b+1)/(a+b)2
Conjugate Bayesian inference
Posterior
• Combining a beta prior with the binomial likelihood
gives aj posterior
/ distribution
j
p(µ x; n)
/
p(x µ; n) p(µ)
µ x (1 ¡ µ)n¡x µ a¡1 (1 ¡ µ)b¡1
=
/
µ x+a¡1 (1 ¡ µ)n¡x+b¡1
Beta(x + a; n ¡ x + b)
• When prior and posterior come from same family,
the prior is said to be conjugate to the likelihood
– Occurs when prior and likelihood have the same ‘kernel’
Conjugate Bayesian inference
• Suppose I believe that Government only has the
support of half the population, and I think that
estimate has a standard deviation of about 0.07
– This is approximately a Beta(50, 50) distribution
• We observe a poll with 200 respondents, 120 of
whom (60%) say they will vote for Government
• This produces a posterior which is a
Beta(120+50, 80+50) = Beta(170, 130) distribution
Conjugate Bayesian inference
Posterior
Prior
Likelihood
0.0
0.2
0.4
0.6
0.8
1.0
theta (prob of voting for government)
• Prior mean, E(q ) = 50/100 = 0.5
• Posterior mean, E(q | x, n) = 170/300 = 0.57
• Posterior SD, √Var(q | x, n) = 0.029
p on the data:
•^ Frequentist estimate is based
only
£
^
µ = 120=200 = 0:6; SE(µ) =
(0:6
0:4)=200 = 0:035
Conjugate Bayesian inference
A harder problem
• What is the probability that Government wins?
– It is not .57 or .60. Those are expected votes but not the probability
of winning. How to answer this?
• Frequentists have a hard time with this one. They can
obtain a p-value for testing H0: q > 0.5, but this isn’t the
same as the probability that Government wins
– (its actually the probability of observing data more extreme than
120 out of 200 if H0 is true)
Pr(theta>0.5) = 0.98
• Easy from Bayesian perspective
– calculate Pr(q > 0.5 | x, n), the
posterior probability that q > 0.5
0.45
0.55
theta
0.65
Bayesian computation
• All Bayesian inference is based on the posterior
distribution
• Summarising posterior distributions involves
integration
R
Mean: E(µjx) = µp(µjx)dµ
R
j
Prediction: p(xnew x) = p(xnew ; µjx)p(µjx)dµ
R
2
j
Probabilities: P r(µ (c1 ; c2 ) x) = c2 p(µjx)dµ
c1
• Except for conjugate models, integrals are usually
analytically intractable
• Use Monte Carlo (simulation) integration (MCMC)
Bayesian computation
• Suppose we didn’t know how to analytically
integrate the Beta(170, 130) posterior…
• ….but we do know how to simulate from a Beta
100 samples
mean = 0.569
0.4
0.6
theta
10,000 samples
mean = 0.566
0.8 0.4
0.6
theta
True posterior
mean=0.567
0.8 0.4
0.6
theta
0.8
Bayesian computation
• Monte Carlo integration
(1), q (2),…,
(n) from p(q | x )
R
P
– Suppose we
have
samples
q
q
E(µjx) = µp(µjx)dµ ¼ 1 n µ(i)
– Then
n
i=1
• Can also use samples to estimate posterior tail
area probabilities, percentiles, variances etc.
• Difficult to generate independent samples when
posterior is complex and high dimensional
• Instead, generate dependent samples from a
Markov chain having p(q | x ) as its stationary
distribution → Markov chain Monte Carlo (MCMC)
Illustrative Examples
Borrowing strength
• Bayesian learning → borrowing “strength”
(precision) from other sources of information
• Informative prior is one such source
– “today’s posterior is tomorrows prior”
– relevance of prior information to current study must be
justified
Informative priors
Example 1: Western and Jackman (1994)*
• Example of regression analysis in comparative
research
• What explains cross-national variation in union
density?
– Union density is defined as the percentage of the work
force who belongs to a labour union
• Two issues
– Philosophical: data represent all available observations
from a population → conventional (frequentist) analysis
based on long-run behaviour of repeatable data
mechanism not appropriate
– Practical: small, collinear dataset yields imprecise
estimates of regression effects
* Slides adapted from Jeff Grynaviski: http://home.uchicago.edu/~grynav/bayes/abs03.htm
Informative priors
• Competing theories
– Wallerstein: union density depends on the size of the
civilian labour force (LabF)
– Stephens: union density depends on industrial
concentration (IndC)
– Note: These two predictors correlate at -0.92.
• Control variable: presence of a left-wing
government (LeftG)
• Sample: n = 20 countries with a continuous history
of democracy since World War II
• Fit linear regression model to compare theories
union densityi ~ N(mi, s2)
mi = b0 + b1LeftG + b2LabF + b3IndC
Informative priors
• Results with non-informative priors on regression
coefficients (numerically equivalent to OLS analysis)

Ind Conc.
point estimate
___ 95% CI
Labour Force
Left Govt
-40
0 20 40
regression coefficient
Informative priors
Motivation for Bayesian approach with informative priors
• Because of small sample size and multicollinear
variables, not able to adjudicate between theories
• Data tend to favour Wallerstein (union density
depends on labour force size), but neither coefficient
estimated very precisely
• Other historical data are available that could provide
further relevant information
• Incorporation of prior information provides additional
structure to the data, which helps to uniquely identify
the two coefficients
Informative priors
Prior distributions for regression coefficients
Wallerstein
• Believes in negative labour force effect
• Comparison of Sweden and Norway in 1950:
→ doubling of labour force corresponds to 3.5-4% drop in
union density
→ on log scale, labour force effect size ≈ -3.5/log(2) ≈ -5
• Confidence in direction of effect represented by
prior SD giving 95% interval that excludes 0
b2 ~ N(-5, 2.52)
Informative priors
Prior distributions for regression coefficients
Stephens
• Believes in positive industrial concentration effect
• Decline in industrial concentration in UK in 1980s:
→ drop of 0.3 in industrial concentration corresponds to
about 3% drop in union density
→ industrial concentration effect size ≈ 3/0.3 = 10
• Confidence in direction of effect represented by
prior SD giving 95% interval that excludes 0
b3 ~ N(10, 52)
Informative priors
Prior distributions for regression coefficients
Wallerstein and Stephens
• Both believe left-wing gov’ts assist union growth
• Assuming 1 year of left-wing gov’t increases union
density by about 1% translates to effect size of 0.3
• Confidence in direction of effect represented by
prior SD giving 95% interval that excludes 0
b1 ~ N(0.3, 0.152)
• Vague prior b0 ~ N(0, 1002) assumed for intercept
Informative priors
Vague priors
Wallerstein priors
Stephens priors
Ind Conc
Lab Force
Left Govt
-20 0 20
coefficient
-20 0 20
coefficient
-20 0 20
coefficient
Informative priors
• Effects of LabF and IndC estimated more precisely
• Both sets of prior beliefs support inference that
labour-force size decreases union density
• Only Stephens’ prior supports conclusion that
industrial concentration increases union density
• Choice of prior is subjective – if no consensus, can
we be satisfied that data have been interpreted
“fairly”?
• Sensitivity analysis
– Sensitivity to priors (e.g. repeat analysis using priors with
increasing variance)
– Sensitivity to data (e.g. residuals, influence diagnostics)
Hierarchical priors
• Hierarchical priors are another widely used approach for
borrowing strength
• Useful when data available on many “similar” units
(individuals, areas, studies, subgroups,…)
• Data xi and parameters qi for each unit i=1,…,N
• Three different assumptions:
– Independent parameters: units are unrelated, and each qi is
estimated separately using data xi alone
– Identical parameters: observations treated as coming from same
unit, with common parameter q
– Exchangeable parameters: units are “similar” (labels convey no
information) → mathematically equivalent to assuming qi’s are
drawn from common probability distribution with unknown
parameters
Meta-analysis
Example 2: Meta-analysis (Spiegelhalter et al 2004)
• 8 small RCTs of IV magnesium sulphate following acute
myocardial infarction
• Data: xig = deaths, nig = patients in trial i, treatment group
g (0=control, 1=magnesium)
• Model (likelihood):
xig ~ Binomial(pig, nig)
logit(pig) = fi + qi·g
• θi is log odds ratio for treatment effect
• If not willing to believe trials are identical, but no reason
to believe they are systematically different → assume qi’s
are exchangeable with hierarchical prior
qi ~ Normal(m, s 2)
m, s 2 also treated as unknown with (vague) priors
Meta-analysis
Estimates and 95% intervals for treatment effect from independent MLE
and hierarchical Bayesian analysis
LIMIT-2
Ceremuz.
Shecter
Feldstedt
Abraham
Smith
Rasmussen
Morton
Overall
Hierachical Bayes
Fixed Effects
0.10
0.25
0.50
Mortality OR
0.75
1.50
Meta-analysis
ESS = n £
• Effective sample size
V1
V2
n = sample size of trial
V1 = variance of qi without borrowing (var of MLE)
V2 = variance of qi with borrowing (posterior variance of qi )
Trial
1
2
3
4
5
6
7
8
n
76
270
400
94
298
115
48
2316
Var(MLE)
1.12
0.17
0.55
1.37
0.23
0.81
1.04
0.02
Var(µi jx)
0.15
0.10
0.15
0.16
0.10
0.22
0.16
0.02
V1 =V2
7.3
1.7
3.7
8.5
2.2
3.6
6.4
1.1
ESS
555
459
1480
799
656
414
307
2350
Meta-analysis
Standardised Test Score
-0.3
-0.1
0.1
0.3
• Example 3: Meta-analysis of effect of class size on
educational achievement (Goldstein et al, 2000)
8 studies:
• 1 RCT
• 3 matched
• 2 experimental
• 2 observational
15
20
25
30
Class size
35
Meta-analysis
8
7
6
5
4
3
2
1
Overall
Goldstein
0.05
0.00
-0.05
Effect of class size
(change in score per extra pupil)
-0.10
• Goldstein et al use
maximum likelihood,
with bootstrap CI due
to small sample size
• Under-estimates
uncertainty relative to
Bayesian intervals
• Note that 95% CI for
Bayesian estimate of
effect of class size
includes 0
Accounting for data quality
• Bayesian approach also provides formal
framework for propagating uncertainty about
different quantities in a model
• Natural tool for explicitly modelling different
aspects of data quality
– Measurement error
– Missing data
Accounting for data quality
Example 4: Accounting for population errors in smallarea estimation and disease mapping
(Best and Wakefield, 1999)
• Context: Mapping geographical variations in risk of
breast cancer by electoral ward in SE England,
1981-1991
• Typical model:
yi ~ Poisson(li Ni)
yi = number of breast cancer cases in area i
Ni = St Nit = population-years at risk in area i
li is the area specific rate of breast cancer:
parameter of interest
Accounting for data quality
• Ni usually assumed to be known
• Ignores uncertainty in small-area age/sex population counts
in inter-census years
• B&W make use of additional data on Registrar General’s
mid-year district age/sex population totals Ndt
• Model A: Nit = Ndt pit where pit is proportion of annual district
population in particular age group of interest living in ward i
pit estimated by interpolating 1981 and 1991 census counts
• Model B: Allow for sampling variability in Nit
Nit ~ Multinomial(Ndt, [ p1t ,…, pKt ])
• Model C: Allow for uncertainty in proportions pit
pit ~ informative Dirichlet prior distribution
Accounting for data quality
Random effects Poisson
regression: log li = ai + b Xi
Xi = deprivation score for ward i
Xi
prior
prior
m
s2
ai
li
Ni
yi
ward i
prior
b
Accounting for data quality
prior
Ndt
pit
Xi
Nit
prior
prior
m
s2
ai
year t
li
Ni
yi
ward i
prior
b
Accounting for data quality
Ward RR (modelled Ei)
Area-specific RR estimates
RR of breast cancer for
affluent vs deprived wards
Ni
known
Ward RR (assuming Ei known)
A
B
C
Model uncertainty
• Model uncertainty can be large for observational
data studies
• In regression models:
– What is the ‘best’ set of predictors for response of
interest?
– Which confounders to control for?
– Which interactions to include?
– What functional form to use (linear, nonlinear,….)?
Model uncertainty
• Example 5: Predictors of crime rates in US States (adapted
from Raftery et al, 1997)
• Ehrlich (1973) – developed and tested theory that decision to
commit crime is rational choice based on costs and benefits
• Costs of crime related to probability of imprisonment and
average length of time served in prison
• Benefits of crime related to income inequalities and aggregate
wealth of community
• Net benefits of other (legitimate) activities related to
employment rate and education levels in community
• Ehrlich analysed data from 47 US states in 1960, focusing on
relationship between crime rate and the 2 prison variables
• Up to 13 candidate control variables also considered
Model uncertainty
• y = log crime rate in 1960 in each of 47 US states
• Z1, Z2 = log prob. of prison, log av. time in prison
• X1,…, X13 = candidate control variables
• Fit Normal linear regression model
• Results sensitive to choice of control variables
Model
Control variables
Full model
Stepwise
Adjusted R2
Ehrlich model 1
Ehrlich model 2
All
1,3,4,9,11,13
1,2,4,7,8,9,11,12,13
9,12,13
1,6,9,10,12,13
Table adapted from Table 2 in Raftery et al (1997)
Estimated
Prob Prison
-0.30 (0.10)
-0.19 (0.07)
-0.30 (0.09)
-0.45 (0.12)
-0.43 (0.11)
e®ect (SE)
Time Prison
-0.27 (0.17)
{
{
-0.25 (0.16)
-0.55 (0.20)
-0.53 (0.20)
Model uncertainty
• Using Bayesian approach, can let set of control variables be
an unknown parameter of the model, q
• Don't know (a priori) no. of covariates in ‘best’ model
 q has unknown dimension  assign prior distribution
• Can handle such “trans-dimensional” (TD) models using
“reversible jump” MCMC algorithms
• Normal linear regression model
yi ~ Normal(mi, s 2)
i = 1,...,47
• Variable
model:
¹i =selection
Zi ° + W
¯; Wi = (Xiµ ; Xiµ ; :::; Xiµ )
i
1
2
k
k = number of currently selected predictors
µ = vector of k selected columns of X
¯ = vector of k regression coe±cients
Model uncertainty
k
q
b
g
mi
Xi
Zi
s2
yi
state i
Model uncertainty
k chains 1:2 sample: 20000
0.8
0.6
0.4
0.2
0.0
-1
0
2
4
6
Model uncertainty
3
5
Posterior mean and 95%
CI for effect (b) conditional
on being in model
7
9
11
control variable
13
Time prison
1
Prob prison
0.0
0.4
0.8
effect on
probability
crime rate
probability
-1 1 3 5 0.0
0.4
0.8
0.8
0.4
0 probability
Posterior probability that
control variable is in model
Model uncertainty
• Most likely (40%) set of control variables contains X4 (police
expenditure in 1960) and X13 (income inequality)
• 2nd most likely (28%) set of control variables contains X5
(police expenditure in 1959) and X13 (income inequality)
• Control variables with >10% marginal probability of inclusion
– X3 : average years of schooling (18%)
– X4 : police expenditure in 1960 (56%)
– X5 : police expenditure in 1959 (40%)
– X13 : income inequality (94%)
• Posterior estimates of prison variables, averaged over models
log prob. of prison: -0.29 (-0.55, -0.05)
log av. time in prison: -0.27 (-0.69, 0.14)
Discussion
• Bayesian approach provides coherent framework
for combining many sources of evidence in a
statistical model
• Formal approach to “borrowing strength”
– Improved precision/effective sample size
– Fully accounts for uncertainty
• Relevance of different pieces of evidence is a
judgement – must be justifiable
• Bayesian approach forces us to be explicit about
model assumptions
• Sensitivity analysis to assumptions is crucial
Discussion
• Bayesian calculations are computationally
intensive, but:
– Provides exact inference; no asymptotics
– MCMC offers huge flexibility to model complex problems
• All examples discussed here were fitted using free
WinBUGS software: www.mrc-bsu.cam.ac.uk
• Want to learn more about using Bayesian methods
for social science data analysis?
– Short course: Introduction to Bayesian inference
and WinBUGS, Sept 19-20, Imperial College
See www.bias-project.org.uk for details
Thank you!
References
•
Best, N. and Wakefield, J. (1999). Accounting for inaccuracies in
population counts and case registration in cancer mapping studies. J
Roy Statist Soc, Series A, 162: 363-382
•
Goldstein, H., Yang, M., Omar, R., Turner, R. and Thompson, S.
(2000). Meta-analysis using multilevel models with an application to
the study of class size effects. Applied Statistics, 49: 399-412.
•
Raftery, A., Madigan, D. and Hoeting, J. (1997). Bayesian model
averaging for linear regression models. J Am Statist Assoc, 92: 179191.
•
Spiegelhalter, D., Abrams, K. and Myles, J. (2004). Bayesian
Approaches to Clinical Trials and Health Care Evaluation, Wiley,
Chichester.
•
Western, B. and Jackman, S. (1994). Bayesian inference for
comparative research. The American Political Science Review, 88:
412-423.