Bayesian inference

Download Report

Transcript Bayesian inference

Bayesian models for fMRI data
Klaas Enno Stephan
Laboratory for Social and Neural Systems Research
Institute for Empirical Research in Economics
University of Zurich
Functional Imaging Laboratory (FIL)
Wellcome Trust Centre for Neuroimaging
University College London
With many thanks for slides & images to:
FIL Methods group,
particularly Guillaume Flandin
The Reverend Thomas Bayes
(1702-1761)
Methods & models for fMRI data analysis in neuroeconomics
April 2010
Why do I need to learn about Bayesian stats?
Because SPM is getting more and more Bayesian:
• Segmentation & spatial normalisation
• Posterior probability maps (PPMs)
– 1st level: specific spatial priors
– 2nd level: global spatial priors
• Dynamic Causal Modelling (DCM)
• Bayesian Model Selection (BMS)
• EEG: source reconstruction
Spatial priors
Bayesian segmentation
on activation extent
and normalisation
Posterior probability
maps (PPMs)
Image time-series
Kernel
Realignment
Smoothing
Design matrix
Dynamic Causal
Modelling
Statistical parametric map (SPM)
General linear model
Statistical
inference
Normalisation
Gaussian
field theory
p <0.05
Template
Parameter estimates
Problems of classical (frequentist) statistics
p-value: probability of getting the observed data in the effect’s
absence. If small, reject null hypothesis that there is no effect.
H0 :   0
p( y | H 0 )
Probability of observing the data y,
given no effect ( = 0).
Limitations:
 One can never accept the null hypothesis
 Given enough data, one can always demonstrate a significant effect
 Correction for multiple comparisons necessary
Solution: infer posterior probability of the effect
p( | y )
Probability of the effect,
given the observed data
Overview of topics
• Bayes' rule
• Bayesian update rules for Gaussian densities
• Bayesian analyses in SPM
– Segmentation & spatial normalisation
– Posterior probability maps (PPMs)
• 1st level: specific spatial priors
• 2nd level: global spatial priors
– Bayesian Model Selection (BMS)
Bayes‘ Theorem
Posterior
Likelihood
Prior
p( y |  ) p( )
P( | y ) 
p( y )
Evidence
Reverend Thomas Bayes
1702 - 1761
“Bayes‘ Theorem describes, how an ideally rational person
processes information."
Wikipedia
Bayes’ Theorem
Given data y and parameters , the conditional probabilities are:
p( y ,  )
p( y |  ) 
p( )
p( y ,  )
p( | y ) 
p( y )
Eliminating p(y,) gives Bayes’ rule:
Likelihood
Posterior
Prior
p( y |  ) p( )
P( | y ) 
p( y )
Evidence
Bayesian statistics
new data
prior knowledge
p( y |  )
p ( )
p( | y)  p( y |  ) p( )
posterior
 likelihood
∙ prior
Bayes theorem allows one to formally
incorporate prior knowledge into
computing statistical probabilities.
Priors can be of different sorts:
empirical, principled or shrinkage
priors.
The “posterior” probability of the
parameters given the data is an
optimal combination of prior knowledge
and new data, weighted by their
relative precision.
Bayes in motion - an animation
Principles of Bayesian inference
 Formulation of a generative model
likelihood p(y|)
prior distribution p()
 Observation of data
y
 Update of beliefs based upon observations, given a prior
state of knowledge
p( | y )  p( y |  ) p( )
Posterior mean & variance of univariate Gaussians
Likelihood & Prior
y  
p ( y |  )  N ( y; ,  )
2
e
p( )  N ( ;  p ,  p2 )

Posterior
Posterior: p( | y)  N ( ; , )
2
1
1
1


 2  e2  p2
 1
1 

    2   2 p 
p 
e
2
Posterior mean =
variance-weighted combination
of prior mean and data mean

Likelihood
Prior
p
Same thing – but expressed as precision weighting
Likelihood & prior
p ( y |  )  N ( y; ,  )
1
e
y  
p( )  N ( ;  p , p1 )

Posterior
Posterior: p( | y)  N ( ; , 1 )
  e   p
p
e
    p


Relative precision weighting

Likelihood
Prior
p
Same thing – but explicit hierarchical perspective
y   (1)   (1)
Likelihood & Prior
p( y |  )  N ( y; ,1 /  )
(1)
(1)
(1)
 (1)   ( 2)   ( 2)
p( (1) )  N ( (1) ; ( 2 ) ,1 / ( 2 ) )

Posterior
Posterior
p(
(1)
Likelihood
| y )  N ( ;  ,1 /  )
(1)
  (1)  ( 2 )
(1) (1) ( 2 ) ( 2 )

 



Relative precision weighting

(1)
Prior
 ( 2)
Bayesian regression: univariate case
Normal densities
p( )  N ( ; p , p2 )
Univariate
linear
model
y  x  e
 | y
p( y |  )  N ( y;x, e2 )
x
p( | y)  N ( ; | y ,2| y )
p
1

2
|y
 | y

x2

2
e

1
 p2
 x

1
   2 y  2 p 

 p 
 e
2
|y
Relative precision weighting
Bayesian GLM: multivariate case
General
Linear
Model
Normal densities
p(θ)  N (θ; η p , C p )
y  Xθ  e
p(y | θ)  N (y; Xθ, Ce )
1
1
C | y  XT Ce X  C p

2
p(θ | y)  N (θ; η | y , C | y )
1
1
η | y  C | y XT Ce y  C p η p

One step if Ce is known.
Otherwise iterative estimation
with EM.
1
An intuitive example
10
2
5
0
-5
Prior
Likelihood
Posterior
-10
-10
-5
0
1
5
10
Still intuitive
10
2
5
0
-5
Prior
Likelihood
Posterior
-10
-10
-5
0
1
5
10
Less intuitive
Prior
Likelihood
Posterior
10
2
5
0
-5
-10
-10
-5
0
1
5
10
Bayesian (fixed effects) group analysis
Likelihood distributions from different
subjects are independent
Under Gaussian assumptions this is
easy to compute:
 one can use the posterior from one
subject as the prior for the next
group
posterior
covariance
p  | y1
y N   p  y1
y N   p  
individual
posterior
covariances
N
 p    p  yi  
i 1
N
 p  y1   p  yi  
i 2
N
 p  y1 , y2   p  yi  
 p  y1
1
 | y1 ,...,y N
C
 | y ,...,y
1
N
i 3
y N 1  p  y N  
“Today’s posterior is tomorrow’s prior”
group
posterior
mean
N
  C|1yi
i 1
 N 1

   C | yi | yi C | y1 ,...,y N
 i 1

individual posterior
covariances and means
Bayesian analyses in SPM8
• Segmentation & spatial normalisation
• Posterior probability maps (PPMs)
– 1st level: specific spatial priors
– 2nd level: global spatial priors
• Dynamic Causal Modelling (DCM)
• Bayesian Model Selection (BMS)
• EEG: source reconstruction
Spatial normalisation: Bayesian regularisation
Deformations consist of a linear
combination of smooth basis functions

lowest frequencies of a 3D
discrete cosine transform.
Find maximum a posteriori (MAP) estimates: simultaneously minimise
– squared difference between template and source image
– squared difference between parameters and their priors
Deformation parameters
MAP:
log p ( | y )  log p ( y |  )  log p ( )  log p ( y )
“Difference” between template
and source image
Squared distance between parameters and
their expected values (regularisation)
Bayesian segmentation with empirical priors
• Goal: for each voxel, compute
probability that it belongs to a
particular tissue type, given its
intensity
p (tissue| intensity)
p (intensity | tissue) ∙ p (tissue)
• Likelihood model:
Intensities are modelled by a
mixture of Gaussian distributions
representing different tissue
classes (e.g. GM, WM, CSF).
• Priors are obtained from tissue
probability maps (segmented
images of 151 subjects).
Ashburner & Friston 2005, NeuroImage
Unified segmentation & normalisation
• Circular relationship between segmentation & normalisation:
– Knowing which tissue type a voxel belongs to helps normalisation.
– Knowing where a voxel is (in standard space) helps segmentation.
• Build a joint generative model:
– model how voxel intensities result from mixture of tissue type distributions
– model how tissue types of one brain have to be spatially deformed to match
those of another brain
• Using a priori knowledge about the parameters:
adopt Bayesian approach and maximise the posterior probability
Ashburner & Friston 2005, NeuroImage
Bayesian fMRI analyses
General Linear Model:
y  X  
with
 ~ N (0, C )
What are the priors?
• In “classical” SPM, no priors (= “flat” priors)
• Full Bayes: priors are predefined on a principled or empirical basis
• Empirical Bayes: priors are estimated from the data, assuming a
hierarchical generative model  PPMs in SPM
Parameters of one level = priors for distribution
of parameters at lower level
Parameters and hyperparameters at each
level can be estimated using EM
Hierarchical models and Empirical Bayes
y  X (1) (1)   (1)
Hierarchical
model

(1)
X 

( 2)
( 2)

( 2)
Parametric
Empirical
Bayes (PEB)
 ( n 1)  X ( n ) ( n )   ( n )
EM = PEB = ReML
Single-level
model
y   (1)  X (1) ( 2) 
... 
X (1)  X ( n1) ( n ) 
X (1)  X ( n ) ( n )
Restricted
Maximum
Likelihood
(ReML)
Posterior Probability Maps (PPMs)
Posterior distribution: probability of the effect given the data
p( | y )
mean: size of effect
precision: variability
Posterior probability map: images of the probability (confidence) that
an activation exceeds some specified threshold , given the data y

p(   | y )  
p( | y )

Two thresholds:
• activation threshold : percentage of whole brain mean signal
(physiologically relevant size of effect)
• probability  that voxels must exceed to be displayed (e.g. 95%)
PPMs vs. SPMs
p( | y)  p( y |  ) p( )
PPMs
Posterior
Likelihood
Prior
SPMs

u

Bayesian test: p(
  | y)  
t  f ( y)
Classical t-test: p(t  u |   0)  
2nd level PPMs with global priors
1st level (GLM):
2nd level (shrinkage prior):
y  X (1) (1)   (1)
p( )  N (0, C )
 (1)   ( 2)   ( 2)
p( )  N (0, C )
 0   ( 2)
Basic idea: use the variance of  over voxels
as prior variance of  at any particular voxel.
2nd level:
(2) = average effect over voxels,
(2) = voxel-to-voxel variation.
(1)
reflects regionally specific effects
 assume that it sums to zero over all voxels
 shrinkage prior at the second level
 variance of this prior is implicitly estimated
by estimating (2)
p( )
0
In the absence of evidence
to the contrary, parameters
will shrink to zero.
Shrinkage Priors
Small & variable effect
Large & variable effect
Small but clear effect
Large & clear effect
2nd level PPMs with global priors
1st level (GLM):
y  X   (1)
p( )  N (0, C )
voxel-specific
p( )  N (0, C )
global 
pooled estimate
2nd level (shrinkage prior):
  0
 We are looking for the same
effect over multiple voxels
 Pooled estimation of C over
voxels
Friston & Penny 2003, NeuroImage
( 2)
Once Cε and C are known, we can
apply the usual rule for computing the
posterior mean & covariance:
C | y  X T C1 X  C1 
1
m | y  C | y X T C1 y
PPMs and multiple comparisons
No need to correct for multiple comparisons:
Thresholding a PPM at 95% confidence: in every voxel, the
posterior probability of an activation  is  95%.
At most, 5% of the voxels identified could have activations less
than .
Independent of the search volume, thresholding a PPM thus
puts an upper bound on the false discovery rate.
PPMs vs.SPMs
rest [2.06]
rest
contrast(s)
<
PPM 2.06
SPMresults: C:\home\spm\analysis_PET
Height threshold P = 0.95
Extent threshold k = 0 voxels
<
<
SPMmip
[0, 0, 0]
4
<
SPMmip
[0, 0, 0]
<
1
4
7
10
13
16
19
22
25
28
31
34
37
40
43
46
49
52
55
60
<
SPM{T39.0}
SPMresults: C:\home\spm\analysis_PET
Height threshold T = 5.50
Extent threshold k = 0 voxels
1 4 7 10 13 16 19 22
Design matrix
PPMs: Show activations
greater than a given
size
3
contrast(s)
1
4
7
10
13
16
19
22
25
28
31
34
37
40
43
46
49
52
55
60
1 4 7 10 13 16 19 22
Design matrix
SPMs: Show voxels
with non-zero
activations
PPMs: pros and cons
Advantages
• One can infer that a
cause did not elicit a
response
• Inference is
independent of search
volume
• SPMs conflate effectsize and effectvariability
Disadvantages
• Estimating priors over
voxels is
computationally
demanding
• Practical benefits are
yet to be established
• Thresholds other than
zero require
justification
1st level PPMs with local spatial priors
• Neighbouring voxels often not independent
• Spatial dependencies vary across the brain
• But spatial smoothing in SPM is uniform
• Matched filter theorem: SNR maximal when
smoothing the data with a kernel which
matches the smoothness of the true signal
Contrast map
• Basic idea: estimate regional spatial
dependencies from the data and use this as
a prior in a PPM
 regionally specific smoothing
 markedly increased sensitivity
AR(1) map
Penny et al. 2005, NeuroImage
The generative spatio-temporal model
q1
q2
r2
r1
K
p  α    p  k 
P
p(β)   p(  p )
k 1
p 1
p  k   Ga  k ; q1 , q2 
u1
p(  p )  Ga(  p ; r1 , r2 )

u2
p W   p w
p  n   Ga  n ; u1 , u2 
p  w Tk   N w Tk ; 0,  k1  ST S 

K
k 1
n 1
observation
noise

p  λ    p  n 
N
W
regression
coefficients
Y
Penny et al. 2005, NeuroImage

Y=XW+E
T
k

1
P
p ( A)   p(a p )
p 1

p (a p )  N  a p ; 0,  p 1 (ST S) 1 
A
autoregressive
parameters
 = spatial precision of parameters
 = observation noise precision
 = precision of AR coefficients
The spatial prior
Prior for k-th parameter:
   N w
T
k
pw
Shrinkage
prior
T
k
;0, 
1
k
S S 
T
1

Spatial precision:
determines the amount of
smoothness
Spatial
kernel matrix
Different choices possible for spatial kernel matrix S.
Currently used in SPM: Laplacian prior (same as in LORETA)
Example: application to event-related fMRI data
Smoothing
Contrast maps for
familiar vs. non-familiar
faces, obtained with
- smoothing
- global spatial prior
- Laplacian prior
Global prior
Laplacian Prior
Bayesian model selection (BMS)
Given competing hypotheses
on structure & functional
mechanisms of a system, which
model is the best?
Which model represents the
best balance between model
fit and model complexity?
For which model m does
p(y|m) become maximal?
Pitt & Miyung (2002), TICS
Bayesian model selection (BMS)
Bayes’ rules:
p( y |  , m) p( | m)
p( | y, m) 
p ( y | m)
Model evidence:
p( y | m)   p( y |  , m)  p( | m) d
accounts for both accuracy and complexity of the model
allows for inference about structure (generalisability) of the model
Various approximations, e.g.:
- negative free energy
- AIC
- BIC
Penny et al. (2004) NeuroImage
Model comparison via Bayes factor:
p( y | m1 )
BF 
p( y | m2 )
Example: BMS of dynamic causal models
attention
 modulation of back-
M1
PPC
ward or forward
connection?
stim
 additional driving
effect of attention
on PPC?
M3
M2
M2 better than M1
BF = 2966
V1
attention
attention
stim
V5
PPC
PPC
V1
V5
BF = 12
M3 better than M2
stim
V1
V5
M4
 bilinear or nonlinear
modulation of
forward connection?
Stephan et al. (2008) NeuroImage
attention
PPC
BF = 23
M4 better than M3
stim
V1
V5
Thank you