Introduction - Translational Neuromodeling Unit

Download Report

Transcript Introduction - Translational Neuromodeling Unit

Bayesian models for fMRI data
Klaas Enno Stephan
Translational Neuromodeling Unit (TNU)
Institute for Biomedical Engineering, University of Zurich & ETH Zurich
Laboratory for Social & Neural Systems Research (SNS), University of
Zurich
Wellcome Trust Centre for Neuroimaging, University College London
With many thanks for slides & images to:
FIL Methods group,
particularly Guillaume Flandin and Jean Daunizeau
SPM Course Zurich
13-15 February 2013
The Reverend Thomas Bayes
(1702-1761)
Bayes‘ Theorem
Posterior
Likelihood
P ( | y ) 
p ( y |  ) p ( )
p( y)
Evidence
Reverend Thomas Bayes
1702 - 1761
“Bayes‘ Theorem describes, how an ideally rational person
processes information."
Wikipedia
Prior
Bayes’ Theorem
Given data y and parameters , the joint probability is:
p ( y ,  )  p ( | y ) p ( y )  p ( y |  ) p ( )
Eliminating p(y,) gives Bayes’ rule:
Likelihood
Posterior
P ( | y ) 
Prior
p ( y |  ) p ( )
p( y)
Evidence
Bayesian inference: an animation
Principles of Bayesian inference
 Formulation of a generative model
likelihood p(y|)
prior distribution p()
y  f ( )  
p ( y |  )  N ( f ( ),   )
p ( )  N (  p ,  p )
 Observation of data
y
 Update of beliefs based upon observations, given a prior
state of knowledge
p ( | y )  p ( y |  ) p (  )
Posterior mean & variance of univariate Gaussians
Likelihood & Prior
p ( y |  )  N ( ,  )
2
e
y  
p ( )  N (  p ,  p )
2

Posterior
2
Posterior: p ( | y )  N (  ,  )
1

2

1

2
e

Likelihood
1
p
2
 1

1

 
  2 p 
2


p
 e

2
Posterior mean =
variance-weighted combination
of prior mean and data mean

Prior
p
Same thing – but expressed as precision weighting
Likelihood & prior
1
e
p ( y |  )  N ( ,  )
y  
1
p ( )  N (  p ,  p )

Posterior
Posterior: p ( | y )  N (  ,   1 )
Likelihood
  e   p

e

 
p

Prior
p
Relative precision weighting

p
Same thing – but explicit hierarchical perspective
Likelihood & Prior
p( y |
p (
(1)
(1)
)  N (
)  N (
(2)
y
(1)
,1 / 
,1 / 
(2)
(1)
)

(1 )
p (

(1 )

(2)

(2)

Posterior
(1)

(1)



(2)
(1)

Prior

(2)


(2)
Relative precision weighting

(1 )
Likelihood
| y )  N (  ,1 /  )
 
 
(1 )
)
Posterior
(1)


(2)
 Why should I know about Bayesian stats?
Because Bayesian principles are fundamental for
• statistical inference in general
• sophisticated analyses of (neuronal) systems
• contemporary theories of brain function
Problems of classical (frequentist) statistics
p-value: probability of observing data in the effect’s absence
H0 :  0
p( y | H 0 )
Limitations:
 One can never accept the null hypothesis
 Given enough data, one can always demonstrate a significant effect
 Correction for multiple comparisons necessary
Solution: infer posterior probability of the effect
p ( | y )
Generative models: Forward and inverse problems
forward problem
p  y  , m  p  | m 
likelihood  prior
posterior distribution
p  y , m 
inverse problem
Dynamic causal modeling (DCM)
EEG, MEG
fMRI
Forward model:
Predicting measured
activity given a putative
neuronal state
Model inversion:
Estimating neuronal
mechanisms from brain
activity measures
y  g ( x , )  
dx
dt
Friston et al. (2003) NeuroImage
 f ( x, u , )  
The Bayesian brain hypothesis & free-energy principle
sensations – predictions
Prediction error
Change
sensory input
Change
predictions
Action
Perception
Maximizing the evidence (of the brain's generative model)
= minimizing the surprise about the data (sensory inputs).
Friston et al. 2006,
J Physiol Paris
Individual hierarchical Bayesian learning

volatility
k 1
k
x3
x3
associations
k 1
k
x2
x2
k 1
events in the world
x1
sensory stimuli
u
k 1
Mathys et al. 2011, Front. Hum. Neurosci.
k
x1
u
k
i 
ˆ i 1
i
P E i 1
Aberrant Bayesian message
passing in schizophrenia:
Backward & lateral
 i   i  g i ( i 1 ,  i )   i i
abnormal (precision-weighted)
prediction errors
input
1
2
3
4
u
2
3
4

abnormal modulation of
NMDAR-dependent synaptic
plasticity at forward connections
of cortical hierarchies
Forward & lateral
 i
  i 1
T
i 1  
T
  i 1
i 
  i 1
 i 1
Forward recognition
effects
  i 1
De-correlating lateral
interactions
T


Stephan et al. 2006, Biol. Psychiatry

 i
T
i
 i
i
Lateral interactions
mediating priors
i
 i 1
i
  i i
Backward generation
effects
 g i ( i 1 ,  i )
 Why should I know about Bayesian stats?
Because SPM is getting more and more Bayesian:
• Segmentation & spatial normalisation
• Posterior probability maps (PPMs)
– 1st level: specific spatial priors
– 2nd level: global spatial priors
• Dynamic Causal Modelling (DCM)
• Bayesian Model Selection (BMS)
• EEG: source reconstruction
Spatial priors
Bayesian segmentation
on activation extent
and normalisation
Posterior probability
maps (PPMs)
Image time-series
Kernel
Realignment
Smoothing
Design matrix
Dynamic Causal
Modelling
Statistical parametric map (SPM)
General linear model
Statistical
inference
Normalisation
Gaussian
field theory
p <0.05
Template
Parameter estimates
Spatial normalisation: Bayesian regularisation
Deformations consist of a linear
combination of smooth basis functions
(3D DCT).
Find maximum a posteriori (MAP) estimates:
Deformation parameters
MAP:
log p ( | y )  log p ( y |  )  log p ( )  log p ( y )
“Difference” between template
and source image
Squared distance between parameters and
their expected values (regularisation)
Spatial normalisation: overfitting
Affine registration.
(2 = 472.1)
Template
image
Non-linear
registration
using
regularisation.
(2 = 302.7)
Non-linear
registration
without
regularisation.
(2 = 287.3)
Bayesian segmentation with empirical priors
• Goal:
for each voxel, compute probability that it
belongs to a particular tissue type, given
its intensity
p (tissue | intensity)
p (intensity | tissue) ∙ p (tissue)
• Likelihood:
Intensities are modelled by a mixture of
Gaussian distributions representing
different tissue classes (e.g. GM, WM,
CSF).
• Priors:
obtained from tissue probability maps
(segmented images of 151 subjects).
Ashburner & Friston 2005, NeuroImage
Bayesian fMRI analyses
General Linear Model:
y  X  
with
 ~ N ( 0, C  )
What are the priors?
• In “classical” SPM, no priors (= “flat” priors)
• Full Bayes: priors are predefined
• Empirical Bayes: priors are estimated from the data, assuming a
hierarchical generative model
Parameters of one level = priors for distribution
of parameters at lower level
Parameters and hyperparameters at each
level can be estimated using EM
Posterior Probability Maps (PPMs)
Posterior distribution: probability of the effect given the data
p ( | y )
mean: size of effect
precision: variability
Posterior probability map: images of the probability that an
activation exceeds some specified threshold , given the data y

p (   | y )  
p ( | y )

Two thresholds:
• activation threshold : percentage of whole brain mean signal
• probability  that voxels must exceed to be displayed (e.g. 95%)
2nd level PPMs with global priors
1st level (GLM):
2nd level (shrinkage prior):
y  X

(1 )

(1)

(1)
(2)

0
Heuristically: use the variance of meancorrected activity over voxels as prior
variance of  at any particular voxel.
(1) reflects regionally specific effects
 assume that it is zero on average over
voxels
 variance of this prior is implicitly estimated
by estimating (2)

(1)
(2)
p ( )  N ( 0 , C  )
p ( )  N ( 0 , C  )
(2)
p ( )
0
In the absence of evidence
to the contrary, parameters
will shrink to zero.
2nd level PPMs with global priors
1st level (GLM):
y  X  
(1 )
p ( )  N ( 0 , C  )
voxel-specific
p ( )  N ( 0 , C  )
global 
pooled estimate
over voxels
2nd level (shrinkage prior):
 0
(2)
Compute Cε and C via ReML/EM, and
apply the usual rule for computing
posterior mean & covariance for
Gaussians:
Friston & Penny 2003, NeuroImage
C |y  X C  X  C
T
1
m |y  C |y X C  y
T

1  1
1
PPMs vs. SPMs
p ( | y )  p ( y |  ) p ( )
PPMs
Posterior
Likelihood
Prior
SPMs

u

Bayesian test: p (   | y )  
t  f ( y)
Classical t-test: p ( t  u |   0 )  
PPMs and multiple comparisons
Friston & Penny (2003): No need to correct for multiple
comparisons:
Thresholding a PPM at 95% confidence: in every voxel, the
posterior probability of an activation  is  95%.
At most, 5% of the voxels identified could have activations less
than .
Independent of the search volume, thresholding a PPM thus
puts an upper bound on the false discovery rate.
NB: being debated
PPMs vs.SPMs
rest [2.06]
rest
contrast(s)
<
PPM 2.06
SPMresults: C:\home\spm\analysis_PET
Height threshold P = 0.95
Extent threshold k = 0 voxels
<
<
SPMmip
[0, 0, 0]
4
<
SPMmip
[0, 0, 0]
<
1
4
7
10
13
16
19
22
25
28
31
34
37
40
43
46
49
52
55
60
<
SPM{T39.0}
SPMresults: C:\home\spm\analysis_PET
Height threshold T = 5.50
Extent threshold k = 0 voxels
1 4 7 10 13 16 19 22
Design matrix
PPMs: Show activations
greater than a given
size
3
contrast(s)
1
4
7
10
13
16
19
22
25
28
31
34
37
40
43
46
49
52
55
60
1 4 7 10 13 16 19 22
Design matrix
SPMs: Show voxels
with non-zero
activations
PPMs: pros and cons
Advantages
• One can infer that a
cause did not elicit a
response
• Inference is
independent of search
volume
• do not conflate effectsize and effectvariability
Disadvantages
• Estimating priors over
voxels is
computationally
demanding
• Practical benefits are
yet to be established
• Thresholds other than
zero require
justification
Model comparison and selection
Given competing hypotheses
on structure & functional
mechanisms of a system, which
model is the best?
Which model represents the
best balance between model
fit and model complexity?
For which model m does p(y|m)
become maximal?
Pitt & Miyung (2002) TICS
Bayesian model selection (BMS)
Model evidence:
 p ( y |  , m ) p ( | m ) d 
log p ( y | m )  log p ( y |  , m )
 K L  q    , p   | m  
 K L  q    , p   | y , m  
accounts for both accuracy and
complexity of the model
p(y|m)
p( y | m) 
Gharamani, 2004
y
all possible datasets
Various approximations, e.g.:
- negative free energy, AIC, BIC
a measure of generalizability
McKay 1992, Neural Comput.
Penny et al. 2004a, NeuroImage
Approximations to the model evidence
Maximizing log model evidence
= Maximizing model evidence
Logarithm is a
monotonic function
Log model evidence = balance between fit and complexity
log p ( y | m )  accuracy ( m )  complexity ( m )
 log p ( y |  , m )  complexity ( m )
No. of
parameters
In SPM2 & SPM5, interface offers 2 approximations:
Akaike Information Criterion:
Bayesian Information Criterion:
No. of
data points
AIC  log p ( y |  , m )  p
BIC  log p ( y |  , m ) 
p
log N
2
Penny et al. 2004a, NeuroImage
The (negative) free energy approximation
• Under Gaussian assumptions about the posterior (Laplace
approximation):
log p ( y | m )
 log p ( y |  , m )  K L  q    , p   | m    K L  q    , p   | y , m  
F  log p ( y | m )  K L  q    , p   | y , m  
 log p ( y |  , m )  K L  q    , p   | m  
accuracy
com plexity
The complexity term in F
• In contrast to AIC & BIC, the complexity term of the negative
free energy F accounts for parameter interdependencies.
KL q ( ), p ( | m ) 

1
2
ln C  
1
2
ln C  | y 
1
2

 |y
 

T
1
C

 |y
• The complexity term of F is higher
– the more independent the prior parameters ( effective DFs)
– the more dependent the posterior parameters
– the more the posterior mean deviates from the prior mean
• NB: Since SPM8, only F is used for model selection !
 

Bayes factors
To compare two models, we could just compare their log
evidences.
But: the log evidence is just some number – not very intuitive!
A more intuitive interpretation of model comparisons is made
possible by Bayes factors:
positive value, [0;[
B12 
p ( y | m1 )
p( y | m2 )
Kass & Raftery classification:
Kass & Raftery 1995, J. Am. Stat. Assoc.
B12
p(m1|y)
Evidence
1 to 3
50-75%
weak
3 to 20
75-95%
positive
20 to 150
95-99%
strong
 150
 99%
Very strong
BMS in SPM8: an example
attention
M1
stim
M1
M2
M3
M4
M3
stim
PPC
V1
attention
V1
V5
PPC
M2
M2 better than M1
BF 2966
F = 7.995
PPC
attention
stim
V1
V5
M3 better than M2
BF  12
F = 2.450
V5
M4
attention
PPC
M4 better than M3
BF  23
F = 3.144
stim
V1
V5
Thank you