Introduction - Translational Neuromodeling Unit
Download
Report
Transcript Introduction - Translational Neuromodeling Unit
Bayesian models for fMRI data
Klaas Enno Stephan
Translational Neuromodeling Unit (TNU)
Institute for Biomedical Engineering, University of Zurich & ETH Zurich
Laboratory for Social & Neural Systems Research (SNS), University of
Zurich
Wellcome Trust Centre for Neuroimaging, University College London
With many thanks for slides & images to:
FIL Methods group,
particularly Guillaume Flandin and Jean Daunizeau
SPM Course Zurich
13-15 February 2013
The Reverend Thomas Bayes
(1702-1761)
Bayes‘ Theorem
Posterior
Likelihood
P ( | y )
p ( y | ) p ( )
p( y)
Evidence
Reverend Thomas Bayes
1702 - 1761
“Bayes‘ Theorem describes, how an ideally rational person
processes information."
Wikipedia
Prior
Bayes’ Theorem
Given data y and parameters , the joint probability is:
p ( y , ) p ( | y ) p ( y ) p ( y | ) p ( )
Eliminating p(y,) gives Bayes’ rule:
Likelihood
Posterior
P ( | y )
Prior
p ( y | ) p ( )
p( y)
Evidence
Bayesian inference: an animation
Principles of Bayesian inference
Formulation of a generative model
likelihood p(y|)
prior distribution p()
y f ( )
p ( y | ) N ( f ( ), )
p ( ) N ( p , p )
Observation of data
y
Update of beliefs based upon observations, given a prior
state of knowledge
p ( | y ) p ( y | ) p ( )
Posterior mean & variance of univariate Gaussians
Likelihood & Prior
p ( y | ) N ( , )
2
e
y
p ( ) N ( p , p )
2
Posterior
2
Posterior: p ( | y ) N ( , )
1
2
1
2
e
Likelihood
1
p
2
1
1
2 p
2
p
e
2
Posterior mean =
variance-weighted combination
of prior mean and data mean
Prior
p
Same thing – but expressed as precision weighting
Likelihood & prior
1
e
p ( y | ) N ( , )
y
1
p ( ) N ( p , p )
Posterior
Posterior: p ( | y ) N ( , 1 )
Likelihood
e p
e
p
Prior
p
Relative precision weighting
p
Same thing – but explicit hierarchical perspective
Likelihood & Prior
p( y |
p (
(1)
(1)
) N (
) N (
(2)
y
(1)
,1 /
,1 /
(2)
(1)
)
(1 )
p (
(1 )
(2)
(2)
Posterior
(1)
(1)
(2)
(1)
Prior
(2)
(2)
Relative precision weighting
(1 )
Likelihood
| y ) N ( ,1 / )
(1 )
)
Posterior
(1)
(2)
Why should I know about Bayesian stats?
Because Bayesian principles are fundamental for
• statistical inference in general
• sophisticated analyses of (neuronal) systems
• contemporary theories of brain function
Problems of classical (frequentist) statistics
p-value: probability of observing data in the effect’s absence
H0 : 0
p( y | H 0 )
Limitations:
One can never accept the null hypothesis
Given enough data, one can always demonstrate a significant effect
Correction for multiple comparisons necessary
Solution: infer posterior probability of the effect
p ( | y )
Generative models: Forward and inverse problems
forward problem
p y , m p | m
likelihood prior
posterior distribution
p y , m
inverse problem
Dynamic causal modeling (DCM)
EEG, MEG
fMRI
Forward model:
Predicting measured
activity given a putative
neuronal state
Model inversion:
Estimating neuronal
mechanisms from brain
activity measures
y g ( x , )
dx
dt
Friston et al. (2003) NeuroImage
f ( x, u , )
The Bayesian brain hypothesis & free-energy principle
sensations – predictions
Prediction error
Change
sensory input
Change
predictions
Action
Perception
Maximizing the evidence (of the brain's generative model)
= minimizing the surprise about the data (sensory inputs).
Friston et al. 2006,
J Physiol Paris
Individual hierarchical Bayesian learning
volatility
k 1
k
x3
x3
associations
k 1
k
x2
x2
k 1
events in the world
x1
sensory stimuli
u
k 1
Mathys et al. 2011, Front. Hum. Neurosci.
k
x1
u
k
i
ˆ i 1
i
P E i 1
Aberrant Bayesian message
passing in schizophrenia:
Backward & lateral
i i g i ( i 1 , i ) i i
abnormal (precision-weighted)
prediction errors
input
1
2
3
4
u
2
3
4
abnormal modulation of
NMDAR-dependent synaptic
plasticity at forward connections
of cortical hierarchies
Forward & lateral
i
i 1
T
i 1
T
i 1
i
i 1
i 1
Forward recognition
effects
i 1
De-correlating lateral
interactions
T
Stephan et al. 2006, Biol. Psychiatry
i
T
i
i
i
Lateral interactions
mediating priors
i
i 1
i
i i
Backward generation
effects
g i ( i 1 , i )
Why should I know about Bayesian stats?
Because SPM is getting more and more Bayesian:
• Segmentation & spatial normalisation
• Posterior probability maps (PPMs)
– 1st level: specific spatial priors
– 2nd level: global spatial priors
• Dynamic Causal Modelling (DCM)
• Bayesian Model Selection (BMS)
• EEG: source reconstruction
Spatial priors
Bayesian segmentation
on activation extent
and normalisation
Posterior probability
maps (PPMs)
Image time-series
Kernel
Realignment
Smoothing
Design matrix
Dynamic Causal
Modelling
Statistical parametric map (SPM)
General linear model
Statistical
inference
Normalisation
Gaussian
field theory
p <0.05
Template
Parameter estimates
Spatial normalisation: Bayesian regularisation
Deformations consist of a linear
combination of smooth basis functions
(3D DCT).
Find maximum a posteriori (MAP) estimates:
Deformation parameters
MAP:
log p ( | y ) log p ( y | ) log p ( ) log p ( y )
“Difference” between template
and source image
Squared distance between parameters and
their expected values (regularisation)
Spatial normalisation: overfitting
Affine registration.
(2 = 472.1)
Template
image
Non-linear
registration
using
regularisation.
(2 = 302.7)
Non-linear
registration
without
regularisation.
(2 = 287.3)
Bayesian segmentation with empirical priors
• Goal:
for each voxel, compute probability that it
belongs to a particular tissue type, given
its intensity
p (tissue | intensity)
p (intensity | tissue) ∙ p (tissue)
• Likelihood:
Intensities are modelled by a mixture of
Gaussian distributions representing
different tissue classes (e.g. GM, WM,
CSF).
• Priors:
obtained from tissue probability maps
(segmented images of 151 subjects).
Ashburner & Friston 2005, NeuroImage
Bayesian fMRI analyses
General Linear Model:
y X
with
~ N ( 0, C )
What are the priors?
• In “classical” SPM, no priors (= “flat” priors)
• Full Bayes: priors are predefined
• Empirical Bayes: priors are estimated from the data, assuming a
hierarchical generative model
Parameters of one level = priors for distribution
of parameters at lower level
Parameters and hyperparameters at each
level can be estimated using EM
Posterior Probability Maps (PPMs)
Posterior distribution: probability of the effect given the data
p ( | y )
mean: size of effect
precision: variability
Posterior probability map: images of the probability that an
activation exceeds some specified threshold , given the data y
p ( | y )
p ( | y )
Two thresholds:
• activation threshold : percentage of whole brain mean signal
• probability that voxels must exceed to be displayed (e.g. 95%)
2nd level PPMs with global priors
1st level (GLM):
2nd level (shrinkage prior):
y X
(1 )
(1)
(1)
(2)
0
Heuristically: use the variance of meancorrected activity over voxels as prior
variance of at any particular voxel.
(1) reflects regionally specific effects
assume that it is zero on average over
voxels
variance of this prior is implicitly estimated
by estimating (2)
(1)
(2)
p ( ) N ( 0 , C )
p ( ) N ( 0 , C )
(2)
p ( )
0
In the absence of evidence
to the contrary, parameters
will shrink to zero.
2nd level PPMs with global priors
1st level (GLM):
y X
(1 )
p ( ) N ( 0 , C )
voxel-specific
p ( ) N ( 0 , C )
global
pooled estimate
over voxels
2nd level (shrinkage prior):
0
(2)
Compute Cε and C via ReML/EM, and
apply the usual rule for computing
posterior mean & covariance for
Gaussians:
Friston & Penny 2003, NeuroImage
C |y X C X C
T
1
m |y C |y X C y
T
1 1
1
PPMs vs. SPMs
p ( | y ) p ( y | ) p ( )
PPMs
Posterior
Likelihood
Prior
SPMs
u
Bayesian test: p ( | y )
t f ( y)
Classical t-test: p ( t u | 0 )
PPMs and multiple comparisons
Friston & Penny (2003): No need to correct for multiple
comparisons:
Thresholding a PPM at 95% confidence: in every voxel, the
posterior probability of an activation is 95%.
At most, 5% of the voxels identified could have activations less
than .
Independent of the search volume, thresholding a PPM thus
puts an upper bound on the false discovery rate.
NB: being debated
PPMs vs.SPMs
rest [2.06]
rest
contrast(s)
<
PPM 2.06
SPMresults: C:\home\spm\analysis_PET
Height threshold P = 0.95
Extent threshold k = 0 voxels
<
<
SPMmip
[0, 0, 0]
4
<
SPMmip
[0, 0, 0]
<
1
4
7
10
13
16
19
22
25
28
31
34
37
40
43
46
49
52
55
60
<
SPM{T39.0}
SPMresults: C:\home\spm\analysis_PET
Height threshold T = 5.50
Extent threshold k = 0 voxels
1 4 7 10 13 16 19 22
Design matrix
PPMs: Show activations
greater than a given
size
3
contrast(s)
1
4
7
10
13
16
19
22
25
28
31
34
37
40
43
46
49
52
55
60
1 4 7 10 13 16 19 22
Design matrix
SPMs: Show voxels
with non-zero
activations
PPMs: pros and cons
Advantages
• One can infer that a
cause did not elicit a
response
• Inference is
independent of search
volume
• do not conflate effectsize and effectvariability
Disadvantages
• Estimating priors over
voxels is
computationally
demanding
• Practical benefits are
yet to be established
• Thresholds other than
zero require
justification
Model comparison and selection
Given competing hypotheses
on structure & functional
mechanisms of a system, which
model is the best?
Which model represents the
best balance between model
fit and model complexity?
For which model m does p(y|m)
become maximal?
Pitt & Miyung (2002) TICS
Bayesian model selection (BMS)
Model evidence:
p ( y | , m ) p ( | m ) d
log p ( y | m ) log p ( y | , m )
K L q , p | m
K L q , p | y , m
accounts for both accuracy and
complexity of the model
p(y|m)
p( y | m)
Gharamani, 2004
y
all possible datasets
Various approximations, e.g.:
- negative free energy, AIC, BIC
a measure of generalizability
McKay 1992, Neural Comput.
Penny et al. 2004a, NeuroImage
Approximations to the model evidence
Maximizing log model evidence
= Maximizing model evidence
Logarithm is a
monotonic function
Log model evidence = balance between fit and complexity
log p ( y | m ) accuracy ( m ) complexity ( m )
log p ( y | , m ) complexity ( m )
No. of
parameters
In SPM2 & SPM5, interface offers 2 approximations:
Akaike Information Criterion:
Bayesian Information Criterion:
No. of
data points
AIC log p ( y | , m ) p
BIC log p ( y | , m )
p
log N
2
Penny et al. 2004a, NeuroImage
The (negative) free energy approximation
• Under Gaussian assumptions about the posterior (Laplace
approximation):
log p ( y | m )
log p ( y | , m ) K L q , p | m K L q , p | y , m
F log p ( y | m ) K L q , p | y , m
log p ( y | , m ) K L q , p | m
accuracy
com plexity
The complexity term in F
• In contrast to AIC & BIC, the complexity term of the negative
free energy F accounts for parameter interdependencies.
KL q ( ), p ( | m )
1
2
ln C
1
2
ln C | y
1
2
|y
T
1
C
|y
• The complexity term of F is higher
– the more independent the prior parameters ( effective DFs)
– the more dependent the posterior parameters
– the more the posterior mean deviates from the prior mean
• NB: Since SPM8, only F is used for model selection !
Bayes factors
To compare two models, we could just compare their log
evidences.
But: the log evidence is just some number – not very intuitive!
A more intuitive interpretation of model comparisons is made
possible by Bayes factors:
positive value, [0;[
B12
p ( y | m1 )
p( y | m2 )
Kass & Raftery classification:
Kass & Raftery 1995, J. Am. Stat. Assoc.
B12
p(m1|y)
Evidence
1 to 3
50-75%
weak
3 to 20
75-95%
positive
20 to 150
95-99%
strong
150
99%
Very strong
BMS in SPM8: an example
attention
M1
stim
M1
M2
M3
M4
M3
stim
PPC
V1
attention
V1
V5
PPC
M2
M2 better than M1
BF 2966
F = 7.995
PPC
attention
stim
V1
V5
M3 better than M2
BF 12
F = 2.450
V5
M4
attention
PPC
M4 better than M3
BF 23
F = 3.144
stim
V1
V5
Thank you