Transcript 08_Bayes

Bayesian Inference
Lee Harrison
York Neuroimaging Centre
14 / 05 / 2010
Bayesian segmentation
and normalisation
realignment
Spatial priors
on activation extent
smoothing
Posterior probability
maps (PPMs)
Dynamic Causal
Modelling
general linear model
statistical
inference
normalisation
p <0.05
template
Gaussian
field theory
Attention to Motion
Paradigm
Results
SPC
V3A
V5+
Attention – No attention
Büchel & Friston 1997, Cereb. Cortex
Büchel et al. 1998, Brain
- fixation only
- observe static dots
- observe moving dots
- task on moving dots
+ photic
+ motion
+ attention
 V1
 V5
 V5 + parietal cortex
Attention to Motion
Paradigm
Dynamic Causal Models
Model 1 (forward):
Model 2 (backward):
attentional modulation
of V1→V5: forward
attentional modulation
of SPC→V5: backward
Photic
SPC
Photic
V1
- fixation only
- observe static dots
- observe moving dots
- task on moving dots
SPC
V1
V5
Motion
Attention
Attention
Motion
V5
Bayesian model selection: Which model is optimal?
Responses to Uncertainty
Long term memory
Short term memory
Responses to Uncertainty
Paradigm
Stimuli sequence of randomly sampled discrete events
Model simple computational model of an observers
response to uncertainty based on the number of
past events (extent of memory)
1 2 3 4
Question which regions are best explained by
short / long term memory model?
…
1
trials
2
40
?
?
Overview
• Introductory remarks
• Some probability densities/distributions
• Probabilistic (generative) models
• Bayesian inference
• A simple example – Bayesian linear regression
• SPM applications
– Segmentation
– Dynamic causal modeling
– Spatial models of fMRI time series
Probability distributions and densities
k=2
Probability distributions and densities
k=2
Probability distributions and densities
k=2
Probability distributions and densities
p( |  )   k k 1
k
k=2
p( |  )  exp( 12  (  0 )2 )
2
1
2
Probability distributions and densities
k=2
Probability distributions and densities
k=2
Probability distributions and densities
k=2
Generative models
q( ) ?
estimation
2
2
3
 2  e3 ~ N (0,  31 )
1
2
1  X 2  2  e2
y
1
y  X 11  e1
space
1
space
generation
y
  { ,  }
time
Bayesian statistics
new data
prior knowledge
p( y |  )
p ( )
p( | y)  p( y |  ) p( )
posterior
 likelihood
∙ prior
Bayes theorem allows one to
formally incorporate prior
knowledge into computing
statistical probabilities.
The “posterior” probability of the
parameters given the data is an
optimal combination of prior
knowledge and new data,
weighted by their relative
precision.
Bayes’ rule
Given data y and parameters , their joint probability can be written in 2
ways:
p( y, )  p( y |  ) p( )
p( | y) p( y)  p( y, )
Eliminating p(y,) gives Bayes’ rule:
Likelihood
Posterior
Prior
p( y |  ) p( )
p( | y ) 
p( y)
Evidence
Principles of Bayesian inference
 Formulation of a generative model
likelihood p(y|)
prior distribution p()
 Observation of data
y
 Update of beliefs based upon observations, given a prior state
of knowledge
p( | y)  p( y |  ) p( )
Univariate Gaussian
Normal densities
p( )  N ( ;  p , p1 )
y   e
p( y |  )  N ( y;  ,e1 )


p( | y)  N ( ;  ,  1 )
  e   p
   1  e y   p  p 
Posterior mean =
precision-weighted combination of
prior mean and data mean
p
Bayesian GLM: univariate case
Normal densities
p( )  N ( ;  p , p1 )
y  x  e
p( y |  )  N ( y; x,e1 )

p( | y)  N ( ;  ,  1 )
  e x   p
2
   1  e xy   p  p 
p
x
Bayesian GLM: multivariate case
Normal densities
y  X  e
p( )  N ( ;  p , C p )
p( y |  )  N ( y; X , Ce )
1
1
C  X T Ce X  C p

1
β2
p(  | y )  N (  ;  , C )
1
  C X T Ce y  C p  p

One step if Ce and Cp are known.
Otherwise iterative estimation.
β1
Approximate inference: optimization
True posterior
p( | y, m) 
mean-field
approximation
iteratively improve
Approximate
posterior
2
1
p( y, | m)
p( y m)
q( )   q(i )
i
free energy
 p( y, | m) 
 q( ) 
   q( ) log

log p( y m)   q( ) log
 q( )  
 p( | y, m) 

Objective
function
Value of parameter
Simple example – linear regression
Data
Ordinary least squares
y  X
y
ED  ( y  X )T ( y  X )
ED
 0  ˆols  ( X T X ) 1 X T y

Simple example – linear regression
Data and model fit
Ordinary least squares
y  X
ED  ( y  X )T ( y  X )
y
ED
 0  ˆols  ( X T X ) 1 X T y

Bases (explanatory variables)
X
Sum of squared errors
Simple example – linear regression
Data and model fit
Ordinary least squares
y  X
ED  ( y  X )T ( y  X )
y
ED
 0  ˆols  ( X T X ) 1 X T y

Bases (explanatory variables)
X
Sum of squared errors
Simple example – linear regression
Data and model fit
Ordinary least squares
y  X
ED  ( y  X )T ( y  X )
y
ED
 0  ˆols  ( X T X ) 1 X T y

Bases (explanatory variables)
X
Sum of squared errors
Simple example – linear regression
Data and model fit
Ordinary least squares
Over-fitting: model fits noise
Inadequate cost function: blind to
overly complex models
y
Solution: include uncertainty in
model parameters
Bases (explanatory variables)
X
Sum of squared errors
Bayesian linear regression:
priors and likelihood
Model:
X 
y  X  e
Bayesian linear regression:
priors and likelihood
Model:
Prior:
2
y  X  e
p(   2 )  N k (0,  21 I k )
 exp( 2  / 2)
2
1
Bayesian linear regression:
priors and likelihood
Model:
Prior:
2
y  X  e
p(   2 )  N k (0,  21 I k )
 exp( 2  / 2)
2
1
X
Sample curves from prior
(before observing any data)
Mean curve
Bayesian linear regression:
priors and likelihood
y  X  e
Model:
p(   2 )  N k (0,  21 I k )
Prior:
2
 exp( 2  / 2)
2
Likelihood:
1
N
p ( y  , 1 )   p ( yi |  , 11 )
i 1
p ( yi  , 1 )  N ( X i  , 11 )
 exp(1 ( yi  X i  ) 2 / 2)
Bayesian linear regression:
priors and likelihood
y  X  e
Model:
p(   2 )  N k (0,  21 I k )
Prior:
2
 exp( 2  / 2)
2
Likelihood:
1
N
p ( y  , 1 )   p ( yi |  , 11 )
i 1
p ( yi  , 1 )  N ( X i  , 11 )
 exp(1 ( yi  X i  ) 2 / 2)
X
Bayesian linear regression:
priors and likelihood
y  X  e
Model:
p(   2 )  N k (0,  21 I k )
Prior:
2
 exp( 2  / 2)
2
Likelihood:
1
N
p ( y  , 1 )   p ( yi |  , 11 )
i 1
p ( yi  , 1 )  N ( X i  , 11 )
 exp(1 ( yi  X i  ) 2 / 2)
X
Bayesian linear regression:
posterior
y  X  e
Model:
p(   2 )  N k (0,  21 I k )
Prior:
2
 exp( 2  / 2)
2
Likelihood:
1
N
p( y  , 1 )   p( yi |  , 1 )
i 1
Bayes Rule:
p( y, )  p( y |  , ) p( |  )

y
Bayesian linear regression:
posterior
y  X  e
Model:
p(   2 )  N k (0,  21 I k )
Prior:
2
 exp( 2  / 2)
2
Likelihood:
N
p( y  , 1 )   p( yi |  , 1 )
1
i 1
Bayes Rule:
p( y, )  p( y |  , ) p( |  )
X
Posterior:
p  | y , 


 N  , C 
C  1 X T X   2 I k
  1CX T y

1
Bayesian linear regression:
posterior
y  X  e
Model:
p(   2 )  N k (0,  21 I k )
Prior:
2
 exp( 2  / 2)
2
Likelihood:
N
p( y  , 1 )   p( yi |  , 1 )
1
i 1
Bayes Rule:
p( y, )  p( y |  , ) p( |  )
X
Posterior:
p  | y , 


 N  , C 
C  1 X T X   2 I k
  1CX T y

1
Bayesian linear regression:
posterior
y  X  e
Model:
p(   2 )  N k (0,  21 I k )
Prior:
2
 exp( 2  / 2)
2
Likelihood:
N
p( y  , 1 )   p( yi |  , 1 )
1
i 1
Bayes Rule:
p( y, )  p( y |  , ) p( |  )
X
Posterior:
p  | y , 


 N  , C 
C  1 X T X   2 I k
  1CX T y

1
Posterior Probability Maps (PPMs)
Posterior distribution: probability of the effect given the data
p(  | y )
mean: size of effect
precision: variability
Posterior probability map: images of the probability
(confidence) that an activation exceeds some specified
threshold sth, given the data y
p(  sth | y)  pth
p(  | y )
sth

Two thresholds:
• activation threshold sth : percentage of whole brain mean
signal (physiologically relevant size of effect)
• probability pth that voxels must exceed to be displayed
(e.g. 95%)
Bayesian linear regression:
model selection
Bayes Rule:
p(  y,  , m) 
p( y |  ,  , m) p( |  , m)
p( y  , m)
X
normalizing constant
Model evidence:
p( y  , m)   p( y |  ,  , m) p(  |  , m)d
log p( y |  , m) 
accuracy(m)  com plexity(m)
accuracy(m)  y  X
2
com plexity(m)  k log 21   2 
2
aMRI segmentation
class variances
 32
 2 2  12
1
ith voxel label
2
yi
ith voxel value
3
i

class prior
frequencies
class means
PPM of belonging to…
grey matter
white matter
CSF
Dynamic Causal Modelling:
generative model for fMRI and ERPs
Hemodynamic
forward model:
neural activityBOLD
Electric/magnetic
forward model:
neural activityEEG
MEG
LFP
Neural state equation:
x  F ( x, u, )
fMRI
Neural model:
1 state variable per region
bilinear state equation
no propagation delays
inputs
ERPs
Neural model:
8 state variables per region
nonlinear state equation
propagation delays
Bayesian Model Selection for fMRI
m1
m2
m3
m4
attention
attention
attention
attention
PPC
stim
V1
V5
PPC
stim
V1
V5
PPC
stim
V1
V5
PPC
stim
V1
attention
models marginal likelihood
ln p y m
15
PPC
10
1.25
stim
5
estimated
effective synaptic strengths
for best model (m4)
0.10
0.26
V1
0.39
0.26
0.13
0.46
0
m1 m2 m3 m4
[Stephan et al., Neuroimage, 2008]
V5
V5
fMRI time series analysis with spatial priors
degree of smoothness

p   N 0, 1L1
Y  X  
aMRI
Spatial precision matrix
prior precision
of GLM coeff
Smooth Y (RFT)

prior precision
of AR coeff


prior precision
of data noise

ML estimate of β
VB estimate of β

Y
GLM coeff
A
AR coeff
(correlated noise)
observations
Penny et al 2005
fMRI time series analysis with spatial priors:
posterior probability maps
Display only voxels
that exceed e.g. 95%
p  pth
p  q(   sth )
activation
threshold
sth
Probability mass pn
Mean (Cbeta_*.img)
PPM (spmP_*.img)
Posterior density q(βn)
probability of getting an effect, given the data
q(n )  N (n , n )
Std dev (SDbeta_*.img)
mean: size of effect
covariance: uncertainty
fMRI time series analysis with spatial priors:
Bayesian model selection
log p( y m)  F (q)
Log-evidence maps
subject 1
model 1
subject N
model K
Compute log-evidence
for each model/subject
fMRI time series analysis with spatial priors:
Bayesian model selection
log p( y m)  F (q)
Log-evidence maps
BMS maps
subject 1
model 1
subject N
q(rk  0.5)  0.941
rk  
PPM
k  
EPM
q(rk )
model K
rk
Compute log-evidence
for each model/subject
Probability that model k
generated data
model k
Joao et al, 2009
Reminder…
Long term memory
Short term memory
Compare two models
Short-term memory model
IT indices: H,h,I,i
onsets
long-term memory model
IT indices are smoother
Missed
trials
H=entropy; h=surprise; I=mutual information; i=mutual surprise
Group data: Bayesian Model Selection maps
Regions
best
explained
by shortterm
memory
model
primary visual
cortex
Regions best
explained by
long-term
memory
model
frontal cortex
(executive
control)
Thank-you