Introduction: statistical and machine learning based approaches to neurobiology Shin Ishii Nara Institute of Science and Technology.

Download Report

Transcript Introduction: statistical and machine learning based approaches to neurobiology Shin Ishii Nara Institute of Science and Technology.

Introduction: statistical and
machine learning based
approaches to neurobiology
Shin Ishii
Nara Institute of Science and
Technology
1. Mathematical Fundamentals:
Maximum Likelihood and
Bayesian Inferences
Coin tossing
• Tossing a skewed coin
One head comes up in five tossing
Head
Tail
Tail
Tail
Tail
• How often does the head appear for this coin?
– Probability of coming up one head in five tossing:
(Note: each trial is independent)
Likelihood
Parameter: rate of head appearance in an individual trial
Likelihood function
• Likelihood: evaluation of the observed data
– viewed as a function of the parameter
Likelihood of parameter
:
Likelihood of parameter
:
Which parameter is better for explaining the observed data?
What is the most likely parameter ?
How to determine it?
It seems natural to set
according to the frequency of coming up the head.
Really?
Kullback-Leibler (KL) divergence
• A measure of the difference between two
probability distributions:
and
We can measure the
difference according
to an objective and
numerical value.
Note: KL divergence is not a metric.
Minimize KL divergence
• Random events are drawn from the real
distribution
trial distribution
true distribution
minimize
divergence
Using the observed
data, we want to
estimate the true
distribution using a trial
distribution.
data set
The smaller the KL divergence
, the better an estimate.
Minimize KL divergence
• KL divergence between the two distributions
Constant: independent
of parameter
To minimize KL divergence, we have only
to maximize the second term with respect
to the parameter .
Likelihood and KL divergence
• The second term is approximated by the sample
mean:
data set
Log likelihood
They are the same:
• Minimizing the KL divergence
• Maximizing the likelihood
Maximum Likelihood (ML) estimation
• Maximum likelihood (ML) estimate:
• What is the most likely parameter in the coin
tossing?
Head
Maximization condition
Tail
Tail
Tail
Tail
Property of ML estimate
R. Fisher
(1890-1962)
• As the number of observations
increases, the squared error of an
estimate decreases in
order.
• ML estimate is asymptotically
unbiased.
If the infinite number of observations could be obtained,
an ML estimate becomes the real parameter.
Infeasible
What happens when only a limited number of observations
have been obtained from the real environment?
Problem with ML estimation
• Is it a really skewed coin?
Head
Tail
Tail
Tail
Tail
See...
Four
tails occurred
It may
justconsecutive
happen to come
up four by chance.
The
ML estimate
overfits
the
first observations.
consecutive
tails.
It maytobe
detrimental
to
How
to avoid
overfitting?
assume
the this
parameter
as
Consider extreme case:
Five more tossing...
Head
Head
Head
Head
Tail
If the data consists of
one head in a single
tossing, an ML estimate
gives 100%.
Not reasonable
Bayesian approach
• Bayes theorem
Likelihood
Prior
Posterior
a posteriori
information
=
information obtained
from data
+
We have no information about the
probably skewed coin. Then, we now
assume that the parameter is distributed
around
.
Prior distribution
a priori
information
Bayesian approach
• Bayes theorem
Likelihood
Prior
Posterior
a posteriori
information
=
information obtained
from data
One head and four tails.
Hmm... It may be a skewed coin,
but better consider other
Likelihood
possibilities.
function
+
a priori
information
observed data
Bayesian approach
• Bayes theorem
Likelihood
Prior
Posterior
a posteriori
information
=
information obtained
from data
+
a priori
information
The parameter is distributed mainly
between
.
Variance (uncertainty) exists.
Posterior
distribution
Property of Bayesian inference
• Bayesian view: probability represents
uncertainty of random events (subjective value).
Frequentist
That can’t be! The prior distribution leads to a
subjective distortion against estimation.
The estimation process must be objective to
obtained data.
R. Fisher
(1890-1962)
Bayesian
No problem. Uncertainty of random events
(subjective probability) depends on the amount of
information obtained from data and prior knowledge
of the events.
T. Bayes
(1702-1761)
Application of Bayesian approaches
• Data obtained from the real world:
– sparse
– high dimension
– unobservable variables
Bayesian methods are available
Bioinfomatics
User support system
(Bayesian Network)
2. Bayesian Approaches
to Reconstruction of Neural Codes
A neural decoding problem
How the brain works?
Sensory information is represented in
sequences of spikes.
When the same stimulus is repeatedly
input, spike occurrence varies
between trials.
An indirect approach is to reconstruct the stimuli
from the observed spike trains.
Bayesian application to a neural code
Stimulus
(prior knowledge)
?
Spike train
(observation)
Time
Possible algorithm of stimulus reconstruction (estimation) from
spike train only (Maximum likelihood estimation)
spike train & stimulus (Bayes estimation)
Note: we focus on whether spike trains include stimulus information,
NOT BUT whether the algorithm is true in the brain.
‘Observation’ depends on ‘Prior’
(Bialek et al., Science, 1991)
Stimulus
Black box
Neural system
Time
Estimated
stimulus
Spike train
Estimation
algorithm
Time
Time
‘Observation’ depends on ‘Prior’
Stimulus
distribution
P(s)
Black box
Neural system
s
P( s | x )
Estimated
stimulus
distribution
P( x | s )
s
Estimation
algorithm
x
Simple example of signal estimation
Particular value of observation:
Incoming signal:
Noise
 (
s
0)
Observation = Signal + Noise :
P( )
P(s)
+
Estimation stimulus
x
P( x | s )
=
sest  f ( x )
x  s 
P( s | x )
Simple example of signal estimation
 If the probability that one observes a particular x with signal s
just depends on the noise
, and the noise
is supposed to be


chosen from a Gaussian,
P( x | s)  P(  x  s) 
 If the signal
s
1
2 2
 ( s  x )2 
exp  

2
 2 
is supposed to be chosen from a Gaussian,
P( s ) 
1
2 s 2
 s2 
exp  
2
2

s 

 So the posterior is,
 ( s  x )2 
 s2 
P( s | x )  P( x | s ) P( s ) 
exp  
exp  

2
2
2 s
 2 
 2 s 
1
Simple example of signal estimation
Maximum likelihood estimation: maximize P ( x | s )

P( x | s)  0  sest  x
s
K
1
Bayes estimation: maximize P ( s | x )

P( s | x )  0  sest  Kx
s
SNR
K
( SNR   s 2 /  2 )
SNR  1
P( s | x )  P( x | s ) P( s)
Bayes theorem:
Posterior
Likelihood Prior knowledge
(Observation)
SNR
Signal estimation of a fly
(Bialek et al., Science, 1991)
Calliphora erythrocephala
Movement sensitive neuron (H1)
Gaussian visual stimulus
Time
Visually guided flight
Time scale 30 ms of behavior
H1 firing rate : 100-200 spikes/s
Behavioral decisions are based on a few spikes.
Signal estimation of a fly
(Bialek et al., Science, 1991)
Observation
P[{ti } | s ]
Stimulus
s(t )
Encoder
{ti }
X
Time
Estimated stimulus maximizes
However,
P[ s |{ti }]
P[ s |{ti }]
can not be measured directly.
P( s | x )  P( x | s ) P( s)
Bayes theorem:
Posterior
Likelihood Prior knowledge
(Observation)
P[ s ]
Kernel reconstruction and least square
Estimated stimulation:
sest   ds P[s |{ti }]  s
can not still calculated, because
P[ s |{ti }] can not be defined.
Next step alternative calculation:
sest   ds P[s |{ti }]  s
Choosing the kernel F ( ) in the function sest 
which minimizes the square error
 F (t  ti )
i
   | s(t )  sest (t ) |2 dt
2
Signal estimation of a fly
(Bialek et al., Science, 1991)
Stimulus
Estimated stimulus
Kernel
F ( )
Case of mammalian
O’Keefe’s place cell
It is known that hippocampal CA1 cells would
represent the position is a familiar field.
Rat hippocampal CA1 cells
(Lever et al., nature, 2002)
Each place cell shows high activity when the rat is located at a specific position
Case of mammalian
Question:
Can one estimate rat’s position in
the field from firing patterns of rat
hippocampal place cells?
Incremental Bayes estimation
(Brown et al., J. Neurosci., 1998)
(Lever et al., nature, 2002)
Each place cell shows high activity when the rat is located at a specific position
Sequential Bayes estimation from spike train
(Brown et al., J. Neurosci., 1998)
Observation:
Spike train of a place cell
. . . tk 1
tk . . .
Rat position can be estimated by integrating
the recent place cell activities and the position
estimate from the history of activities.
time
s(tk )
Prior stimulation:
Rat’s position at
P( s(tk ) | spikes in [t1 , tk ]) 
P(spikes at tk | s(tk ), tk 1 )  P( s(tk ) | spikes in [t1 , tk 1 ])
P( s | x )  P( x | s ) P( s)
Bayes theorem:
Posterior
Likelihood Prior knowledge
(Observation)
tk
Incremental Bayes estimation from spike train
(Brown et al., J. Neurosci., 1998)
P spikes at tk 1 | s(tk 1 ), tk 
Observation: Spike train of place cells
t1
...
tk 1
tk
Prior:
Rat’s position
tk 1
...
Time
p
p
s(tk )
s ( tk 1 )
p 
p 
P  s(tk 1 ) | spikes in [t1 , tk ]
  P s(tk 1 ) | s(tk )  P s(tk ) | spikes in [t1, tk ] ds(tk )
Incremental Bayes estimation from spike train
P spikes at tk | s(tk ), tk 1 
(Brown et al., J. Neurosci., 1998)
t1
...
tk 1
P  s(tk ) | spikes in [t1 , tk 1 ]
tk
tk 1
...
Time
s ( tk 1 )
Observation probability P spikes at tk | s(tk ), tk 1  is the function
of the firing rate of cells, which depends on the position & theta rhythm.
Firing rate of a place cell depends on
1. position component (receptive field)
2. theta phase component
Inhomogeneous Poisson process for spike train
(Brown et al., J. Neurosci., 1998)
Firing rate of a place cell depends on
1. preferred position (receptive field)
2. preferred phase in the theta rhythm
Position component (asymmetric Gaussian):


1
2


x (t | x(t ),  x )  exp   ( x(t )   )W 1 ( x(t )   ) 
Theta phase component (cosine):
 (t |  (t ),  )  exp cos( (t )  0 )
Instantaneous firing rate:
 (t | x(t ), (t ),  x )  x (t ( x(t ),  x )   (t |  (t ),  )
The parameters were determined by maximum likelihood.
Position estimation from spike train
(Brown et al., J. Neurosci., 1998)
Assumption:
The path of the rat may be approximated as a zero
mean two-dimensional Gaussian random walk.
Parameters,  x1 and  x 2 were estimated by ML.
 x2
Finally, estimation procedure is as follows:
1. Encoding stage: estimate parameters,  x ,   ,  x1 , and  x 2 .
2. Decoding stage: estimate rat’s position
by incremental Bayes method at each spike event
with the assumption of Gaussian random walk.
P  s(tk ) | spikes in [t1 , tk ] 
P spikes at tk | s(tk ), tk 1   P  s(tk ) | spikes in [t1 , tk 1 ]
Bayes estimation from spike train
(Brown et al., J. Neurosci., 1998)
Real rat
EKF estimation with variance
spike!
spike!
spike!
spike!
spike!
spike!
Calculation of posterior distribution is done in discontinuous time steps;
when a spike occurs as a new observation.
Position estimation from spike train (1)
(Brown et al., J. Neurosci., 1998)
Bayes estimation
Posterior
Mouse position
Maximum likelihood
=
Estimation
Maximum correlation
Model activity
Correlation
X
Prior
Likelihood
Likelihood
Observed
firing pattern
Position estimation from spike train (2)
(Brown et al., J. Neurosci., 1998)
Bayes estimation
Mouse position
Maximum likelihood
Estimation
Maximum correlation
The ML and maximum correlation methods ignore
the history of neural activities, but the incremental
Bayes incorporates it as a prior.
Posterior
=
Model activity
Correlation
X
Prior
Likelihood
Likelihood
Observed
firing pattern
3. Information Theoretic Analysis
of Spike Trains
Information transmission
in neural systems
Environmental stimulus
Encoding
X
Spike train
(neural response)
Y
t
Decoding
•How does a spike train code information about
the corresponding stimuli?
•How efficient is the information transmission?
•Which kind of coding is optimal?
Information transmission:
Generalized view
Encoder
Information
source
Decoder
Observable
Observable
Transmitter
Receiver
Destination
Y
~
Z
Channel
Z
X
Message
Signal
Received
signal
Message
Y : Transmitted symbol ( response)
~
Z : Decoded symbol
Z : Symbol
X : Encoded symbol ( stimuli)
Noise source
Shannon’s communication system (Shannon, 1948)
Neural coding is stochastic process
Stimulus
Observed Spike trains
X
Y1 | X
t

Y2 | X
Y3 | X
Neuronal responses against a given
stimulus are not deterministic but
stochastic, and the stimulus against
each response is also probabilistic.
Shannon’s Information
•
Smallest unit of information is “bit”
– 1 bit = the amount of information needed to
choose between two equally-likely outcomes
(eg: tossing a coin)
•
Properties:
1. Information for independent events are
additive over constituent events
2. If we already know the outcome, there is no
information
Shannon’s Information
Property 1
Independent events: P( X 1 , X 2 ,, X N )  P1 ( X 1 ) P2 ( X 2 ) PN ( X N )
Implies:
I ( P( X 1 , X 2 ,, X N ))
 I ( P1 ( X 1 ))  I ( P2 ( X 2 ))    I ( PN ( X N ))
Property 2
Certain events:
Implies:
P( X 1 )  1 or P( X 1 )  0
I ( P( X ))  0
I ( P( X ))   log2 P( X )
Eg. Tossing a coin
• Tossing an even coin
Head
Tail
P( X  Head)  0.5
Observed X  Head
I ( X  Head)   log2 0.5
 1 bit
P( X  Tail)  0.5
X  Tail
I ( X  Tail)  1 bit
Eg. Tossing a coin
• Tossing a horribly skewed coin…
Head
Tail
P( X  Head)  0.99
Observed X  Head
P( X  Tail)  0.01
X  Tail
I ( X  Head)  0.0145 bits I ( X  Tail)  6.64 bits
Observing ordinary event has low information,
but observing rare event is highly informative.
Eg. Tossing 5 coins
Head
Tail
• Case 1: even 5 coins
Tail
Tail
Tail
Given P( X  Head)  0.5
I ( X  H,T,T,T,T)  1*5  5 bits
• Case 2: skewed 5 coins Given P( X  Head)  0.2
I ( X  H,T,T,T,T)   log 2 0.2  4*log 2 0.8
 3.6 bits
Entropy
On average, how much information do we
get from an observation drawn from the
distribution?
• Entropy is the expectation of the information
over all possible observations
H p  E X [ log2 P( X )]
Entropy can be defined…
 P( X ) log
Hp  
2
P( X )
discrete
X

H p   P( X ) log2 P( X )dX
continuous
Some properties of entropy
• Scalar property of a probability distribution
• Entropy is maximum if P(X) is constant
– Least certainty of the event
• Entropy is minimum if P(X) is a delta function
• Entropy is always positive
• The higher the entropy is, the more you learn
(on average) by observing values of the random
variable
• The higher the entropy is, the less you can
predict the values of the random variable
Eg. Tossing a coin
Head
P( X  Head)  p
Tail
P( X  Tail)  1  p
Entropy
H P   p log2 p  (1  p) log2 (1  p)
Entropy reaches the maximum
when each event occurs with
equal probability, i.e., occurs
most randomly.
Eg. Entropy of Gaussian
distributions
x ~ N ( , 2 )
Hx  
1

e

2
 x 2 / 2 2
 1
1
 x 2 / 2 2 
dx  log2 (2e 2 ) bits
log
e
2
  2

Entropy only depends the standard deviation,
i.e., the entropy reflects variability of information source.
What distribution maximizes the
entropy of random variables?
 P( X ) log
Hp  
2
X
discrete X  1,, M 
P( X )
1
P( X ) 
M

uniform distribution
continuous E[ X ]  ,V [ X ]   2
H p   P( X ) log2 P( X )dX
 ( X  )2 

P( X ) 
exp  
2

2

2 


1
Gaussian
Entropy of Spike Trains
• A spike train can be transformed into a binary
vector by discretizing the time into small bins.
T
spike train
binary word
t
0 1 1
0 1 0
0 1 0 1
MacKay and McCulloch (1952)
1 0 0 1 0
T : durationof time window
t : time resolution
• Computing the entropy of possible spike trains
How informative are such spike trains?
Entropy of spike trains
Brillouin (1962)
How many different binary words may occur
N  T / t
over the whole bins
All possible words
Entropy
N!
N total 
N1! N 0 !
N1  pN : number of 1's
N 0  (1  p) N : number of 0's
p  rt : firing probability
r : mean spikerate
H total  log2 N total 
Assume T is much larger than t
Stirling approximation
ln x! x(ln x  1)  
 1 / ln 2ln N! ln N1! ln N 0 !
 1 / ln 2( N (ln N  1)  N1 ln(N1  1)  N 0 ln(N 0  1)  )
  N / ln 2N1 / N lnN1 / N   N 0 / N lnN 0 / N 
 (T / t ) / ln 2(rt ) ln(rt )  (1  rt ) ln(1  rt ).
The entropy is linear to length of time window, T
Entropy rate:
the unit of bits of information per second
• If the chance of a spike in a bin is small (low rate,
or high sampling rate) p  rt  1 then we can
approximate the entropy rate as:
Entropy rate of temporal (timing) code
Eg.r ~ 50 spikes/s
H total
e
 r log2
T
rt
ln(1  rt )  rt
5.76 bits per spike
(288 bit/s)
H total
1
(rt ) ln(rt )  (1  rt ) ln(1  rt ).

T
t ln 2
(ms)
Entropy of spike count distribution
Entropy of spike count distribution (rate code)
 p(n) log
H (spikecount)  
2
p(n)
n
probability of n observingspikesin time windowT
What should we choose for p(n)?
We know only know two constraints about p(n):
1. probability distributions must be
normalized n p(n)  1.
2. average spike count in T should be n  rT
We cannot determine p(n) uniquely,
but can obtain p(n) which maximizes the entropy.
Entropy of spike count distribution
• Entropy for spike counts is maximized by
an exponential distribution:
p(n)  exp( n)
1
  log(1 
)
rt
1 


Entropy is then: H  log2 (1  rt )  rt log2 1 
 rt 
Conditional entropy and
mutual information
Conditional entropy
H r|s  E log2 p(r | s)
H r|s
I(s,r)
 p(r | s) log p(r | s)
   p(r | s) log p(r | s)dr
H r| s  
H s| r
H r |s
discrete
2
2
continuous
Mutual information
Hr
Hs
I (r, s)  H r  H r|s  I (s, r )  0
• The entropy H rrepresents the uncertainty about the response in the
absence of any other information.
• The conditional entropy H r|srepresents the remaining uncertainty about
the response for a fixed stimulus s.
• I(r,s) is the mutual information between s and r; representing the reduction
of uncertainty in r achieved by measuring s.
• If r and s are statistically independent, then I(r,s)=0.
Reproducibility and variability
in neural spike trains
Ordered firing patterns
(high reproducibility)
Dynamic stimuli
(natural condition)
Calliphora erythrocephala
Movement sensitive neuron (H1)
Irregular firing patterns
(low reproducibility, Poisson-like patterns)
random walk
with diffusion
constant
14 degrees2 / s
Static stimuli
(artificial condition)
van Steveninck et al., Science, 1997
Reproducibility and variability
in neural spike trains
Low spike count variance
Dynamic stimuli
(natural condition)
random walk
with diffusion
constant
Calliphora erythrocephala
Movement sensitive neuron (H1)
14 degrees2 / s
mean count
High spike count variance
(Poisson-like mean-variance relationship)
Static stimuli
(artificial condition)
Does more precise spike timing convey more
information about input stimuli?
van Steveninck et al., Science, 1997
Quantifying information transfer
30-ms window
100 responses
1. At each t, divide spike trains in 10
contiguous 3-ms bins, and construct
local word frequency P (W | t )
2. Stepping in 3-ms bins, ~ 3  106 words
are sampled (900 trials, 10 s)
Distribution of 1500 words
P(W )  P(W | t )
word
local word
frequency
t
van Steveninck et al., Science, 1997
Quantifying information transfer
Entropy of spike trains
 P(W ) log
H total  
2
P(W ) bits
W
Conditional entropy of neuronal response given stimulus
H noise  
 P(W | t) log
2
P(W | t )
W
bits
t
Transmitted information
(mutual information between W and t)
I  H total  H noise bits
H1’s responses to
dynamic stimuli
H total  5.05 bits
H noise  2.62 bits
I  2.43 bits
van Steveninck et al., Science, 1997
Comparison with simulated spike trains
Simulate spike trains by
a modulated Poisson process
1. has the correct dynamics of the
firing rate of the responses to
dynamic stimuli
2. but follows the mean-variance
relation of static stimuli
(mean=variance).
Simulated responses
H total  5.17 bits
H noise  4.22 bits
I  0.95 bits
H1’s responses to
dynamic stimuli
H total  5.05 bits
Models that may
accurately accounts for the
H1 neural response to
static stimuli can
significant
underestimate the signal
transfer under more
natural conditions.
H noise  2.62 bits
I  2.43 bits
More than twice as much information
Summary
• Statistical inference
– Maximum likelihood inference
– Bayesian inference
– Bayesian approach to neural decoding
problems
• Information theory
– Information amount and entropy
– Information theoretic approach to a neural
encoding system