Transcript Slide 1

Gibbs sampling
The Motif Finding Problem
• Given a set of DNA sequences:
cctgatagacgctatctggctatccacgtacgtaggtcctctgtgcgaatctatgcgtttccaaccat
agtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc
aaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt
agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtataca
ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtacgtc
• Find the motif in each of the individual
sequences
The Motif Finding Problem
• If starting positions s=(s1, s2,… st) are
given, finding consensus is easy because
we can simply construct (and evaluate) the
profile to find the motif.
• But… the starting positions s are usually
not given. How can we find the “best”
profile matrix?
– Gibbs sampling
– Expectation-Maximization algorithm
Notations
•
•
•
•
•
•
Set of symbols: 
Sequences: S = {S1, S2, …, SN}
Starting positions of motifs: A = {a1, a2, …, aN}
Motif model ( ) : qij = P(symbol at the i-th position = j)
Background model: pj = P(symbol = j)
Count of symbols in each column: cij= count of symbol, j,
in the i-th column in the aligned region
Motif Finding Problem
• Problem: find starting positions and model
parameters simultaneously to maximize the
posterior probability:
max , A P( , A | S )
• This is equivalent to maximizing the likelihood
by Bayes’ Theorem, assuming uniform prior
distribution:
max , A P(S | A, )
Equivalent Scoring Function
• Maximize the log-odds ratio:
W
||
i 1
j 1
P(S | A, )   q
cij
ij
W
||
i 1
j 1
P(S | A,0 )   p
W ||
qij
P(S | A, )
F  log
  cij log
P( S | A,0 ) i 1 j 1
pj
cij
j
Sampling and optimization
• To maximize a function, f(x):
– Brute force method: try all possible x
– Sample method: sample x from probability
distribution: p(x) ~ f(x)
– Idea: suppose xmax is argmax of f(x), then it is
also argmax of p(x), thus we have a high
probability of selecting xmax
Gibbs Sampling
• Idea: a joint distribution may be hard to sample
from, but it may be easy to sample from the
conditional distributions where all variables are
fixed except one
• To sample from p(x1, x2, …xn), let each state of
the Markov chain represent (x1, x2, …xn), the
probability of moving to a state (x1, x2, …xn) is:
p(xi |x1, …xi-1,xi+1,…xn). It is also called Markov
Chain Monte Carlo (MCMC) method.
Gibbs Sampling
Gibbs Sampling in Motif Finding
Randomly initialize A0;
Repeat:
(1) randomly choose a sequence z from S;
A* = At \ az;
compute θt = estimator of θ given S and A*;
(2) sample az according to P(az = x), which is
proportional to Qx/Px;
update At+1 = A* union x;
Select At that maximizes F;
Qx: the probability of generating x according to θt;
Px: the probability of generating x according to the background model
Estimator of θ
• Given an alignment A, i.e. the starting
positions of motifs, θ can be estimated by its
MLE with smoothing (e.g. Dirichlet prior with
parameter bj):
qij 
cij  b j
N 1  B
The Motif Finding Problem
• Given a set of DNA sequences:
cctgatagacgctatctggctatccacgtacgtaggtcctctgtgcgaatctatgcgtttccaaccat
agtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc
aaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt
agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtataca
ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtacgtc
• Find the motif in each of the individual
sequences
Gene Regulation
TF
Gene 1
CACGTGT
TF
Gene 2
CACGTGA
TF
Gene 3
CAAGTGA
TF
Gene 4
CAGGTGA
Transcription factor
binding site, or motif
Evolutionary Conservation
CACGTGACC
CACGTGAAC
CACGTGAAC
Overview of TGS
How did the motifs
evolve?
How to find the
ancestral motif
instances?
Colored lines: regulatory regions of genes
Colored boxes: motif instances
How to find the ancestral motif
instances?
Ancestral motif profile:
1
2
3
A .036 .892 .036
C .892 .036 .892
G .036 .036 .036
T .036 .036 .036
CACGTGACC
4
5
6
7
8
9
.036 .036 .036 .892 .036 .036
.036 .036 .036 .036 .75 .75
.892 .036 .892 .036 .036 .036
.036 .892 .036 .036 .178 .178
CACGTGAAC CACGTGAAC
How did the motifs evolve?
Background substitution matrix
A
C
G
T
A
C
G
T
0.8515 0.0278 0.0775 0.0432
0.0464 0.8026 0.0344 0.1167
0.1167 0.0350 0.8023 0.0460
0.0429 0.0785 0.0264 0.8522
Motif substitution matrix
A
C
G
T
A
C
G
T
0.9802 0.0066 0.0066 0.0066
0.0120 0.9640 0.0120 0.0120
0.0120 0.0120 0.9640 0.0120
0.0066 0.0066 0.0066 0.9802
Evolution of motifs
Distant species
250
million
years
Implementation
Overview of Gibbs Sampler
Iteratively sample from conditional distribution
when other parameters are fixed.
In order to draw:
i,
draw
( X 1 , X 2 , , X n )
X i ~ ( X i | X [ i ] )
~
( X )
Implementation
Parameters

Ancestral motif weight matrix at the root

Background distribution (multinomial)
0
pi
Probability that a gene in the i-th species will contain the motif
w
Motif width
M 0i
Background substitution matrix for the i-th branch
M 1i
Motif substitution matrix for the i-th branch
Implementation
Prior distribution


0
pi
Beta(1,1)
w
Poisson distribution
M 0i
M 1i
Implementation
Initialization
Parameters are sampled using prior distributions;
Motif instances in current species are sampled from
sequences directly for each current species;
Motif instances in ancestral species are randomly
assigned with one of its immediate child motif
instances.
Implementation
Motif instance updating
A1( 0)
Updating motif instances in ancestral species
Pr(A1(0) | A[(01)] , A1(1) , A1( 2) , 0 , p1, p2 , w, M11 , M12 )
M11 M12
A1(1)
Updating motif instances in current species
Pr(A1(1) | A1(0) , S , 0 , p1 , w, M11 )
A1( 2)
Implementation
Updating motif instance in ancestral species
A
C
G
T
Ancestral Motif Weight Matrix
1
2
3
4
5
6
7
8
9
.036 .892 .036 .036 .036 .036 .892 .036 .036
.892 .036 .892 .036 .036 .036 .036 .75 .75
.036 .036 .036 .892 .036 .892 .036 .036 .036
.036 .036 .036 .036 .892 .036 .036 .178 .178
M11
CCCGTGACC
2th position
A: 0.932…
C: 0.067
G: 8.4e-6
T: 2.5e-4
M12
CACGTGAAC
M11
C
M12
A
Implementation
Updating motif instances for current species
Updated ancestral motif instance
CACTTGAAC
M11
…CACACCACGTGAGCTT...
M12
…CACATCACGTGAACTT…
Multiple Species?
?
CAGGTGATC
CACGTGAAC
CACGTGAAC
CACGTGAAC CACGTGATC CAGGTGATC