Motif Finding - Bilkent University

Transcript Motif Finding - Bilkent University

Motif Finding
PSSMs
Expectation Maximization
Gibbs Sampling
Complexity of Transcription
Representing Binding Sites
for a TF

A single site
AAGTTAATGA


A set of sites represented as a consensus
VDRTWRWWSHD (IUPAC degenerate DNA)


A matrix describing a a set of sites
A
C
G
T
14 16 4 0 1 19 20 1
3 0 0 0 0 0 0 0
4 3 17 0 0 2 0 0
0 2 0 21 20 0 1 20
4 13 4 4 13 12 3
7 3 1 0 3 1 12
9 1 3 0 5 2 2
1 4 13 17 0 6 4
Set of
binding
sites
AAGTTAATGA
CAGTTAATAA
GAGTTAAACA
CAGTTAATTA
GAGTTAATAA
CAGTTATTCA
GAGTTAATAA
CAGTTAATCA
AGATTAAAGA
AAGTTAACGA
AGGTTAACGA
ATGTTGATGA
AAGTTAATGA
AAGTTAACGA
AAATTAATGA
GAGTTAATGA
AAGTTAATCA
AAGTTGATGA
AAATTAATGA
ATGTTAATGA
AAGTAAATGA
AAGTTAATGA
AAGTTAATGA
AAATTAATGA
AAGTTAATGA
AAGTTAATGA
AAGTTAATGA
AAGTTAATGA
Nucleic acid codes
code description
A
Adenine
C
Cytosine
G
Guanine
T
Thymine
U
Uracil
R
Purine (A or G)
Y
Pyrimidine (C, T, or U)
M
C or A
K
T, U, or G
W
T, U, or A
S
C or G
B
C, T, U, or G (not A)
D
A, T, U, or G (not C)
H
A, T, U, or C (not G)
V
A, C, or G (not T, not U)
N
Any base (A, C, G, T, or U)
From frequencies to log
scores
w matrix
f matrix
A
C
G
T
5
0
0
0
0
2
3
0
1
2
1
1
0
4
0
1
f(b,i) + s(N)
0
Log
p(b)
0
4
1
(
) AC
1.6
-1.7
G -1.7
T -1.7
-1.7
0.5
1.0
-1.7
-0.2
0.5
-0.2
-0.2
-1.7
1.3
-1.7
-0.2
TGCTG = 0.9
-1.7
-1.7
1.3
-0.2
TFs do not act alone
http://www.bioinformatics.ca/
PSSMs for Liver TFs…
HNF1
C/EBP
HNF3
HNF4
PSSMs for Helix-Turn-Helix Motif
Promoter…
Promoter Weight Matrices
(PWM)
E.Coli PWMs
Motif Logo
Position: 1234567



Motifs can mutate on
less important bases.
The five motifs at top
right have mutations
in position 3 and 5.
Representations
called motif logos
illustrate the
conserved regions of
a motif.
http://weblogo.berkeley.edu
http://fold.stanford.edu/eblocks/acsearch.html
TGGGGGA
TGAGAGA
TGGGGGA
TGAGAGA
TGAGGGA
Example: Calmodulin-Binding Motif (calcium-binding proteins)
Sequence Motifs
• Motifs represent a short common sequence
– Regulatory motifs (TF binding sites)
– Functional site in proteins (DNA binding motif)
http://webcourse.cs.technion.ac.il/236523/Winter2005-2006/en/ho_Lectures.html
Regulatory Motifs

Transcription Factors bind to
regulatory motifs



Motifs are 6 – 20 nucleotides long
Activators and repressors
Usually located near target gene, mostly
upstream
Challenges



How to recognize a regulatory motif?
Can we identify new occurrences of
known motifs in genome sequences?
Can we discover new motifs within
upstream sequences of genes?
Motif Representation


Exact motif: CGGATATA
Consensus: represent only
deterministic nucleotides.
CGGATATACCGG
Example: HAP1 binding
sites in 5 sequences.
CGGTACTAACGG



consensus motif:

CGGNNNTANCGG
N stands for any nucleotide.
Representing only
consensus loses
information. How can this
be avoided?
CGGTGATAGCGG
CGGCGGTAACGG
CGGCCCTAACGG
------------
CGGNNNTANCGG
PSPM – Position Specific
Probability Matrix


Represents a motif of length k (5)
Count the number of occurrence of each
nucleotide in each position
1
2
3
4
5
A
10
25
5
70
60
C
30
25
80
10
15
T
50
25
5
10
5
G
10
25
10
10
20
PSPM – Position Specific
Probability Matrix

Defines Pi{A,C,G,T} for i={1,..,k}.

Pi (A) – frequency of nucleotide A in position i.
1
2
3
4
5
A
0.1
0.25
0.05
0.7
0.6
C
0.3
0.25
0.8
0.1
0.15
T
0.5
0.25
0.05
0.1
0.05
G
0.1
0.25
0.1
0.1
0.2
Identification of Known Motifs
within Genomic Sequences

Motivation:



identification of new genes controlled by the
same TF.
Infer the function of these genes.
enable better understanding of the regulation
mechanism.
PSPM – Position Specific
Probability Matrix

Each k-mer is assigned a probability.

Example: P(TCCAG)=0.5*0.25*0.8*0.7*0.2
1
2
3
4
5
A
0.1
0.25
0.05
0.7
0.6
C
0.3
0.25
0.8
0.1
0.15
T
0.5
0.25
0.05
0.1
0.05
G
0.1
0.25
0.1
0.1
0.2
Detecting a Known Motif within a
Sequence using PSPM



The PSPM is moved along the query sequence.
At each position the sub-sequence is scored for a match to the
PSPM.
Example:
sequence = ATGCAAGTCT…
1
2
3
4
5
A
0.1
0.25
0.05
0.7
0.6
C
0.3
0.25
0.8
0.1
0.15
T
0.5
0.25
0.05
0.1
0.05
G
0.1
0.25
0.1
0.1
0.2
Detecting a Known Motif within a
Sequence using PSPM





The PSPM is moved along the query sequence.
At each position the sub-sequence is scored for a match to the
PSPM.
Example:
1
2
3
4
sequence = ATGCAAGTCT…
Position 1: ATGCA
A
0.1
0.25
0.05
0.7
0.1*0.25*0.1*0.1*0.6=1.5*10-4
5
0.6
C
0.3
0.25
0.8
0.1
0.15
T
0.5
0.25
0.05
0.1
0.05
G
0.1
0.25
0.1
0.1
0.2
Detecting a Known Motif within a
Sequence using PSPM






The PSPM is moved along the query sequence.
At each position the sub-sequence is scored for a match to the
PSPM.
Example:
1
2
3
sequence = ATGCAAGTCT…
A
0.1
0.25
0.05
Position 1: ATGCA
0.1*0.25*0.1*0.1*0.6=1.5*10-4
C
0.3
0.25
0.8
Position 2: TGCAA
T
0.5
0.25
0.05
0.5*0.25*0.8*0.7*0.6=0.042
G
0.1
0.25
0.1
4
5
0.7
0.6
0.1
0.15
0.1
0.05
0.1
0.2
Detecting a Known Motif within a
Sequence using PSSM


Is it a random match, or is it indeed an
occurrence of the motif?
PSPM -> PSSM (Probability Specific
Scoring Matrix)



odds score matrix: Oi(n) where n {A,C,G,T}
for i={1,..,k}
defined as Pi(n)/P(n), where P(n) is background
frequency.
Oi(n) increases => higher odds that n at
position i is part of a real motif.
PSSM as Odds Score Matrix


Assumption: the background frequency of each nucleotide is
0.25.
Original PSPM (Pi):
A
1
2
3
4
5
0.1
0.25
0.05
0.7
0.6
1
2
3
4
5
0.4
1
0.2
2.8
2.4

Odds Matrix (Oi):

Going to log scale we get an additive score,
Log odds Matrix (log2Oi):
A
A
1
2
3
4
5
-1.322
0
-2.322
1.485
1.263
Calculating using Log Odds Matrix




Odds  0 implies random match;
Odds > 0 implies real match (?).
Example: sequence = ATGCAAGTCT…
Position 1: ATGCA
-1.32+0-1.32-1.32+1.26=-2.7
odds= 2-2.7=0.15
Position 2: TGCAA
1+0+1.68+1.48+1.26 =5.42
odds=25.42=42.8
1
2
3
4
5
A
-1.32
0
-2.32
1.48
1.26
C
0.26
0
1.68
-1.32
-0.74
T
1
0
-2.32
-1.32
-2.32
G
-1.32
0
-1.32
-1.32
-0.32
Calculating the probability of a match




ATGCAAG
Position 1 ATGCA = 0.15
Position 2 TGCAA = 42.3
Position 3 GCAAG = 0.18
P (i) = S / (∑ S)
Example 0.15 /(.15+42.8+.18)=0.003
P (1)= 0.003
P (2)= 0.993
P (3) =0.004
Building a PSSM




Collect all known sequences that bind a
certain TF.
Align all sequences (using multiple
sequence alignment).
Compute the frequency of each
nucleotide in each position (PSPM).
Incorporate background frequency for
each nucleotide (PSSM).
Finding new Motifs



We are given a group of genes, which
presumably contain a common
regulatory motif.
We know nothing of the TF that binds
to the putative motif.
The problem: discover the motif.
Example
Predicting the cAMP Receptor Protein (CRP)
binding site motif
Extract experimentally defined CRP Binding Sites
GGATAACAATTTCACA
AGTGTGTGAGCGGATAACAA
AAGGTGTGAGTTAGCTCACTCCCC
TGTGATCTCTGTTACATAG
ACGTGCGAGGATGAGAACACA
ATGTGTGTGCTCGGTTTAGTTCACC
TGTGACACAGTGCAAACGCG
CCTGACGGAGTTCACA
AATTGTGAGTGTCTATAATCACG
ATCGATTTGGAATATCCATCACA
TGCAAAGGACGTCACGATTTGGG
AGCTGGCGACCTGGGTCATG
TGTGATGTGTATCGAACCGTGT
ATTTATTTGAACCACATCGCA
GGTGAGAGCCATCACAG
GAGTGTGTAAGCTGTGCCACG
TTTATTCCATGTCACGAGTGT
TGTTATACACATCACTAGTG
AAACGTGCTCCCACTCGCA
TGTGATTCGATTCACA
Create a Multiple Sequence Alignment
GGATAACAATTTCACA
TGTGAGCGGATAACAA
TGTGAGTTAGCTCACT
TGTGATCTCTGTTACA
CGAGGATGAGAACACA
CTCGGTTTAGTTCACC
TGTGACACAGTGCAAA
CCTGACGGAGTTCACA
AGTGTCTATAATCACG
TGGAATATCCATCACA
TGCAAAGGACGTCACG
GGCGACCTGGGTCATG
TGTGATGTGTATCGAA
TTTGAACCACATCGCA
GGTGAGAGCCATCACA
TGTAAGCTGTGCCACG
TTTATTCCATGTCACG
TGTTATACACATCACT
CGTGCTCCCACTCGCA
TGTGATTCGATTCACA
Generate a PSSM
A
C
G
T
1
-0.43
0.1
-0.46
0.55
2
1.37
0.12
-1.59
-11.2
3
1.69
-1.28
-11.2
-1.43
4
-1.28
0.12
-11.2
1.32
5
0.91
-11.2
-0.46
0.47
6
1.53
-1.38
-1.48
-1.43
7
0.9
-0.48
-11.2
0.12
8
-1.37
-1.28
-11.2
1.68
9
-11.2
-11.2
1.73
-0.56
10
-11.2
-0.51
-11.2
1.72
11
-0.48
-11.2
1.72
-11.2
12
1.56
-1.59
-11.2
-0.46
13
-0.51
-0.38
-0.55
0.88
14
-11.2
0.5
0.57
0.13
15
0.17
-0.51
0.12
0.12
16
0.9
-11.2
0.5
-0.48
17
0.17
0.16
0.06
-0.48
18
-0.4
-0.38
0.82
-0.48
19
-1.38
-1.28
-11.2
1.68
20
-1.48
1.7
-11.2
-1.38
21
1.5
-1.38
-1.43
-1.28
Shannon Entropy

Expected variation per column can be
calculated

Low entropy means higher
conservation
Entropy

The entropy (H) for a column is:
H 



f
a
residues ( a )
log( pa )
a: is a residue,
fa: frequency of residue a in a column,
pa : probability of residue a in that column
Entropy

entropy measures can determine which
evolutionary distance (PAM250,
BLOSUM80, etc) should be used

Entropy yields amount of information
per column (discussed with sequence
logos in a bit)
Log-odds score

Profiles can also indicate log-odds
score:


Log2(observed:expected)
Result is a bit score
Matlab
Multalign
1 Enter an array of sequences.
seqs =
{'CACGTAACATCTC','ACGACGTAACATCTTCT','AAACGTAA
CATCTCGC'};

2 Promote terminations with gaps in the alignment.
multialign(seqs,'terminalGapAdjust',true)
ans =
--CACGTAACATCTC-ACGACGTAACATCTTCT
-AAACGTAACATCTCGC
Matlab
3 Compare alignment without termination
gap adjustment.
multialign(seqs)
ans =
CA--CGTAACATCT--C
ACGACGTAACATCTTCT
AA-ACGTAACATCTCGC
Matlab
>>
a={'ATATAGGAG','AATTATAGA','TTAG
AGAAA'}
>> a =
'ATATAGGAG'
'TTAGAGAAA'
'AATTATAGA'
Char function
>> cseq=char(a)
cseq =
ATATAGGAG
AATTATAGA
TTAGAGAAA
Double function
>> intseq=double(cseq)
intseq =
65
65
84
84
65
84
65
84
65
84
84
71
65
65
65
71
84
71
71
65
65
65
71
65
71
65
65
double
>> double('A')
ans =
65
>> double('C')
ans =
67
>> double('G')
ans =
71
>> double('T')
ans =
84
Initiate PSPM matrix
>> Pspm=zeros(4,length(intseq))
Pspm =
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Use a for loop to count each
nucleotide at each position
>> for i = 1:length(intseq)
Pspm(1,i)=length(find(intseq(:,i)==65));
Pspm(2,i)=length(find(intseq(:,i)==67));
Pspm(3,i)=length(find(intseq(:,i)==71));
Pspm(4,i)=length(find(intseq(:,i)==84));
end
>> Pspm
Pspm =
2
0
0
1
1
0
0
2
2
0
0
1
0
0
1
2
3
0
0
0
0
0
2
1
2
0
1
0
2
0
1
0
2
0
1
0
Add pseudocounts
>> Pspmp=Pspm+1
Pspmp =
3
1
1
2
2
1
1
3
3
1
1
2
1
1
2
3
4
1
1
1
1
1
3
2
3
1
2
1
3
1
2
1
3
1
2
1
Normalize to get frequencies
>> Pspmnorm=Pspmp./repmat(sum(Pspmp),4,1)
Pspmnorm =
Columns 1 through 7
0.4286
0.1429
0.1429
0.2857
0.2857
0.1429
0.1429
0.4286
0.4286
0.1429
0.1429
0.2857
Columns 8 through 9
0.4286
0.1429
0.2857
0.1429
0.4286
0.1429
0.2857
0.1429
0.1429
0.1429
0.2857
0.4286
0.5714
0.1429
0.1429
0.1429
0.1429
0.1429
0.4286
0.2857
0.4286
0.1429
0.2857
0.1429
Calculate odds score
>> Pswm=Pspmnorm/0.25
Pswm =
Columns 1 through 7
1.7143
0.5714
0.5714
1.1429
1.1429
0.5714
0.5714
1.7143
1.7143
0.5714
0.5714
1.1429
Columns 8 through 9
1.7143
0.5714
1.1429
0.5714
1.7143
0.5714
1.1429
0.5714
0.5714
0.5714
1.1429
1.7143
2.2857
0.5714
0.5714
0.5714
0.5714
0.5714
1.7143
1.1429
1.7143
0.5714
1.1429
0.5714
Log odds ratio
>> logPswm=log2(Pswm)
logPswm =
Columns 1 through 7
0.7776
-0.8074
-0.8074
0.1926
0.1926
-0.8074
-0.8074
0.7776
Columns 8 through 9
0.7776
-0.8074
0.1926
-0.8074
0.7776
-0.8074
0.1926
-0.8074
0.7776
-0.8074
-0.8074
0.1926
-0.8074
-0.8074
0.1926
0.7776
1.1926
-0.8074
-0.8074
-0.8074
-0.8074
-0.8074
0.7776
0.1926
0.7776
-0.8074
0.1926
-0.8074
Estimate the probability of the given
sequence to belong to the defined
PSWM
>> Unknown='TTAAGAAGG'
Unknown =
TTAAGAAGG
>> intunknown=double(Unknown)
intunknown =
84
84
65
65
71
65
65
71
71
Get the index of the PSWM for
the unknown sequence
>> for i=1:length(intunknown)
A=find(intunknown==65)
intunknown(A)=1;
C=find(intunknown==67)
intunknown(C)=2;
G=find(intunknown==71)
intunknown(G)=3;
T=find(intunknown==84)
intunknown(T)=4;
end
>> intunknown
intunknown =
4
4
1
1
3
1
1
3
3
Calculate the log odds-ratio of
the Unknown 'TTAAGAAGG'
>> logunknown=logPswm(intunknown)
logunknown =
Columns 1 through 7
0.1926
0.1926
0.7776
0.7776 -0.8074
Columns 8 through 9
-0.8074 -0.8074
>> Punknown=sum(logunknown)
Punknown =
1.0737
0.7776
0.7776
Is this significant score or just
random similarity?
>> cseq
cseq =
ATATAGGAG
AATTATAGA
TTAGAGAAA
>> Unknown
Unknown =
TTAAGAAGG
What would be the maximum
score?
>> logPswm
logPswm =
Columns 1 through 7
0.7776
0.1926
-0.8074
-0.8074
-0.8074
-0.8074
0.1926
0.7776
Columns 8 through 9
0.7776
0.7776
-0.8074
-0.8074
0.1926
0.1926
-0.8074
-0.8074
0.7776
-0.8074
-0.8074
0.1926
>> maxscore=max(logPswm)
maxscore =
Columns 1 through 7
0.7776 0.7776 0.7776 0.7776
Columns 8 through 9
0.7776 0.7776
>> totalmaxscore=sum(maxscore)
totalmaxscore=
7.4135
-0.8074
-0.8074
0.1926
0.7776
1.1926
0.7776
1.1926
-0.8074
-0.8074
-0.8074
0.7776
-0.8074
-0.8074
0.7776
0.1926
0.7776
-0.8074
0.1926
-0.8074
Write a function using the above
statements to scan a sequence


Write a function named ‘logodds’ that
calculates the logs-odd ratio of a given
alignment.
Write a function named ‘scanmotif’ that calls
the ‘logodds’ to search through a sequence
using a sliding window to calculate the
logodds of a subsequence and store these
scores. The function should allow for
selection of a maximum number of locations
that are likely to contain the motif based on
the scores obtained.
Position Specific Scoring Matrix
(PSSM)

incorporate information theory to
indicate information contained within
each column of a multiple alignment.

information is a logarithmic
transformation of the frequency of each
residue in the motif
PSSMs and Pseudocounts

Problem: PSSMs are only as good as
the initial msa



Some residues may be underrepresented
Other columns may be too conserved
Solution: Introduce Pseudocounts to
get a better indication
Pseudocounts

New estimated probability:
nca  bca
Pca 
N c Bc





Pca: Probability of residue a in column c
nca: count of a’s in column c
bca: pseudocount of a’s in column c
Nc: total count in column c
Bc: total pseudocount in column c
PSSMs and pseudocounts

probabilities converted into a log-odds
form (usually log2 so the information
can be reported in bits) and placed in
the PSSM.
Searching PSSMs

value for the first residue in the
sequence occurring in the first column
is calculated by searching the PSSM

the value for the residue occurring in
each column is calculated
Searching PSSMs

values are added (since they are logarithms)
to produce a summed log odds score, S

S can be converted to an odds score using
the formula 2S

odds scores for each position can be
summed together and normalized to produce
a probability of the motif occurring at each
location.
Information in PSSMs

Information theory: amount of information
contained within each sequence.

No information: amount of uncertainty can
be measured as log220 = 4.32 for amino
acids, since there are 20 amino acids. For
nucleic acid sequences, the amount of
uncertainty can be measured as log24 = 2.
Information in PSSMs

If a column is completely conserved
then the uncertainty is 0 – there is only
one choice.

two residues occurring with equal
probability -- uncertainty to deciding
which residue it is.
Measure of Uncertainty

Measured as the entropy
HC  
f
log(
p
)
 ac
ac
residues ( a )
Relative Entropy



. Relative entropy takes into account
overall composition of the organism
being studied
RC  
f
ac
residues ( a )
log2 ( pac / ba )
Ba is background frequency of residue
a in the organism
PSSM Uncertainty

Uncertainty for whole model is summed
over all columns:
Hc 
H
allcolumns
c
Sequence Logos

Information in PSSMs can be viewed visually

Sequence logos illustrate information in each
column of a motif

height of logo is calculated as the amount by
which uncertainty has been decreased
Sequence Logos
Statistical Methods

Commonly used methods for locating
motifs:


Expectation-Maximization (EM)
Gibbs Sampling
Expectation-Maximization

Begin with set of sequences with an
unknown signal in common



Signal may be subtle
Approximate length of signal must be
given
Randomly assign locations of this motif
in each sequence
Expectation-Maximization

Two steps:


Expectation Step
Maximization Step
Expectation-Maximization

Expectation step



Residue Frequencies for each position
calculated
Residues not in a motif are background
Frequencies used to determine
probability of finding site at any position
in a sequence to fit motif model
Maximization Step



Determine location for each sequence
that maximally aligns to the motif
pattern
Once new motif location found for each
sequence, motif pattern is revised in
the expectation
E-M continues until solution converges
TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCT
CCCACGCAGCCGCCCTCCTCCCCGGTCACTGACTGGTCCTG
TCGACCCTCTGAACCTATCAGGGACCACAGTCAGCCAGGCAAG
AAAACACTTGAGGGAGCAGATAACTGGGCCAACCATGACTC
GGGTGAATGGTACTGCTGATTACAACCTCTGGTGCTGC
AGCCTAGAGTGATGACTCCTATCTGGGTCCCCAGCAGGA
GCCTCAGGATCCAGCACACATTATCACAAACTTAGTGTCCA
CATTATCACAAACTTAGTGTCCATCCATCACTGCTGACCCT
TCGGAACAAGGCAAAGGCTATAAAAAAAATTAAGCAGC
GCCCCTTCCCCACACTATCTCAATGCAAATATCTGTCTGAAACGGTTCC
CATGCCCTCAAGTGTGCAGATTGGTCACAGCATTTCAAGG
GATTGGTCACAGCATTTCAAGGGAGAGACCTCATTGTAAG
TCCCCAACTCCCAACTGACCTTATCTGTGGGGGAGGCTTTTGA
CCTTATCTGTGGGGGAGGCTTTTGAAAAGTAATTAGGTTTAGC
ATTATTTTCCTTATCAGAAGCAGAGAGACAAGCCATTTCTCTTTCCTCCCGGT
AGGCTATAAAAAAAATTAAGCAGCAGTATCCTCTTGGGGGCCCCTTC
CCAGCACACACACTTATCCAGTGGTAAATACACATCAT
TCAAATAGGTACGGATAAGTAGATATTGAAGTAAGGAT
ACTTGGGGTTCCAGTTTGATAAGAAAAGACTTCCTGTGGA
TGGCCGCAGGAAGGTGGGCCTGGAAGATAACAGCTAGTAGGCTAAGGCCAG
CAACCACAACCTCTGTATCCGGTAGTGGCAGATGGAAA
CTGTATCCGGTAGTGGCAGATGGAAAGAGAAACGGTTAGAA
GAAAAAAAATAAATGAAGTCTGCCTATCTCCGGGCCAGAGCCCCT
TGCCTTGTCTGTTGTAGATAATGAATCTATCCTCCAGTGACT
GGCCAGGCTGATGGGCCTTATCTCTTTACCCACCTGGCTGT
CAACAGCAGGTCCTACTATCGCCTCCCTCTAGTCTCTG
CCAACCGTTAATGCTAGAGTTATCACTTTCTGTTATCAAGTGGCTTCAGCTATGCA
GGGAGGGTGGGGCCCCTATCTCTCCTAGACTCTGTG
CTTTGTCACTGGATCTGATAAGAAACACCACCCCTGC
Residue Counts

Given motif alignment, count for each
location is calculated:
Residue Frequencies

The counts are then converted to
frequencies:
Example Maximization Step

Consider the first sequence:

TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCT


There are 41 residues; 41-6+1 = 36
sites to consider
MEME Software

One of three motif models:



OOPS: One expected occurrence per
sequence
ZOOPS: Zero or one expected occurrence
per sequence
TCM: Any number of occurrences of the
motif
Gibbs Sampling



Similar to E-M algorithm
Combines E-M and simulated
annealing
Goal: Find most probable pattern by
sampling from motif probabilities to
maximize ratio of model:background
probabilities
Predictive Update Step

random motif start position chosen for
all sequences except one

Initial alignment used to calculate
residue frequencies for motif and
background

similar to the Expectation Step of EM
Sampling Step

ratio of model:background probabilities
normalized and weighted

motif start position chosen based on a
random sampling with the given
weights

Different than E-M algorithm
Gibbs Sampling

process repeated until residue frequencies in
each column do not change

The sampling step is then repeated for a
different initial random alignment

Sampling allows escape from local maxima
Gibbs Sampling

Dirichlet priors (pseudocounts) are
added into the nucleotide counts to
improve performance

shifting routine shifts motif a few bases
to the left or the right

A range of motif sizes is checked
Gibbs Sampler Web Interface

http://bayesweb.wadsworth.org/gibbs/gi
bbs.html

Motif Finding - Bilkent University

Transcript Motif Finding - Bilkent University

Directory