In the search of motifs (and other hidden structures) Esko Ukkonen Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki CPM.

Download Report

Transcript In the search of motifs (and other hidden structures) Esko Ukkonen Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki CPM.

In the search of motifs (and
other hidden structures)
Esko Ukkonen
Department of Computer Science &
Helsinki Institute of Information Technology HIIT
University of Helsinki
CPM 2005, Jeju, 21 June 2005
Uncover
a hidden
structure
(?)
Motif?
• a pattern that occurs unexpectedly often in
(a set of) strings
• pattern: substring, substring with gaps, string in
generalized alphabet (e.g., IUPAC), HMMs,
binding affinity matrix, cluster of binding affinity
matrices,… (= the hidden structure to be learned
from data)
• (unexpectedly: statistical modelling)
• occurrence: exact, approximate, with high
probability, …
• strings ↔ applications: bioinformatics …
Plan of the talk
1. Gapped motifs in a string
2. Founder sequence reconstruction
problem, with applications to haplotype
analysis and genotype phasing (WABI
2002, ALT 2004, WABI 2005)
3. Uncovering gene enhancer elements
1. Gapped motifs
ATT
HATTIVATTI
I#A
HATTIVATTI
Substring motifs of a string S
• string S = s1 … sn in alphabet A.
• Problem: what are the frequently occurring
(ungapped) substrings of S? Longest
substring that occurs at least q times?
• Thm: Suffix tree T(S) of S gives complete
occurrence counts of all substring motifs of
S in O(n) time (although S may have O(n2)
substrings!)
T(S) is full text index
T(S)
P
P occurs in S at
locations 8, 31, …
31
8
Path for P exists in T(S) ↔ P occurs in S
Counting the substring motifs
• internal nodes of T(S) ↔ repeating
substrings of S
• number of leaves of the subtree of a node
for string P = number of occurrences of P
in S
T(hattivatti)
hattivatti
vatti
attivatti
i
t
vatti
ttivatti
tivatti
ivatti
i
ti
atti
i
vatti
hattivatti
vatti
ti
atti
vatti
tti
ti
i
vatti
atti
hattivatti
attivatti
vatti
tti
tivatti
ttivatti
ivatti
Substring motifs of hattivatti
vatti
i
t
vatti
4
2
i
ti
atti
i
2
hattivatti
2
2
ti
vatti
vatti
atti
hattivatti
attivatti
vatti
ivatti
vatti
tti
tivatti
ttivatti
Counts for the O(n) maximal motifs shown
Finding repeats in DNA
• human chromosome 3
• the first 48 999 930 bases
• 31 min cpu time (8 processors, 4 GB)
• Human genome: 3x109 bases
• T(HumanGenome) feasible
Longest repeat?
Occurrences at: 28395980, 28401554r
Length: 2559
ttagggtacatgtgcacaacgtgcaggtttgttacatatgtatacacgtgccatgatggtgtgctgcacccattaactcgtcatttagcgttaggtatatctccgaat
gctatccctcccccctccccccaccccacaacagtccccggtgtgtgatgttccccttcctgtgtccatgtgttctcattgttcaattcccacctatgagtgagaac
atgcggtgtttggttttttgtccttgcgaaagtttgctgagaatgatggtttccagcttcatccatatccctacaaaggacatgaactcatcatttttttatggctgcata
gtattccatggtgtatatgtgccacattttcttaacccagtctacccttgttggacatctgggttggttccaagtctttgctattgtgaatagtgccgcaataaacatac
gtgtgcatgtgtctttatagcagcatgatttataatcctttgggtatatacccagtaatgggatggctgggtcaaatggtatttctagttctagatccctgaggaatca
ccacactgacttccacaatggttgaactagtttacagtcccagcaacagttcctatttctccacatcctctccagcacctgttgtttcctgactttttaatgatcgcca
ttctaactggtgtgagatggtatctcattgtggttttgatttgcatttctctgatggccagtgatgatgagcattttttcatgtgttttttggctgcataaatgtcttcttttga
gaagtgtctgttcatatccttcgcccacttttgatggggttgtttgtttttttcttgtaaatttgttggagttcattgtagattctgggtattagccctttgtcagatgagtag
gttgcaaaaattttctcccattctgtaggttgcctgttcactctgatggtggtttcttctgctgtgcagaagctctttagtttaattagatcccatttgtcaattttggctttt
gttgccatagcttttggtgttttagacatgaagtccttgcccatgcctatgtcctgaatggtattgcctaggttttcttctagggtttttatggttttaggtctaacatgta
agtctttaatccatcttgaattaattataaggtgtatattataaggtgtaattataaggtgtataattatatattaattataaggtgtatattaattataaggtgtaaggaag
ggatccagtttcagctttctacatatggctagccagttttccctgcaccatttattaaatagggaatcctttccccattgcttgtttttgtcaggtttgtcaaagatcaga
tagttgtagatatgcggcattatttctgagggctctgttctgttccattggtctatatctctgttttggtaccagtaccatgctgttttggttactgtagccttgtagtatag
tttgaagtcaggtagcgtgatggttccagctttgttcttttggcttaggattgacttggcaatgtgggctcttttttggttccatatgaactttaaagtagttttttccaatt
ctgtgaagaaattcattggtagcttgatggggatggcattgaatctataaattaccctgggcagtatggccattttcacaatattgaatcttcctacccatgagcgt
gtactgttcttccatttgtttgtatcctcttttatttcattgagcagtggtttgtagttctccttgaagaggtccttcacatcccttgtaagttggattcctaggtattttattct
ctttgaagcaattgtgaatgggagttcactcatgatttgactctctgtttgtctgttattggtgtataagaatgcttgtgatttttgcacattgattttgtatcctgagacttt
gctgaagttgcttatcagcttaaggagattttgggctgagacgatggggttttctagatatacaatcatgtcatctgcaaacagggacaatttgacttcctcttttcc
taattgaatacccgttatttccctctcctgcctgattgccctggccagaacttccaacactatgttgaataggagtggtgagagagggcatccctgtcttgtgcca
gttttcaaagggaatgcttccagtttttgtccattcagtatgatattggctgtgggtttgtcatagatagctcttattattttgagatacatcccatcaatacctaatttatt
gagagtttttagcatgaagagttcttgaattttgtcaaaggccttttctgcatcttttgagataatcatgtggtttctgtctttggttctgtttatatgctggagtacgtttat
tgattttcgtatgttgaaccagccttgcatcccagggatgaagcccacttgatcatggtggataagctttttgatgtgctgctggattcggtttgccagtattttattg
aggatttctgcatcgatgttcatcaaggatattggtctaaaattctctttttttgttgtgtctctgtcaggctttggtatcaggatgatgctggcctcataaaatgagtta
gg
Ten occurrences?
ttttttttttttttgagacggagtctcgctctgtcgcccaggctggagtgcagtggcgggat
ctcggctcactgcaagctccgcctcccgggttcacgccattctcctgcctcagcctcc
caagtagctgggactacaggcgcccgccactacgcccggctaattttttgtatttttagt
agagacggggtttcaccgttttagccgggatggtctcgatctcctgacctcgtgatccg
cccgcctcggcctcccaaagtgctgggattacaggcgt
Length: 277
Occurrences at: 10130003, 11421803, 18695837, 26652515, 42971130,
47398125
In the reversed complement at: 17858493, 41463059, 42431718,
42580925
Gapped motifs of S
• gapped pattern: P in (A U {#})*
•
•
•
•
gap symbol # matches any symbol in A
aa##bb#b
L(P) = occurrences of P in S
P is called a motif of S if |L(P)| > 1 and a motif
with quorum q if |L(P)| ≥ q.
• Problem: find occurrence count |L(P)| for all
gapped motifs P of S
• anban has exponentially many motifs (M-F.
Sagot)!
Motifs vs self-alignments
• self-alignments of S => maximal motifs
S
align the
occurrences
Motifs vs multiple self-alignments
• self-alignments of S => maximal motifs
expand if possible
Motifs vs self-alignments
• S = aaaaabaaaaa P = a###a
• aaaaabaaaaa
aaaaabaaaaa
a###a
aaaaabaaaaa
aaaaabaaaaa
Motifs vs self-alignments
• S = aaaaabaaaaa P = a###a
• aaaaabaaaaa
aaaaabaaaaa
a###a
aaaaabaaaaa
aaaaabaaaaa
Motifs vs self-alignments
• S = aaaaabaaaaa P = a###a
• aaaaabaaaaa
aaaaabaaaaa
aaa#a#aaa
aaaaabaaaaa
aaaaabaaaaa
• aaa#a#aaa is maximal motif for this self-alignment
Maximal motifs
• multiple self-alignments of S ↔ maximal gapped
motifs of S: the unanimous columns give the
non-gap symbols of the motif
• any motif P has a unique maximal motif M(P)
(align the occurrences and maximize); L(M(P)) =
L(P) + d
• unfortunately: anban has exponentially many
maximal motifs
Blocks of maximal motifs
• aaa##b##ba
has blocks aaa, b, ba
• Lemma: Maximal substring motifs (1-block
motifs) ↔ (branching) nodes of T(S)
• Thm: Each block of a maximal motif of S is a
maximal substring motif of S, hence there are
O(n) different strings that can be used as a block
of a maximal motif.
• Cor: There are O(n2k-1) different maximal motifs
with k blocks [O(n2k) unrestricted motifs].
Counting 2-block maximal motifs
• Thm: The occurrence counts for all
maximal motifs with two blocks can be
found in (optimal) time O(n3).
Algorithm (very simple)
X
d
Y
2-block motif (X,d,Y)
for each maximal substring motif X
for each distance d = 1,2, …
mark the leaves of T(S) that correspond to
locations L(X) + d
for each maximal substring motif Y,
find the number h(Y) of marked leaves in
its subtree in T(S)
the occurrence count of motif (X,d,Y) is h(Y)
Algorithm (very simple)
X
d
Y
2-block motif (X,d,Y)
for each maximal substring motif X
for each distance d = 1,2, …
mark the leaves of T(S) that correspond to
locations L(X) + d
for each maximal substring motif Y,
find the number h(Y) of marked leaves in
its subtree in T(S)
the occurrence count of motif (X,d,Y) is h(Y)
O(n)
O(n)
O(n)
Counting 2-block maximal motifs
(cont)
• Thm: The occurrence counts for all maximal
motifs with two blocks can be found in (optimal)
time O(n3).
• flexible gaps:
x*y
* = gap of any length
• Thm: The occurrence counts for all maximal
motifs with two blocks and one flexible gap can
be found in (optimal) time O(n2).
General case
• Q1: Given q and W, has S a motif with at
least W non-gap symbols and at least q
occurrences?
• In k-block case, is O(n2k-1) (or even better)
time possible?
• related work: A. Apostolico, M-F. Sagot, L.
Parida, N. Pisanti, …
2. Founder reconstruction and
applications
Haplotype evolution: founders
and iterated recombinations
• WABI 2002
founder
haplotypes
current
(observed)
haplotypes
only recombinations; mutations not shown
statistical models of recombination:
average fragment length ~ 1/#generations
Uncovering founder sequences
• Problem: Given current sequences C
(haplotypes), construct their ‘founders’
that produce the sequences by iterated
recombinations using minimum possible
total number of cross-overs (i.e., current
sequences have a parse into smallest
possible number of fragments taken
from the founders)
Example
001000010011
111111100110
001011110110
111100100011
Example
001000010011
111111100110
001011110110
111100100011
Example
001000010011
111111100110
001011110110
111100100011
001011100011
111100010110
6 crossovers
Example
001000010011
111111100110
001011110110
111100100011
Example
001000010011
111111100110
001011110110
111100100011
OBS: two
founders
(colors) always
suffice if no
restrictions
000000000000
111111111111
18 crossovers
Founder reconstruction problem
• given a set D of m sequences, construct M
founder sequences that give D in minimum
number of cross-overs
• solution by dynamic programming,
exponential time in m (WABI 2002)
• Q2: NP-hard?
Modeling a set of haplotypes by a HMM
• ’motif’ = Hidden Markov Model
• minimum description length (MDL)
modeling
• ALT 2004
Hidden Markov Model (HMM)
• states i with emission alphabet Hi
• emission probabilities P(H 0 Hi)
• state transition probabilities wij
.
{P(H)}
i
wij
.
.
.
j
Conserved fragments and parses
• haplotypes 1 1 1 1 2 1 2 1 2 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2
• parse
111121212111111122222222
• conserved 1 1 1 1 2 1 2 1 2 1 1 2 1 1 1 1 2 2 1 2 2 2 2 1
fragments
• fragmentation
model
21212
(HMM)
1111
1 1 2 1 1 1 1
2212222
Lactose tolerance
• recent finding in Finnish population: an
SNP C/T-13910, 14 kb upstream from the
lactase gene, associates completely with
lactose intolerance
• two datasets over 23 SNPs in the vicinity
of this SNP
• lactose intolerant persons: 21 haplotypes
• lactose tolerant persons: 38 haplotypes
Case/control study by HMM
Lactose intolerant
(~6 fragments per haplotype)
Lactose tolerant
(2 fragments per haplotype => young)
Genotype phasing via founders using a
HMM
• the genotype phasing problem: given a set of
genotypes, find their resolving haplotype pairs
• find at most M founders that produce resolving
haplotype pairs in minimum possible number of
cross-overs => relatively good haplotyping
method
• improved results with a related HMM, trained with
the Expectation Maximization algorithm
• WABI 2005
HMM for haplotyping
transition
transition
probability
probability
distribution
distribution
emission
probability
distribution
…
…
…
…
Example HMM
3. Uncovering gene enhancer
elements
Introduction
• Gene expression regulation in multicellular
organisms is controlled in combinatorial fashion
by so called transcription factors.
• Transcription factors bind to DNA cis-elements
on enhancer modules (promoters), and multiple
factors need to bind to activate the module.
• In mammals, the modules are few and far
• The problem: Locate functional regulatory
modules.
Gene regulation
DNA
promoter1 gene1 promoter2 gene2
transcription
RNA
translation
Proteins
promoter3
gene3 promoter4 gene4
transcription factors
Model of cell type specific regulation of
target gene expression
Common targets (e.g. Patched):
GLI GLI Ubiquitously expressed TF
transcription
Cell type specific targets (e.g. N-myc):
GLI
X
Y (tissue specific TFs)
transcription
Binding affinity matrices
• The cis-elements are
represented by affinity
matrices.
– A column per position
– A row per nucleotide
• Discovered:
– Computationally
– Traditional wet lab
– Microarrays
9
19
5
18
11
3
1
36
49
0
2
0
51
0
0
0
0
0
17
34
1
45
0
5
1
25
4
21
4
16
21
10
Finding preserved motifs of binding
sites
• looking at one (human) genome gives too many
positives
• comparative approach: take the 200 kB regions
surrounding the same genes (paralogs and
orthologs) of different mammals (human, mouse,
chicken, …), find preserved clusters (motifs) of
binding sites
• Smith-Waterman type algorithm with a novel
scoring function
Whole genome comparisons
• Whole genomes can be analyzed with our
implementation
• We have compared human genes to orthologs
in mouse, rat, chicken, fugu, tetraodon and
zebrafish
– 100kbp flanking regions on both sides of the gene.
– Coding regions masked out.
– About 20 000 comparisons for each pair of
species.
– About 2 min each
Enhancer prediction for N-myc
200 kb Human N-Myc genomic region
200 kb Mouse N-Myc genomic region
coding region of N-Myc
Conserved GLI binding sites in
two predicted enhancer elements,
CM5 and CM7
Wet-lab verification
●
●
Selected predicted cismodules for wet-lab
verification
Fused 1kb DNA segment
containing the predicted
enhancer to a marker gene
with a minimal promoter and
generated transgenic
embryos.
To conclude
• combinatorial vs probabilistic motifs
• significance of the findings for the
applications => statistical modeling
• Want to do computational biology? Then
find a good biologist who has good
computational intuition.
Acknowledgements
•
•
•
•
•
•
Mikko Koivisto
Heikki Mannila
Kimmo Palin
Pasi Rastas
Morris Michael
Stefan Kurzt
(Hamburg)
•
•
•
•
Outi Hallikas (Biom)
Jussi Taipale (Biom)
Markus Perola (Biom)
Hans Söderlund (VTT)
The BioSapiens project is funded by the European
Commission within its FP6 Programme, under the
thematic area "Life sciences, genomics and biotechnology
for health,"contract number LHSG-CT-2003-503265.