The Protein Folding Prediction Problem

Download Report

Transcript The Protein Folding Prediction Problem

Finding Regulatory Signals in
Genomic Sequences
Weeder
ProFind
Giancarlo Mauri
Bioinformatics and Natural Computing Group
Dipartimento di Informatica, Sistemistica e Comunicazione
Università degli Studi di Milano-Bicocca
Gene Expression Data

When and how much a gene
is expressed under some
given conditions (tissue,
external stimuli, disease...)

We can group genes
according to their
expression profile

We can suspect a “common
cause” for their expression
7/18/2015
2
Transcription Factors

The expression of a gene starts with transcription
from DNA to RNA

Transcription is modulated by dedicated proteins
called transcription factors (TFs)

TFs bind to DNA in the regions surrounding the
starting site of the gene (mostly upstream), and
direct polymerases to the “right spot” to start
transcription

Different effects: may enhance or block
transcription
7/18/2015
3
Transcription Factors
7/18/2015
4
TF
Transcription Starts
TFBS
7/18/2015
5
AND
TF1
TF2
Transcription Starts
TFBS
7/18/2015
6
NOT
TF1
Transcription Starts
TF2
TFBS
7/18/2015
7
TFs Binding Sites (TFBSs)
Fundamental in regulatory analysis is the
identification of potential TFBSs

Bound by transcription factors

Short degenerate sequences, 5-16 nucleotides long, (gaps
possible but rare)

Each TF does not recognize a single fragment but a set of them
(similar to each other) called signal or motif

Can be illustrated by profiles and/or consensi (computational
models)
7/18/2015
8
TFs Binding Sites (TFBSs)
Fundamental in regulatory analysis is the
identification of potential TFBSs

Bound by transcription factors

Short degenerate sequences, 5-16 nucleotides long, (gaps
possible but rare)

Each TF does not recognize a single fragment but a set of them
(similar to each other) called signal or motif

Can be illustrated by profiles and/or consensi (computational
models)
7/18/2015
9
Finding TFBSs

We have a set of related genes:

similar expression profile

similar biological function

anything else..

We take their upstream regulatory regions

If they are regulated by the same TF(s), then we should find
its (their) binding sites in the sequences

We should find short patterns conserved in the sequences

We could use the detected TFBS to predict the behavior of
a gene
7/18/2015
10
Finding Novel TFBSs

Over-representation:

First, detect groups of similar oligos

Describe each group with a consensus or a profile (or in
some other smart way)

Find the most over-represented groups
If sequences were built at random and/or
we picked sequences at random,
the group should not appear with
the same size/conservation
7/18/2015
11
Finding Novel TFBSs

Most of early research has focused on the first point: how to
detect the best groups (unfortunately, there are thousands of
candidates) given simple score measures

Recent research has followed the second point: which is the
best measure to tell “significant” groups from random
similarities?

Is it expected or not, to find a group that is conserved?

Can we take advantage from the wealth of sequence data
available?
7/18/2015
12
Weeder :
A tool for pattern discovery in
Genomic Sequences
Giancarlo Mauri*
Giulio Pavesi*
Graziano Pesole^
* Università degli Studi di Milano-Bicocca
^ Università degli Studi di Milano
References

G. Pavesi, G.Mauri, G.Pesole. An Algorithm for
Finding Signals of Unknown Length in Unaligned
DNA Sequences. Bioinformatics 17, S207-S214,
2001

G.Pavesi, P.Mereghetti, G.Mauri, G.Pesole.
WeederWeb: Discovery of Transcription Factor
Binding Sites in a Set of Sequences from CoRegulated Genes. Nucleic Acids Research Web
Server Issue 2004, 32: W199-W203

http://159.149.109.16:8080/weederWeb/
7/18/2015
14
Weeder (2001)

Idea: instead of reducing the set of candidate
patterns, reduce the set of possible matches for
each pattern, trying to save a “significant” number
of valid occurrences

Instead of searching exhaustively for patterns
that occur in every sequence, we “short-sightedly”
look for patterns that occur in a subset of them

The algorithm needs as input only a given error
ratio e
7/18/2015
15
Suffix Trees

A suffix tree is a data structure that exposes the internal
structure of a string in a very deep and meaningful way

Suffix tree T for S = s1…sn

rooted directed tree

exactly n leaves numbered 1 to n

internal nodes with at least two sons

edges labeled by non empty substrings of S

labels out of the same node begin with different symbols

the concatenation of the edge labels on the path from the root
to any leaf i exactly spells the suffix of S starting at position i,
i.e., s1…sn
7/18/2015
16
Suffix Trees

The same structure can be built also for a set of k
sequences

To distinguish which sequence a suffix belongs to, it appends a
different marker symbol, not occurring elsewhere, to each
sequence in the set.

It is also possible to annotate each node of the tree with a kbit string, where the i-th bit is set if the word spelled by the
path ending at the node occurs in the i-th sequence.
7/18/2015
17
Suffix Trees
G
#
A
C
C
C
C
A
$
A
G
#
$
G
#
A
$
A
A
G
A
G
#
$
#
Suffix tree for ACCA (end with $) and CCAAG (end with #)
7/18/2015
18
Suffix Trees

A generalized suffix tree can be built in O(N) time and takes
O(N) space, where N is the overall length of the sequences

Annotating it with the bit strings takes additional O(kN) time

Each pattern occurring in the strings is spelled by a path
starting from the root of the tree

The time needed to search for a pattern depends only on the
length of the pattern

The structure allows to implement recursively the exhaustive
enumeration of all the candidate patterns of a given length

The time complexity is thus reduced to be exponential in the
maximum number of mutations allowed (Sagot, 1998)
7/18/2015
19
Searching for an Exact Pattern


Given a set of sequences and the annotated suffix tree,
every pattern appearing in at least one sequence of the
set is spelled by a unique path starting from the root
We match the symbols of pattern p along the unique
path in the tree until

p is exhausted
• In this case, the bit string on the next node on the path
specifies which sequences p appears in

no more matches are possible
7/18/2015
20
Searching for a Pattern with
Mismatches




We can also search for a pattern p with at most e
mismatches in a similar way.
We match p along different paths on the tree at the
same time, keeping track of the number of
mismatches encountered on each path.
Whenever the number of errors on a path is greater
than e, we discard that path.
The sequences p appears in are given by the logicalOR of the bit strings corresponding to the different
paths.
7/18/2015
21
Searching for (M, e) Patterns

The algorithm starts with the empty pattern from the
root of the tree, and recursively expands it

Let us suppose we have found on the tree the
endpoints of paths corresponding to the occurrences
of a pattern p=p1…pm in the sequences, where all the
paths spell words within distance e from p, with m<M

If p occurs in at least q sequences, we try and expand
it by one symbol
7/18/2015
22
Searching for (M, e) Patterns

Expanding a pattern by one symbol

For each character b  {A, C, G, T}, we match b against
the next symbol on each path

If a path ends just before a node V of the tree, we match
b against the first symbol on each edge leaving V

When we encounter a mismatch, we increase the previous
error along the path by one

7/18/2015
If the new error is greater than e, we discard the path
23
Searching for (M, e) Patterns


Once all paths have been checked, the surviving ones
represent the approximate occurrences of p’=p1…pmb
If p’ occurs in at least q sequences, and is shorter
than M, we expand p’ as well. Otherwise, we continue
with p and the next character in 
7/18/2015
24
Searching for (M, e) Patterns

For example





It matches the first symbol on each edge leaving the root
against A.
If A is valid, i.e., A occurs in at least q sequences, it is expanded
to AA.
If also AA is valid, we move to AAA, and so on.
If it is not valid, we proceed to look for occurrences of AC.
In this method, patterns don’t have to occur exactly in
the sequences.
7/18/2015
25
Searching for (M, e) Patterns

The main drawback is that every pattern of length
e satisfies the input constraints, since every other
pattern of length e found in the tree is a valid
occurrence for it

Thus, the method works well only for small values
of e
7/18/2015
26
Searching for (M, e) Patterns
e
At the beginning of the search, all paths of length e are valid
7/18/2015
27
Searching for (M, e) Patterns

To apply the algorithm also to longer patterns with
higher values of e, instead of reducing the set of
patterns that have to be searched, we restrict the
number of paths that have to be followed for each
pattern.

That is, we narrow down the set of valid
occurrences-the WEEDER algorithm.
7/18/2015
28
Searching for Approximate
Occurrences of Patterns
 Problem

Definition:
Given a set of k sequences on the alphabet
={
A, C, G, T } , we want to find all (M, e) patterns

(M, e) patterns: patterns of length M that occur
with at most e mismatches in at least q sequences
of the set
7/18/2015
29
The Outline of WEEDER

WEEDER fixes an initial error ratio e

Given a pattern p, a path is valid if the distance from p to the
path is not greater than e |p|
• |p| is the length of the pattern

When we expand p by one symbol, the error threshold is set
to e (|p|+1)
7/18/2015
30
Block Decomposition of a Pattern
0
4
8
12
16
e = 0.25
1
2
3
4


Each block size is 1/e
Let p = p1…pm. We can see p as composed of
blocks
7/18/2015
em
31
Valid Occurrences

For every pattern p = p1…pm, valid occurrences are
words si+1…si+m occurring in the sequences for which:  j
{1,…, m} d(p1…pj, si+1…si+j)  ej


d(p1…pj, si+1…si+j) is the number of mismatches between p1…pj
and si+1…si+j
si+1…si+m is a valid occurrence for p if it is a valid
occurrence for all its prefixes
{p1, p1p2, …, p1p2…pm-1}
7/18/2015
32
An Example for WEEDER
C
T
T
%
%
#
CA&
G
GCT%
A
CGC#
&
C
A
&
C
GC#
#
T
T
C
$
A
%
$
C
A&
& CGC# $
GCTCA&
ATCACGC#
GC#
TC$
CACGC#
GCT%
q = 2, e =0.25
S1: AATCACGC#
S2: AGCTCA&
S3: ATGCT%
S4: ACTC$
7/18/2015
33
ACTC ACTCA
C
G
T
T
%
%
#
CA&
#
C
GCT%
A $
GC#
T
T
C
&
&
GCTCA&
C
$
A
CGC#
C
&
A
CGC#
TC$
%
A&
$
GC#
CACGC#
GCT%
ACTC: error max =1. S1, S2, S4 contain ACTC.
ACTCA: error max =2. S1, S2, S4 contain ACTCA.
S1: AATCACGC# S2: AGCTCA& S3: ATGCT%
7/18/2015
ATCACGC#
S4: ACTC$
34
ACTCA  ACTCAA
C
G
T
T
%
CA&
#
C
GCT%
A
$
C
T
T
CGC#
ATCACGC#
GC#
TC$
%
C
&
&
GCTCA&
GC#
$
A
CGC#
C
%
#
&
A
A&
$
CACGC#
GCT%
ACTCAA: error max =2. S1, S2 contain ACTCAA.
ACTCAC, ACTCAG, ACTCAT are also patterns.
S1: AATCACGC#
S2: AGCTCA&
S3: ATGCT%
7/18/2015
S4: ACTC$
35
Weeder (2001)

Given a pattern P = p1p2....pm, the algorithm can
find all the valid occurrences of P (with at most
e|P| mutations), such that at most ei mutations
occur in the first i letters of the pattern

But: some occurrences of a pattern can be missed
altogether

Are DNA signals always so polite to show up in
“blocks-decomposed” form?

The answer is no, but we can use Weeder with a
grain of salt
7/18/2015
36
Using Weeder

Example: (15,4) pattern occurring in 20 sequences

Valid (block decomposed) possible occurrences: 829

Total possible occurrences:1365

Probability if “hitting” a possible occurrence in a sequence:
phit=.61

Probability of finding the pattern in every sequence: like
trying to win the national lottery

If we search for patterns occurring in at least 10
sequences, the probability of “seeing” at least 10 times the
pattern is:
Phit(20,10) = .89
7/18/2015
37
Using Weeder



Thus, we can use Weeder as a sieve, to filter the
set of candidate patterns
All patterns that are found to occur in at least q of
the sequences by Weeder can be searched again in
the sequences, but this time with no restriction on
the position of mismatches
We expect the number of patterns (random
patterns other than the real signal) passed to the
second phase to be much smaller than the original
number (and no longer exponential)
7/18/2015
38
Using Weeder

The probability of finding a pattern in a sequence
depends on its length and the error ratio

The probability of finding a pattern in a set of
sequences (and thus the choice of the quorum q for
the first phase) depends on the number of
sequences

The same approach can be applied also when the
signal does not show up in each sequence
7/18/2015
39
Using Weeder


When the signal to be found is expected to be
short, the algorithm can be used in “exact” mode
For longer signal, the lower is the quorum q, the
higher is the probability of finding the signal

But: also the number of patterns satisfying the
input constraints is higher, and the program is
slower

Users can choose a suitable trade-off between
time and accuracy
7/18/2015
40
Theoretical Time Complexity

Naïve approach: O(4men)

Suffix tree approach (Sagot, 1998): O(4emekn)

where n is the input size, m is the pattern length, and e is
the number of mutations allowed

Weeder:
O((1/e)e4ekn)
where e is the number of mutations occurring in the
longest pattern found
7/18/2015
41
Weeder Web

Weeder Web is a web interface to the Weeder algorithm, where
all the parameters concerning the motifs are automatically set for
the discovery of transcription factor binding sites

Although there is no pre-set limit on the length of the input
sequences, feasible results can be obtained by submitting
sequences of "typical" length for regulatory/promoter regions (i.e.
from 500 to 5000 bps)

A priori, there's no limit on the number of sequences you can
input. Also, for the moment we do not consider correlations among
different motifs (i.e. cis-regulatory modules)
7/18/2015
42
Weeder Web

All the statistical measures (background oligo
frequencies, expected occurrences and so on) used to
score/rank motifs and to post-process the output have
been derived from the analysis of promoter/enhancer
and 5'UTR regions only (taken from different
organisms)

If you submit something else (i.e. 3'UTRs, coding
regions, noncoding RNAs, and so on) the statistical
evaluation probably will not be consistent with your
data, and thus produce unreliable results

http://www.pesolelab.it/Tool/ind.php
7/18/2015
43
Post-Processing

Real motifs have different degrees of variation in
different positions

Some admit “any” nucleotide

Some are (almost) perfectly conserved

We should find “redundant” motifs among the
highest-scoring ones

Pieces of a long motif should appear also in shorter
results
7/18/2015
44
Post-Processing

Look for “redundant” (either in length or in
conservation) motifs in the reports of each run

Collect the instances of each one and build a
frequency matrix

Scan the sequences looking for matches

Report the best matches (with no constraint on the
substitutions allowed)
7/18/2015
45
Assessment results
Tompa et.al.,
Nat Biotechnol. 2005 Jan;23(1):137-44.
Combined over all data sets
0.35
nSn
0.3
nPPV
0.25
nPC
0.2
nCC
0.15
sSn
YMF
Weeder
SeSiMCMC
QuickScore
MotifSampler
MOE
MITRA
GLAM
Multiple-family-analysis
7/18/2015
Consensus
AlignACE
ANN-Spec
0
MEME3
sASP
MEME
0.05
Improbizer
sPPV
IITD
0.1
46
G
LA
M
E
on
se
ns
us
C
-S
pe
c
A
M
E3
M
E
M
IT
R
M
E
M
E
YM
F
M
O
M
M
E
ot
ul
i fS
tip
am
le
-fa
pl
m
er
il y
-a
na
ly
si
Q
s
ui
ck
Sc
or
e
Se
S
iM
C
M
C
W
ee
de
r
7/18/2015
IIT
D
Im
pr
ob
iz
er
C
N
Al
ig
nA
AN
Combined sSn
Assessment results
sSn categorized by species
0.6
0.5
0.4
w holeset
fly
0.3
human
mouse
0.2
yeast
0.1
0
47
ProFind :
A GA Approach to the Definition
of Regulatory Signals in
Genomic Sequences
Characterization of CAP and TATA-box through
probability matrices with a genetic algorithm
Giancarlo Mauri, Roberto Mosca and Giulio Pavesi
Bioinformatics and Natural Computing Group
Dipartimento di Informatica, Sistemistica e Comunicazione
Università degli studi di Milano-Bicocca
TATA Box and CAP Binding Sites

A large number of genes present two characteristic signals:

TATA-box
• 25-35 bp upstream of the TSS
• When discovered it was given a TATA consensus
• Bound by the TATA Binding Protein (TBP) part of a large complex of
some 50 different proteins including TFIID and TFIIB

CAP (also called Initiator or Inr)
• Straddles the TSS
• Experimental evidence that it is bound by TFIID too
• Previous characterization by Bucher [1990] with a CA[Py] consensus

Very strong positional preference for the two signals with
respect to the TSS
7/18/2015
49
Describing Binding Sites
Frequency Matrix nij
.........0...........
...CGTGCCATTTGTTGT...
...TCCTACAGTGCAGCA...
...TCACATATTATTGTC...
...GAAAGCAACAACTAA...
...TAAATCGTCAGTGTA...
...CCGACCAGAGTGAAA...
...GGGTTTGGTTTGATA...
...GCGTGCAGTTGTGAA...
...GTCGCCATATACACA...
...GTGGCCGTATGCGCT...
7/18/2015
-2
-1
0
1
2
A
2
0
7
1
3
C
4
8
0
0
2
G
2
0
3
4
0
T
2
2
0
5
5
-2
-1
0
1
2
A
0.2
0
0.7
0.1
0.3
C
0.4
0.8
0
0
0.2
G
0.2
0
0.3
0.4
0
T
0.2
0.2
0
0.5
0.5
50
Frequency Matrix

Given a signal of length m, a matrix P of dimension 4m is built up, taking
Pi,j=P[nucleotide i at position j]

We suppose the existence of two different models
 The signal
 The background
Signal occurence
S  S1 S2 S3 Sm
m
P
j 1
m
Sj,j
b
j 1
7/18/2015
Sj
1
bi  P[nucleotidei in thebg], i A, C, G, T
 PS j , j 

0
log

 bS 
j 1
j
w
Weights stored 
m
in the matrix
i, j
51
The Problem
Dataset
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
7/18/2015

DS  S 1 , S 2 , S 3 ,, S n
TTTTGTTTTTTTATTTCCTGTATTTTT
TCCAGCCCGAACAAAATCGATCAAAAT
ATCCCTCTGGCCATTGGCAATCGATCC
AGAAACAAAACGGCTTGTAAAACAAAC
GTGCAGTGAGTCAGTGTGTTGTGTGCC
GAGCGTAAGCAAGAGAGAGAGGTGAAG
AGGTGAAGCCAGGGGCGGAGGCGCAAG
AGAAAAGAGAGAGTGAAAGCATAGTCC
AGTTTTCATATTGTTACCGTTTGAGTT
TTCACCAGCCACTTTCAGTCGGTTTAT
GCATAACGAATCACTCTGATCGCTGTC
GGTCCAGCGACCACTCGCAGTTCTACA
GATCGGCGTGCCATTTGTTGTTGAATC
CCGCTCTCCTCCAGTGCAGCAGCAGCA
TCCAAGTCACCGATTATTGTCTCAGTG
GAACTGGAAACCAACAACTAACGGAGC
TCAGTCTAAATTTACCCTGTAAAATTC
GTAGTTCCGACCAGAGTGAAACTGAAC
CTTTATGGGTTTAGTTTGATAGGAGTC
TCACTGGCGTTGTTAGAGTTGTGAATG
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...

-2
-1
0
1
2
A
0.2
0
0.7
0.1
0.2
C
0.3
0.7
0
0
0.2
G
0.2
0
0.2
0.4
0
T
0.2
0.2
0
0.4
0.5
52
The Score Function

The score function is made up of two terms:

A positive term

A negative term
...
...
...
...
...
...
...
...
TTTTGTTTTTTTATTTCCTGTATTTTT
TCCAGCCCGAACAAAATCGATCAAAAT
ATCCCTCTGGCCATTGGCAATCGATCC
AGAAACAAAACGGCTTGTAAAACAAAC
GTGCAGTGAGTCAGTGTGTTGTGTGCC
GAGCGTAAGCAAGAGAGAGAGGTGAAG
AGGTGAAGCCAGGGGCGGAGGCGCAAG
AGAAAAGAGAGAGTGAAAGCATAGTCC
Negative
score
Positive
score
...
...
...
...
...
...
...
...
Negative
score
Total score = positive score – negative score
7/18/2015
53
7/18/2015
Position
40
35
30
25
20
15
10
5
0
-5
-10
-15
-20
-25
-30
-35
-40
-45
-50
Mean score
The Score Function
2,5
2
1,5
1
0,5
0
DCPD
54
The Genetic Algorithm
1
1
1
0
1
0
0
0
0
0
1
1
1
1
0
1
0
1
1
0
7/18/2015
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
TTTTGTTTTTTTATTTCCTGTATTTTT
TCCAGCCCGAACAAAATCGATCAAAAT
ATCCCTCTGGCCATTGGCAATCGATCC
AGAAACAAAACGGCTTGTAAAACAAAC
GTGCAGTGAGTCAGTGTGTTGTGTGCC
GAGCGTAAGCAAGAGAGAGAGGTGAAG
AGGTGAAGCCAGGGGCGGAGGCGCAAG
AGAAAAGAGAGAGTGAAAGCATAGTCC
AGTTTTCATATTGTTACCGTTTGAGTT
TTCACCAGCCACTTTCAGTCGGTTTAT
GCATAACGAATCACTCTGATCGCTGTC
GGTCCAGCGACCACTCGCAGTTCTACA
GATCGGCGTGCCATTTGTTGTTGAATC
CCGCTCTCCTCCAGTGCAGCAGCAGCA
TCCAAGTCACCGATTATTGTCTCAGTG
GAACTGGAAACCAACAACTAACGGAGC
TCAGTCTAAATTTACCCTGTAAAATTC
GTAGTTCCGACCAGAGTGAAACTGAAC
CTTTATGGGTTTAGTTTGATAGGAGTC
TCACTGGCGTTGTTAGAGTTGTGAATG
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
-2
-1
0
1
2
A
0.2
0
0.7
0.1
0.2
C
0.3
0.7
0
0
0.2
G
0.2
0
0.2
0.4
0
T
0.2
0.2
0
0.4
0.5
55
The Genetic Algorithm

One-point crossover

Mutation operator flips one bit with a probability pm

Given a genome the frequency matrix is calculated from the
sequences selected by the genome itself

The matrix is scored according to the previously defined
score function

A local optimization procedure is applied to the resulting
best individual
7/18/2015
56
Results
Three different datasets:

Eukaryotic Promoter Database
(EPD, www.epd.isb-sib.ch, rel. 74)
• Vertebrates: 2199 seqs, from -50 to 50
• Homo Sapiens: 1796 seqs (included in the previous one)

Drosophila Core Promoters Database
(DCPD, www-biology.ucsd.edui/labs/kadonaga/DCPD.htm)
• 205 seqs, from -47 to 44
7/18/2015
57
Results - CAP
Start position: -2
Length: 6
Population: 500
Generations: 20000
Crossover Probability: 0.9
Mutation Probability: 0.04
7/18/2015
Bucher, P.: Weight matrix descriptions of four eukaryotic RNA
polymerase II promoter elements derived from 502 unrelated
promoter sequences. J. Mol. Biol. 212 (1990) 563–78
58
Results - CAP

CA, CG, TA and TG dinucleotides starting at position -1 in
for the EPD dataset

CA, TA dinucleotides starting at position -1 for the DCPD
dataset (consistently with [Kadonaga,2000]
Kutach, A., Kadonaga, J.: The downstream promoter element DPE
appears to be as widely used as the TATA box in Drosophila core
promoters. Mol. Cell Biol. 20 (2000) 4754–64
7/18/2015
59
7/18/2015
Position
Position
40
35
30
25
20
15
0,25
10
5
0
-5
-10
-15
-20
0
-25
0,35
0,15
0,2
0,1
0
2,5
2
1,5
1
0,5
0
DCPD
60
40
35
30
25
20
15
10
5
0
-5
-10
-15
-20
-25
-30
-35
-40
-45
-50
0,4
0,05
-30
0,3
-35
-40
EPD Homo Sapiens
Mean score
40
35
30
25
20
15
10
0,05
-45
-50
Position
5
0
-5
-10
-15
-20
-25
-30
-35
-40
-45
-50
Mean score
Mean score
Results - CAP
DCPD
0,35
0,4
0,25
0,3
0,15
0,2
0,1
EPD Vertebrates
Results – TATA Box
Start position: -30
Length: 8
Population: 500
Generations: 20000
Crossover Probability: 0.9
Mutation Probability: 0.04
0,45
0,4
Mean score
0,35
0,3
0,25
0,2
Consistent with the finding that the
TBP recognizes the minor grove of
DNA, where protein-DNA interactions
are typically influenced by A/TKim, J.L., Nikolov, D.B., Burley, S.I.: Co-crystal structure
content, but not by the specific
of TBP Recognizing the minor groove of a TATA element.
Nature 365 (1993) 520–527
nucleotide sequence.
0,15
0,1
0,05
Position
33
29
25
21
17
9
13
5
1
-3
-7
-11
-15
-19
-23
-27
-31
-35
-39
-43
-47
0
EPD Vertebrates
Lo, K., Smale, S.T.: Generality of a functional initiator
consensus sequence. Gene 182 (1996) 13–22
7/18/2015
61

The signals found have proved to be consistent with those described
experimentally, as in the case of fruit fly signals

Differently from previous approaches to the same problem:

it does not make any prior assumption about the signals structure

it takes advantage of the specific localization of the signals considered

We do not need to know in advance which sequences contain the signal,
this is taken care of by the algorithm

Further improvements include:

Development of models taking into account correlation between adjacent
positions

Application of the method to new datasets with a better characterization
of the TSS to investigate positional correlation between the TATA-box and
the TSS and the presence of alternative TSSs
7/18/2015
62
SeQuAl
Large Scale Multiple Alignment of Genomic Sequences
Summary

Multiple Sequence Alignment

The Methods So Far

SeQuAl project

Producing a Threaded Blockset Alignment

Results & possible improvements
7/18/2015
64
Multiple Sequence Alignment (1)
The multiple sequence alignment problem is the
process of taking three or more input sequences and
forcing them to have the same length by inserting a
universal gap symbol, in order to maximize their
similarity as measured by a score function.
7/18/2015
65
Multiple Sequence Alignment (2)
7/18/2015
66
Dynamic Programming Vs Heuristics

Dynamic Programming Techniques



O(NxM) for two sequences of length N and M
Exponential for multiple sequences
O(nxM2) for n sequences of length approximately M using some
heuristics (ClustalW)
Prohibitively time-consuming for sequences
of length exceeding 10 kb
Need for more heuristics
to speed up the alignment
7/18/2015
67
The Methods So Far

Pairwise sequence alignment tools


MUMmer, GLASS, WABA, BLASTZ
Multiple sequence alignment tools

MAVID, MLAGAN, MGA, Mauve

Anchoring techniques

Seed ungapped alignments and extensions

Filling gaps using dynamic programming
7/18/2015
68
SeQuAl project

Goals:

Ability to find conservation even in the presence of
mutations on amino acids

Ability to find conservation in subsets of the original set

Anchoring technique based on text indexing
structures

Hashing functions on nucleotides and amino acids
7/18/2015
69
SeQuAl project

Four steps alignment:




Anchors search
Anchors chaining
Anchors refinement
Gaps filling
New!
New!
S1
S2
S3
S4
7/18/2015
70
Anchors Discovery (1)

Discovery of MUMs and MEMs using Suffix Trees




Searched on




Longer than a minimum threshold
Shorter than a maximum length for speeding up the time complexity
Statistically significant
the nucleotide sequence
the amino acid sequence
on the sequence of classes of aminoacids
Fragments generation
S1
S2
S3
S4
New!
New!
Anchors Discovery (2)
(R,1)
{S,R}
S: ACGTCA$
R: TGTCTA$
T
A
G
{S,R}
C
G
T
{S,R}
C
$
G
T
C
A
(S,6)
(R,6)
$
(S,1)
C
{S,R}
A
T
$
(R,4)
A
$
(S,5)
{S,R}
G
T
T
A
C
A
$
$
T
C
T
A
A
C
$
A $
(R,5) {S,R}
A
(S,4)
T
$
A
$
(S,3)
(R,3)
(R,2)
(S,2)
7/18/2015
$
72
Fragments Chaining

Ordering relation between fragments
S1
S2
S4

S3
A
<
B
<
C
D
<
<
E
Graph induced by the ordering relation
B
D
start
A
C
stop
E
Shortest Path on a DAG
Partial Anchors (1)

Ordering
relation?
S
1
S2
S4
S3
S4
B
A <
A
C
7/18/2015
B
<
C
<
A

Order the partial fragments based
on their “barycenter”

Use a shifting barrier to impose a
total ordering between partial
fragments
74
Partial Anchors (2)
7/18/2015
75
Producing a Threaded Blockset Alignment

Block is a rectangular array of symbols such that removing dashes from any row

Blockset: a set of such blocks


produces a run of one or more consecutive positions in one of the original sequences
A “ref-blockset" consists of a blockset in which every block has a designated row, all of
which come from the same original sequence, called the reference for that refblockset.
If a blockset is threaded by each of the original sequences, we call it a threaded
blockset.
Local Alignment
with ClustalW
S1
S2
S3
S4
7/18/2015
76
Results (1)
7/18/2015
77
Results (2)
SeQuAl Vs MGA
100%
96,06%
93,48%
95%
91,12%
88,90%
90%
85%
94,04%
92,60%
83,04%
80,70%
80%
73,63%
74,20%
75%
70%
65%
60%
55%
50%
Adenovirus 5
E. coli 3
C. pneumoniae 3
S. aureus 3
Dataset
MGA
SeQuAl
S. aureus 4

Approximate anchors in non-coding areas of the
genome


Developing a hashing function for regulatory regions
Important evolutionary events

Modeling rearrangements and inversions

Memory-saving implementation of suffix trees

Custom visualization tool and classification of
aligned regions based on the nature of anchors
7/18/2015
79
Conclusions

Exhaustive methods:

Suitable for short patterns and a once-for-all analysis of
data (e.g. whole genomes)

Sequence driven methods:

Give faster but less accurate answers

Limited data sizes

It is possible to choose a trade-off between time and
accuracy
7/18/2015
80
Block Decomposition
Where (3-4-4-4) comes from:
0.26*1 = 1
0.26*2 = 1
0.26*3 = 1
0.26*4 = 2
0.26*5 = 2
…..
0.26*15 = 4
3
7/18/2015
4
4
4
81
Number of Patterns

829*34
3
4
4
4
01
210
0123
2134
 4  4
3After
  iteration
4 4   48
1  31 3
12 

1+12+48+288+216+192+72 = 829
 4  3 
829*34 for all
–Probability phit = 829/1365 = 0.61
–Probability of finding pattern in all 20 sequences is (0.61)20 ≈ 0
7/18/2015
82
It works…

Use the algorithm as a sift, i.e. we run it to find all patterns
that occur in at least half of the sequences.

Search the patterns reported by the algorithm and use the
suffix tree to check which ones actually appear in all the
sequences.
7/18/2015
83
The next…

The probability that a pattern of length 15 occurs with up to
four mutations of a give position of a random sequence:


i
15   3   1 
p (15, 4)       
i 0  i   4   4 
The probability is seen by WEEDER is
4
(15 i )
p  0.63
*
The probability that a pattern occurs in a position in a hit
form
valid for WEEDER is
*
(15,4)
p
7/18/2015
 p  p(15, 4)
*
hit
84
Finally…

The pattern occurs at least once in a sequence of length n is:

seq
*
( n14)
(15,4)
(15,4)
The probability that the
pattern occurs at least
half of the
 1 (1 p
p
)
sequence of set and is found by WEEDER is:
 20 
(20,10)     ( p
i 
20
seq
seq
i
seq
(20i )
 The expected
to the
(15,4) number of patterns passed
(15,4)by WEEDER
(15,4)
second phase is :
i 10
p
4 p
15
7/18/2015
seq
(15,4)
) (1  p
(20,10)
)
85
How good it works…..

A set of 20 sequences of length 600 raises less
than 10 patterns.

A set of 20 sequences of length 1000 raises to
about 300 patterns.

If we look for patterns that occur in at least 9
sequences:

50 for sequences of length 600

3500 for sequence of length 1000
7/18/2015
86
Performance Evaluation (cont’d)

We could partition the set of sequences in two subsets of ten
sequences each and the probability that the pattern will pop up
among the ones found by the algorithm in either subset is:
10  i
1  (   pmiss (1  pmiss )(10i ) )2  0.98
where pmiss
i =6 1 -i phit
10

7/18/2015
87
Performance Evaluation (cont’d)


A pattern is expanded whenever at least one of the
two counters is greater than q.
q is a minimum number of sequences for each
subset
7/18/2015
88
Performance Evaluation (cont’d)

This approach can be pushed even further.
Partition the set of sequences as long as the parameters are
such that only a few patterns satisfy them.

And it works well on long signals, where the phit
value is lower and random patterns are unlikely to
be picked up by the sifting phase.
7/18/2015
89
Thus…

The final exact search has to be performed on a
limited number of patterns.

When the signal length and the number of
mutations are known in advance, we can determine
the best parameter setting and search strategy
for WEEDER.
7/18/2015
90
If the length of the pattern is not
known in advance…

There’re some additional problems to be faced…
If we choose
:
e  0.25
The probability of hitting a (16, 4) pattern is 0.54.
But the chances of finding a (15, 4) have dropped to 0.45. (since it’s
decomposed as (4-4-4-3))
=>The probabilities of finding a pattern depends on the block decomposition
induced by .
7/18/2015
91
Moreover…
a lower threshold value q would
increase the number of candidates.
 Setting
For example:
Run a algorithm with
hoping to find a (16, 4) pattern.
on 20 sequences of length 400
=>The constraints are satisfied by hundreds of patterns of length 12.
7/18/2015
92
What is the solution?

One possible solution could be to investigate only
the longest patterns found.
=>But a significant signal could be hidden also among
the shorter ones.
7/18/2015
93
The solution we adopt…

Expand all patterns that appear in at least q sequences, but report
only those that occur in q(m), which is set according to the pattern
length.
In the previous example:
A pattern of length 12 with 3 mutations can be found with probability of about 90% by setting q=11.
=>Thus, if a pattern of length 12 appears in at least q(m)=11 sequences, we can pass it to the second phase.

Since q(m) can be set according to the number of sequences given as
input, and the parameters q and .
e
7/18/2015
94
If the nucleotide composition of the
sequences is not uniform…

For example, 1:1:1:2
twice as often as the others)
(T occurs
In order to avoid spending too much time in the final phase, we can
set the threshold q(m) according to the pattern probability.
7/18/2015
95
The points just discussed

At the end, the algorithm might report more than
one pattern satisfying the constraints.
=>We need to introduce significance
measures to sort the output.
7/18/2015
96
Sorting the Output

The algorithm may output more than one pattern
under:

The length of the signal to be found is unknown

The sequences contain more than one signal
7/18/2015
97
Example

A successful (15,4) pattern can be expanded by one
symbol and becomes a (16,5) pattern.

All its occurrences are also valid occurrences of
the longer one.
7/18/2015
98
A Method for Sorting Outputs
k
S P | P | N P  2 d ( P, Pi ) I ( P, i)
i 1

A pattern P is given

is the best instance of P in the sequence i

is the total number of sequences P occurs in
Pi
NP

I ( P, i )

is 1 if P appears in sequence i, otherwise 0
d ( P, Pi )
7/18/2015
means the distance between P and
Pi
99
Relative Entropy

When the nucleotide composition of the sequences
is biased, we use the background probabilities to
define new match premiums and penalties.
m
EP  

j 1 r{ A,C ,G ,T }
7/18/2015
Pr , j log
Pr , j
br
100
Relative Entropy (cont’d)
m
EP  

j 1 r{ A,C ,G ,T }


Pr , j log
Pr , j
br
Pr , j
is the frequency with which residue r occurs in
position j in the occurrences of P
br
7/18/2015
is the frequency of r in the sequences
101
Relative Entropy (cont’d)

It is suitable for sorting patterns under following
situations:

Patterns that appear the same number of times, for
example once in every sequence of the set.

7/18/2015
Sequences containing multiple signals.
102
Another Measure of Significance

We can define statistical measures of significance,
that compare the actual number of occurrences of
a pattern with the expected value.
ZP 
7/18/2015
N P  N ( P ,e )
N ( P ,e ) (1   ( P ,e ) )
103
Measure of Significance (cont’d)
ZP 
N P  N ( P ,e )
N ( P ,e ) (1   ( P ,e ) )
 ( P,e) is the probability that P occurs in a sequence of the set

with at most e mutations.

N is the overall length of the sequences.

This value can be computed explicitly (Tompa,1999)
7/18/2015
104
Software Implementation

Machine: Pentium II class computer with 64MB
RAM

OS: Linux

Program Codes: written in ANSI C, about 2500
lines long

Testing Data: challenge problem as described
before, varying the length from 100 to 1000
nucleotides
7/18/2015
105
Experiment Results -- Time

Construction of the suffix tree: always within one second

To find the (15,4) signal with 89% success probability, and run the
algorithm with = 0.26 and q = 10. for sequence lengths up to 400
nucleotides. it took less than one minute (including the final exact
search).

Length 500: execution time grows significantly

Length 500 and 600: average time 125 and 200 seconds

q = 11: execution time drops to 100 and 120
7/18/2015
106
Experiment Results – Time
(cont’d)

Increasing the number of sequences did not influence the execution time very
much.

Example

For every sequence length, the algorithm took just a few more seconds when run over
30 or 40 sequences with q set to 16 and 21, respectively.

Sequence length set to 800, the program took on the average 320 and 450 seconds to
complete the execution with q set to 11 and 10 respectively. When length is 1000 long,
it took about 15 minutes.

Thus the WEEDER algorithm is suited to work on a large set of relatively short
(up to 600 nucleotide) sequences than a small set of very long sequences.
7/18/2015
107