Implementation of Planted Motif Search Algorithms

Transcript Implementation of Planted Motif Search Algorithms

Implementation of Planted
Motif Search Algorithms PMS1
and PMS2
Clifford Locke
BioGrid REU, Summer 2008
Department of Computer Science and Engineering
University of Connecticut, Storrs, CT
Introduction


General Problem: Multiple Sequence
Comparison
Biological Basis

DNA structure/function

Sequence of nucleotides


Genes code for proteins


Modeled as strings
Structure  Function
Image credit: www.britannica.com
Evolution – Result of DNA mutations and selective
pressures
Introduction

Goals of Multiple Sequence Comparison




Deduce evolutionary relationships.
Protein and gene function studies.
Find transcription factor/ regulatory protein binding
sites.
Approaches:


Find common subpatterns and deduce a biological
relationship.
Find common subpatterns between DNA sequences
with a known biological relationship.
Planted Motif Search


Motifs- Common functional subsequences in a
set of biological sequences
Planted (l,d) motif search problem:


Input are n strings (S1, S2, … , Sn) of length m and
two integers l and d. Find all strings x such that |x|
= l and every input string contains at least one
variant of x at a Hamming distance of at most d.
Primary applications: Finding transcription
factor binding sites; drug target identification
Algorithm PMS1




Generate the set of all l-mers in each input sequence. Let Ci
correspond to the l-mers of Si.
For each l-mer u in Ci (1 < i < n), generate all l-mers v such
that v is at a Hamming distance of at most d from u (v is a
“neighbor” of u). Let Li correspond to all l-mers u and v from
input sequence Si.
Alphabetically sort each set of neighbors Li and eliminate any
duplicates.
Merge and intersect all sets Li to find the l-mer that appears in
each neighborhood. Such l-mers constitute the motifs in the
input sequences.
Algorithm PMS2

Algorithm PMS2 exploits these observations


If M occurs in each input sequence, then at least lk+1 length-k substrings of M occur in each input
sequence.
In each input sequence there must be at least one
position ij such that a k-mer of M occurs at each
position ij – ij+l – k .
Algorithm PMS2



Use a modified PMS1 to solve the planted (d+c, d)motif problem. Let R contain the (d+c)-motifs.
Find all of the occurrences of R in an arbitrary input
sequence Sj,. Let Li contain the (d+c)-motifs of R with
variants starting at position i of Sj.
For each position i in Sj




A is the l-mer of Sj starting at position i
M1 and M2 are members of Li and Li+l – (d+c).
If the last 2(d+c) – l characters of M1 are equal to the first
2(d+c) – l characters of M2, form an l-mer B by appending
the last l – (d+c) characters of M2 to M1.
If dH(A,B) < d, add B to a list of candidates C. Once
the list of candidates is complete, check if each
candidate is a motif.
Results






n = 20 and m = 600; arbitrary motif
inserted in each input sequence
Each implementation gave the correct
planted motif for each (l,d) case
PMS1 was faster than PMS2 for the
challenging instances (9,2) and (11,3)
Otherwise, PMS2 could be faster,
depending on the value of c
Low values of c lead to a high number of
(d+c,d)-motifs, which leads to a high
number of candidate strings
Conslusions


PMS1 better-suited for challenge problems
PMS2 better suited for larger l
(l,d)
PMS1
PMS2
(9,2)
53.343
d+c = 5: 305.672
d+c = 6: 342.657
d+c = 7: 394.234
d+c = 8: 72.609
(10,2)
73.203
d+c = 7: 547.25
d+c = 8: 72.344
d+c = 9: 53.828
(11,2)
89.704
d+c = 7: 705.75
d+c = 8: 70.25
d+c = 9: 54.046
(12,2)
118.266
d+c = 8: 76.468
d+c = 9: 54.312
d+c = 10: 71.531
(11,3)
1076.23
d+c = 10: 1105.03
(12,3)
1552.47
d+c = 10: 1059.83
Runtimes, in seconds, of algorithms
PMS1 and PMS2
Future Work
Minimization of Consensus Sequences

Consensus Sequence



An expression that can be used to describe two or more sequences
Two forms:
 {c1, c2, … ,cn} – Presence of one of the given characters c in the list
 {i1, i2, … in}c – Character c may occur any number of times ik
Examples:
 Merging abcde, abccde, abcdee, and abccdee gives ab{1,2}cd{1,2}e


Merging agtgc and actgc gives a{c,g}tgc
Problem Statement: Output a minimum number of consensus
sequences for a given set of input sequences.
Minimization of Consensus Sequences

Algorithm




To start, all input sequences are “alive”
An arbitrary alive sequence S is chosen and compared with every other
alive sequence T to check if they can be merged.
 Dynamic programming is used to optimally align S and T
 The optimal alignments of S and T will have loops corresponding to
insertions, deletions, and replacements.
 Merging may occur only if all loops can be resolved
 All mismatches can be resolved
 Insertions and deletions can be resolved only if there is a match
of the inserted/deleted character to the left or right of the loop.
If S and T can be merged, a consensus sequence is generated and
added to the list of “alive” sequences. S and T are killed.
This process continues until no two alive sequences can be merged. At
that point, all remaining alive input and consensus sequences are output.
Summary




Planted Motif Search Problem: Find an l-mer
that differs in d or less places from at least one
l-mer in each input sequence
Algorithms PMS1 and PMS2 are based on a
model that generates the neighborhood of
every input sequence and intersects them to
find the motifs
PMS1 is best suited for challenge problems; use
PMS2 for larger l
Future work will include the minimization of
consensus sequences (regular expressions)
Acknowledgements

Special thanks to:



Sanguthevar Rajasekaran
National Science Foundation
University of Connecticut Department of Computer
Science and Engineering
Levenshtein Distance


Formal definition: The lowest number of edit operations,
consisting of insertion (I), deletion (D), and replacement (R),
necessary to convert one string to another.
Algorithm
 Let Di,j be the edit distance of S1(1…i) and S2(1…j).
 Add a blank space to the beginning of each string and align
the strings along the edges of a matrix.
 By definition, Di,0= i and D0,j= j.
 Recurrence relation: Di,j= min(Di-1,j+ 1, Di,j-1+ 1, Di-1,j-1 + ti,j )
 ti,j = 0 if S1[i] = S2[j] , 1 otherwise (substitution)
 By definition of Di,j, Dn,m, where n = |S1| and m = |S2|, is
the edit distance of S1 and S2
Example

S1 = vintner, S2 = writers
w r
i
t
e
r
s
w r
i
t
e
r
s
0
1
2
3
4
5
6
7
0
1
2
3
4
5
6
7
0
0
1
2
3
4
5
6
7
0
0
1
2
3
4
5
6
7
v
1
1
v
1
1
1
2
3
4
5
6
7
i
2
2
i
2
2
2
2
2
3
4
5
6
n
3
3
n
3
3
3
3
3
3
4
5
6
t
4
4
t
4
4
4
4
4
3
4
5
6
n
5
5
n
5
5
5
5
5
4
4
5
6
e
6
6
e
6
6
6
6
6
5
4
5
6
r
7
7
r
7
7
7
6
7
6
5
4
5
Adapted from Algorithms on Strings, Trees, and Sequences by Dan Gusfield, 1999.

Value in bottom-right cell gives Levenshtein distance
(5)
Optimal Alignment from Levenshtein Distance

Working from the bottom right of the matrix, insert pointers
Set a pointer from cell (i,j) to
 Cell (i-1, j) if Di,j = Di-1,j + 1
 Corresponds to a deletion of S1(i) from S1
 Cell (i, j-1) if Di,j = Di,j-1 + 1
 Corresponds to an insertion of S2(j) into S1
 Cell (i-1, j-1) if Di,j = Di-1,j-1 + ti,j
 Corresponds to match (t=0) or replacement (t=1)
Follow the pointers from Dn,m to D(0,0) to get optimal alignment
Some cells may have two pointers, in which case more than one optimal
alignment exists




3 optimal alignments in the example:
w r i
v i
t _ e r s
w r i
_ t
_ e r s
w r i
_ t
_ e r s
n t n e r _
v _ i
n t
n e r _
_ v i
n t
n e r _
Example
w
r
i
t
e
r
s
0
1
2
3
4
5
6
7
0
0
1
2
3
4
5
6
7
v
1
1
1
2
3
4
5
6
7
i
2
2
2
2
2
3
4
5
6
n
3
3
3
3
3
3
4
5
6
t
4
4
4
4
4
3
4
5
6
n
5
5
5
5
5
4
4
5
6
e
6
6
6
6
6
5
4
5
6
r
7
7
7
6
7
6
5
4
5
w r
i
_ t
_ e r
s
v _ i
n t
n e r
_

Implementation of Planted Motif Search Algorithms

Transcript Implementation of Planted Motif Search Algorithms

Directory