Exhaustive search - University of Illinois at Urbana

Download Report

Transcript Exhaustive search - University of Illinois at Urbana

Exhaustive search (cont’d)
CS 466
Saurabh Sinha
Chapter 4.5-4.9
Finding motifs ab initio
• Enumerate all possible strings of some
fixed (small) length
• For each such string (“motif”) count its
occurrences in the promoters
• Report the most frequently occurring motif
• Does the true motif pop out ?
Simple statistics
• Consider 10 promoters, each 100 bp long
• Suppose a secret motif ATGCAACT has been “planted”
in each promoter
• Our enumerative method counts every possible “8-mer”
• Expected number of occurrences of an 8-mer is 10 x
100 x (1/4)8 ≈ 0.015
• Most likely, an arbitrary 8-mer will occur at most once,
may be twice
• 10 occurrences of ATGCAACT will stand out
Variation in binding sites
• Motif occurrences will not always be
exact copies of the consensus string
• The transcription factor can usually
tolerate some variability in its binding
sites
• It’s possible that none of the 10
occurrences of our motif ATGCAACT is
actualy this precise string
A new motif model
• To define a motif, lets say we know where the motif
occurrence starts in the sequence
• The motif start positions in their sequences can be
represented as s = (s1,s2,s3,…,st)
Motifs: Profiles and Consensus
a G g t a c T t
C c A
Alignment
a c g
a c g
C c g
t
t
t
t
a
T
C
a
c
A
c
c
g
g
A
g
t
t
t
G
_________________
Profile
A
C
G
T
3
2
0
0
0
4
1
0
1
0
4
0
0
0
0
5
3
1
0
1
1
4
0
0
1
0
3
1
0
0
1
4
_________________
Consensus
A C G T A C G T
• Line up the patterns by
their start indexes
s = (s1, s2, …, st)
• Construct matrix profile
with frequencies of each
nucleotide in columns
• Consensus nucleotide in
each position has the
highest score in column
Profile matrices
• Suppose there were t sequences to begin with
• Consider a column of a profile matrix
• The column may be (t, 0, 0, 0)
– A perfectly conserved column
• The column may be (t/4, t/4, t/4, t/4)
– A completely uniform column
• “Good” profile matrices should have more
conserved columns
Scoring Motifs
l
• Given s = (s1, … st) and DNA
a G g t a c T t
C c A t a c g t
a c g t T A g t
a c g t C c A t
C c g t a c g G
_________________
l
Score(s,DNA) =
 max
count (k , i)
i 1 k{ A,T ,C ,G}
A
C
G
T
3 0 1 0 3 1 1 0
2 4 0 0 1 4 0 0
0 1 4 0 0 0 3 1
0 0 0 5 1 0 1 4
_________________
Consensus
a c g t a c g t
Score
3+4+4+5+3+4+3+4=30
t
Good profile matrices
• Goal is to find the starting positions
s=(s1,…st) to maximize the score(s,DNA)
of the resulting profile matrix
• This is one formulation of the “motif
finding problem”
Another formulation
• Hamming distance between two strings v and w is
dH(v,w) = number of mismatches between v and w
• Given an array of starting positions s=(s1,…st), we
define dH(v, s) = ∑idH(v,si)
• Define: TotalDist(v, DNA) = mins dH(v,s)
• Computing TotalDist is easy
– find closest string to v in each input sequence
The median string problem
• Find v that minimizes TotalDist(v)
• A double minimization (mins, minv)
• Equivalent to motif finding problem
– Show this
Naïve time complexity
• Motif finding problem: Consider every
(s1,…st): O((n-l+1)t)
• Median string problem: Consider every
l-mer: O(4l). Relatively fast !
• Common form of both expressions: Find
a vector of L variables, each variable
can take k values: O(kL)
An algorithm to enumerate !
• Want to generate all strings in {1,2,3,4}L
11…11
11…12
11…13
11…14
.
.
44…44
NEXTLEAF(a, L, k)
for i := L to 1
if ai < k
ai := ai+1
return a
ai := 1
return a
ALLLEAVES(L, k)
a := (1,..,1)
while true
output a
a := NEXTLEAF(a,L,k)
if a = (1,..,1)
return
Increment the least significant digit; and “carry over” to next
position if necessary
“Seach Tree” for enumeration
--
Order of steps
4
1
2
11
2
11
12
1
3
2-
3-
4-
4
3
13
14
2
3
21
4
22
5
23
6
24
7
31
8
32
9
10
33
34
11
41
12
13
42
43
14
44
15
Visiting all the vertices in tree
•
Not just the leaves, but the internal nodes also
–
–
•
•
Why? Why not only the leaves of the tree?
We’ll see later.
PreOrder traversal of a tree
PreOrder(node):
1. Visit node
2. PreOrder(left child of node)
3. PreOrder(right child of node)
•
How about a non- “recursive” version?
Visit the Next Vertex
1. NextVertex(a,i,L,k) // a : the array of digits
2.
if i < L
// i : prefix length
3.
a i+1  1
// L: max length
4.
return ( a,i+1)
// k : max digit value
5. else
6.
for j  L to 1
7.
if aj < k
8.
aj  aj +1
9.
return( a,j )
10. return(a,0)
In words
• If at an internal node, just go one level
deeper.
• If at a leaf node,
– go to next leaf
– if moved to a non-sibling in this process,
jump up
Bypassing
• What if we wish to skip an entire
subtree (at some internal node) during
the tree traversal ?
BYPASS(a, i, L, k)
for j := i to 1
if aj < k
aj := aj+1
return (a,j)
return (a,0)
Brute Force Solution for the
Motif finding problem
1.
2.
3.
4.
5.
6.
7.
8.
9.
BruteForceMotifSearchAgain(DNA, t, n, l)
s  (1,1,…, 1)
bestScore  Score(s,DNA)
while forever
s  NextLeaf (s, t, n-l+1)
if (Score(s,DNA) > bestScore)
bestScore  Score(s, DNA)
bestMotif  (s1,s2 , . . . , st)
return bestMotif
O(l(n-l+1)t)
Can We Do Better?
• Sets of s=(s1, s2, …,st) may have a weak profile for the first
i positions (s1, s2, …,si)
• Every row of alignment may add at most l to Score
• Optimism: if all subsequent (t-i) positions (si+1, …st) add
(t – i ) * l to Score(s,i,DNA)
• If Score(s,i,DNA) + (t – i ) * l < BestScore, it makes no
sense to search in vertices of the current subtree
– Use ByPass()
• “Branch and bound” strategy
– This saves us from looking at (n – l + 1)t-i leaves
Pseudocode for Branch and Bound Motif Search
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
BranchAndBoundMotifSearch(DNA,t,n,l)
s  (1,…,1)
bestScore  0
i1
while i > 0
if i < t
optimisticScore  Score(s, i, DNA) +(t – i ) * l
if optimisticScore < bestScore
(s, i)  Bypass(s,i, n-l +1)
else
(s, i)  NextVertex(s, i, n-l +1)
else
if Score(s,DNA) > bestScore
bestScore  Score(s)
bestMotif  (s1, s2, s3, …, st)
(s,i)  NextVertex(s,i,t,n-l + 1)
return bestMotif
The median string problem
• Enumerate 4l strings v
• For each v, compute TotalDist(v, DNA)
– This requires linear scan of DNA, i.e., O(nt)
• Overall: O(nt4l)
• Improvement by branch and bound ?
• During enumeration of l-mers, suppose we are at
some prefix v’, and find that TotalDist(v’,DNA) >
BestDistanceSoFar.
• Why enumerate further ?
Bounded Median String Search
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
BranchAndBoundMedianStringSearch(DNA,t,n,l )
s  (1,…,1)
bestDistance  ∞
i1
while i > 0
if i < l
prefix  string corresponding to the first i nucleotides of s
optimisticDistance  TotalDistance(prefix,DNA)
if optimisticDistance > bestDistance
(s, i )  Bypass(s,i, l, 4)
else
(s, i )  NextVertex(s, i, l, 4)
else
word  nucleotide string corresponding to s
if TotalDistance(s,DNA) < bestDistance
bestDistance  TotalDistance(word, DNA)
bestWord  word
(s,i )  NextVertex(s,i,l, 4)
return bestWord
Greedy Algorithms
Chapter 5.5
A greedy approach to the
motif finding problem
• Given t sequences of length n each, to find
a profile matrix of length l.
• Enumerative approach O(l nt)
– Impractical
• Instead consider a more practical algorithm
called “GREEDYMOTIFSEARCH”
Greedy Motif Search
• Find two closest l-mers in sequences 1 and 2 and form
2 x l alignment matrix with Score(s,2,DNA)
• At each of the following t-2 iterations, finds a “best” l-mer in
sequence i from the perspective of the already constructed (i-1)
x l alignment matrix for the first (i-1) sequences
• In other words, it finds an l-mer in sequence i maximizing
Score(s,i,DNA)
under the assumption that the first (i-1) l-mers have been
already chosen
• Sacrifices optimal solution for speed: in fact the bulk of the time
is actually spent locating the first 2 l-mers
Greedy Motif Search
pseudocode
•
•
•
•
•
•
•
GREEDYMOTIFSEARCH (DNA, t, n, l)
bestMotif := (1,…,1)
s := (1,…,1)
for s1=1 to n-l+1
for s2 = 1 to n-l+1
if (Score(s,2,DNA) > Score(bestMotif,2,DNA)
bestMotif1 := s1
bestMotif2 := s2
s1 := bestMotif1; s2 := bestMotif2
for i = 3 to t
for si = 1 to n-l+1
if (Score(s,i,DNA) > Score(bestMotif,i,DNA)
bestMotifi := si
si := bestMotifi
Return bestMotif
A digression
• Score of a profile matrix looks only at
the “majority” base in each column, not
at the entire distribution
• The issue of non-uniform “background”
frequencies of bases in the genome
• A better “score” of a profile matrix ?
Information Content
• First convert a “profile matrix” to a “position
weight matrix” or PWM
– Convert frequencies to probabilities
• PWM W: Wk = frequency of base  at
position k
• q = frequency of base  by chance
• Information content of W:
 
k  {A,C,G,T }
W k log
W k
q
Information Content
• If Wk is always equal to q, i.e., if W is
similar to random sequence, information
content of W is 0.
• If W is different from q, information
content is high.
Greedy Motif Search
• Can be trivially modified to use “Information Content” as
the score
• Use statistical criteria to evaluate significance of
Information Content
• At each step, instead of choosing the top (1) partial
motif, keep the top k partial motifs
– “Beam search”
• The program “CONSENSUS” from Stormo lab.
• Further Reading: Hertz, Hartzell & Stormo, CABIOS
(1990) http://bioinf.kvl.dk/~gorodkin/teach/bioinf2004/hertz90.pdf
More on Greedy algorithms in
next lecture