Pattern matching

Download Report

Transcript Pattern matching

String Matching
CS 261
Problem of String Matching
• Look for starting index of a string
embedded in a larger string.
• Example: Find index of “sip” in
“mississippi”
• Example: Find location of a base protein
in a DNA molecule
Naïve approach
Loop over starting locations, loop over string,
set flag or similar technique
For (I = 0; I < text.length; I++) {
boolean found = true;
for (j = 0; j < target.legth; j++)
if (text[I+j] != target[j])
{ found = false; break; }
if (found) { … we found it }
}
Complexity?
• Loops over both text and target, is
O(nm) where n is text length, and m is
target length.
• Seems obvious - but can you do better?
• One problem, every time we fail in the
inner loop, we throw away all previous
information
KMP searching
• A failure does not mean we need to
start over completely from scratch
• Example, suppose we are looking for
the pattern “agacagata” (DNA protein)
• Notice that the beginning is repeated
later in the string. (Common in DNA)
Prefix Patterns
Position 0 1 2 3 4 5 6 7 8 9 10
Pattern a g a c a g a t a
Shifted
a g ac a g a t
If we fail at location 7, we still know that
values in 4, 5 and 6 match the start of
the pattern
Build a prefix table
• As large as the target text, tells how far
to slide
0 1 2 3 4 5 6 7 8 9101112131415
A g c t a g c a g c t a g c t g
0 0 0 0 1 2 3 1 2 3 4 5 6 7 ..
If we fail at 6, part we have seen could be
a match to 3.
Builds a fast on-line matcher
• The KMP algorithm works on-line
• That means, it never backs up in the
text stream
• Imagine the characters were coming
down a network, it doesn’t need to
buffer the characters at all.
• So it is O(n), where n is the text size
Could you do better?
• You might think that O(n) is best you
can do.
• You have to look at every character at
least one, right?
• Wrong.
Boyer Moore Matching
• What if you try to match from the right
instead of from the left.
• Try looking for “perfect” in Dec of Ind
We the people of the united st
perfect
Boyer Moore Matching
• Notice that the t doesn’t match
• So so we can skip the target altogether
We the people of the united st
perfect
Boyer Moore Matching
• In fact, there is no space in the target
• So we don’t have to look at the earlier
characters
We the people of the united st
perfect
Boyer Moore Matching
• So there are characters in the text we
never even look at
• Running time is better than O(n)!
We the people of the united st
perfect
Builds a shift table
• Boyer-Moore starts by building a shift
table from the input
• Table tells how far to shift given the
match you have seen so far
• Moves very quickly through the input
Which is faster
• BM is faster when the input alphabet is
large and the pattern if few different
characters
• KMP is faster when the alphabet is
small (e.g., DNA)
• Both do preprocessing, costly if the text
is small
Another problem
• Looking at DNA for repeated sequences
• Similar to string search, but you don’t
have a target string, only a length.
• There is an obvious O(n2m) algorithm
Obvious algorithm
For (int I = 0; I < (n-m); I++)
for (int j = I; j < (n-m); j++) {
boolean found = true;
for (int k = 0; k < m; k++)
if (text[I+k] != text[j+k]) {
found = false; break; }
if (found) { … // oh goody
Could you do better?
•
•
•
•
What about using sorting?
Make the input into an n by m matrix,
Sort on rows
Check adjacent rows to see if they
match
• Can be o(m n log n)
The matrix
0
1
1
3
4
5
…
M
1
2
3
4
5
6
…
M+
1
2
3
4
5
6
7
…
3
4
5
6
7
8
…
…
Reduces computation from
hours to minuits
• But could you do better?
• What if you had a lot of memory?
• Think about hash tables
Using hashing
•
•
•
•
Take each section of m characters
Compute hash value h
Add index to hash bucket h
Now only need to look for matches if
two values hash into same location
• Can be O(n) assuming you have
enough memory ….
Bottom line
• Obvious algorithms are not always
fastest
• But good algorithms usually require
knowing something about the input.