Transcript Slides

CS 3343: Analysis of
Algorithms
Lecture 26: String Matching Algorithms
Definitions
• Text: a longer string T
• Pattern: a shorter string P
• Exact matching: find all occurrence of P in T
T
b b a b a
P
a
b
a
x
Length = n
a b a b a
y
a b a
length = m
The naïve algorithm
b b a b a
a
x
a b a b a
y
a b a
Length = m
b
a
Length = n
a
b
a
a
b
a
a
b
a
a
b
a
a
b
a
a
b
a
a
b
a
Time complexity
• Worst case: O(mn)
• Best case: O(m)
– aaaaaaaaaaaaaa vs. baaaaaaa
• Average case?
– Alphabet size = k
– Assume equal probability
– How many chars do you need to compare before find
a mismatch?
• In average: k / (k-1)
• Therefore average-case complexity: mk / (k-1)
• For large alphabet, ~ m
– Not as bad as you thought, huh?
Real strings are not random
T: aaaaaaaaaaaaaaaaaaaaaaaaa
P: aaaab
Plus: O(m) average case is still bad for long
strings!
Smarter algorithms:
O(m + n) in worst case
sub-linear in practice
how is this possible?
How to speedup?
• Pre-processing T or P
• Why pre-processing can save us time?
– Uncovers the structure of T or P
– Determines when we can skip ahead without missing
anything
– Determines when we can infer the result of character
comparisons without actually doing them.
ACGTAXACXTAXACGXAX
ACGTACA
Cost for exact string matching
Overhead
Total cost = cost (preprocessing)
+ cost(comparison)
+ cost(output)
Constant
Hope: gain > overhead
Minimize
String matching scenarios
• One T and one P
– Search a word in a document
• One T and many P all at once
– Search a set of words in a document
– Spell checking
• One fixed T, many P
– Search a completed genome for a short sequence
• Two (or many) T’s for common patterns
• Would you preprocess P or T?
• Always pre-process the shorter seq, or the one
that is repeatedly used
Pattern pre-processing algs
– Karp – Rabin algorithm
• Small alphabet and small pattern
– Boyer – Moore algorithm
• The choice of most cases
• Typically sub-linear time
– Knuth-Morris-Pratt algorithm (KMP)
– Aho-Corasick algorithm
• The algorithm for the unix utility fgrep
– Suffix tree
• One of the most useful preprocessing techniques
• Many applications
Algorithm KMP
• Not the fastest
• Best known
• Good for “real-time matching”
– i.e. text comes one char at a time
– No memory of previous chars
• Idea
– Left-to-right comparison
– Shift P more than one char whenever possible
Intuitive example 1
abcxabc
T
mismatch
P
abcxabcde
Naïve approach:
T
abcxabc
?
abcxabcde
abcxabcde
abcxabcde
abcxabcde
• Observation: by reasoning on the pattern alone, we can
determine that if a mismatch happened when comparing
P[8] with T[i], we can shift P by four chars, and compare
P[4] with T[i], without missing any possible matches.
• Number of comparisons saved: 6
Intuitive example 2
Should not be a c
abcxabc
T
mismatch
P
abcxabcde
Naïve approach:
T
abcxabc
?
abcxabcde
abcxabcde
abcxabcde
abcxabcde
abcxabcde
abcxabcde
• Observation: by reasoning on the pattern alone, we can
determine that if a mismatch happened between P[7]
and T[j], we can shift P by six chars and compare T[j]
with P[1] without missing any possible matches
• Number of comparisons saved: 7
KMP algorithm: pre-processing
• Key: the reasoning is done without even knowing what string T is.
• Only the location of mismatch in P must be known.
T
P
t’
z
t
x
t
y
j
P
i
t’
z
j
t
i
y
Pre-processing: for any position i in P, find P[1..i]’s longest proper
suffix, t = P[j..i], such that t matches to a prefix of P, t’, and the next
char of t is different from the next char of t’ (i.e., y ≠ z)
For each i, let sp(i) = length(t)
KMP algorithm: shift rule
T
P
t’
z
j
P
t
x
t
y
i
t y
t’ z
1 sp(i) j
i
Shift rule: when a mismatch occurred between P[i+1] and T[k], shift P to the
right by i – sp(i) chars and compare x with z.
This shift rule can be implicitly represented by creating a failure link between y
and z. Meaning: when a mismatch occurred between x on T and P[i+1], resume
comparison between x and P[sp(i)+1].
Failure Link Example
P: aataac
a
sp(i)
0
If a char in T fails to match at
pos 6, re-compare it with the
char at pos 3 (= 2 + 1)
a
t
a
a
c
1
0
0
2
0
aa
at
aat
aac
Another example
P: abababc
If a char in T fails to match at
pos 7, re-compare it with the
char at pos 5 (= 4 + 1)
a
Sp(i)
0
b
a
b
a
b
c
0
0
0
0
4
0
ab
ab
abab
abab
ababa
ababc
KMP Example using Failure Link
a
a
t
a
a
c
T: aacaataaaaataaccttacta
aataac
Time complexity analysis:
^^*
• Each char in T may be compared up to n
aataac
times. A lousy analysis gives O(mn) time.
.*
• More careful analysis: number of
aataac
Implicit
comparisons can be broken to two phases:
comparison
^^^^^*
• Comparison phase: the first time a char in T
aataac
is compared to P. Total is exactly m.
• Shift phase. First comparisons made after a
..*
shift. Total is at most m.
aataac
.^^^^^ • Time complexity: O(2m)
KMP algorithm using DFA
(Deterministic Finite Automata)
If a char in T fails to match at
pos 6, re-compare it with the
char at pos 3
P: aataac
Failure link
a
a
t
a
DFA
0
1
a
2
c
If the next char in T is t after
matching 5 chars, go to state 3
a
a
a
t
t
3
a
4
a
a
All other inputs goes to state 0.
5
c
6
a
DFA Example
a
DFA
0
a
1
a
2
t
t
3
a
4
a
5
a
c
6
a
T: aacaataataataaccttacta
1201234534534560001001
Each char in T will be examined exactly once.
Therefore, exactly m comparisons are made.
But it takes longer to do pre-processing, and needs
more space to store the FSA.
Difference between Failure Link
and DFA
• Failure link
– Preprocessing time and space are O(n), regardless of
alphabet size
– Comparison time is at most 2m (at least m)
• DFA
– Preprocessing time and space are O(n ||)
• May be a problem for very large alphabet size
• For example, each “char” is a big integer
• Chinese characters
– Comparison time is always m.
The set matching problem
• Find all occurrences of a set of patterns in T
• First idea: run KMP or BM for each P
– O(km + n)
• k: number of patterns
• m: length of text
• n: total length of patterns
• Better idea: combine all patterns together and
search in one run
A simpler problem: spell-checking
• A dictionary contains five words:
–
–
–
–
–
potato
poetry
pottery
science
school
• Given a document, check if any word is (not) in
the dictionary
– Words in document are separated by special chars.
– Relatively easy.
Keyword tree for spell checking
This version of the potato gun was inspired by the Weird Science team out of Illinois
p
s
o
t
o
o
o
l
i
e
t
t
r
e
n
y
c
r
y
1
2
•
•
•
•
h
e
t
a
c
3
e
4
O(n) time to construct. n: total length of patterns.
Search time: O(m). m: length of text
Common prefix only need to be compared once.
What if there is no space between words?
5
Aho-Corasick algorithm
• Basis of the fgrep algorithm
• Generalizing KMP
– Using failure links
• Example: given the following 4 patterns:
– potato
– tattoo
– theater
– other
Keyword tree
0
p
o
t
t
t
h
e
a h
r
e
t
a
t
a
4
t
t
e
o
o
1
r
o
2
3
Keyword tree
0
p
o
t
t
t
h
e
a h
r
e
t
a
t
a
4
t
t
e
o
o
1
r
o
2
3
potherotathxythopotattooattoo
Keyword tree
0
p
o
t
t
t
h
e
a h
r
e
t
a
t
a
4
t
t
e
o
o
1
r
o
2
3
potherotathxythopotattooattoo
O(mn)
m: length of text. n: length of longest pattern
Keyword Tree with a failure link
0
p
o
t
t
t
h
e
a h
r
e
t
a
t
a
4
t
t
e
o
o
1
r
o
2
3
potherotathxythopotattooattoo
Keyword Tree with a failure link
0
p
o
t
t
t
h
e
a h
r
e
t
a
t
a
4
t
t
e
o
o
1
r
o
2
3
potherotathxythopotattooattoo
Keyword Tree with all failure links
0
p
o
t
t
t
h
e
a h
r
e
t
a
t
a
4
t
t
e
o
o
1
o
2
r
3
Example
0
p
o
t
t
t
h
e
a h
r
e
t
a
t
a
4
t
t
e
o
o
1
o
2
r
3
potherotathxythopotattooattoo
Example
0
p
o
t
t
t
h
e
a h
r
e
t
a
t
a
4
t
t
e
o
o
1
o
2
r
3
potherotathxythopotattooattoo
Example
0
p
o
t
t
t
h
e
a h
r
e
t
a
t
a
4
t
t
e
o
o
1
o
2
r
3
potherotathxythopotattooattoo
Example
0
p
o
t
t
t
h
e
a h
r
e
t
a
t
a
4
t
t
e
o
o
1
o
2
r
3
potherotathxythopotattooattoo
Example
0
p
o
t
t
t
h
e
a h
r
e
t
a
t
a
4
t
t
e
o
o
1
o
2
r
3
potherotathxythopotattooattoo
Aho-Corasick algorithm
• O(n) preprocessing, and O(m+k)
searching.
– n: total length of patterns.
– m: length of text
– k is # of occurrence.
• Can create a DFA similar as in KMP.
– Requires more space,
– Preprocessing time depends on alphabet size
– Search time is constant
Suffix Tree
• All algorithms we talked about so far preprocess
pattern(s)
–
–
–
–
Karp-Rabin: small pattern, small alphabet
Boyer-Moore: fastest in practice. O(m) worst case.
KMP: O(m)
Aho-Corasick: O(m)
• In some cases we may prefer to pre-process T
– Fixed T, varying P
• Suffix tree: basically a keyword tree of all
suffixes
Suffix tree
•
•
T: xabxac
Suffixes:
1.
2.
3.
4.
5.
6.
xabxac
abxac
bxac
xac
ac
c
x a
c
a
b
x
c
a
c
2
5
b
x
a
c
6
c
b x a c
1
4
3
Naïve construction: O(m2) using Aho-Corasick.
Smarter: O(m). Very technical. big constant factor
Difference from a keyword tree: create an internal
node only when there is a branch
Suffix tree implementation
• Explicitly labeling seq end
• T: xabxa
T: xabxa$
x a
a
b
x
a
b
x
a
b x a
a
1
b
x
$
2
3
x a
2
a
$
5
b
x
a
$
3
b x a$
$
4
1
Suffix tree implementation
• Implicitly labeling edges
• T: xabxa$
x a
a
b
x
$
2
a
$
5
b
x
a
1:2
b x a$
1
$
$
3:$
4
$
3:$
5
$
3
3:$
2:2
2
3
4
1
Suffix links
• Similar to failure link in a keyword tree
• Only link internal nodes having branches
x
a
b
d
e
g
h
j
i
f
c
a
b
xabcf
c
f
d
e
f
g
h
i
j
Suffix tree construction
1234567890
acatgacatt
1:$
1
Suffix tree construction
1234567890
acatgacatt
1:$
1
2:$
2
Suffix tree construction
1234567890
acatgacatt
a
2:$
4:$
2:$
3
1
2
Suffix tree construction
1234567890
acatgacatt
a
2:$
4:$
4:$
2:$
4
3
1
2
Suffix tree construction
5:$
1234567890
acatgacatt
a
2:$
5
4:$
4:$
2:$
4
3
1
2
Suffix tree construction
5:$
1234567890
acatgacatt
a
a
5:$
t
4:$
2:$
4
$
6
1
4:$
c
t
5
3
2
Suffix tree construction
5:$
1234567890
acatgacatt
a
a
c
t
5:$
t
$
6
1
c
a
t
4:$
3
5
4:$
t
4
5:$
7
2
Suffix tree construction
5:$
1234567890
acatgacatt
a
a
c
c
a
t t
t
5:$
t
t
$
6
1
8
5:$
3
5
4:$
t
4
5:$
7
2
Suffix tree construction
1234567890
acatgacatt
a
a
c
t
c
a
t t
t
5:$
t
t
$
6
1
8
5:$
5:$
3
5
t 5:$
t
9
4
5:$
7
2
Suffix tree construction
1234567890
acatgacatt
a
a
c
t
c
a
t t
t
5:$
t
t
$
6
1
8
5:$
5
$ 10
t 5:$
5:$
3
t
9
4
5:$
7
2
ST Application 1: pattern matching
• Find all occurrence of
P=xa in T
– Find node v in the ST that
matches to P
– Traverse the subtree
rooted at v to get the
locations
x a
c
a
b
x
c
a
c
5
2
b
x
a
c
3
6
c
b x a c
4
T: xabxac
• O(m) to construct ST (large constant factor)
• O(n) to find v – linear to length of P instead of T!
• O(k) to get all leaves, k is the number of occurrence.
• Asymptotic time is the same as KMP. ST wins if T is fixed.
KMP wins otherwise.
1
ST Application 2: set matching
• Find all occurrences of a
set of patterns in T
x a
c
a
b
– Build a ST from T
– Match each P to ST
x
c
a
c
5
2
b
x
a
c
6
3
c
b x a c
4
T: xabxac
P: xab
• O(m) to construct ST (large constant factor)
• O(n) to find v – linear to total length of P’s
• O(k) to get all leaves, k is the number of occurrence.
• Asymptotic time is the same as Aho-Corasick. ST wins if T fixed.
AC wins if P’s are fixed. Otherwise depending on relative size.
1
ST application 3: repeats finding
• Genome contains many repeated DNA
sequences
• Repeat sequence length: Varies from 1
nucleotide to millions
– Genes may have multiple copies (50 to 10,000)
– Highly repetitive DNA in some non-coding regions
• 6 to 10bp x 100,000 to 1,000,000 times
• Problem: find all repeats that are at least kresidues long and appear at least p times in the
genome
Repeats finding
• at least k-residues long and appear at
least p times in the seq
– Phase 1: top-down, count label lengths (L)
from root to each node
– Phase 2: bottom-up: count # of leaves
descended from each internal node
For each node with
L >= k, and N >= p,
print all leaves
O(m) to traverse tree
(L, N)
Maximal repeats finding
1. Right-maximal repeat
– S[i+1..i+k] = S[j+1..j+k],
– but S[i+k+1] != S[j+k+1]
2. Left-maximal repeat
acatgacatt
1. cat
2. aca
3. acat
– S[i+1..i+k] = S[j+1..j+k]
– But S[i] != S[j]
3. Maximal repeat
– S[i+1..i+k] = S[j+1..j+k]
– But S[i] != S[j], and S[i+k+1] != S[j+k+1]
Maximal repeats finding
1234567890
acatgacatt
a
a
c
t
c
a
t t
t
5:e
t
t
6
1
8
5:e
5
$ 10
t 5:e
5:e
3
t
9
4
5:e
7
2
• Find repeats with at least 3 bases and 2 occurrence
– right-maximal: cat
– Maximal: acat
– left-maximal: aca
Maximal repeats finding
1234567890
acatgacatt
a
a
c
t
c
a
t t
t
5:e
t
t
6
5:e
5
$ 10
t 5:e
8
5:e
3
9
t
5:e
7
1
Left char = []
4
2
g
c
c
a
a
• How to find maximal repeat?
– A right-maximal repeats with different left chars
ST application 4: word enumeration
• Find all k-mers that
occur at least p times
– Compute (L, N) for each
node
• L: total label length from
root to node
• N: # leaves
– Find nodes v with L>=k,
and L(parent)<k, and
N>=y
– Traverse sub-tree rooted
at v to get the locations
L<k
L=k
L=K
L>=k, N>=p
This can be used in many
applications. For example, to find
words that appeared frequently in
a genome or a document
Joint Suffix Tree
•
•
•
•
•
Build a ST for many than two strings
Two strings S1 and S2
S* = S1 & S2
Build a suffix tree for S* in time O(|S1| + |S2|)
The separator will only appear in the edge
ending in a leaf
• S1 = abcd
• S2 = abca
• S* = abcd&abca$
a
a
1,1
c
b
a
&
d
a
c b
$
b
c
2,4 a d
&
2,1
a
2,2
b
c
a
&abcd
useless
d &
c
a
bc
d
d
a &
1,4
a
2,3
b
c
d
1,2
1,3
To Simplify
a
c b
a
1,1
c
b
a
&
d
a
$
b
c
2,4 a d
&
2,1
a
2,2
b
c
a
&abcd
useless
d &
c
a
b c
d
d
a &
1,4
a
2,3
b
c
d
a
c b
d
a
1,1
$
b
c
2,4 a d
2,1
2,2
d
c
a
d
1,4
2,3
1,3
1,2
1,3
1,2
• We don’t really need to do anything, since all edge
labels were implicit.
• The right hand side is more convenient to look at
Application of JST
• Longest common substring
– For each internal node v,
keep a bit vector B
– B[1] = 1 if a child of v is a
suffix of S1
– Find all internal nodes with
B[1] = B[2] = 1
– Report one with the longest
label
– Can be extended to k
sequences. Just use a
longer bit vector.
Not subsequence
a
c b
d
a
1,1
$
b
c
2,4 a d
2,1
2,2
d
c
a
d
1,4
2,3
1,3
1,2
Application of JST
• Given K strings, find all k-mers that appear
in at least d strings
L< k
L >= k
1,x
B = (1, 0, 1, 1)
cardinal(B) >= d
4,x
3,x
3,x
Many other applications
• Reproduce the behavior of Aho-Corasick
• Recognizing computer virus
– A database of known computer viruses
– Does a file contain virus?
• DNA finger printing
– A database of people’s DNA sequence
– Given a short DNA, which person is it from?
• …
• Catch
– Large constant factor for space requirement
– Large constant factor for construction
– Suffix array: trade off time for space
Summary
• One T, one P
– Boyer-Moore is the choice
– KMP works but not the best
Alphabet independent
• One T, many P
– Aho-Corasick
– Suffix Tree
Alphabet dependent
• One fixed T, many varying P
– Suffix tree
• Two or more T’s
– Suffix tree, joint suffix tree, suffix array
Pattern pre-processing algs
– Karp – Rabin algorithm
• Small alphabet and small pattern
– Boyer – Moore algorithm
• The choice of most cases
• Typically sub-linear time
– Knuth-Morris-Pratt algorithm (KMP)
– Aho-Corasick algorithm
• The algorithm for the unix utility fgrep
– Suffix tree
• One of the most useful preprocessing techniques
• Many applications
Karp – Rabin Algorithm
• Let’s say we are dealing with binary
numbers
Text: 01010001011001010101001
Pattern: 101100
• Convert pattern to integer
101100 = 2^5 + 2^3 + 2^2 = 44
Karp – Rabin algorithm
Text: 01010001011001010101001
Pattern: 101100 = 44 decimal
10111011001010101001
= 2^5 + 0 + 2^3 + 2^2 + 2^1 = 46
10111011001010101001
= 46 * 2 – 64 + 1 = 29
10111011001010101001
= 29 * 2 - 0 + 1 = 59
10111011001010101001
= 59 * 2 - 64 + 0 = 54
10111011001010101001
= 54 * 2 - 64 + 0 = 44
Θ(m+n)
Karp – Rabin algorithm
What if the pattern is too long to fit into a single integer?
Pattern: 101100. What if each word in our computer has only 4 bits?
Basic idea: hashing. 44 % 13 = 5
10111011001010101001
= 46 (% 13 = 7)
10111011001010101001
= 46 * 2 – 64 + 1 = 29 (% 13 = 3)
10111011001010101001
= 29 * 2 - 0 + 1 = 59 (% 13 = 7)
10111011001010101001
= 59 * 2 - 64 + 0 = 54 (% 13 = 2)
10111011001010101001
= 54 * 2 - 64 + 0 = 44 (% 13 = 5)
Θ(m+n) expected
running time
Boyer – Moore algorithm
• Three ideas:
– Right-to-left comparison
– Bad character rule
– Good suffix rule
Boyer – Moore algorithm
• Right to left comparison
x
y
Skip some chars without missing any occurrence.
y
But how?
Bad character rule
0
1
12345678901234567
T:xpbctbxabpqqaabpq
P: tpabxab
*^^^^
What would you do now?
Bad character rule
0
1
12345678901234567
T:xpbctbxabpqqaabpq
P: tpabxab
*^^^^
P:
tpabxab
Bad character rule
0
1
123456789012345678
T:xpbctbxabpqqaabpqz
P: tpabxab
*^^^^
P:
tpabxab
*
P:
tpabxab
Basic bad character rule
tpabxab
char
Right-most-position in P
a
6
b
7
p
2
t
1
x
5
Pre-processing:
O(n)
Basic bad character rule
k
T: xpbctbxabpqqaabpqz
P: tpabxab
When rightmost T(k) in
*^^^^
P is left to i, shift pattern
P to align T(k) with the
rightmost T(k) in P
i=3
Shift 3 – 1 = 2
P: tpabxab
char
Right-most-position in P
a
6
b
7
p
2
t
1
x
5
Basic bad character rule
k
T: xpbctbxabpqqaabpqz
P: tpabxab
*
When T(k) is not in
i=7
P, shift left end of P
to align with T(k+1)
Shift 7 – 0 = 7
P: tpabxab
char
Right-most-position in P
a
6
b
7
p
2
t
1
x
5
Basic bad character rule
k
T: xpbctbxabpqqaabpqz
P: tpabxab
When rightmost T(k)
*^^
in P is right to i, shift
pattern P one pos
i=5
P: tpabxab
char
Right-most-position in P
a
6
b
7
p
2
t
1
x
5
5 – 6 < 0. so shift 1
Extended bad character rule
k
T: xpbctbxabpqqaabpqz
P: tpabxab
*^^
Find T(k) in P that is
immediately left to i,
shift P to align T(k)
with that position
5 – 3 = 2. so shift 2
i=5
P: tpabxab
char
Position in P
a
6, 3
b
7, 4
p
2
t
1
x
5
Preprocessing still O(n)
Extended bad character rule
• Best possible: m / n comparisons
• Works better for large alphabet size
• In some cases the extended bad character
rule is sufficiently good
• Worst-case: O(mn)
• What else can we do?
0
1
123456789012345678
T:prstabstubabvqxrst
P: qcabdabdab
*^^
According to extended bad character rule
P:
qcabdabdab
(weak) good suffix rule
0
1
123456789012345678
T:prstabstubabvqxrst
P: qcabdabdab
*^^
P:
qcabdabdab
(Weak) good suffix rule
x
T
P
t’
P
y
t
Preprocessing:
For any suffix t of P, find the
rightmost copy of t, denoted by t’.
How to find t’ efficiently?
t
t’
y
t
(Strong) good suffix rule
0
1
123456789012345678
T:prstabstubabvqxrst
P: qcabdabdab
*^^
(Strong) good suffix rule
0
1
123456789012345678
T:prstabstubabvqxrst
P: qcabdabdab
*^^
P:
qcabdabdab
(Strong) good suffix rule
0
1
123456789012345678
T:prstabstubabvqxrst
P: qcabdabdab
*^^
P:
qcabdabdab
(Strong) good suffix rule
x
T
P
z
t’
y
z≠y
P
t
In preprocessing:
For any suffix t of P, find the
rightmost copy of t, t’, such that the
char left to t ≠ the char left to t’
t
z t’
y
t
• Pre-processing can be done in linear time
• If P in T, searching may take O(mn)
• If P not in T, searching in worst-case is O(m+n)
Example preprocessing
qcabdabdab
Bad char rule
char
Positions in P
a
9, 6, 3
b
10, 7, 4
c
2
d
8,5
Good suffix rule
1 2 3 4 5 6 7 8 9 10
q c a b d a b d a b
0 0 0 0 0 0 0 2 0 0
dab
q
1
Where to shift depends on T
cab
Does not depend on T