exact string matching

Download Report

Transcript exact string matching

CS 5263 Bioinformatics
Exact String Matching Algorithms
Overview
• Sequence alignment: two sub-problems:
– How to score an alignment with errors
– How to find an alignment with the best score
• Today: exact string matching
– Does not allow any errors
– Efficiency becomes the sole consideration
• Time and space
Why exact string matching?
• The most fundamental string comparison
problem
• Often the core of more complex string
comparison algorithms
– E.g., BLAST
• Often repeatedly called by other methods
– Usually the most time consuming part
– Small improvement could improve overall
efficiency considerably
Definitions
• Text: a longer string T (length m)
• Pattern: a shorter string P (length n)
• Exact matching: find all occurrences of P in T
T
b b a b a
P
a
b
a
x
length n
a b a b a
y
a b a
length m
The naïve algorithm
b b a b a
a
b
a
a
b
a
a
b
a
a
b
x
a b a b a
a
a
b
a
a
b
a
a
b
a
a
b
a
y
a b a
Time complexity
• Worst case: O(mn)
• Best case: O(m)
e.g. aaaaaaaaaaaaaa vs baaaaaaa
• Average case?
– Alphabet A, C, G, T
– Assume both P and T are random
– Equal probability
– In average how many chars do you need to
compare before giving up?
Average case time complexity
P(mismatch at 1st position): ¾
P(mismatch at 2nd position): ¼ * ¾
P(mismatch at 3nd position): (¼)2 * ¾
P(mismatch at kth position): (¼)k-1 * ¾
Expected number of comparison per position:
k (1-p) p(k-1) k = (1-p) / p * k pk k
= 1/(1-p)
= 4/3
Average complexity: 4m/3
Not as bad as you thought it might be
Biological sequences are not
random
T: aaaaaaaaaaaaaaaaaaaaaaaaa
P: aaaab
Plus: 4m/3 average case is still bad for long
genomic sequences!
Especially if this has to be done again and again
Smarter algorithms:
O(m + n) in worst case
sub-linear in practice
How to speedup?
• Pre-processing T or P
• Why pre-processing can save us time?
– Uncovers the structure of T or P
– Determines when we can skip ahead without missing
anything
– Determines when we can infer the result of character
comparisons without doing them.
ACGTAXACXTAXACGXAX
ACGTACA
Cost for exact string matching
Overhead
Total cost = cost (preprocessing)
+ cost(comparison)
+ cost(output)
Constant
Hope: gain > overhead
Minimize
String matching scenarios
• One T and one P
– Search a word in a document
• One T and many P all at once
– Search a set of words in a document
– Spell checking (fixed P)
• One fixed T, many P
– Search a completed genome for short sequences
• Two (or many) T’s for common patterns
• Q: Which one to pre-process?
• A: Always pre-process the shorter seq, or the
one that is repeatedly used
Pre-processing algs
• Pattern preprocessing
– Knuth-Morris-Pratt algorithm (KMP)
– Aho-Corasick algorithm
• Multiple patterns
– Boyer – Moore algorithm (discuss only if have time)
• The choice of most cases
• Typically sub-linear time
• Text preprocessing
– Suffix tree / array
• Very powerful method for many purposes
– Burrows-Wheeler transform
• Key in next-generation short sequence reads alignment
Algorithm KMP: Intuitive example 1
abcxabc
T
mismatch
P
abcxabcde
Naïve approach:
T
abcxabc
?
abcxabcde
abcxabcde
abcxabcde
abcxabcde
• Observation: by reasoning on the pattern alone, we can
determine that if a mismatch happened when comparing
P[8] with T[i], we can shift P by four chars, and compare
P[4] with T[i], without missing any possible matches.
• Number of comparisons saved: 6
Intuitive example 2
Should not be a c
abcxabc
T
mismatch
P
abcxabcde
Naïve approach:
T
abcxabc
?
abcxabcde
abcxabcde
abcxabcde
abcxabcde
abcxabcde
abcxabcde
• Observation: by reasoning on the pattern alone, we can
determine that if a mismatch happened between P[7]
and T[j], we can shift P by six chars and compare T[j]
with P[1] without missing any possible matches
• Number of comparisons saved: 7
KMP algorithm: pre-processing
• Key: the reasoning is done without even knowing what string T is.
• Only the location of mismatch in P must be known.
T
P
t’
z
t
x
t
y
j
P
i
t’
z
j
t
i
y
Pre-processing: for any position i in P, find P[1..i]’s longest proper
suffix, t = P[j..i], such that t matches to a prefix of P, t’, and the next
char of t is different from the next char of t’ (i.e., y ≠ z)
For each i, let sp(i) = length(t)
KMP algorithm: shift rule
T
P
t’
z
j
P
t
x
t
y
i
t y
t’ z
1 sp(i) j
i
Shift rule: when a mismatch occurred between P[i+1] and T[k], shift P to the
right by i – sp(i) chars and compare x with z.
This shift rule can be implicitly represented by creating a failure link between y
and z. Meaning: when a mismatch occurred between x on T and P[i+1], resume
comparison between x and P[sp(i)+1].
Why KMP is correct?
• Theorem: for any alignment of P with T, if P[1..i] matches T[k-i, k-1],
but P[i+1] != T[k], then P can be shifted by i – sp(i) to the right
without missing any occurrence of P in T.
• Proof by contradiction (sketch):
–
–
–
–
sp(i) = || ( is the longest t)
Suppose an occurrence of P is missed
sp(i) ≥ || + || > ||, contradiction
Therefore, KMP shift rule is correct.
k-1


x
i y
z
x
T
P before shift
P after shift
“Missed occurrence of P”
Failure Link Example
P: aataac
a
sp(i)
0
If a char in T fails to match at
pos 6, re-compare it with the
char at pos 3 (= 2 + 1)
a
t
a
a
c
1
0
0
2
0
aa
at
aat
aac
Another example
P: abababc
If a char in T fails to match at
pos 7, re-compare it with the
char at pos 5 (= 4 + 1)
a
Sp(i)
0
b
a
b
a
b
c
0
0
0
0
4
0
ab
ab
abab
abab
ababa
ababc
KMP Example using Failure Link
a
a
t
a
a
c
T: aacaataaaaataaccttacta
aataac
Time complexity analysis:
^^*
• Each char in T may be compared up to n
aataac
times. A lousy analysis gives O(mn) time.
.*
• More careful analysis: number of
aataac
Implicit
comparisons can be broken to two phases:
comparison
^^^^^*
• Comparison phase: the first time a char in T
aataac
is compared to P. Total is exactly m.
• Shift phase: comparisons made after a shift.
..*
Total is at most m.
aataac
.^^^^^ • Time complexity: O(2m)
KMP algorithm using DFA
(Deterministic Finite Automata)
If a char in T fails to match at
pos 6, re-compare it with the
char at pos 3
P: aataac
Failure link
a
a
t
a
DFA
0
1
a
2
c
If the next char in T is t after
matching 5 chars, go to state 3
a
a
a
t
t
3
a
4
a
a
All other inputs goes to state 0.
5
c
6
a
DFA Example
a
DFA
0
a
1
a
2
t
t
3
a
4
a
5
a
c
6
a
T: aacaataataataaccttacta
1201234534534560001001
Each char in T will be examined exactly once.
Therefore, exactly m comparisons are made.
But it takes longer to do pre-processing, and needs
more space to store the FSA.
Difference between Failure Link
and DFA
• Failure link
– Preprocessing time and space are O(n), regardless of
alphabet size
– Comparison time is at most 2m (at least m)
• DFA
– Preprocessing time and space are O(n ||)
• May be a problem for very large alphabet size
• For example, each “char” is a big integer
• Chinese characters
– Comparison time is always m.
Boyer – Moore algorithm
• Often the choice of algorithm for many
cases
– One T and one P
– We will talk about it later if have time
– In practice sub-linear
The set matching problem
• Find all occurrences of a set of patterns in T
• First idea: run KMP or BM for each P
– O(km + n)
• k: number of patterns
• m: length of text
• n: total length of patterns
• Better idea: combine all patterns together and
search in one run
A simpler problem: spell-checking
• A dictionary contains five words:
–
–
–
–
–
potato
poetry
pottery
science
school
• Given a document, check if any word is (not) in
the dictionary
– Words in document are separated by special chars.
– Relatively easy.
Keyword tree for spell checking
This version of the potato gun was inspired by the Weird Science team out of Illinois
p
s
o
t
o
o
o
l
i
e
t
t
r
e
n
y
c
r
y
1
2
•
•
•
•
h
e
t
a
c
3
e
4
O(n) time to construct. n: total length of patterns.
Search time: O(m). m: length of text
Common prefix only need to be compared once.
What if there is no space between words?
5
Aho-Corasick algorithm
• Basis of the fgrep algorithm
• Generalizing KMP
– Using failure links
• Example: given the following 4 patterns:
– potato
– tattoo
– theater
– other
Keyword tree
0
p
o
t
t
t
h
e
a h
r
e
t
a
t
a
4
t
t
e
o
o
1
r
o
2
3
Keyword tree
0
p
o
t
t
t
h
e
a h
r
e
t
a
t
a
4
t
t
e
o
o
1
r
o
2
3
potherotathxythopotattooattoo
Keyword tree
0
p
o
t
t
t
h
e
a h
r
e
t
a
t
a
4
t
t
e
o
o
1
r
o
2
3
potherotathxythopotattooattoo
O(mn)
m: length of text. n: length of longest pattern
Keyword Tree with a failure link
0
p
o
t
t
t
h
e
a h
r
e
t
a
t
a
4
t
t
e
o
o
1
r
o
2
3
potherotathxythopotattooattoo
Keyword Tree with a failure link
0
p
o
t
t
t
h
e
a h
r
e
t
a
t
a
4
t
t
e
o
o
1
r
o
2
3
potherotathxythopotattooattoo
Keyword Tree with all failure links
0
p
o
t
t
t
h
e
a h
r
e
t
a
t
a
4
t
t
e
o
o
1
o
2
r
3
Example
0
p
o
t
t
t
h
e
a h
r
e
t
a
t
a
4
t
t
e
o
o
1
o
2
r
3
potherotathxythopotattooattoo
Example
0
p
o
t
t
t
h
e
a h
r
e
t
a
t
a
4
t
t
e
o
o
1
o
2
r
3
potherotathxythopotattooattoo
Example
0
p
o
t
t
t
h
e
a h
r
e
t
a
t
a
4
t
t
e
o
o
1
o
2
r
3
potherotathxythopotattooattoo
Example
0
p
o
t
t
t
h
e
a h
r
e
t
a
t
a
4
t
t
e
o
o
1
o
2
r
3
potherotathxythopotattooattoo
Example
0
p
o
t
t
t
h
e
a h
r
e
t
a
t
a
4
t
t
e
o
o
1
o
2
r
3
potherotathxythopotattooattoo
Aho-Corasick algorithm
• O(n) preprocessing, and O(m+k) searching.
– n: total length of patterns.
– m: length of text
– k is # of occurrence.
• Can create a DFA similar as in KMP.
– Requires more space,
– Preprocessing time depends on alphabet size
– Search time is constant
• A: Where can this algorithm be used in previous topics?
• Q: BLAST
– Given a query sequence, we generate many seed sequences (kmers)
– Search for exact matches to these seed sequences
– Extend exact matches into longer inexact matches
Suffix Tree
• All algorithms we talked about so far preprocess
pattern(s)
– Boyer-Moore: fastest in practice. O(m) worst case.
– KMP: O(m)
– Aho-Corasick: O(m)
• In some cases we may prefer to pre-process T
– Fixed T, varying P
• Suffix tree: basically a keyword tree of all
suffixes
Suffix tree
•
•
T: xabxac
Suffixes:
1.
2.
3.
4.
5.
6.
xabxac
abxac
bxac
xac
ac
c
x a
c
a
b
x
c
a
c
2
5
b
x
a
c
6
c
b x a c
1
4
3
Naïve construction: O(m2) using Aho-Corasick.
Smarter: O(m). Very technical. big constant factor
Difference from a keyword tree: create an
internal node only when there is a branch
Suffix tree implementation
• Explicitly labeling sequence end
• T: xabxa$
x a
a
b
x
a
b
x
a
b x a
a
1
b
x
$
2
3
x a
2
a
$
5
b
x
a
$
3
• One-to-one correspondence of leaves and suffixes
• |T| leaves, hence < |T| internal nodes
b x a$
$
4
1
Suffix tree implementation
• Implicitly labeling edges
• T: xabxa$
x a
a
b
x
$
2
a
$
5
b
x
a
1:2
b x a$
1
$
$
3:$
4
$
3:$
5
$
3
3:$
2:2
2
3
• |Tree(T)| = O(|T| + size(edge labels))
4
1
Suffix links
• Similar to failure link in a keyword tree
• Only link internal nodes having branches
x
a
b
d
e
g
h
j
i
f
c
a
b
P: xabcf
c
f
d
e
f
g
h
i
j
ST Application 1: pattern matching
• Find all occurrence of
P=xa in T
– Find node v in the ST that
matches to P
– Traverse the subtree
rooted at v to get the
locations
x a
c
a
b
x
c
a
c
5
2
b
x
a
c
3
6
c
b x a c
4
T: xabxac
• O(m) to construct ST (large constant factor)
• O(n) to find v – linear to length of P instead of T!
• O(k) to get all leaves, k is the number of occurrence.
• Asymptotic time is the same as KMP. ST wins if T is fixed.
KMP wins otherwise.
1
ST Application 2: set matching
• Find all occurrences of a
set of patterns in T
x a
c
a
b
– Build a ST from T
– Match each P to ST
x
c
a
c
5
2
b
x
a
c
6
3
c
b x a c
4
T: xabxac
P: xab
• O(m) to construct ST (large constant factor)
• O(n) to find v – linear to total length of P’s
• O(k) to get all leaves, k is the number of occurrence.
• Asymptotic time is the same as Aho-Corasick. ST wins if T fixed.
AC wins if P’s are fixed. Otherwise depending on relative size.
1
ST application 3: repeats finding
• Genome contains many repeated DNA
sequences
• Repeat sequence length: Varies from 1
nucleotide to millions
– Genes may have multiple copies (50 to 10,000)
– Highly repetitive DNA in some non-coding regions
• 6 to 10bp x 100,000 to 1,000,000 times
• Problem: find all repeats that are at least kresidues long and appear at least p times in the
genome
Repeats finding
• at least k-residues long and appear at
least p times in the seq
– Phase 1: top-down, count label lengths (L)
from root to each node
– Phase 2: bottom-up: count # of leaves
descended from each internal node
For each node with
L >= k, and N >= p,
print all leaves
O(m) to traverse tree
(L, N)
Maximal repeats finding
1. Right-maximal repeat
– S[i+1..i+k] = S[j+1..j+k],
– but S[i+k+1] != S[j+k+1]
2. Left-maximal repeat
acatgacatt
1. cat
2. aca
3. acat
– S[i+1..i+k] = S[j+1..j+k]
– But S[i] != S[j]
3. Maximal repeat
– S[i+1..i+k] = S[j+1..j+k]
– But S[i] != S[j], and S[i+k+1] != S[j+k+1]
Maximal repeats finding
1234567890
acatgacatt
a
a
c
t
c
a
t t
t
5:e
t
t
6
1
8
5:e
5
$ 10
t 5:e
5:e
3
t
9
4
5:e
7
2
• Find repeats with at least 3 bases and 2 occurrence
– right-maximal: cat
– Maximal: acat
– left-maximal: aca
Maximal repeats finding
1234567890
acatgacatt
a
a
c
t
c
a
t t
t
5:e
t
t
6
5:e
5
$ 10
t 5:e
8
5:e
3
9
t
5:e
7
1
Left char = []
4
2
g
c
c
a
a
• How to find maximal repeat?
– A right-maximal repeats with different left chars
ST application 4: word enumeration
• Find all k-mers that
occur at least p times
– Compute (L, N) for each
node
• L: total label length from
root to node
• N: # leaves
– Find nodes v with L>=k,
and L(parent)<k, and
N>=p
– Traverse sub-tree rooted
at v to get the locations
L<k
L=k
L=K
L>=k, N>=p
This can be used in many
applications. For example, to find
words that appeared frequently in
a genome or a document
Joint Suffix Tree (JST)
•
•
•
•
•
Build a ST for more than two strings
Two strings S1 and S2
S* = S1 & S2
Build a suffix tree for S* in time O(|S1| + |S2|)
The separator will only appear in the edge
ending in a leaf (why?)
Joint suffix tree example
• S1 = abcd
• S2 = abca
• S* = abcd&abca$
a
a
c
b
a
1,1
Seq ID
Suffix ID
&
d
a
c b
$
b
c
2,4 a d
&
2,1
a
2,2
b
c
a
&abcd
(2, 0)
d &
c
useless
a
bc
d
d
a &
1,4
a
2,3
b
c
d
1,2
1,3
To Simplify
a
c b
a
1,1
c
b
a
&
d
a
$
b
c
2,4 a d
&
2,1
a
2,2
b
c
a
&abcd
useless
d &
c
a
b c
d
d
a &
1,4
a
2,3
b
c
d
a
c b
d
a
1,1
$
b
c
2,4 a d
2,1
2,2
d
c
a
d
1,4
2,3
1,3
1,2
1,3
1,2
• We don’t really need to do anything, since all edge
labels were implicit.
• The right hand side is more convenient to look at
Application 1 of JST
• Longest common substring between
two sequences
• Using smith-waterman
a
– Gap = mismatch = -infinity.
– Quadratic time
c b
• Using JST
– Linear time
– For each internal node v, keep a bit
vector B
– B[1] = 1 if a child of v is a suffix of S1
– Bottom-up: find all internal nodes with
B[1] = B[2] = 1 (green nodes)
– Report a green node with the longest
label
– Can be extended to k sequences. Just
use a bit vector of size k.
d
a
1,1
$
b
c
2,4 a d
2,1
2,2
d
c
a
d
1,4
2,3
1,3
1,2
Application 2 of JST
• Given K strings, find all k-mers that appear
in at least (or at most) d strings
• Exact motif finding problem
L< k
cardinal(B) >= 3
L >= k
B = 1010
1,x 3,x 3,x
B = BitOR(1010, 0011) = 1011
cardinal(B) = 3
B = 0011
4,x
Application 3 of JST
• Substring problem for sequence databases
– Given: A fixed database of sequences (e.g., individual genomes)
– Given: A short pattern (e.g., DNA signature)
– Q: Does this DNA signature belong to any individual in the
database?
• i.e. the pattern is a substring of some sequences in the database
• Aho-Corasick doesn’t work
– This can also be used to design signatures for individuals
• Build a JST for the database seqs
• Match P to the JST
• Find seq IDs from descendents
1,1
Seqs: abcd, abca
P1: cd
P2: bc
a
c b
d
a
$
b
c
2,4 a d
2,1
2,2
c
a
d
d
1,4
2,3
1,3
1,2
Application 4 of JST
• Detect DNA contamination
– For some reason when we try to clone and sequence a genome, some
DNAs from other sources may contaminate our sample, which should
be detected and removed
– Given: A fixed database of sequences (e.g., possible cantamination
sources)
– Given: A DNA just sequenced (e.g., DNA signature)
– Q: Does this DNA contain longer enough substring from the seqs in the
database?
a
d
• Build a JST for the database seqs
c
b
b
c
• Scan T using the JST
$ c
d 1,4
d
a
a 2,4 a
d
1,1
2,3
Contamination sources: abcd, abca
2,1
1,3
2,2
1,2
Sequence: dbcgaabctacgtctagt
Suffix Tree Memory Footprint
• The space requirements of suffix trees
can become prohibitive
– |Tree(T)| is about 20|T| in practice
• Suffix arrays provide one solution.
Suffix Arrays
• Very space efficient (m integers)
• Pattern lookup is nearly O(n) in practice
– O(n + log2 m) worst case with 2m additional integers
– Independent of alphabet size!
• Easiest to describe (and construct) using suffix trees
– Other (slower) methods exist
x a
a
3
4
1
xabxa$
3
2
xa$
2
$
4
5
bxa$
5
b
x
a
$
1
abxa$
$
$
a$
1. xabxa
2. abxa
3. bxa
4. xa
5. a
b
x
a
b x a$
Suffix array construction
• Build suffix tree for T$
• Perform “lexical” depth-first search of
suffix tree
– output the suffix label of each leaf
encountered
• Therefore suffix array can be constructed
in O(m) time.
Suffix array pattern search
• If P is in T, then all the locations of P are
consecutive suffixes in Pos.
• Do binary search in Pos to find P!
–
–
–
–
Compare P with suffix Pos(m/2)
If lexicographically less, P is in first half of T
If lexicographically more, P is in second half of T
Iterate!
Suffix array pattern search
2
$
3
4
1
5
2
3
4
1
xabxa$
5
b
x
a
$
$
xa$
$
b
x
a
b x a$
bxa$
a
M
abxa$
x a
L
R
M
a$
• T: xabxa$
• P: abx
R
Suffix array binary search
• How long to compare P with suffix of T?
– O(n) worst case!
• Binary search on Pos takes O(n log m) time
• Worst case will be rare
– occur if many long prefixes of P appear in T
• In random or large alphabet strings
– expect to do less than log m comparisons
• O(n + log m) running time when combined with
LCP table
– suffix tree = suffix array + LCP table
Summary
• One T, one P
– Boyer-Moore is the choice
– KMP works but not the best
Alphabet independent
• One T, many P
– Aho-Corasick
– Suffix Tree (array)
• One fixed T, many varying P
– Suffix tree (array)
• Two or more T’s
– Suffix tree, joint suffix tree
Alphabet dependent
Boyer – Moore algorithm
• Three ideas:
– Right-to-left comparison
– Bad character rule
– Good suffix rule
Boyer – Moore algorithm
• Right to left comparison
Resume comparison here
x
y
Skip some chars without missing any occurrence.
y
Bad character rule
0
1
12345678901234567
T:xpbctbxabpqqaabpq
P: tpabxab
*^^^^
What would you do now?
Bad character rule
0
1
12345678901234567
T:xpbctbxabpqqaabpq
P: tpabxab
*^^^^
P:
tpabxab
Bad character rule
0
1
123456789012345678
T:xpbctbxabpqqaabpqz
P: tpabxab
*^^^^
P:
tpabxab
*
P:
tpabxab
Basic bad character rule
tpabxab
char
Right-most-position in P
a
6
b
7
p
2
t
1
x
5
Pre-processing:
O(n)
Basic bad character rule
k
T: xpbctbxabpqqaabpqz
P: tpabxab
When rightmost T(k) in
*^^^^
P is left to i, shift pattern
P to align T(k) with the
rightmost T(k) in P
i=3
Shift 3 – 1 = 2
P: tpabxab
char
Right-most-position in P
a
6
b
7
p
2
t
1
x
5
Basic bad character rule
k
T: xpbctbxabpqqaabpqz
P: tpabxab
*
When T(k) is not in
i=7
P, shift left end of P
to align with T(k+1)
Shift 7 – 0 = 7
P: tpabxab
char
Right-most-position in P
a
6
b
7
p
2
t
1
x
5
Basic bad character rule
k
T: xpbctbxabpqqaabpqz
P: tpabxab
When rightmost T(k)
*^^
in P is right to i, shift
pattern P by 1
i=5
P: tpabxab
char
Right-most-position in P
a
6
b
7
p
2
t
1
x
5
5 – 6 < 0. so shift 1
Extended bad character rule
k
T: xpbctbxabpqqaabpqz
P: tpabxab
*^^
Find T(k) in P that is
immediately left to i,
shift P to align T(k)
with that position
5 – 3 = 2. so shift 2
i=5
P: tpabxab
char
Position in P
a
6, 3
b
7, 4
p
2
t
1
x
5
Preprocessing still O(n)
Extended bad character rule
• Best possible: m / n comparisons
• Works better for large alphabet size
• In some cases the extended bad character
rule is sufficiently good
• Worst-case: O(mn)
– Expected time is sublinear
0
1
123456789012345678
T:prstabstubabvqxrst
P: qcabdabdab
*^^
According to extended bad character rule
P:
qcabdabdab
(weak) good suffix rule
0
1
123456789012345678
T:prstabstubabvqxrst
P: qcabdabdab
*^^
P:
qcabdabdab
(Weak) good suffix rule
x
T
P
t’
P
y
t
Preprocessing:
For any suffix t of P, find the
rightmost copy of t, denoted by t’.
How to find t’ efficiently?
t
t’
y
t
(Strong) good suffix rule
0
1
123456789012345678
T:prstabstubabvqxrst
P: qcabdabdab
*^^
(Strong) good suffix rule
0
1
123456789012345678
T:prstabstubabvqxrst
P: qcabdabdab
*^^
P:
qcabdabdab
(Strong) good suffix rule
x
T
P
z
t’
y
z≠y
P
t
In preprocessing:
For any suffix t of P, find the
rightmost copy of t, t’, such that the
char left to t ≠ the char left to t’
t
z t’
y
t
Example preprocessing
qcabdabdab
Bad char rule
char
Positions in P
a
9, 6, 3
b
10, 7, 4
c
2
d
8, 5
q
1
Where to shift depends on T
Good suffix rule
1 2 3 4 5 6 7 8 9 10
q c a b d a b d a b
0 0 0 0 2 0 0 2 0 0
dabdab
cabdab
Does not depend on T
Largest shift given by either the (extended) bad char rule or the
(strong) good suffix rule is used.
dab
cab
Time complexity of BM algorithm
• Pre-processing can be done in linear time
• With strong good suffix rule, worst-case is
O(m) if P is not in T
– If P is in T, worst-case could be O(mn)
– E.g. T = m100, P = m10
– unless a modification was used (Galil’s rule)
• Proofs are technical. Skip.
How to actually do pre-processing?
• Similar pre-processing for KMP and B-M
– Find matches between a suffix and a prefix
KMP
B-M
P
P
t’
x
j
x t’
j
y
t
i
y
t
For each i, find a j.
similar to DP. Start
from i = 2
i
– Both can be done in linear time
– P is usually short, even a more expensive preprocessing may result in a gain overall
Fundamental pre-processing
P
t’
1
x
zi
t
i
y
i+zi-1
• Zi: length of longest substring starting at i that
matches a prefix of P
– i.e. t = t’, x ≠ y, Zi = |t|
– With the Z-values computed, we can get the
preprocessing for both KMP and B-M in linear time.
aabcaabxaaz
Z = 01003100210
• How to compute Z-values in linear time?
Computing Z in Linear time
P
t’
P
t’
l
t
k
y
r
l
t
k
y
r
x
x
1
k-l+1
We already computed all Zvalues up to k-1. need to
compute Zk. We also know
the starting and ending points
of the previous match, l and r.
We know that t = t’, therefore
the Z-value at k-l+1 may be
helpful to us.
Computing Z in Linear time
Case 1:
P
Case 2:
P
The previous r is smaller than k.
i.e., no previous match extends
beyond k. do explicit comparison.
k
x
l
1
k
y
r
Zk-l+1 <= r-k+1. Zk = Zk-l+1
No comparison is needed.
k-l+1
Case 3:
P
l
1
k r
Zk-l+1 > r-k+1. Zk = Zk-l+1
Comparison start from r
k-l+1
•
•
No char inside the box is compared twice. At most one mismatch per iteration.
Therefore, O(n).
Z-preprocessing for B-M and KMP
Z
KMP
t’
1
t’
i
j
x
t
y
i
j
B-M
y
t
x
zi
x t’
j
j = i+zi-1
For each j
sp’(j+zj-1) = z(j)
y
t
i
Use Z backwards
• Both KMP and B-M preprocessing can be done in O(n)