Transcript Lecture 2

Boyer-Moore Algorithm
• 3 main ideas
– right to left scan
– bad character rule
– good suffix rule
Right to left scan
0
1
1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7
x z b c t z x t b p t x c t b p q
t b c b b
t b c b b
t b c b b t b c b b
Bad character rule
• Definition
– For each character x in the alphabet, let R(x) denote the
position of the right-most occurrence of character x in
P.
– R(x) is defined to be 0 if x is not in P
• Usage
– Suppose characters in P[i+1..n] match T[k+1..k+n-i]
and P[i] mismatches T[k].
– Shift P to the right by max(1, i - R(T[k])) places
• Hopefully more than 1
Illustration of bad character rule
0
1
1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7
x z b c t z x t b p t x c t b p q
t b c b b
i = 5, R(z) = 0, so max(1, t
5-0)b= 5c b b
t b c b b
i = 5, R(t) = 1, so max(1, 5-1) = 4
i = 4, R(t) = 1, so max(1, 4-1) = 3
t b c b b
Extended bad character rule
• Definition
– For each character x in the alphabet, let R(x,i) denote
the position of the right-most occurrence of character x
P[1..i-1].
– R(x,i) is defined to be 0 if x is not in P[1..i-1].
• Usage
– Suppose characters in P[i+1..n] match T[k+1..k+n-i]
and P[i] mismatches T[k].
– Shift P to the right by max(1, i - R(T[k],i)) places
• Hopefully more than 1
Illustration of extended bad
character rule
0
1
1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7
x z b c b b x t b p t x c t b p q
b t c t b
i = 4, R(b) = 5, so max(1, 4-5) = 1
b t c t b
i = 4, R(b,4) = 1, so max(1, 4-1) = 3
b t c t b
Implementation Issues
• Bad character rule
– Space required: O(|S|) for the number of characters in
the alphabet
– Calculate R[] matrix in O(n) time (exercise)
• Extended bad character rule
–
–
–
–
Space required: full table is O(n|S|)
Smaller implementation: O(n)
Preprocess time: O(n)
Search time impact: increases search time by at worst
twice the number of mismatches
• See book for details (pg 18)
Observations
• Bad character rules
– work well in practice with large alphabets like
the English alphabet
– work less well with small alphabets like DNA
– Do not guarantee linear worst-case run-time
• Give an example of such a case
Strong good suffix rule part 1
• Situation
– P[i..n] matches text T[j..j+n-i] but T[(j-1) does
not match P(i-1)
• The rightmost non-suffix substring t’ of P
that matches the suffix P[i..n] AND the
character to the left of t’ in P is different
than P(i-1)
• Shift P so that t’ matches up with T[j..j+n-i]
Illustration of suffix rule part 1
0
1
1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8
p r s t a b s t u b a b v q x r s t
q c a b d a b d a b
q c a b d a b d a b
Preprocessing for suffix rule part 1
• Definitions
– For each i, L’(i) is the largest position less than n such
that P[i..n] matches a suffix of P[1..L’(i)] and that the
character preceding that suffix is not equal to P(i-1).
– For string P, Nj(P) is the length of the longest suffix of
the substring P[1..j] that is also a suffix of P
• Observations
– Nj(P) = Zn-j+1(Pr)
– L’(i) is the largest j < n such that Nj(P) = |P[i..n]| which
equals n-i+1
– If L’(i) > 0, shift P by n-L’(i) places to the right
Z-based computation of L’(i)
for (i=1;i<=n;i++)
L’(i) = 0;
for (j=1; j<=n-1; j++) {
k = n-Nj(P)+1;
L’(k) = j;
}
Strong good suffix rule part 2
• If L’(i) = 0 then …
• Let t’’ = the largest suffix of P[i..n] that is
also a prefix of P, if one exists.
• If t’’ exists, shift P so that prefix of P
matches up with t’’ at end of T[j..j+n-i].
• Otherwise, shift P past T[j+n-i].
Illustration of suffix rule part 2
0
1
1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8
p r s t a b s t a b a b v q x r s t
a b a b s t a b a b
a b a b s t a b a b
Preprocessing for suffix rule part 2
• Definitions
– For each i, let l’(i) denote the length of the largest suffix of
P[i..n] that is also a prefix of P, if one exists. Otherwise, let
l’(i)=0.
• Observations
– l’(i) = the largest j <= |P[i..n]| such that Nj(P) = j
• Question
– How does l’(i) relate to l’(i+1)?
• The same unless Nn-i+1(P) = n-i+1
Z-based computation of l’(i)
l’[n+1] = 0;
for (i=n;i>=2;i--)
if (N[n-i+1] = = (n-i+1)) l’[i] = n-i+1;
else l’[i] = l’[i+1];
}
Addendum to suffix rule
• Shift by 1 if there is an immediate mismatch
• That is, if P(n) mismatches with the
corresponding character in T
Boyer-Moore Overview
• Precompute L’(i), l’(i) for each position in P
• Precompute R(x) or R(x,i) for each
character x in S
• Align P to T
• Compare right to left
• On mismatch, shift by the max possible
from (extended) bad character rule and
good suffix rule and return to compare
Observations I
• Original Boyer-Moore algorithm uses “weak good suffix
rule” without using the mismatch information
– This is not sufficient to prove that the search part of Boyer-Moore
runs in linear time in the worst case
• Using only strong good suffix rule, can prove a worst-case
time of O(n) provided P is not in T
• If P is in T, original Boyer-Moore runs in Q(nm) time in
the worst case, but this can be corrected with simple
modifications
• Using only the bad character shift rule leads to O(nm) time
in the worst-case, but works in sublinear time on random
strings