Boyer-Moore String Search

Download Report

Transcript Boyer-Moore String Search

yxamplegsreinfkaeij
ajkja;lij;nknfejiena
nfhytoirht08to43508
Searching a String with the
gjsfnbgfwurhqqjwnsj
Boyer-Moore Algorithm
dlhfjsng83uu5hfaw09
854w09ruwij0w9ut94u
5t943543r0135573898
9002211esacbnmasdfg
hjklq3wwrtyiuiopun4
Shana Rose Negin
n5ns4e2232tg7msgism
December 14, 2000
8k942uq2nac36872324
Boyer-Moore String
Search
•How does it work?
•Examples
•Complexity
•Acknowledgements
How Does it Work?
•Pattern moves left to right.
•Comparisons are done right to left.
•Uses two heuristics:
•Bad Character
•Good Suffix
Each heuristic is put into play when a mismatch
occurs. They give us the maximum number of
characters the search pattern can move forward
safely and still know that there are no characters
that need to be checked.
Pattern Moves Left to Right
Text:
Pattern:
Several hours later, Cindy
indy
Text:
Middle Pattern:
Several hours_later, Cindy
indy
Text:
Pattern:
Several hours later, Cindy
indy
Start
End
Comparisons are done right
to left.
First
Comparison
Text:
Pattern:
Several hours_later, Cindy
indy
Second
Comparison
Text:
Pattern:
Several hours_later, Cindy
indy
Third
Comparison
Text:
Pattern:
Several hours_later, Cindy
indy
Fourth
Comparison
Text:
Pattern:
Several hours_later, Cindy
indy
Three Parts to the
Bad Character Heuristic
1. When the comparison gives a mismatch, the bad-character heuristic
proposes moving the pattern to the right by an amount so that the bad
character from the string will match the rightmost occurrence of the bad
character in the pattern.
2. If the bad character doesn’t occur in the pattern, then the pattern may be
moved completely past the bad character.
3. If the rightmost occurrence of the bad character is to the right of the
current bad character position, then this heuristic makes no proposal.
Bad Character Heuristic
1. When the comparison gives a mismatch, the bad-character heuristic
proposes moving the pattern to the right by an amount so that the bad
character from the string will match the rightmost occurrence of the bad
character in the pattern.
Text:
Pattern:
You’ve got a funny face, man.
cite
Text:
Shift:
You’ve got a funny face,_man.
cite
Shifted two characters to match up the c’s.
Bad Character Heuristic
2. If the bad character doesn’t occur in the pattern, then the
pattern may be moved completely past the bad character.
Text:
Pattern:
You’ve got a funny face, man.
poor
Text:
Shift:
You’ve got a funny face, man.
poor
Shifted four characters because there was no match.
Bad Character Heuristic
3. If the rightmost occurrence of the bad character is to
the right of the current bad character position, then this
heuristic makes no proposal.
Text:
Pattern:
There are no babies here.
drab
Text:
Shift:
There are no babies here.
drab
The shift proposed would be negative, so it is ignored.
Good Suffix Heuristic
The good-suffix heuristic proposes to move the pattern to the
right by the least amount so that a group of characters in the
pattern will match with the good suffix found in the text.
Text:
Pattern:
...I wish I had_an apple instead of...
banana
Text:
Shift:
…..I wish I had an apple instead of...
banana
Shift two so that the second occurrence of ‘an’ in
‘banana’ matches the characters ‘an’ in the string.
EXAMPLE
Text:
im a grad. dad is glad
Im_a_grad._dad_is_glad
1
grad2
3
7
4
11
grad 5
8
12
6
9
grad 10
grad
grad
grad
grad
Bad-character Good-Suffix
Pattern:
grad
Match
12 comparisons out
of 22 characters.
EXAMPLE
Text: Where are you moving? What are you doing?
Pattern: grad
Bad-character
Good-Suffix
Where_are_you_moving?_What_are_you_doing?
grad
grad
grad
grad
grad
grad
grad
grad
grad
grad
grad
Match
10 comparisons
out of 41
characters.
Last ‘grad’ is
longer than the
remaining string,
so it is discarded
before it is
counted.
Applets
• http://www.accessone.com/~lorre/pages/bmi.html
• http://www.i.kyushu-u.ac.jp/~takeda/PM_DEMO/e.html
The Algorithm:
Sigma = alphabet in use;
T = Search string (text);
P = Pattern;
N = length[T];
M = length[P];
L = Compute_Last_Occurrence_Function(P, M, Sigma);
Y = Compute_Good_Suffix_Function(P, M);
s = 0;
while (s <= n-m) {
(j = m);
while (j > 0 AND P[j] = T[s+j]) {
j--;
if (j=0) {
print(“Pattern FOUND!!! Location” s);
s = s + Y[0];
else
s = s+ max(Y[j], j-L[T[s+j]]);
(for bad-character heuristic)
(for good-suffix heuristic)
Sigma = alphabet in use;
T = Search string (text);
P = Pattern;
N = length[T];
M = length[P];
Compute_Last_Occurrence_Function
Compute_Last_Occurance_Function(P, M, Sigma) {
/* Contained in the array L, there is a field for every letter in the alphabet. When this
function is finished computing, the number in L[a] will represent the number of characters
from the beginning of the pattern that the rightmost ‘a’ lies; L[b] will contain the distance
from the beginning of the pattern for the right most occurrence of ‘b’, and so on.
EXAMPLE: pattern: jeff
L-> a b c d e
f g h i j k
0 0 0 0 2 4 0 0 0
0
*/
1
for (each character a in sigma)
L[a] = 0;
for (j = 0; j < m; j++)
L[P[j]] = j;
return L;
}
/* COMPLEXITY: O(Sigma + M) */
// Initialize all fields to 0
// For every letter in the pattern,
// record its distance from the start
// of the pattern
Sigma = alphabet in use;
T = Search string (text);
P = Pattern;
N = length[T];
M = length[P];
Compute_Good_Suffix_Function(P, M) {
/*
First get the prefix. The fields of Y represent the distance of the
suffix from the start of the pattern, using the rightmost character
as a reference. Then it searches the pattern to find the next
rightmost occurrence of the suffix, and recommends that shift. If
there is no other occurrence, it recommends a shift of the length of
the pattern */
Pi = Compute_Prefix_Function(P)
P’ = Reverse(P)
Pi’ = Compute_Prefix_Function(P’)
for (i = 0; i < M; i++)
Y[i] = M - Pi[M];
for (j = 0; j < M; j++)
i = M - Pi’[j];
if (Y[I] > j - Pi’[j]
Y[I] = j - Pi’[l]
return Y
}
Compute_Good_Suffix_Function
/* COMPLEXITY: O(M) */
Sigma = alphabet in use;
T = Search string (text);
P = Pattern;
N = length[T];
M = length[P];
The Main Loop
while (s <= n-m) {
(j = m);
while (j > 0 AND P[j] = T[s+j]) {
j--;
if (j=0) {
print(“Pattern FOUND!!! Location” s);
s = s + Y[0];
else
s = s+ max(Y[j], j-L[T[s+j]]);
// for every shift
//
// for the length of the pattern
//
// if you reach the beginning of the
//
pattern,
//
You found the pattern!
//
Tell someone and shift
//
the length of the pattern
// else, choose the greater of the
//
two heuristic results
Complexity
•Compute_Last_Occurrence: O(|Sigma| + m)
•Compute_Good_Suffix: O(m)
•Number of shifts: O(n-m+1)
•Time to check the new shift: O(m)
O((n+m+1)m+|Sigma|)
•Total: (|Sigma|+m) + m + m(n-m+1)
=
O(NM)
Worst Case
HOWEVER...
IN PRACTICE...
the algorithm takes
sub-linear time
Specifically, in the best case, the
algorithm’s running time is
O(N/M)
(length of text over length of
pattern)
The complexity is best when the
letters in the pattern don’t match
the letters in the text very often.
Since this is generally the case,
the average running time ends
up being approximately
equivalent to the best case.
O(N/M)
(length of text over length of
pattern)
Conclusion:
The Boyer-Moore algorithm is a very good algorithm. Its worst case running
time is linear; its best case running time is sub-linear. Most of the time it
tends toward the best case rather than the worst case.
I recommend the boyer-moore algorithm for searching a string.
Shana Negin
252a-as
December 14, 2000
Algorithms csc252
Acknowledgements
Corman: Chapter 34.5
Cole, Richard: “Tight Bounds on the complexity of the BoyerMoore string-matching algorithm.” New York University
http://www.accessone.com/~lorre/pages/bmi.html
http://www.i.kyushu-u.ac.jp/~takeda/PM_DEMO/e.html