Chapter 7: Space and Time Tradeoffs

Download Report

Transcript Chapter 7: Space and Time Tradeoffs

Preprocessing
Application: String Matching
By Rong Ge
COSC3100
Space-for-time tradeoffs
Two varieties of space-for-time algorithms:
 input enhancement — preprocess the input (or its part) to store
some info to be used later in solving the problem
• counting sorts (Last Lecture)
• string searching algorithms (today)

prestructuring — preprocess the input to make accessing its
elements easier
• hashing
A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson
Education, Inc. Upper Saddle River, NJ. All Rights Reserved.
2
Review: String searching by brute force
String Matching problem:
Given a pattern P: a string of m characters to search for
and a text T: a (long) string of n>=m characters to search in
Return whether P occurs in T and the position of P in T
Brute force algorithm Revisit
Step 1
Step 2
Align pattern P at beginning of text T
Scan from left to right, compare each character of
P to the corresponding character in T until either all characters of P are
found to match (successful search) or a mismatch is detected
Step 3 While a mismatch is detected and T is not yet exhausted, realign P one
position to the right and repeat Step 2
#comparison: O(mn)
More efficient algorithms with O(n) time?
3
String searching by preprocessing
Several string searching algorithms are based on the input
enhancement idea of preprocessing the pattern P

Horspool’s algorithm
1.
2.
3.
preprocess P to build one shift table tb
align P against the beginning of T
repeat until a matching substring is found or T ends:
1.
2.

compare the corresponding characters from right to left for P
if mismatch occurs, shift P to right by tb(c) where c is the rightmost
character of T in the current alignment
Boyer -Moore algorithm improves Horspool’s algorithm by
using two tables: one good table and one bad table
A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson
Education, Inc. Upper Saddle River, NJ. All Rights Reserved.
4
Horspool’s Algorithm
Two key points:
• preprocesses P to generate a shift table that determines how
much to shift P when a mismatch occurs
• always makes a shift based on the T’s character c aligned
with the last character in P according to the shift table’s entry
for c
A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson
Education, Inc. Upper Saddle River, NJ. All Rights Reserved.
5
How far to shift for each case?
Focus on the rightmost character in T in the alignment:
 Case I: character C != B (mismatch), and C does not occur in P
.....C...................... Text (C not in pattern)
BAOBAB

Pattern
Case II:
Character O != B (mismatch), but occur in P once
.....O...................... (O occurs once in pattern)
BAOBAB
Character A != B (mismatch), and occurs in P more than once
.....A...................... (A occurs twice in pattern)
BAOBAB

Case III: character R = R (match), but doesn’t appear in P in the first m-1 characters
...MER......................
LEADER

Case IV: character B = B (match), and occurs in P in the first m-1 characters
.....B......................
BAOBAB
Your answers?
6
Build a shift table for a given pattern P

Shift table:
• An 1-D array indicating the shift size for each character c if it is the
rightmost character of T in the alignment
• array size: number of characters in the alphabet
– E.g.: 26 if all characters in T are capital case English letters

The array entry for a character c is computed as follows
• Cases I & III: P’s length m
• Case II & IV: distance from c’s rightmost occurrence in P among P’s first
m-1 characters to P’s right end

Horspool's alg. populates the shift table with two steps:
1. Initialize each array entry as P’s length
2. Scan P from left to right and update the array entries
A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson
Education, Inc. Upper Saddle River, NJ. All Rights Reserved.
7
Build shift table

Example: Pattern: BAOBAB
• Alphabet: capital case English letters

Preprocessing
• Initialize all array entries as P’s length, which is 6 for BAOBAB
- Taking care of cases I and III
- Shift table is indexed by text and pattern alphabet
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
• Scan P from left to right for the first m-1 characters and update the table
0. BAOBAB: distance = 5 (B’s entry)
1. BAOBAB: distance = 4 (A’s entry)
2. BAOBAB: distance = 3 (O’s entry)
3. BAOBAB: distance = 2 (B’s entry)
4. BAOBAB: distance = 1 (A’s entry) skip the last character of P,
- Taking care of cases II & IV
Resulting shift table
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Levitin
1 2A. 6
6 “Introduction
6 6 6to the
6 Design
6 6& Analysis
6 6 of6Algorithms,”
6 3 3rd
6 ed.,
6 Ch.67 ©2012
6 6Pearson
6 6 6 6 6 6
Education, Inc. Upper Saddle River, NJ. All Rights Reserved.
8
Example of Horspool’s alg. application
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
_
1 2 6 6 6 6 6 6 6 6 6 6 6 6 3 6 6 6 6 6 6 6 6 6 6 6 6
Alg: keep aligning P against T and comparing from right to left, shifting if
mismatched.
Example:
BARD LOVED BANANAS
(Text)
BAOBAB
(shift t(L)=6)
BAOBAB
(shift t(B)=2)
BAOBAB
(shift t(N)=6)
BAOBAB
(T out of bound, unsuccessful search)
# total comparisons: 4.
1st alignment: 1
2nd alignment: 2
3rd alignment: 1
4th alignment: 0
How many comparisons with brute force alg?
A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson
Education, Inc. Upper Saddle River, NJ. All Rights Reserved.
9
Boyer-Moore algorithm
Based on same two ideas:
• comparing P to T from right to left for each alignment
• precomputing shift sizes in two tables
– bad-symbol table t1 indicates how much to shift based on
text’s character causing a mismatch
t1 is built the same as Horspool’s alg.
– good-suffix table d2 indicates how much to shift based on
matched part (suffix) of the pattern
A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson
Education, Inc. Upper Saddle River, NJ. All Rights Reserved.
10
Scenarios in string matching
SI: the rightmost character of P doesn’t match, BM algorithm acts as Horspool’s.
Shift P to right by t1(c)
text
c
Rightmost character of
P doesn’t match
pattern
x
SII: k characters are matched, and then a mismatch occurs.
0<k<m
text
c K characters
k>0 matches
pattern
x
K characters
How much
do we shift Pattern to right?
A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson
Education, Inc. Upper Saddle River, NJ. All Rights Reserved.
11
Prefix and suffix of strings

Prefix of a string P:
A prefix of a string P is a substring of P that occurs at the beginning of P.
E.g. P: BANANA
Prefix
B
Length 1

BA
BAN
BANA
BANAN
BANANA
2
3
4
5
6
Suffix of a string P:
A suffix of a string P is a substring that occurs at the end of P.
E.g. P: BANANA
Suffixes: A, NA, ANA, NANA, ANANA, BANANA
suffix
A
Length 1
NA ANA NANA
ANANA
BANANA
2
5
6
3
4
A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson
Education, Inc. Upper Saddle River, NJ. All Rights Reserved.
12
Good-suffix table d2 for pattern

applied after a suffix of P matches T for an alignment, and then
a mismatch occurs
• k is the length of the matched suffix of P, 0 < k < m

Good-suffix table d2:
• an 1-D array with size of m, but the entry indexed by 0 is not used.
• Each entry indicates the shift size for the good, matched suffix
A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson
Education, Inc. Upper Saddle River, NJ. All Rights Reserved.
13
Build good-suffix table d2 for the pattern

For each entry indexed by k, set d2(k) with three cases in order,
k is in [1, m-1] and corresponds to the suffix of P with size k
1.
the distance between matched suffix of size k and its rightmost
occurrence in the pattern that is not preceded by the same character as
the suffix
E.g.: P: CABABA d2(k=1) = 4 //Suffix with size 1: A; the rightmost A
that is not preceded by B is the second character in P.
2.
the distance between the longest part of the k-character suffix and the
corresponding prefix of P, if there is no occurrence in case 1
E.g.: P: ANAN d2(k=3) = 2 // Suffix with size 3: NAN; the longest
part of NAN that matches a prefix of P is AN
3.
m, the length of P if there is no occurrence in case 2
E.g.: P: ANAN d2(k=1) = 4
//Suffix with size 1: N
A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson
Education, Inc. Upper Saddle River, NJ. All Rights Reserved.
14
Exercise

Build the good suffix table for pattern WOWWOW
A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson
Education, Inc. Upper Saddle River, NJ. All Rights Reserved.
15
Exercise

Build the good suffix table for pattern WOWWOW
k
pattern
d2
1
WOWWOW
2
Case 1
2
WOWWOW
5
Case 2: longest part of OW with a matched prefix is W
3
WOWWOW
3
Case 1
4
WOWWOW
3
Case 2: longest part of WWOW with a matched prefix is WOW
5
WOWWOW
3
Case 2: longest part of WWOW with a matched prefix is WOW
A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson
Education, Inc. Upper Saddle River, NJ. All Rights Reserved.
16
BM alg. for scenario II in string matching
After successfully matching 0 < k < m characters, the algorithm
shifts the pattern to right by
d = max {d1, d2}
where d1 = max{t1(c) - k, 1} is bad-symbol shift
d2(k) is good-suffix shift
A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson
Education, Inc. Upper Saddle River, NJ. All Rights Reserved.
17
Boyer-Moore Algorithm (cont.)
Step 1
Step 2
Step 3
Step 4
Fill in the bad-symbol shift table
Fill in the good-suffix shift table
Align the pattern against the beginning of the text
Repeat until a matching substring is found or text ends:
Compare the corresponding characters right to left.
If no characters match, retrieve entry t1(c) from the badsymbol table for the text’s character c causing the mismatch
and shift the pattern to the right by t1(c).
If 0 < k < m characters are matched, retrieve entry t1(c) from
the bad-symbol table for the text’s character c causing the
mismatch and entry d2(k) from the good-suffix table and
shift the pattern to the right by
d = max {d1, d2}
where d1 = max{t1(c) - k, 1}.
A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson
Education, Inc. Upper Saddle River, NJ. All Rights Reserved.
18
Example of Boyer-Moore alg. application
Table t1:
A B C D E F G H I J K L M N O P Q R S T U V W
1 2 6 6 6 6 6 6 6 6 6 6 6 6 3 6 6 6 6 6 6 6 6
B E S S _ N N E W _ A B O U T _ B A O B A
B A O B A B
0 match
shift d1 = t1(N) = 6
B A O B A B
Table d2
2 matches: k=2
d1 = t1(_)-2 = 4
k pattern d2
shift d2(2) = 5
B A O B A B
1 BAOBAB 2
1 match: k=1
2 BAOBAB 5
shift d1 = t1(_)-1 = 5
3 BAOBAB 5
d2(1) = 2
B A O B A
4 BAOBAB 5
#comp: 12 with BM, 13 with Horspool
A.
Levitin
“Introduction
to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson
5 BAOBAB 5
#alignment: 4 with BM, 5 with Horspool
Education, Inc. Upper Saddle River, NJ. All Rights Reserved.
X Y Z _
6 6 6 6
B S
B (success)
19
Summary


Preprocessing - preprocess the input (or its part) to store some
info to be used later in solving the problem
With preprocessing, Horspool’s and BM string algorithms are
fast
• Comparison starts from right to left on P
A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson
Education, Inc. Upper Saddle River, NJ. All Rights Reserved.
20
Finishing Counting Sort
Alg. CountingSort(A[0,…,n-1], lb, ub)
Input: an array A of integers between lb and ub (lb <=ub)
Output: Array B[0,…,n-1] of A’s elements sorted in
nondescending order
for j  0 to ub-lb do
//init freq to 0
C[j]  0
for i  0 to n-1 do
//count freq
C[A[i]-lb]  C[A[i]-lb] + 1
for j  1 to ub-lb do
//calc. last pos. of lb+j-1
C[j]  C[j-1] + C[j]
for i  n-1 to 0 do
//copy from A to B
j  A[i] – lb
B[C[j]–1]  A[i]
C[j]  C[j] – 1
E.G. A = {3, 2, 4, 1, 4, 1}
A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson
Education, Inc. Upper Saddle River, NJ. All Rights Reserved.
21