Transcript Document

CSCE350 Algorithms and Data
Structure
Lecture 16
Jianjun Hu
Department of Computer Science and
Engineering
University of South Carolina
2009.11.
Chapter 7: Space-Time Tradeoffs
• For many problems some extra space really pays off:
• extra space in tables (breathing room?)
– hashing
– non comparison-based sorting
• input enhancement
– indexing schemes (eg, B-trees)
– auxiliary tables (shift tables for pattern matching)
• tables of information that do all the work
– dynamic programming
String Matching
•
•
pattern: a string of m characters to search for
text: a (long) string of n characters to search in
•
Brute force algorithm:
1. Align pattern at beginning of text
2. moving from left to right, compare each character of pattern to
the corresponding character in text until
•
•
all characters are found to match (successful search); or
a mismatch is detected
3. while pattern is not found and the text is not yet exhausted,
realign pattern one position to the right and repeat step 2.
What is the complexity of the brute-force string matching?
String Searching - History
• 1970: Cook shows (using finite-state machines) that problem can be
solved in time proportional to n+m
• 1976 Knuth and Pratt find algorithm based on Cook’s idea; Morris
independently discovers same algorithm in attempt to avoid
“backing up” over text
• At about the same time Boyer and Moore find an algorithm that
examines only a fraction of the text in most cases (by comparing
characters in pattern and text from right to left, instead of left to
right)
• 1980 Another algorithm proposed by Rabin and Karp virtually
always runs in time proportional to n+m and has the advantage of
extending easily to two-dimensional pattern matching and being
almost as simple as the brute-force method.
Horspool’s Algorithm
• A simplified version of Boyer-Moore algorithm
that retains key insights:
– compare pattern characters to text from right to left
– given a pattern, create a shift table that determines how
much to shift the pattern when a mismatch occurs (input
enhancement)
Consider the Problem
• Search pattern BARBER in some text
…
s0
B
A
R
c
B
E
…
sn-1
R
• Compare the pattern in the current text position
from the right to the left
• If the whole match is found, done.
• Otherwise, decide the shift distance of the
pattern (move it to the right)
• There are four cases!
Shift Distance -- Case 1:
• There is no ‘c’ in the pattern. Shift by the m –
the length of the pattern
…
s0
B
A
R
c
B
E
…
s0
B
A
R
E
sn-1
…
sn-1
R
S
B
…
R
B
A
R
B
E
R
Shift Distance -- Case 2:
• There are occurrence of ‘c’ in the pattern, but
it is not the last one. Shift should align the
rightmost occurrence of the ‘c’ in the pattern
…
s0
B
A
R
c
B
E
…
s0
B
A
…
sn-1
…
sn-1
R
B
R
B
E
R
B
A
R
B
E
R
Shift Distance -- Case 3:
• ‘c’ matches the last character in the pattern,
but no ‘c’ among the other m-1 characters.
Follow Case 1 and shift by m
…
s0
L
E
s0
L
E
c
A
D
E
R
…
M
E
R
A
D
E
R
L
E
…
sn-1
…
sn-1
A
D
E
R
Shift Distance -- Case 4:
• ‘c’ matches the last character in the pattern,
and there are other ‘c’s among the other m-1
characters. Follow Case 2.
…
s0
O
R
c
D
…
s0
O
R
D
E
R
O
R
E
R
O
R
D
E
…
sn-1
…
sn-1
R
We can precompute the shift distance for every
possible character ‘c’ (given a pattern)
 the patte rn's le ngthm,
 if c is notam ongthe firstm - 1 characte rsof the patte rn

t (c)  
the distancefrom therightm ostc am ongthe firstm - 1

characte rsof the patte rnto its lastcharacte r,othe rwise
• Shift Table for the pattern “BARBER”
c
A
B
C
D
E
F
…
R
…
Z
_
t(c)
4
2
6
6
1
6
6
3
6
6
6
Example
• See Section 7.2 for the pseudocode of the shift-table
construction algorithm and Horspool’s algorithm
• Example: find the pattern BARBER from the following
text
J I M _ S A W _ M E _ I N _ A _ B A R B E R S H O
B A R B E R
B A R B E R
B A R B E R
B A R B E R
B A R B E R
B A R B E R
Algorithm Efficiency
• The worst-case complexity is Θ(nm)
• In average, it is Θ(n)
• It is usually much faster than the brute-force
algorithm
• A simple exercise: Create the shift table of 26
letters and space for the pattern BAOBAB
Boyer-Moore algorithm
• Based on same two ideas:
– compare pattern characters to text from right to left
– given a pattern, create a shift table that determines how much
to shift the pattern when a mismatch occurs (input
enhancement)
• Uses additional shift table with same idea applied
to the number of matched characters
• The worst-case efficiency of Boyer-Moore
algorithm is linear. See Section 7.2 of the
textbook for detail
Boyer-Moore algorithm
s0
B
A
…
S
E
R
R
B
E
R
B
A
Bad-symbol shift
…
R
B
E
sn-1
R
Space and Time Tradeoffs: Hashing
• A very efficient method for implementing a
dictionary, i.e., a set with the operations:
• insert
• find
• delete
• Applications:
• databases
• symbol tables
Addressing by index number
Addressing by content
Example Application: How to store
student records into a data structure ?
•
•
•
•
•
•
•
Store student record
xxx-xx-6453 Jeffrey …..Los angels….
xxx-xx-20380 1 2 3 4 5 6
xxx-xx-0913
6453
xxx-xx-4382
4382
9084
0913
xxx-xx-9084
xxx-xx-2498
7
8
2038
2498
9
Hash tables and hash functions
• Hash table: an array with indices that correspond to
buckets
• Hash function: determines the bucket for each record
• Example: student records, key=SSN. Hash function:
h(k) = k mod m
(k is a key and m is the number of buckets)
– if m = 1000, where is record with SSN= 315-17-4251
stored?
• Hash function must:
– be easy to compute
– distribute keys evenly throughout the table
Collisions
•
•
•
•
If h(k1) = h(k2) then there is a collision.
Good hash functions result in fewer collisions.
Collisions can never be completely eliminated.
Two types handle collisions differently:
– Open hashing - bucket points to linked list of all keys hashing to it.
– Closed hashing –
• one key per bucket
• in case of collision, find another bucket for one of the keys (need
Collision resolution strategy)
– linear probing: use next bucket
– double hashing: use second hash function to compute increment
Example of Open Hashing
• Store student record into 10 bucket using hashing
function
•
h(SSN)=SSN mod 10
0
1
2
3
4
5
6
7
8
9
• xxx-xx-6453
• xxx-xx-2038
2038
6453
4382
9084
• xxx-xx-0913
2498
0913
• xxx-xx-4382
• xxx-xx-9084
• xxx-xx-2498
Open hashing
• If hash function distributes keys uniformly,
average length of linked list will be α =n/m
(load factor)
• Average number of probes = 1+α/2
• Worst-case is still linear!
• Open hashing still works if n>m.
Example: Closed Hashing (Linear
Probing)
•
•
•
•
•
•
xxx-xx-6453
xxx-xx-2038
xxx-xx-0913
xxx-xx-4382
xxx-xx-9084
xxx-xx-2498
0
1
2
3
4
5
4382
6453
0913
9084
6
7
8
9
2038
2498
Hash function for strings
• h(SSN)=SSN mod 10 SSN must be a integer
• What is the key is a string?
• “ABADGUYISMOREWELCOME”………..Bill….
Hash function for strings
• h(SSN)=SSN mod 10 SSN must be a integer
• What is the key is a string?
• static unsigned long sdbm(str) unsigned char
*str; {
– unsigned long hash = 0; int c;
– while (c = *str++)
• hash = c + (hash << 6) + (hash << 16) - hash;
– return hash;
–}
Closed Hashing (Linear Probing)
•
•
•
•
Avoids pointers.
Does not work if n>m.
Deletions are not straightforward.
Number of probes to insert/find/delete a key
depends on load factor α = n/m (hash table
density)
• successful search: (½) (1+ 1/(1- α))
• unsuccessful search: (½) (1+ 1/(1- α)²)
• As the table gets filled (α approaches 1),
number of probes increases dramatically:
B-Trees
• Organize data for fast queries
• Index for fast search
• For datasets of structured records, B-tree
indexing is used
Motivation (cont.)
• Assume that we use an AVL tree to store about 20 million
records
• We end up with a very deep binary tree with lots of different
disk accesses; log2 20,000,000 is about 24, so this takes about
0.2 seconds
• We know we can’t improve on the log n lower bound on
search for a binary tree
• But, the solution is to use more branches and thus reduce the
height of the tree!
– As branching increases, depth decreases
B-Trees
27
Constructing a B-tree
17
3
1
2
B-Trees
6
7
8
28
12
14
16
25
28
26
29
48
45
52
53
55
68
Definition of a B-tree
• A B-tree of order m is an m-way tree (i.e., a
tree where each node may have up to m
children) in which:
1.the number of keys in each non-leaf node is one
less than the number of its children and these
keys partition the keys in the children in the
fashion of a search tree
2.all leaves are on the same level
3.all non-leaf nodes except the root have at least m
/ 2 children
B-Trees
29
4.the root is either a leaf node, or it has from two to
Comparing Trees
• Binary trees
– Can become unbalanced and lose their good time complexity (big O)
– AVL trees are strict binary trees that overcome the balance problem
– Heaps remain balanced but only prioritise (not order) the keys
• Multi-way trees
– B-Trees can be m-way, they can have any (odd) number of children
– One B-Tree, the 2-3 (or 3-way) B-Tree, approximates a permanently
balanced binary tree, exchanging the AVL tree’s balancing operations
for insertion and (more complex) deletion operations
B-Trees
30
Announcement
•
•
•
•
Midterm Exam 2 Statistics:
Points Possible: 100
Class Average: 85 (expected)
High Score: 100