Document 7748333

Download Report

Transcript Document 7748333

Foundations of Software Design
Lecture 26: Text Processing, Tries, and Dynamic
Programming
Marti Hearst & Fredrik Wallenberg
Fall 2002
1
Problem: String Search
• Determine if, and where, a substring occurs
within a string
2
Approaches/Algorithms:
•
•
•
•
Brute Force
Rabin-Karp
Tries
Dynamic Programming
3
“Brute Force” Algorithm
4
Worst-case
Complexity
5
Best-case Complexity,
String Found
6
Best-case
Complexity,
String Not
Found
7
Rabin-Karp Algorithm
• Calculate a hash value for
– The pattern being searched for (length M), and
– Each M-character subsequence in the text
• Start with the first M-character sequence
– Hash it
– Compare the hashed search term against it
– If they match, then look at the letters directly
• Why do we need this step?
– Else go to the next M-character sequence
(Note 1: Karp is a Turing-award winning prof. in CS here!)
(Note 2: CS theory is a good field to be in because they name things after you!)
8
Karp-Rabin: Looking for 31415
31415 mod 13 = 7
Thus compute each 5-char substring mod 13 looking for 7
2359023141526739921
----------8
2359023141526739921
----------9
2359023141526739921
----------3
2359023141526739921
----------7
Found 7! Now check the digits
9
Rabin-Karp Algorithm
• Worst case time?
– N is length of the string
– O(N) if the hash function is chosen well
•
•
http://orca.st.usm.edu/~suzi/stringmatch/rk_alg.html
http://www.mills.edu/ACAD_INFO/MCS/CS/S00MCS125/String.Matching.Algorithms
/animations.html
(Note 1: Karp is a Turing-award winning prof. in CS here!)
(Note 2: CS theory is a good field to be in because they name things after you!)
10
Tries
• A tree-based data structure for storing strings
in order to make pattern matching faster
• Main idea:
– Store all the strings from the document, one letter at
a time, in a tree structure
– Two strings with the same prefix are in the same
subtree
• Useful for IR prefix queries
– Search for the longest prefix of a query string Q that
matches a prefix of some string in the trie
– The name comes from Information Retrieval
11
Trie Example
The standard trie over the alphabet {a,b} for
the set {aabab, abaab, babbb, bbaaa, bbbab}
12
A Simple Incremental Algorithm
• To build the trie, simple add one string at a
time
• Check to see if the current character matches
the current node.
• If so, move to the next character
• If not, make a new branch labeled with the
mismatched character, and then move to the
next character
• Repeat
13
Trie-growing Algorithm
h
b
e
i
a
l
l
t
e
e
o
d
r
u
s
buy
bell
hear
see
bid
bear
stop
bull
sell
stock
l
l
y
a
r
e
l
c
p
l
k
14
Tries, more formally
• The path from the root of T to any node represents a prefix
that is equal to the concatenation of the characters
encountered while traversing the path.
– An internal node can have from 1 to d children where d is the
size of the alphabet.
• The previous example is a binary tree because the alphabet had
only 2 letters
– A path from the root of T to an internal node i corresponds to
an i-character prefix of a string S
– The height of the tree is the length of the longest string
– If there are S unique strings, T has S leaf nodes
– Looking up a string of length M is O(M)
15
Compressed
Tries
Compression is
done after the
trie has been
built up; can’t
add more items.
16
Compressed Tries
• Also known as PATRICIA Trie
– Practical Algorithm To Retrieve Information Coded In Alphanumeric
– D. Morrison, Journal of the ACM 15 (1968).
• Improves a space inefficiency of Tries
• Tries to remove nodes with only one child
(pardon the pun)
• The number of nodes is proportional to the number of strings,
not to their total length
– But this just makes the node labels longer
– So this only helps if an auxiliary data structure is used to actually
store the strings
– The trie only stores triplets of numbers indicating where in the
auxiliary data structure to look
17
Compressed Trie
b
s
e
$
ar
u
ll
id
ll
e
y
hear
e
to
ll
p
ck
18
Suffix Tries
• Regular tries can only be used to find whole words.
• What if we want to search on suffixes?
– build*, mini*
• Solution: use suffix tries where each possible suffix is
stored in the trie
• Example: minimize
Find:imi
mi
i
nimize
ze
e
nimize ze nimize mi
mize ze
19
Dynamic Programming
• Used primarily for optimization problems.
– Not just a good solution, but an optimal one.
• Brute force algorithms
– Try every possibility
– Guarantee finding the optimal solution
– But inefficient
• DP requires a certain amount of structure, namely:
– Simple Subproblems (and simple break-down)
– Global optimum is a composition of subproblem optimums
– Subproblem Overlap: optimal solutions to unrelated
problems can contain subproblems in common.
– In other words, can re-use the results of solving the
subproblem
20
Longest Common Subsequence
• LCS: find the longest string S that is a
subsequence of both X and Y, where
• X is of length n
• Y is of length m
• Example: what is the LCS of
• supergalactic
• galaxy
• (The characters do not have to be contiguous)
21
Dynamic Programming Applied
to LCS Problem
Let’s compare:
X = [GTG]
Y = [CGATG]
X[0…i]
Y[0…j]
We represent the longest subsequence as L[i,j]
* Longest Common Subsequence
22
Dynamic Programming for LCS
Note that the longest string of X and Y (L[i,j]) must be
equal to the longest string of ...
X[0…i-1] = [GT] (removing the last G)
Y[0…j-1] = [CGAT] (removing the last G)
… plus 1, since the matching Gs at Xi,Yj will increase the
length by one.
23
Dynamic Programming for LCS
If Xi,Yj had NOT matched, L[i,j] would have to be equal to
the longest string in L[i-1,j] or L[i,j-1].
If this is true for L[i,j], it must be true for all L.
We know that L[-1,-1] = 0 (since both strings are empty)
Finally we know that L[i,j] cannot be larger than
max(i,j)+1
24
Dynamic Programming for LCS
C
-1 0
-1 0 0
G 0 0 0
T 1 0 0
G 2 0 0
G
1
0
1
1
1
A
2
0
1
1
1
T
3
0
1
2
2
G
4
0
1
2
3
For each position,
take the max of
L[i-1,j] or L[i,j-1]
Add 1 when a new
match is found.
L[0,0] = 0 (X0,Y0 doesn’t match… max of L[-1,0] and L[0,-1])
L[0,1] = 1 (X0,Y1 does match… L[-1,0] + 1)
25
Dynamic Programming
• Running Time/Space:
– Strings of length m and n
– O(mn)
– Brute force algorithm: 2m subsequences of x to
check against n elements of y: O(n 2m)
26
Dynamic Programming vs.
Greedy Algorithms
• Sometimes they are the same.
• Sometimes not
• What makes an algorithm greedy?
– Globally optimal solution can be obtained by making
locally optimal choices
• Dynamic Programming
– Solves subproblems, that can be re-used
– Trickier to think of
– More work to program
27
Greedy Vs. Dynamic Programming:
• The famous knapsack problem:
– A thief breaks into a museum. Fabulous paintings,
sculptures, and jewels are everywhere. The thief
has a good eye for the value of these objects, and
knows that each will fetch hundreds or thousands of
dollars on the clandestine art collector’s market.
But, the thief has only brought a single knapsack to
the scene of the robbery, and can take away only
what he can carry. What items should the thief take
to maximize the haul?
From
www.cs.virginia.edu/~luebke/cs332.fall00/lecture23.ppt
edu/~bodik/cs536.html
28
The Knapsack Problem
• More formally, the 0-1 knapsack problem:
– The thief must choose among n items, where the ith
item worth vi dollars and weighs wi pounds
– Carrying at most W pounds, want to maximize value
• Note: assume vi, wi, and W are all integers
• “0-1” b/c each item must be taken or left in entirety
• A variation, the fractional knapsack problem:
– Thief can take fractions of items
– Think of items in 0-1 problem as gold ingots, in
fractional problem as buckets of gold dust
From
www.cs.virginia.edu/~luebke/cs332.fall00/lecture23.ppt
edu/~bodik/cs536.html
29
The Knapsack Problem:
Optimal Substructure
• Both variations exhibit optimal substructure
• To show this for the 0-1 problem, consider the
most valuable load weighing at most W pounds
– If we remove item j from the load, what do we know
about the remaining load?
– The remainder must be the most valuable load weighing
at most W - wj that the thief could take from museum,
excluding item j
From
www.cs.virginia.edu/~luebke/cs332.fall00/lecture23.ppt
edu/~bodik/cs536.html
30
Solving The Knapsack Problem
• The optimal solution to the fractional knapsack
problem can be found with a greedy algorithm
• The optimal solution to the 0-1 problem
cannot be found with the same greedy
strategy
– Greedy strategy: take in order of dollars/pound
– Example: 3 items weighing 10, 20, and 30 pounds,
knapsack can hold 50 pounds
• Suppose item 2 is worth $100. Assign values to the
other items so that the greedy strategy will fail
From
www.cs.virginia.edu/~luebke/cs332.fall00/lecture23.ppt
edu/~bodik/cs536.html
31
The Knapsack Problem:
Greedy Vs. Dynamic
• The fractional problem can be solved greedily
• The 0-1 problem cannot be solved with a
greedy approach
– It can, however, be solved with dynamic
programming
From
www.cs.virginia.edu/~luebke/cs332.fall00/lecture23.ppt
edu/~bodik/cs536.html
32