PowerPoint 簡報 - National Cheng Kung University

Download Report

Transcript PowerPoint 簡報 - National Cheng Kung University

 Author: Ricardo A. Baeza-Yates, Gaston H. Gonnet
 Publisher: 1992 Communications of the ACM
 Presenter: Yuen-Shuo Li
 Date: 2013/08/14
1
 String searching is a very important component of many
problems, including text editing, bibliographic retrieval, and
symbol manipulation.
 Let T[x] be a table about pattern such that:
0 𝑥 = 𝑝𝑎𝑡𝑡𝑒𝑟𝑛𝑖
𝑇𝑖 𝑥 =
1 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
e.g. Let {a, b, c, d} be the alphabet, and ababc the pattern.
T[a] = 11010
T[b] = 10101
T[c] = 01111
T[d] = 11111
cbaba
T[a] = 11010
 State = 𝑆𝑚−1 𝑆𝑚−2 … 𝑆2 𝑆1 𝑆0 (if length of pattern is m)
State 5
c
b
b
a
0
3
0
1
b
a
b
a
b
c
text
T[a] = 11010
T[b] = 10101
T[c] = 01111
T[d] = 11111
a
b
a
…
 State = 𝑆𝑚−1 𝑆𝑚−2 … 𝑆2 𝑆1 𝑆0 (if length of pattern is m)
State 0
c
b
b
a
3
0
1
b
a
b
a
b
c
text
T[a] = 11010
T[b] = 10101
T[c] = 01111
T[d] = 11111
a
b
a
…
 State = 𝑆𝑚−1 𝑆𝑚−2 … 𝑆2 𝑆1 𝑆0 (if length of pattern is m)
State 1
c
b
b
a
4
0
2
0
b
a
b
a
b
c
text
T[a] = 11010
T[b] = 10101
T[c] = 01111
T[d] = 11111
a
b
a
…
 State = 𝑆𝑚−1 𝑆𝑚−2 … 𝑆2 𝑆1 𝑆0 (if length of pattern is m)
State 1
c
b
b
a
4
0
2
0
b
a
b
a
b
c
text
T[a] = 11010
T[b] = 10101
T[c] = 01111
T[d] = 11111
a
b
a
…
 State = 𝑆𝑚−1 𝑆𝑚−2 … 𝑆2 𝑆1 𝑆0 (if length of pattern is m)
State 1
c
b
b
a
b
4
0
2
0
a
b
a
b
c
a
text
T[a] = 11010
T[b] = 10101
T[c] = 01111
T[d] = 11111
b
a
…
 State = 𝑆𝑚−1 𝑆𝑚−2 … 𝑆2 𝑆1 𝑆0 (if length of pattern is m)
State 4
c
b
b
a
b
0
2
0
a
b
a
b
c
a
text
T[a] = 11010
T[b] = 10101
T[c] = 01111
T[d] = 11111
b
a
…
 State = 𝑆𝑚−1 𝑆𝑚−2 … 𝑆2 𝑆1 𝑆0 (if length of pattern is m)
State 5
c
b
b
a
b
0
3
0
1
a
b
a
b
c
a
text
T[a] = 11010
T[b] = 10101
T[c] = 01111
T[d] = 11111
b
a
…
 State = 𝑆𝑚−1 𝑆𝑚−2 … 𝑆2 𝑆1 𝑆0 (if length of pattern is m)
State 0
b
b
a
b
a
4
1
2
1
b
a
b
c
a
b
text
T[a] = 11010
T[b] = 10101
T[c] = 01111
T[d] = 11111
a
…
 To update the state after reading a new character on the text,
we must
 Shift the vector state b bits to the left to reflect that we have
advanced one position in the text.
 Update the individual states according to the new character.
 Each search step changes the state using the assignment:
b bits to represent each individual state 𝑆𝑖
𝑠𝑡𝑎𝑡𝑒𝑗 = 𝑠𝑡𝑎𝑡𝑒𝑗−1 ≪ 𝑏 + 𝑇[𝑡𝑒𝑥𝑡𝑗 ]
The number of mismatches
 Each search step changes the state using the assignment:
b=1
𝑠𝑡𝑎𝑡𝑒𝑗 = 𝑠𝑡𝑎𝑡𝑒𝑗−1 ≪ 𝑏 | 𝑇[𝑡𝑒𝑥𝑡𝑗 ]
0 or 1
Let {a, b, c, d} be the alphabet, and ababc the pattern.
T[a] = 11010
T[b] = 10101
T[c] = 01111
T[d] = 11111
 The initial state is 11111
State 1
1
1
1
1
a
b
text
T[a] = 11010
T[b] = 10101
T[c] = 01111
T[d] = 11111
d
a
b
a
b
a
 The initial state is 11111
State 1
1
1
1
1
a
b
d
text
T[a] = 11010
T[b] = 10101
T[c] = 01111
T[d] = 11111
a
b
a
b
a
b
 The initial state is 11111
State 1
1
1
1
a
b
d
text
T[a] = 11010
T[b] = 10101
T[c] = 01111
T[d] = 11111
a
b
a
b
a
b
 The initial state is 11111
State 1
1
1
1
0
a
b
d
text
T[a] = 11010
T[b] = 10101
T[c] = 01111
T[d] = 11111
a
b
a
b
a
b
 The initial state is 11111
State 1
1
1
0
1
a
b
d
a
text
T[a] = 11010
T[b] = 10101
T[c] = 01111
T[d] = 11111
b
a
b
a
b
c
 The initial state is 11111
State 1
1
1
1
1
a
b
d
a
b
text
T[a] = 11010
T[b] = 10101
T[c] = 01111
T[d] = 11111
a
b
a
b
c
 The initial state is 11111
State 1
1
1
1
0
a
b
d
a
b
a
text
T[a] = 11010
T[b] = 10101
T[c] = 01111
T[d] = 11111
b
a
b
c
 The initial state is 11111
State 1
1
1
0
1
a
b
d
a
b
a
b
text
T[a] = 11010
T[b] = 10101
T[c] = 01111
T[d] = 11111
a
b
c
 The initial state is 11111
State 1
a
b
1
0
1
0
d
a
b
a
b
a
text
T[a] = 11010
T[b] = 10101
T[c] = 01111
T[d] = 11111
b
c
 The initial state is 11111
State 1
a
b
d
0
1
0
1
a
b
a
b
a
b
text
T[a] = 11010
T[b] = 10101
T[c] = 01111
T[d] = 11111
c
 The initial state is 11111
State 1
a
b
d
a
1
0
1
0
b
a
b
a
b
c
text
T[a] = 11010
T[b] = 10101
T[c] = 01111
T[d] = 11111
 The initial state is 11111
State 1
a
b
d
a
b
0
1
0
1
a
b
a
b
c
text
T[a] = 11010
T[b] = 10101
T[c] = 01111
T[d] = 11111
 The initial state is 11111
The match at the end of the text is
indicated by the value 0 in the
leftmost bit of the state
State 0
b
d
a
b
a
1
1
1
1
b
a
b
c
text
T[a] = 11010
T[b] = 10101
T[c] = 01111
T[d] = 11111
 The complexity of the search time in the worst and average
𝑚𝑏
𝑚𝑏
case is O
𝑛 , where
is the time to compute a constant
𝑤
𝑤
of operations on integers of mb bits using a word size of w bits.
m: pattern size
w: word size
 Now we extend our pattern language to allow don’t care
symbols, complement symbols and any finite class of symbols.
 x: A character from the alphabet
 *: A don’t care symbol which matches any symbol
 [characters]: A class of characters, for which we allow ranges
 𝐶 : The complement of a character or class of characters C.
 To search for these extended patterns, we need only to modify
the table T.
𝛿 𝐶 𝑖𝑠 1 𝑖𝑓 𝑡ℎ𝑒 𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛 𝐶 𝑖𝑠 𝑡𝑟𝑢𝑒, 𝑎𝑛𝑑 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
𝑚−1
𝛿 𝑥 ∉ 𝐶𝑙𝑎𝑠𝑠𝑖+1 2𝑏𝑖
𝑇𝑥 =
𝑖=0
𝑎𝑏 𝑎𝑏 𝑏[𝑎. . 𝑐]
T[a] = 11000
T[b] = 10011
T[c] = 11101
T[d] = 01101
 We allow up to k characters of the pattern to mismatch with the
corresponding text.
For example, if k = 2, the pattern mismatch:
mismatch (match)
dispatch (match)
respatch (mismatch)
 We have to count matches or mismatches. In both cases, at most
O(log m) bits per individual state are necessary because m is a
bound for both, matches and mismatches.
 Then we have a simple algorithm using
b = log 2 (𝑚 + 1)
and op being addition.
b = O(log m) => b = O(log k)
 Clearly only O(log k) bits are necessary to count if we allow at
most k mismatches. The problem is that when adding we have a
potential carry into the next state.
 Since we can get around this problem by having an overflow bit,
but that bit is set to zero at each step of the search.
In this case we need b = log 2 (𝑘 + 1) + 1 bits
At each step we record the overflow bits in an overflow state, and
we reset the overflow bits of all individual states.
 We want to search for all occurrences of ababc with at most 2
mismatch. Because the value of b is 3 for 2 mismatches, every
position in the state is represented by a number in the range 04.
 Initial state: 00000
 Initial overflow: 44444
We report a match when the sum of the leftmost digits of the state
and the overflow is less than 3
 It is possible to use only b= log 2 (𝑘 + 1) bits. The idea is to add
1 except if each b-bit slice has only ones.
 For example, if b-5, we perform the operations
Where x will have a 0 in the least significant bit of every 5 bit
individual state if all bits in each 5 bit slice are ones, or 1
otherwise.
 We can keep one vector state per pattern, we have an
𝑚
immediate O( 𝑚𝑎𝑥 ln) time algorithm for a set of l patterns,
𝑤
where 𝑚𝑚𝑎𝑥 = max(patterns)
 However, we can coalesce all the vectors, keeping all the
information in only one vector state achieving O(
search time, with 𝑚𝑠𝑢𝑚 = sum(patterns)
𝑚𝑠𝑢𝑚
𝑤
n)
 Experimental results for searching 100 times for all possible
matches of a pattern in a 50,000 character English text(a legal
document)
BMH: Boyer-Moore, as suggested by Horspool
 The execution time while search 1,000 words chosen at random
from the same English text