PowerPoint 簡報 - National Cheng Kung University
Download
Report
Transcript PowerPoint 簡報 - National Cheng Kung University
Author: Ricardo A. Baeza-Yates, Gaston H. Gonnet
Publisher: 1992 Communications of the ACM
Presenter: Yuen-Shuo Li
Date: 2013/08/14
1
String searching is a very important component of many
problems, including text editing, bibliographic retrieval, and
symbol manipulation.
Let T[x] be a table about pattern such that:
0 𝑥 = 𝑝𝑎𝑡𝑡𝑒𝑟𝑛𝑖
𝑇𝑖 𝑥 =
1 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
e.g. Let {a, b, c, d} be the alphabet, and ababc the pattern.
T[a] = 11010
T[b] = 10101
T[c] = 01111
T[d] = 11111
cbaba
T[a] = 11010
State = 𝑆𝑚−1 𝑆𝑚−2 … 𝑆2 𝑆1 𝑆0 (if length of pattern is m)
State 5
c
b
b
a
0
3
0
1
b
a
b
a
b
c
text
T[a] = 11010
T[b] = 10101
T[c] = 01111
T[d] = 11111
a
b
a
…
State = 𝑆𝑚−1 𝑆𝑚−2 … 𝑆2 𝑆1 𝑆0 (if length of pattern is m)
State 0
c
b
b
a
3
0
1
b
a
b
a
b
c
text
T[a] = 11010
T[b] = 10101
T[c] = 01111
T[d] = 11111
a
b
a
…
State = 𝑆𝑚−1 𝑆𝑚−2 … 𝑆2 𝑆1 𝑆0 (if length of pattern is m)
State 1
c
b
b
a
4
0
2
0
b
a
b
a
b
c
text
T[a] = 11010
T[b] = 10101
T[c] = 01111
T[d] = 11111
a
b
a
…
State = 𝑆𝑚−1 𝑆𝑚−2 … 𝑆2 𝑆1 𝑆0 (if length of pattern is m)
State 1
c
b
b
a
4
0
2
0
b
a
b
a
b
c
text
T[a] = 11010
T[b] = 10101
T[c] = 01111
T[d] = 11111
a
b
a
…
State = 𝑆𝑚−1 𝑆𝑚−2 … 𝑆2 𝑆1 𝑆0 (if length of pattern is m)
State 1
c
b
b
a
b
4
0
2
0
a
b
a
b
c
a
text
T[a] = 11010
T[b] = 10101
T[c] = 01111
T[d] = 11111
b
a
…
State = 𝑆𝑚−1 𝑆𝑚−2 … 𝑆2 𝑆1 𝑆0 (if length of pattern is m)
State 4
c
b
b
a
b
0
2
0
a
b
a
b
c
a
text
T[a] = 11010
T[b] = 10101
T[c] = 01111
T[d] = 11111
b
a
…
State = 𝑆𝑚−1 𝑆𝑚−2 … 𝑆2 𝑆1 𝑆0 (if length of pattern is m)
State 5
c
b
b
a
b
0
3
0
1
a
b
a
b
c
a
text
T[a] = 11010
T[b] = 10101
T[c] = 01111
T[d] = 11111
b
a
…
State = 𝑆𝑚−1 𝑆𝑚−2 … 𝑆2 𝑆1 𝑆0 (if length of pattern is m)
State 0
b
b
a
b
a
4
1
2
1
b
a
b
c
a
b
text
T[a] = 11010
T[b] = 10101
T[c] = 01111
T[d] = 11111
a
…
To update the state after reading a new character on the text,
we must
Shift the vector state b bits to the left to reflect that we have
advanced one position in the text.
Update the individual states according to the new character.
Each search step changes the state using the assignment:
b bits to represent each individual state 𝑆𝑖
𝑠𝑡𝑎𝑡𝑒𝑗 = 𝑠𝑡𝑎𝑡𝑒𝑗−1 ≪ 𝑏 + 𝑇[𝑡𝑒𝑥𝑡𝑗 ]
The number of mismatches
Each search step changes the state using the assignment:
b=1
𝑠𝑡𝑎𝑡𝑒𝑗 = 𝑠𝑡𝑎𝑡𝑒𝑗−1 ≪ 𝑏 | 𝑇[𝑡𝑒𝑥𝑡𝑗 ]
0 or 1
Let {a, b, c, d} be the alphabet, and ababc the pattern.
T[a] = 11010
T[b] = 10101
T[c] = 01111
T[d] = 11111
The initial state is 11111
State 1
1
1
1
1
a
b
text
T[a] = 11010
T[b] = 10101
T[c] = 01111
T[d] = 11111
d
a
b
a
b
a
The initial state is 11111
State 1
1
1
1
1
a
b
d
text
T[a] = 11010
T[b] = 10101
T[c] = 01111
T[d] = 11111
a
b
a
b
a
b
The initial state is 11111
State 1
1
1
1
a
b
d
text
T[a] = 11010
T[b] = 10101
T[c] = 01111
T[d] = 11111
a
b
a
b
a
b
The initial state is 11111
State 1
1
1
1
0
a
b
d
text
T[a] = 11010
T[b] = 10101
T[c] = 01111
T[d] = 11111
a
b
a
b
a
b
The initial state is 11111
State 1
1
1
0
1
a
b
d
a
text
T[a] = 11010
T[b] = 10101
T[c] = 01111
T[d] = 11111
b
a
b
a
b
c
The initial state is 11111
State 1
1
1
1
1
a
b
d
a
b
text
T[a] = 11010
T[b] = 10101
T[c] = 01111
T[d] = 11111
a
b
a
b
c
The initial state is 11111
State 1
1
1
1
0
a
b
d
a
b
a
text
T[a] = 11010
T[b] = 10101
T[c] = 01111
T[d] = 11111
b
a
b
c
The initial state is 11111
State 1
1
1
0
1
a
b
d
a
b
a
b
text
T[a] = 11010
T[b] = 10101
T[c] = 01111
T[d] = 11111
a
b
c
The initial state is 11111
State 1
a
b
1
0
1
0
d
a
b
a
b
a
text
T[a] = 11010
T[b] = 10101
T[c] = 01111
T[d] = 11111
b
c
The initial state is 11111
State 1
a
b
d
0
1
0
1
a
b
a
b
a
b
text
T[a] = 11010
T[b] = 10101
T[c] = 01111
T[d] = 11111
c
The initial state is 11111
State 1
a
b
d
a
1
0
1
0
b
a
b
a
b
c
text
T[a] = 11010
T[b] = 10101
T[c] = 01111
T[d] = 11111
The initial state is 11111
State 1
a
b
d
a
b
0
1
0
1
a
b
a
b
c
text
T[a] = 11010
T[b] = 10101
T[c] = 01111
T[d] = 11111
The initial state is 11111
The match at the end of the text is
indicated by the value 0 in the
leftmost bit of the state
State 0
b
d
a
b
a
1
1
1
1
b
a
b
c
text
T[a] = 11010
T[b] = 10101
T[c] = 01111
T[d] = 11111
The complexity of the search time in the worst and average
𝑚𝑏
𝑚𝑏
case is O
𝑛 , where
is the time to compute a constant
𝑤
𝑤
of operations on integers of mb bits using a word size of w bits.
m: pattern size
w: word size
Now we extend our pattern language to allow don’t care
symbols, complement symbols and any finite class of symbols.
x: A character from the alphabet
*: A don’t care symbol which matches any symbol
[characters]: A class of characters, for which we allow ranges
𝐶 : The complement of a character or class of characters C.
To search for these extended patterns, we need only to modify
the table T.
𝛿 𝐶 𝑖𝑠 1 𝑖𝑓 𝑡ℎ𝑒 𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛 𝐶 𝑖𝑠 𝑡𝑟𝑢𝑒, 𝑎𝑛𝑑 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
𝑚−1
𝛿 𝑥 ∉ 𝐶𝑙𝑎𝑠𝑠𝑖+1 2𝑏𝑖
𝑇𝑥 =
𝑖=0
𝑎𝑏 𝑎𝑏 𝑏[𝑎. . 𝑐]
T[a] = 11000
T[b] = 10011
T[c] = 11101
T[d] = 01101
We allow up to k characters of the pattern to mismatch with the
corresponding text.
For example, if k = 2, the pattern mismatch:
mismatch (match)
dispatch (match)
respatch (mismatch)
We have to count matches or mismatches. In both cases, at most
O(log m) bits per individual state are necessary because m is a
bound for both, matches and mismatches.
Then we have a simple algorithm using
b = log 2 (𝑚 + 1)
and op being addition.
b = O(log m) => b = O(log k)
Clearly only O(log k) bits are necessary to count if we allow at
most k mismatches. The problem is that when adding we have a
potential carry into the next state.
Since we can get around this problem by having an overflow bit,
but that bit is set to zero at each step of the search.
In this case we need b = log 2 (𝑘 + 1) + 1 bits
At each step we record the overflow bits in an overflow state, and
we reset the overflow bits of all individual states.
We want to search for all occurrences of ababc with at most 2
mismatch. Because the value of b is 3 for 2 mismatches, every
position in the state is represented by a number in the range 04.
Initial state: 00000
Initial overflow: 44444
We report a match when the sum of the leftmost digits of the state
and the overflow is less than 3
It is possible to use only b= log 2 (𝑘 + 1) bits. The idea is to add
1 except if each b-bit slice has only ones.
For example, if b-5, we perform the operations
Where x will have a 0 in the least significant bit of every 5 bit
individual state if all bits in each 5 bit slice are ones, or 1
otherwise.
We can keep one vector state per pattern, we have an
𝑚
immediate O( 𝑚𝑎𝑥 ln) time algorithm for a set of l patterns,
𝑤
where 𝑚𝑚𝑎𝑥 = max(patterns)
However, we can coalesce all the vectors, keeping all the
information in only one vector state achieving O(
search time, with 𝑚𝑠𝑢𝑚 = sum(patterns)
𝑚𝑠𝑢𝑚
𝑤
n)
Experimental results for searching 100 times for all possible
matches of a pattern in a 50,000 character English text(a legal
document)
BMH: Boyer-Moore, as suggested by Horspool
The execution time while search 1,000 words chosen at random
from the same English text