28. KMP Algorithm -

Download Report

Transcript 28. KMP Algorithm -

KMP algorithm
KNUTH D.E., MORRIS (Jr) J.H., PRATT V.R.,, Fast pattern matching in strings,
SIAM Journal on Computing 6(1), 1977, pp.323-350.
Advisor: Prof. R. C. T. Lee
Reporter: C. W. Lu
1
KMP Table
• The KMP algorithm constructs a table in
preprocessing phase.
• In searching phase, the window shifting will be
determined easily by looking up the table.
• Example:
P = bcbabcbaebcbabcba
KMP Table
i
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16
P
b
c
b
a
b
c
b
a
e
b
c
b
a
b
c
b
a
-1
0
-1
1
-1
0
-1
1
4
-1
0
-1
1
-1
0
-1
1
2
• Once the KMP table is constructed, whenever
a mismatch occurs at location i, for the KMP
algorithm, we move the pattern i-KMPtable(i)
steps under the assumption that the location
starts with 0.
3
i
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16
P
b
c
b
a
b
c
b
a
e
b
c
b
a
b
c
b
a
-1
0
-1
1
-1
0
-1
1
4
-1
0
-1
1
-1
0
-1
1
T: … b c b a c c b a e b c b a b c b a …
P:
b c b a b c b a e b c b a b c b a
b c b a b c b a e b c b a b c b a
Mismatch occurs at location 4 of P.
Move P (4 - KMPtable[4]) = 4 - (-1) = 5 steps.
4
i
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16
P
b
c
b
a
b
c
b
a
e
b
c
b
a
b
c
b
a
-1
0
-1
1
-1
0
-1
1
4
-1
0
-1
1
-1
0
-1
1
T: … b c b a b c b a b b c b a b c b a …
P:
b c b a b c b a e b c b a b c b a
b c b a b c b a e b c b a b c b a
Mismatch occurs in position 8 of P.
Move P (8 - KMPtable[8]) = 8 - 4 = 4 steps.
5
The Definition of the KMP Table
• For location i, if j is the largest such that
P(0,j-1) is a suffix of P(0,i-1) and P(i) not
equal to P(j), then KMPtable(i)=j.
• Example
i
0
1
2
3
4
5
6
7
8
P
b
c
b
a
b
c
b
b
e
-1
0
-1
1
-1
0
-1
3
0
∵P(0, 2) is the longest
prefix which is equal to a
suffix of P(0, 6), and
P(7)≠P(3).
∴KMPtable[7] = 3.
6
Condition for KMPtable[i] = -1
Condition A: P(0) = P(i)
Condition B: P(0, j) is a suffix of P(0, i-1)
Condition C: P(j+1) = P(i)
KMPtable(i) = -1 :
A & (B  (j )( B & C ))
 ( A & B)  (j )( A & ( B & C ))
 ( A & B)  (j )( A & B & C )
7
i
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16
P
b
c
b
a
b
c
b
a
e
b
c
-1
0
-1
1
-1
b
a
b
c
b
a
• There is no suffix of P(0, 3) which is equal to a prefix
of P(0, 3). ( B)
• P(0) = P(4). (A)
• KMPtable[4] = -1 because it satisfies the condition
( A & B) .
8
i
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16
P
b
c
b
a
b
c
b
a
e
b
c
b
a
b
c
b
-1
0
-1
1
-1
0
-1
1
4
-1
0
-1
1
-1
0
-1
a
• There are two suffixes of P(0, 14) which are equal to
a prefix of P(0, 14):
bcbabc (P(0, 5)) and P(6) = P(15);
bc (P(0, 1)), and P(2) = P(15).
( (j )( B & C ) )
• P(0) = P(4). (A)
• KMPtable[15] = -1 because it satisfies the condition
(j )( A & B & C ) .
9
Condition for KMPtable[i] = 0
• Condition A: P(0) = P(i)
• Condition B: P(0, j) is a suffix of P(0, i-1)
• Condition C: P(j+1) = P(i)
KMPtable(i) = 0 :
A & (B  (j )(B & C ))
 (A & B)  (j )(A & ( B & C ))
 (A & B)  (j )(A & B & C )
10
i
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16
P
b
c
b
a
b
c
b
a
e
b
c
b
a
b
c
-1
0
-1
1
-1
0
-1
1
4
-1
0
-1
1
-1
0
b
a
• There are two suffixes of P(0, 13) which are equal to
a prefix of P(0, 13):
bcbab (P(0, 4)) and P(5) = P(14);
b (P(0)), and P(1) = P(14).
( (j )( B & C ) )
• P(0) = P(4). (  A )
• KMPtable[14] = 0 because it satisfies the condition
(j )(A & B & C ) .
11
How to Construct the KMP Table
Efficiently?
• Note that the KMP algorithm is actually an
improvement of the MP algorithm. Therefore,
we may now take a look at the table used in
the MP algorithm.
• We call the table used in the MP algorithm the
prefix table.
12
The Definition of the Prefix Table.
• For location i, let j be the largest j, if it exists,
such that P(0,j-1) is a suffix of P(0,i),
Prefix(i)=j.
• If, for P(0,i), there is no prefix equal to a suffix,
Prefix (i)=0.
13
Example
i
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16
P
b
c
b
a
b
c
b
a
e
b
c
b
a
b
c
b
a
Prefix 0
0
1
0
1
2
3
4
0
1
2
3
4
5
6
7
8
• Note that, in the MP algorithm, we move the
pattern i-Prefix(i)+1 steps when a mismatch
occurs at location i.
14
How can we construct the Prefix Table
Efficiently?
• To compute Prefix(i), we look at Prefix(i-1).
• In the following example, since Prefix(11)=4,
we know that there exists a prefix of length 4
which is equal to a suffix with length 4 of
P(0,11). Besides, P(4)=P(12). We may
conclude that Prefix(12)=Prefix(11)+1=4+1=5.
i
0
1
2
3
4
5
6
7
8
9
10 11 12
a
Prefix 0
g
0
c
0
a
1
a
1
c
0
c
0
g
0
a
1
g
2
c
3
P
a
4
a
5
15
Another Case
• Consider the following example.
i
0
1
2
3
4
5
6
7
8
9
10
P
Prefix
c
0
g
0
c
1
g
2
a
0
g
0
c
1
g
2
c
3
g
4
c
?
• Prefix(9)=4. But P(4)≠P(10).
• Can we conclude that Prefix(10)=0?
• No, we cannot.
16
• There exists a shorter prefix with length 2
which is equal to a suffix of P(0, 9), and
P(10)=P(2). We should conclude that
Prefix(10)=2+1=3.
i
0
1
2
3
4
5
6
7
8
9
10
P
Prefix
c
0
g
0
c
1
g
2
a
0
g
0
c
1
g
2
c
3
g
4
c
?
17
• In other words, we may use the pointer idea
expressed below:
j
i-1
Y
X
• It may be necessary to examine P(0, j) to see
whether there exists a prefix of P(0, j) equal to
a suffix of P(0, j).
• Thus the Prefix function can be found
recursively.
18
Construct the Prefix
Function f
f [0]=0
For ( i=1 ; i<m ; i++ ){
t = f (i-1);
While(t>=0){
if ( P(i) = P(t) ) {
f [i] = t + 1;
break;
}
else{
if ( t != 0)
t = f [t-1];
else{
f [i] = 0;
break;
}
}
}
}
/*recursive*/
19
Example:
i
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16
P
b
c
b
a
b
c
b
a
e
b
c
b
a
b
c
b
a
Prefix 0
i
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16
P b
Prefix 0
c
b
a
b
c
b
a
e
b
c
b
a
b
c
b
a
0
t = f[i-1] = f[0] = 0;
∵P[1] = c ≠ P[t] = P[0] = b ∴f [1] = 0.
20
Example:
i
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16
P
b
c
b
a
b
c
b
a
e
b
c
Prefix 0
0
1
0
1
2
b
a
b
c
b
a
t = f[i-1] = f[4] = 1;
∵P[5] = c = P[t] = P[1] = c ∴f [5] = t +1 = 2.
i
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16
P b
Prefix 0
c
b
a
b
c
b
a
e
b
c
0
1
0
1
2
3
4
0
b
a
b
c
b
a
t = f[i-1] = f[7] = 4;
∵P[8] = e ≠ P[t] = P[4] = b,
t != 0;
t = f[t-1] = f[3] = 0;
∵P[8] = e ≠ P[t] = P[0] = b, ∴f [8] = 0.
21
Example:
i
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16
P
b
c
b
a
b
c
b
a
e
b
c
b
a
b
c
b
Prefix 0
0
1
0
1
2
3
4
0
1
2
3
4
5
6
7
a
t = f[i-1] = f[14] = 6;
∵P[15] = b = P[t] = P[6] = b ∴f [15] = t +1 = 7.
i
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16
P b
Prefix 0
c
b
a
b
c
b
a
e
b
c
b
a
b
c
b
a
0
1
0
1
2
3
4
0
1
2
3
4
5
6
7
8
22
• The KMP Table can also be constructed
recursively.
23
The KMPtable
KMPtable[0] = -1
For ( i=1 ; i<m ; i++ ) {
t=f (i-1)
While ( t > 0 ) {
if ( P(i) ≠ P(t) ) {
KMPtable[i]=t
break
}
else t = f ( t – 1 )
/*recursive*/
}
if ( KMPtable[i] = ψ)
if ( P(i) = P(0)) .
KMPtable[i] = -1
else
KMPtable[i] = 0
}
24
Example:
i
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16
P
b
c
b
a
b
c
b
a
e
b
c
b
a
b
c
b
a
Prefix 0
KMPtable -1
0
1
0
1
2
3
4
0
1
2
3
4
5
6
7
8
i
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16
P
b
c
b
a
b
c
b
a
e
b
c
b
a
b
c
b
a
Prefix 0
KMPtable -1
0
1
0
1
2
3
4
0
1
2
3
4
5
6
7
8
0
-1
1
t = f[i-1] = f[2] = 1;
∵P[3] = a ≠ P[t] = P[1] = c, ∴ KMPtable [3] = t = 1.
25
Example:
i
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16
P
b
c
b
a
b
c
b
a
e
b
c
b
a
b
c
b
a
Prefix 0
KMPtable -1
0
1
0
1
2
3
4
0
1
2
3
4
5
6
7
8
0
-1
1
-1
0
-1
1
t = f[i-1] = f[6] = 3;
∵P[7] = a = P[t] = P[3] = a;
t = f[t-1] = f[2] = 1;
∵P[7] = a ≠ P[t] = P[1] = c;
∴ KMPtable [3] = t = 1.
26
Example:
i
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16
P
b
c
b
a
b
c
b
a
e
b
c
b
a
b
c
b
a
Prefix 0
KMPtable -1
0
1
0
1
2
3
4
0
1
2
3
4
5
6
7
8
0
-1
1
-1
0
-1
1
4
-1
0
-1
1
-1
t = f[i-1] = f[12] = 4;
∵P[13] = b = P[t] = P[4] = b;
t = f[t-1] = f[3] = 0;
∵P[13] = b = P[0] = b;
∴ KMPtable [13] = -1.
27
Example:
i
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16
P
b
c
b
a
b
c
b
a
e
b
c
b
a
b
c
b
a
f 0
KMPtable -1
0
1
0
1
2
3
4
0
1
2
3
4
5
6
7
8
0
-1
1
-1
0
-1
1
4
-1
0
-1
1
-1
0
-1
1
t = f[i-1] = f[15] = 7;
∵P[16] = a = P[t] = P[7] = a;
t = f[t-1] = f[6] = 3;
∵P[16] = a = P[t] = P[3] = a;
t = f[t-1] = f[2] = 1;
∵P[16] = a ≠ P[t] = P[1] = c;
∴ KMPtable [16] = t = 1.
28
Time Complexity
Preprocessing phase in O(m) space and time complexity.
Searching phase in O(n+m) time complexity.
29
References
•
•
•
•
•
•
•
•
•
•
•
AHO, A.V., 1990, Algorithms for finding patterns in strings. in Handbook of Theoretical Computer Science, Volume
A, Algorithms and complexity, J. van Leeuwen ed., Chapter 5, pp 255-300, Elsevier, Amsterdam.
AOE, J.-I., 1994, Computer algorithms: string pattern matching strategies, IEEE Computer Society Press.
BAASE, S., VAN GELDER, A., 1999, Computer Algorithms: Introduction to Design and Analysis, 3rd Edition,
Chapter 11, pp. ??-??, Addison-Wesley Publishing Company.
BAEZA-YATES R., NAVARRO G., RIBEIRO-NETO B., 1999, Indexing and Searching, in Modern Information
Retrieval, Chapter 8, pp 191-228, Addison-Wesley.
BEAUQUIER, D., BERSTEL, J., CHRÉTIENNE, P., 1992, Éléments d'algorithmique, Chapter 10, pp 337-377,
Masson, Paris.
CORMEN, T.H., LEISERSON, C.E., RIVEST, R.L., 1990. Introduction to Algorithms, Chapter 34, pp 853-885,
MIT Press.
CROCHEMORE, M., 1997. Off-line serial exact string searching, in Pattern Matching Algorithms, ed. A.
Apostolico and Z. Galil, Chapter 1, pp 1-53, Oxford University Press.
CROCHEMORE, M., HANCART, C., 1999, Pattern Matching in Strings, in Algorithms and Theory of Computation
Handbook, M.J. Atallah ed., Chapter 11, pp 11-1--11-28, CRC Press Inc., Boca Raton, FL.
CROCHEMORE, M., LECROQ, T., 1996, Pattern matching and text compression algorithms, in CRC Computer
Science and Engineering Handbook, A. Tucker ed., Chapter 8, pp 162-202, CRC Press Inc., Boca Raton, FL.
CROCHEMORE, M., RYTTER, W., 1994, Text Algorithms, Oxford University Press.
GONNET, G.H., BAEZA-YATES, R.A., 1991. Handbook of Algorithms and Data Structures in Pascal and C, 2nd
Edition, Chapter 7, pp. 251-288, Addison-Wesley Publishing Company.
30
References
•
•
•
•
•
•
•
•
•
•
•
GOODRICH, M.T., TAMASSIA, R., 1998, Data Structures and Algorithms in JAVA, Chapter 11, pp 441-467, John
Wiley & Sons.
GUSFIELD, D., 1997, Algorithms on strings, trees, and sequences: Computer Science and Computational Biology,
Cambridge University Press.
HANCART, C., 1992, Une analyse en moyenne de l'algorithme de Morris et Pratt et de ses raffinements, in Théorie
des Automates et Applications, Actes des 2e Journées Franco-Belges, D. Krob ed., Rouen, France, 1991, PUR 176,
Rouen, France, 99-110.
HANCART, C., 1993. Analyse exacte et en moyenne d'algorithmes de recherche d'un motif dans un texte, Ph. D.
Thesis, University Paris 7, France.
KNUTH D.E., MORRIS (Jr) J.H., PRATT V.R., 1977, Fast pattern matching in strings, SIAM Journal on Computing
6(1):323-350.
SEDGEWICK, R., 1988, Algorithms, Chapter 19, pp. 277-292, Addison-Wesley Publishing Company.
SEDGEWICK, R., 1988, Algorithms in C, Chapter 19, Addison-Wesley Publishing Company.
SEDGEWICK, R., FLAJOLET, P., 1996, An Introduction to the Analysis of Algorithms, Chapter ?, pp. ??-??,
Addison-Wesley Publishing Company.
STEPHEN, G.A., 1994, String Searching Algorithms, World Scientific.
WATSON, B.W., 1995, Taxonomies and Toolkits of Regular Language Algorithms, Ph. D. Thesis, Eindhoven
University of Technology, The Netherlands.
WIRTH, N., 1986, Algorithms & Data Structures, Chapter 1, pp. 17-72, Prentice-Hall.
31
Thank You!
32