Transcript Document

Skip Search algorithm

Very Fast String Matching Algorithm for Small Alphabets and Long Patterns, Christian, C., Thierry, L. and Joseph, D.P., Lecture Notes in Computer Science, Vol. 1448, 1998, pp. 55-64 Advisor: Prof. R. C. T. Lee Speaker: T. H. Ku 1

• Skip Search algorithm is an algorithm which solves the string matching problems.

• String matching problem: Input: a text string

T

of length

n

and a pattern string

P

of length

m

.

Output: all occurrences of

P

in

T

.

2

• The Skip Search algorithm consists two phases which are Processing and searching.

• The Skip Search algorithm uses Rule 4(Two window rule) and Rule 2-2 (1-Suffix Rule) to do the string matching.

3

Preprocessing

• The Preprocessing phase of the Skip Search algorithm preprocesses the pattern by computing the buckets for all characters of the alphabet.

Example: Text string

T=GCATCGCAGAGAGTATACAGTACG

0 12 3 4 5 6 7 Pattern string

P=GCAGAGAG

the buckets for all characters of the alphabet

A C G T

(6,4,2) (1) (7,5,3,0) φ 4

Search phase

• The search phase checks what is the

km

-th symbol in the text string, where 1 ≦

k

n/m

. According the symbol to align every identical symbol in the pattern and executes matching. Note that the bucket record every symbols’ location in the pattern. Example: Text string

T=aabcdbdabcabc

Pattern string

P=abcabc, m=6

The 6-th symbol in

T

is

b.

Then we align it by the 5-th symbol and executes matching. Then we align it by the 2-th symbol and executes matching.

T=aabcd b dabcabc a b ca b c a b ca b c

5

Full Example • Text string

T=GCATCGCAGAGAGTATACAGTACG

0 12 34 5 6 7 • Pattern string

P=GCAGAGAG A C G T

(6,4,2) (1) (7,5,3,0) Φ the buckets for all characters of the alphabet 6

0 1 2 3 4 5 6 7 8 9 1011 12 131415161718 19 20 212223 GCATCGC A GAGAGTA T ACAGTAC G GC A G A G A G

 mismatch

GC A G A G A G

 mismatch

GC A G A G A G

 exact match

A C G T

(6,4,2) (1) (7,5,3,0) φ Then we check T[15]=

T

. Since there is no “

T

” in the pattern, we check T[23]=

G

. Then we shift pattern to align T[16…23].

G CA G A G A G

7

Time Complexity

• The space and time complexity of the preprocessing phase is O(

m+σ

)(σ is the number of alphabet.) • The Skip Search algorithm has a quadratic worst case time complexity but the expected number of text character inspections is O(

n

).

8

References

[BM77] A Fast String Searching Algorithm , Boyer, R. S. and Moore, J. S. , Communication of the ACM , Vol. 20 , 1977 , pp. 762-772 . [HS91] Fast String Searching , Hume, A. and Sundy, D. M. , Software, Practice and Experience , Vol. 21 , 1991 , pp. 1221-1248 . [MTALSWW92] Speeding Up Two String-Matching Algorithms, Maxime C., Thierry L., Artur C., Leszek G., Stefan J., Wojciech P. and Wojciech R., Lecture Notes In Computer Science, Vol. 577, 1992, pp. 589-600 .

[MW94] Text algorithms, M. Crochemore and W. Rytter, Oxford University Press, 1994.

[KMP77] Fast Pattern Matching in Strings, D.E. Knuth, J.H. Morris and V.R. Pratt, SIAM Journal on Computing, Vol. 6, No.2, 1977, pp 323-350 .

[T92] A variation on the Boyer-Moore algorithm, Thierry Lecroq, Theoretical Computer Science archive, Vol. 92 , No.1, 1992, pp 119-144 .

[T98] Experiments on string matching in memory structures, Thierry Lecroq, Software—Practice & Experience archive, Vol. 28, No.5, 1998, pp 561-568 [T92] Tuning the Boyer-Moore-Horspool string searching algorithm, Timo Raita, Software—Practice & Experience archive, Vol. 22, No.10, 1992, pp. 879-884 .

1994, pp. 243 .