Transcript Slide 1

Exact String Matching Algorithms:
A Survey
Mehreen Ali, Hina Naz Khan, Shumaila Sayyab, Nadeem Iftikhar
Department of Bio-Science
Mohammad Ali Jinnah University, Islamabad-Pakistan
Introduction
•
Exact string matching algorithms are believed to find all occurrences of a
given string pattern in the given text of finite length.
•
Exact string matching algorithms are excessively used in
1.
2.
3.
4.
5.
most of the operating systems,
text editors,
internet related searches,
high performance computing,
nucleotide or amino acid sequence searches
from genome or protein databases.
Abstract
Exact String matching problem has always remained
an eye catching area of research throughout the
history of computer science. Exact string matching is
fundamental to the database and text processing
applications. Till now several algorithms have been
proposed to solve this problem. This paper provides
a survey on available exact string matching
algorithms, along with their classification and
evaluation based on certain important benchmarks.
General Behavior
• The general behavior consists of alignment of string pattern against the text
and then comparison between them based on the given algorithm.
• Each such alignment is referred to as text
window and each process of
comparison is known as an attempt. Such behavior of algorithms is termed as
sliding window mechanism.
• On a match or mismatch, next alignment of string pattern and text is checked
till the text ends.
• During pre-processing phase a matrix, table or a Finite State Automaton is
computed based on the given string pattern, to be used during the searching
phase.
Benchmarks
• Benchmarks to evaluate algorithms are;
1. Time Complexity (tc)
2. Space Complexity (sc)
3. Pre-processing Time (pt)
4. Character Comparisons (cc) (average or worst case)
• Big (O) notation is used to calculate all these time and space complexities.
Exact String Matching Algorithms
Brute Force Algorithm
• basic and very simple algorithm to proceed;
• It has no pre-processing phase.
• can be done in any order.
Classification
All other algorithms can be classified into four categories depending upon the order in
which the comparisons are made, which are as follows;
1.
2.
3.
4.
From Left To Right
From Right To Left
In a Specific Order
In Any Order
From Left To Right
Deterministic Finite Automaton Algorithm
• Computes the transition table for input, in the pre-processing
phase.
• Needs extra space and time to store and search the table.
Karp-Rabin Algorithm
•
•
avoids checking at each position for the pattern in the text, thus
is very effective for multiple pattern matching.
Hashing function is used.
Shift Or Algorithm
•
•
•
The algorithm uses bitwise techniques
works efficiently if the pattern length is within the memoryword size of the machine.
Searching phase and time complexity is comparatively lesser
than Brute Force algorithm
Morris-Pratt Algorithm
•
•
•
follows Brute Force algorithm
number of shifts is greater that increases the speed of the
search
keeps record of text already matched with the pattern.
Knuth-Morris-Pratt Algorithm
•
•
•
follows Morris and Pratt algorithm,
increases the speed.
has less time and space complexity
Simon Algorithm
•
•
•
derived from Deterministic Finite Automaton algorithm.
the number of the backward edges is reduced but searching
phase is similar.
time complexity increases irrespective of the input size.
Apostolico-Crochemore Algorithm
•
•
•
refinement of the Knuth-Morris-Pratt algorithm
decreases the number of failure attempts thus saves time.
reduced character comparisons and space complexity.
Not So Naïve Algorithm
•
•
follows the searching behavior of Apostolico-Crochemore
algorithm
time complexity is comparable to Brute Force algorithm.
From Right To Left
Boyer-Moore Algorithm
•
•
It uses two functions i.e. good-suffix shift and bad-character shift
maximum shift value from both functions is considered.
Turbo-BM Algorithm
•
•
modified Boyer-Moore algorithm.
Time complexity has reduced as algorithm allows jumping over
already matched factor and a turbo-shift.
Apostolico-Giancarlo Algorithm
•
•
•
•
variant of Boyer-Moore algorithm.
remembers the length of the longest suffix of the pattern and
store it in table Skip.
Suff table is used during computation of bad-character shift
function.
number of character comparisons has been reduced
Quick search Algorithm
•
•
•
simplified Boyer-Moore algorithm
uses only bad character shift function .
reduced space complexity
SSABS Algorithm
•
•
uses Quick Search bad character shift function + the calculation
of text window skip value.
has reduced time complexity
Zhu-Takaoka Algorithm
•
•
•
•
•
variation of Boyer-Moore algorithm.
It considers two consecutive characters to calculate the bad
character shift.
Its search process is fast
Skip table grows very heavily.
increased pre-processing space and time complexity
Berry-Ravindran Algorithm
•
•
•
•
derived from Quick Search algorithm and Zhu-Takaoka
algorithm.
It uses two characters to calculate shift value using bad
character shift value.
reduces the number of character comparisons.
space and time complexities are similar to that of Zhu-Takaoka
algorithm.
TVSBS Algorithm
•
•
•
combination of Berry-Ravindran and SSABS algorithms.
It uses bad character shift function of Berry-Ravindran algorithm
whereas searching phase is similar to that of the SSABS.
Reverse Factor Algorithm
•
•
•
preferred for long patterns and short text.
improved length of shifts.
has quadratic worst time complexity but on the average it is
optimal.
In a Specific Order
Colussi Algorithm
•
•
•
enhancement of Knuth-Morris-Pratt algorithm.
pattern position is divided into two disjoint subsets one is scanned from
left to right and other from right to left.
time complexity reduced and less character comparisons.
Two Way Algorithm
•
•
requires ordered alphabets.
processing is like Colussi algorithm.
String Matching On Ordered Alphabets Algorithm
•
also requires ordered alphabets.
• There is no pre-processing phase
• comparison of each character of string pattern is made one by one.
In Any Order
Horspool Algorithm
•
•
•
simplified Boyer-Moore algorithm.
Boyer-Moore bad character shift function is used
saves time during searching phase by reducing number of
comparisons.
Smith Algorithm
•
•
•
derived from Horspool and Quick Search algorithms
uses their bad character shift functions to compute shift values.
no difference in time and space complexities.
Raita Algorithm
• uses Boyer-Moore bad character shift function
• performs the shifts like the Horspool algorithm.
• same time and space complexities as that of Horspool algorithm
Evaluation
ESMAs
tc
sc
pt
cc
Brute Force Algorithm
O(mn)
constant extra
space
no preprocessing
2n
Morris-Pratt Algorithm
O(n+m)
O(m)
O(m)
2n-1
ApostolicoCrochemore Algorithm
O(n)
O(m)
O(m)
3/2n
Boyer-Moore Algorithm
O(mn)
O(m +|Σ|)
O(m +|Σ|)
3n
Quick Search
Algorithm
O(mn)
O(|Σ|)
O(m +|Σ|)
quadratic worst case
SSABS Algorithm
O([n/(m+1)])
-
-
O(m(n-m+1)) worst case
Zhu-Takaoka Algorithm
O(mn)
O(m+|Σ|^2)
O(m+|Σ|^2)
quadratic worst case
Berry-Ravindran
Algorithm
O(mn)
O(m+|Σ|^2)
O(m+|Σ|^2)
-
TVSBS Algorithm
O([n/(m+2)])
O(|Σ|+k^|Σ|)
O(|Σ|+k^|Σ|)
O(m(n-m+1)) worst case
Colussi Algorithm
O(n)
O(m)
O(m)
3/2n
Skip Search Algorithm
O(mn)
O(m +|Σ|)
O(m +|Σ|)
O(n), quadratic worst case
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Table 1: Comparison of Exact String Matching Algorithms
Summary and Conclusion
•
Among all the selected ESMAs, the latest one i.e. TVSBS algorithm is the
best for exact string matching BECAUSE;
1. uses least space and time complexity during pre-processing
phase and otherwise also.
2. provides better results in fewer attempts
3. and less number of character comparisons even when compared
with SSABS.
•
As with all other surveys, here too the list of ESMAs is yet not complete,
although comprehensive. It is believed that further new proposed
algorithms will also be considered, and evaluated in the similar fashion.
References
[1] AHO, A.V., 1990, Algorithms for finding patterns in strings. in Handbook of Theoretical Computer
Science, Volume A, Algorithms and complexity, J. van Leeuwen ed., Chapter 5, pp 255-300, Elsevier,
Amsterdam.
[2] CHARRAS, C. and LECROQ, T., Handbook of Exact String Matching algorithms
http://www-igm.univ-mlv.fr/~lecroq/string/
[3] CROCHEMORE, M., LECROQ, T., 1996, Pattern matching and text compression algorithms, in
CRC Computer Science and Engineering Handbook, A. Tucker ed., Chapter 8, pp 162-202, CRC
Press Inc., Boca Raton, FL.
[4] GONNET, G.H., BAEZA-YATES, R.A., 1991. Handbook of Algorithms and Data Structures in
Pascal and C, 2nd Edition, Chapter 7, pp. 251-288, Addison-Wesley Publishing Company.
[5] GUSFIELD, D., 1997, Algorithms on strings, trees, and sequences: Computer Science and
Computational Biology, Cambridge University Press.
[6] RAHUL THATHOO, ASHISH VIRMANI, S. SAI LAKSHMI, N. BALAKRISHNAN and K. SEKAR,
TVSBS: A fast exact pattern matching algorithm for biological sequences CURRENT SCIENCE, VOL.
91, NO. 1, 10 JULY 2006.
Thanks