Transcript String Matching - Computer Information Systems
ICS220 – Data Structures and Algorithms Analysis
Lecture 14 Dr. Ken Cosh
Review
• Memory Management – Memory Allocation – Garbage Collection
This Week
• String Matching – String matching is a common task for many computer users; • Internet Searches • String manipulation in word processing • Advanced DNA sequence matching – Therefore effective pattern matching algorithms are essential.
Brute Force
• Our first simple string matching algorithm is brute force.
– We check the first character, if it is a match, we check the second character, if not a match, we step forward one character and start again.
• Any useful information that could be used in subsequent searches is then lost.
Brute Force
bruteForceStringMatching(pattern P, text T) i=0; while i ≤ |T| - |P| j=0; while T i == P j && j < |P| i++; j++; if j == |P| return
match at
i = i – j + 1; i-|P|; return
no match
;
Brute Force
2 3 4 5 6 • T = ababcdababababababad, P=babab ababcdababababababad 1 babab 7 8 babab babab babab babab babab babab babab In this case the match is found on the 8 th try.
Brute Force Complexity
• The best case for the algorithm is that the string is matched straight away (consider searching this sentence for “The”). Here |P| comparisons are required – O(|P|).
• The worst case is if the string isn’t found, but for each character in |T|, we are required to make |P| comparisons – here worst case is O(|T||P|).
• The average case depends on the size and frequencies of the character set.
Brute Force Complexity
• Notice the nested while loops in the Brute Force algorithm; while i ≤ |T| - |P| while T i == P j && j < |P| • Shortly we’ll investigate how we can reduce the number of iterations of each loop.
• For the worst case to occur we could search of a string such as aaaaaaaaaaaab within a string aaaaaaaaaaaaaaaaaaaaaaaaaaa etc.
Improving Brute Force
• A key problem with brute force is that each time we abort the comparison we have to start from the beginning of the pattern again.
• We could reduce the algorithm complexity by enabling us to skip unnecessary searches.
• Hancart’s algorithm allows the search to step forward 2 characters if a match won’t be found.
Hancart
• Hancart’s algorithm refines brute force in a couple of ways.
Mismatch with T i+1 – First the first two characters of the pattern are compared • Either they are the same, or they are different.
– Second comparisons begin with the 2 nd character in the Text.
Mismatch after first comparison P 0 == P 1 Step 2
(i=i+2)
If P 0 !=T i+1 , then P 1 !=T i+1 Step 1 P 0 != P 1 Step 1 Step 2
(i=i+2)
If P 1 ==T i+1 , then P 0 !=T i+1
Hancart
• Hancart’s revision works by allowing us to skip forward 2 characters in situations where there can’t be a match. • Notice that the situations where 2 steps forwards are allowed depends on whether the first 2 characters of the pattern.
• We can refine the search further by extending this observation – that the number of steps forward allowed depends on the contents of the pattern.
• The Knuth Morris Pratt algorithm observes that the pattern contains enough information to determine where the next match could begin.
Hancart
• Hancart’s algorithm reduces the number of iterations through the outer loop – by sometimes allowing the increment to be; i = i – j + 2;
Knuth Morris Pratt
• The Knutt Morris Pratt algorithm begins by finding the longest suffix, which is equal to a prefix of the same substring.
– Substring: A,B,C,D,A,B,D – Longest Suffix: 0,0,0,0,1,2,0 • i.e. when the 2 the substring. The following B forms ‘AB’ a 2 character prefix and suffix.
nd A comes it is both a suffix and a prefix for – Now for each iteration of the outer loop i can be increased by j-x, where x is the longest suffix.
• i.e. if a mismatch is found when comparing the second A, j=5, so i can be increased by 4 (j-1)
Test
Try searching for this substring, A,B,C,D,A,B,D within this string ABCDABCABCDABDE
Knuth Morris Pratt complexity
• Knuth Morris Pratt removes some of the complexity of the brute force algorithm by preprocessing the substring being searched for (to create the suffix table).
• Now as we don’t need to recheck characters in the text it is O(|T|) for the outer loop.
• Preprocessing can be performed quickly, in O(|P|) time, leaving a total complexity of O(|T|+|P|)