String Matching - Computer Information Systems

Transcript String Matching - Computer Information Systems

ICS220 – Data Structures and Algorithms Analysis

Lecture 14 Dr. Ken Cosh

Review

• Memory Management – Memory Allocation – Garbage Collection

This Week

• String Matching – String matching is a common task for many computer users; • Internet Searches • String manipulation in word processing • Advanced DNA sequence matching – Therefore effective pattern matching algorithms are essential.

Brute Force

• Our first simple string matching algorithm is brute force.

– We check the first character, if it is a match, we check the second character, if not a match, we step forward one character and start again.

• Any useful information that could be used in subsequent searches is then lost.

Brute Force

bruteForceStringMatching(pattern P, text T) i=0; while i ≤ |T| - |P| j=0; while T i == P j && j < |P| i++; j++; if j == |P| return

match at

i = i – j + 1; i-|P|; return

no match

;

Brute Force

2 3 4 5 6 • T = ababcdababababababad, P=babab ababcdababababababad 1 babab 7 8 babab babab babab babab babab babab babab In this case the match is found on the 8 th try.

Brute Force Complexity

• The best case for the algorithm is that the string is matched straight away (consider searching this sentence for “The”). Here |P| comparisons are required – O(|P|).

• The worst case is if the string isn’t found, but for each character in |T|, we are required to make |P| comparisons – here worst case is O(|T||P|).

• The average case depends on the size and frequencies of the character set.

Brute Force Complexity

• Notice the nested while loops in the Brute Force algorithm; while i ≤ |T| - |P| while T i == P j && j < |P| • Shortly we’ll investigate how we can reduce the number of iterations of each loop.

• For the worst case to occur we could search of a string such as aaaaaaaaaaaab within a string aaaaaaaaaaaaaaaaaaaaaaaaaaa etc.

Improving Brute Force

• A key problem with brute force is that each time we abort the comparison we have to start from the beginning of the pattern again.

• We could reduce the algorithm complexity by enabling us to skip unnecessary searches.

• Hancart’s algorithm allows the search to step forward 2 characters if a match won’t be found.

Hancart

• Hancart’s algorithm refines brute force in a couple of ways.

Mismatch with T i+1 – First the first two characters of the pattern are compared • Either they are the same, or they are different.

– Second comparisons begin with the 2 nd character in the Text.

Mismatch after first comparison P 0 == P 1 Step 2

(i=i+2)

If P 0 !=T i+1 , then P 1 !=T i+1 Step 1 P 0 != P 1 Step 1 Step 2

(i=i+2)

If P 1 ==T i+1 , then P 0 !=T i+1

Hancart

• Hancart’s revision works by allowing us to skip forward 2 characters in situations where there can’t be a match. • Notice that the situations where 2 steps forwards are allowed depends on whether the first 2 characters of the pattern.

• We can refine the search further by extending this observation – that the number of steps forward allowed depends on the contents of the pattern.

• The Knuth Morris Pratt algorithm observes that the pattern contains enough information to determine where the next match could begin.

Hancart

• Hancart’s algorithm reduces the number of iterations through the outer loop – by sometimes allowing the increment to be; i = i – j + 2;

Knuth Morris Pratt

• The Knutt Morris Pratt algorithm begins by finding the longest suffix, which is equal to a prefix of the same substring.

– Substring: A,B,C,D,A,B,D – Longest Suffix: 0,0,0,0,1,2,0 • i.e. when the 2 the substring. The following B forms ‘AB’ a 2 character prefix and suffix.

nd A comes it is both a suffix and a prefix for – Now for each iteration of the outer loop i can be increased by j-x, where x is the longest suffix.

• i.e. if a mismatch is found when comparing the second A, j=5, so i can be increased by 4 (j-1)

Test

Try searching for this substring, A,B,C,D,A,B,D within this string ABCDABCABCDABDE

Knuth Morris Pratt complexity

• Knuth Morris Pratt removes some of the complexity of the brute force algorithm by preprocessing the substring being searched for (to create the suffix table).

• Now as we don’t need to recheck characters in the text it is O(|T|) for the outer loop.

• Preprocessing can be performed quickly, in O(|P|) time, leaving a total complexity of O(|T|+|P|)

String Matching - Computer Information Systems

Transcript String Matching - Computer Information Systems

ICS220 – Data Structures and Algorithms Analysis

Review

This Week

Brute Force

Brute Force

Brute Force

Brute Force Complexity

Brute Force Complexity

Improving Brute Force

Hancart

Hancart

Hancart

Knuth Morris Pratt

Test

Knuth Morris Pratt complexity

Directory