String Matching - Computer Information Systems

Download Report

Transcript String Matching - Computer Information Systems

ICS220 – Data Structures and Algorithms Analysis

Lecture 14 Dr. Ken Cosh

Review

• Memory Management – Memory Allocation – Garbage Collection

This Week

• String Matching – String matching is a common task for many computer users; • Internet Searches • String manipulation in word processing • Advanced DNA sequence matching – Therefore effective pattern matching algorithms are essential.

Brute Force

• Our first simple string matching algorithm is brute force.

– We check the first character, if it is a match, we check the second character, if not a match, we step forward one character and start again.

• Any useful information that could be used in subsequent searches is then lost.

Brute Force

bruteForceStringMatching(pattern P, text T) i=0; while i ≤ |T| - |P| j=0; while T i == P j && j < |P| i++; j++; if j == |P| return

match at

i = i – j + 1; i-|P|; return

no match

;

Brute Force

2 3 4 5 6 • T = ababcdababababababad, P=babab ababcdababababababad 1 babab 7 8 babab babab babab babab babab babab babab In this case the match is found on the 8 th try.

Brute Force Complexity

• The best case for the algorithm is that the string is matched straight away (consider searching this sentence for “The”). Here |P| comparisons are required – O(|P|).

• The worst case is if the string isn’t found, but for each character in |T|, we are required to make |P| comparisons – here worst case is O(|T||P|).

• The average case depends on the size and frequencies of the character set.

Brute Force Complexity

• Notice the nested while loops in the Brute Force algorithm; while i ≤ |T| - |P| while T i == P j && j < |P| • Shortly we’ll investigate how we can reduce the number of iterations of each loop.

• For the worst case to occur we could search of a string such as aaaaaaaaaaaab within a string aaaaaaaaaaaaaaaaaaaaaaaaaaa etc.

Improving Brute Force

• A key problem with brute force is that each time we abort the comparison we have to start from the beginning of the pattern again.

• We could reduce the algorithm complexity by enabling us to skip unnecessary searches.

• Hancart’s algorithm allows the search to step forward 2 characters if a match won’t be found.

Hancart

• Hancart’s algorithm refines brute force in a couple of ways.

Mismatch with T i+1 – First the first two characters of the pattern are compared • Either they are the same, or they are different.

– Second comparisons begin with the 2 nd character in the Text.

Mismatch after first comparison P 0 == P 1 Step 2

(i=i+2)

If P 0 !=T i+1 , then P 1 !=T i+1 Step 1 P 0 != P 1 Step 1 Step 2

(i=i+2)

If P 1 ==T i+1 , then P 0 !=T i+1

Hancart

• Hancart’s revision works by allowing us to skip forward 2 characters in situations where there can’t be a match. • Notice that the situations where 2 steps forwards are allowed depends on whether the first 2 characters of the pattern.

• We can refine the search further by extending this observation – that the number of steps forward allowed depends on the contents of the pattern.

• The Knuth Morris Pratt algorithm observes that the pattern contains enough information to determine where the next match could begin.

Hancart

• Hancart’s algorithm reduces the number of iterations through the outer loop – by sometimes allowing the increment to be; i = i – j + 2;

Knuth Morris Pratt

• The Knutt Morris Pratt algorithm begins by finding the longest suffix, which is equal to a prefix of the same substring.

– Substring: A,B,C,D,A,B,D – Longest Suffix: 0,0,0,0,1,2,0 • i.e. when the 2 the substring. The following B forms ‘AB’ a 2 character prefix and suffix.

nd A comes it is both a suffix and a prefix for – Now for each iteration of the outer loop i can be increased by j-x, where x is the longest suffix.

• i.e. if a mismatch is found when comparing the second A, j=5, so i can be increased by 4 (j-1)

Test

Try searching for this substring, A,B,C,D,A,B,D within this string ABCDABCABCDABDE

Knuth Morris Pratt complexity

• Knuth Morris Pratt removes some of the complexity of the brute force algorithm by preprocessing the substring being searched for (to create the suffix table).

• Now as we don’t need to recheck characters in the text it is O(|T|) for the outer loop.

• Preprocessing can be performed quickly, in O(|P|) time, leaving a total complexity of O(|T|+|P|)