Transcript Document

Strings and Pattern Matching Algorithms
Pattern P[0..m-1]
Text T[0..n-1]
Brute Force Pattern Matching
Algorithm BruteForceMatch(T,P):
Input: Strings T with n characters and P with m characters
Output: String index of the first substring of T matching P, or an
indication that P is not a substring of T
for i:=0 to n-m do //for each candidate index in T do //
{ j:=0
while (j<m and T[i+j]=P[j]) do j:=j+1
if j=m then return i
}
return “ there is no substring of T matching P.”
Time complexity: O(mn)
Boyer-Moore Algorithm
Improve the running time of the brute-force algorithm by adding two potentially timesaving heuristics:
Looking-Glass Heuristics: When testing a possible placement of P[0..m-1] against
T[0..n-1], begin the comparisons from the end of P and move backward to the front of
P.
Character-Jump Heuristic: Suppose that T[i] does not match P[j] and T[i]=c. If c is
not contained anywhere in P, then shift P completely past T[i], otherwise, shift P until
an occurrence of character c in P gets aligned with T[i].
last(c): if c is in P, last(c) is the index of the last (rightmost) occurrence of c in P.
Otherwise, define last(c)=1.
Compute-Last-Occurrence(P,m,Σ)
for each character c in Σ do last(c) := -1
Time complexity: O(m+ |Σ|)
for j := 0 to m-1 do last(P[j]) := j
Example:
P[0..5] = abacab
c
a b c d
last(c)
4 5 3
-1
Algorithm BMMatch(T,P)
Input: Strings T with n characters and P with m characters
Output: String index of the first substring of T matching P, or an
indication that P is not a substring of T
Compute-Last-Occurrence(P,m,Σ)
i:= m-1
j:= m-1
…………………….a……………………..
repeat
{ if P[j] = T[i] then
…a………b…
if j=0 then
m-j
m-j-1
m-last(T[i])-1
return i //a match!//
else
…a………b…
i:= i-1
j:= j-1
else
Time complexity( worst case):
i:= i+(m-1)-min(j-1, last(T[i]))
//jump step//
O(nm+ |Σ|)
j:= m-1
Example: T=aaaa…aaaa,
}
P=baa…a
until i>n-1
Usually it runs much faster.
return “ there is no substring of T matching P.”
Knuth-Morris-Pratt Algorithm
T
bacbababaaabcbab…
P
ababaca
P
ababaca
In general
T: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
P: xxxx…………xxxxxxxx
prefix
suffix
P:
xxxx…………xxxxxxxx
prefix
suffix
Algorithm KMPPrefixFunction(P)
Input: String P[1..m] with m characters
Output: The prefix function pre for P, which maps j to the length of the longest
prefix of P that is a suffix of P[1..j].
k:= 0
pre(1):= 0
for q := 2 to m
do while k > 0 and P[k+1]  P[q]
do k := pre(k)
if P[k+1]= P[q]
then k := k+1
pre(q):= k
return pre
k: index of the last character in the prefix
Example
i
1 2 3 4 5 6 7 8 9 10
P[i]
a b a b a b a b c a
pre(i) 0 0 1 2 3 4 5 6 0 1
Time complexity: O(m)
Algorithm KMPMatch(T,P)
Input: Strings T[1..n] with n characters and P[1..m] with m characters
Output: String index of the first substring of T matching P, or an
indication that P is not a substring of T
pre:= KMPPrefixFunction(P)
j:=0
for i:= 1 to n
do while j>0 and P[j+1] ≠ T[i]
do j := pre(j)
if P[j+1] = T[i] then j := j+1;
if j = m
then print “Pattern occurs with shift” i-m;
j := pre(j)
// look for the next match//
//a match!//
Time complexity: O(m+n)
Assignment
(1) How many character comparisons will be Boyer-Moore algorithm make in
searching for each of the following patterns in the binary text?
Text: repeat “01110” 20 times
Pattern: (a) 01111, (b) 01110
(2) (i) Compute the prefix function in KMP pattern match algorithm for pattern
ababbabbabbababbabb when the alphabet is ∑ = {a,b}.
(ii) How many character comparisons will be KMP pattern match algorithm make in
searching for each of the following patterns in the binary text?
Text: repeat “010011” 20 times
Pattern: (a) 010010, (b) 010110