Matching Algorithms - Mississippi Valley State University

Transcript Matching Algorithms - Mississippi Valley State University

Chapter 7
Matching Algorithms
Chapter Outline
• String Matching
– Straightforward matching
– Finite automata
– Knuth-Morris-Pratt algorithm
– Boyer-Moore algorithm
• Approximate string matching
2
Prerequisites
• Before beginning this chapter, you
should be able to:
– Create finite automata
– Use character strings
– Use one- and two-dimensional arrays
– Describe growth rates and order
3
Goals
• At the end of this chapter you should be
able to:
– Explain the substring matching problems
– Explain the straightforward algorithms and
its analysis
– Explain the use of finite automata for string
matching
4
Goals (continued)
– Construct and use a Knuth-Morris-Pratt
automaton
– Construct and use slide and jump arrays
for the Boyer-Moore algorithm
– Explain the method of approximate string
matching
5
Matching Algorithms
• The general problem is to find a string of
characters in a larger piece of text
• Could also be used to find any pattern of
bits or bytes in a larger binary file
• Algorithms find the first occurrence of the
string in the larger text
• We will assume that the string length is S
and the text length is T when we analyze
the algorithms
6
Straightforward Matching Example
7
Straightforward Matching
• We compare the first character of the
string with the first character of the text
• If they match, we move to the next
character until we have matched the
entire string or found a mismatch
• If there is a mismatch, we move the
string by one place and start again
8
The Straightforward Algorithm
subLoc = 1
textLoc = 1
textStart = 1
while textLoc ≤ length(text)
and subLoc ≤ length(substring) do
if text[ textLoc ] == substring[ subLoc ] then
textLoc = textLoc + 1
subLoc = subLoc + 1
else
textStart = textStart + 1
textLoc = textStart
subLoc = 1
end if
end while
if subLoc > length(substring) then
return textStart
else
return 0
end if
9
Analysis
• In the worst case, we succeed on each
comparison of the string with the text
except for the last
• This is possible if the string is all X
characters except for one Y at the end
and the text is all X characters
• In this case, we do S*(T-S+1)
comparisons
10
Analysis
• Natural language texts do not have this
sort of pattern, so the algorithm will do
better with them
• This is because there is an uneven
distribution of character use in natural
language
• Studies show that this algorithm uses a
little over T comparisons on a natural
language text
11
Finite Automata
• Finite automata are used to decide
whether a word is in a given language
• We could set up a finite automaton to
accept the string we are looking for and
then if we wind up in the accepting
state, we know we found the string and
can stop
12
Finite Automata
• Because we will look at each text
character once, this will do at most T
comparisons
• However, the algorithm to construct a
finite automaton from a string takes a
long time
13
Knuth-Morris-Pratt Algorithm
• For each character comparison, we can
either succeed or fail
• The Knuth-Morris-Pratt (KMP) algorithm
constructs an automaton that labels the
nodes with the string characters and
has a success and fail link for each
node
14
Knuth-Morris-Pratt Algorithm
• The success links are easy to determine
because they just take us to the next
node
• The fail links will take us back in the
automaton and are based on the string
we are trying to match
• We will get a new character of the text
when we succeed in matching, but will
reuse that character if we fail
15
Knuth-Morris-Pratt Example
• The automaton for the string ababcd
would be:
16
Knuth-Morris-Pratt
Matching Algorithm
subLoc = 1
textLoc = 1
while textLoc ≤ length(text)
and subLoc ≤ length(substring) do
if subLoc == 0 or
text[ textLoc ] == substring[ subLoc ] then
textLoc = textLoc + 1
subLoc = subLoc + 1
else
subLoc = fail[ subLoc ]
end if
end while
if subLoc > length(substring) then
return textLoc - length(substring)
else
return 0
end if
17
Knuth-Morris-Pratt
Failure Link Algorithm
fail[ 1 ] = 0
for i = 2 to length(substring) do
temp = fail[ i - 1 ]
while temp > 0
and substring[ temp ] ≠ substring[ i - 1 ] do
temp = fail[ temp ]
end while
fail[ i ] = temp + 1
end for
18
Failure Link Analysis
• The ≠ comparison will be false at most
S – 1 times
• The fail links are all smaller than their
index
• temp is decreased each time the ≠ is
true
• The while loop is not done on the first
pass
19
Failure Link Analysis
• The variable temp is incremented by 1
for the next pass because of
– The final statement of the for loop
– The increment of i
– The first statement of the while loop
• There are S – 2 “next” passes, so temp
in incremented S – 2 times
• Because fail[1]=0, temp never
becomes negative
20
Failure Link Analysis
• temp starts at 0 and is incremented no
more than S – 2 times
• Because temp is decreased for each
mismatched comparison, there are at
most S – 2 failed comparisons
• There are S – 1 successful
comparisons, so there are at most 2S –
3 comparisons
21
Match Algorithm Analysis
• The while loop does one character
comparison per pass
• Either textLoc and subLoc are
incremented or subLoc is decremented
• Because textLoc starts at 1 and is
never greater than T, it is incremented
no more than T times
22
Match Algorithm Analysis
• Because subLoc starts at 1 and is
never greater than T, it is decremented
no more than T times
• This means that the then clause is done
no more than T times and the else
clause is done no more than T times, so
there are no more than 2T comparisons
23
Knuth-Morris-Pratt Analysis
• The fail link construction takes 2S-3
comparisons and the matching takes 2T
comparisons
• The KMP algorithm is O(S + T), where
the standard algorithm is O(S * T)
24
Boyer-Moore Algorithm
• If we match from the right of the string,
a mismatch might help us move the
string a bigger distance in the text to
skip over other mismatch locations that
can be predicted
25
Boyer-Moore Algorithm
• We have to also consider what we have
matched, so we do not make too small
of a move
• If we move the string by one position to
line up the two ‘t’ characters we will fail
quickly, but that could be predicted
26
Boyer-Moore Algorithm
• This algorithm calculates a slide and a
jump move
• The slide value tells us how much the
pattern should be moved to line up the
text character that did not match
• The jump value tells us how much to
move the pattern to line up the end
characters that matched with their
occurrence earlier in the string
27
Boyer-Moore Matching Algorithm
textLoc = length(pattern)
patternLoc = length(pattern)
while (textLoc ≤ length(text)) and (patternLoc > 0) do
if text[ textLoc ] == pattern[ patternLoc ] then
textLoc = textLoc - 1
patternLoc = patternLoc - 1
else
textLoc = textLoc +
MAXIMUM(slide[text[textLoc]],jump[patternLoc])
patternLoc = length(pattern)
end if
end while
if patternLoc == 0 then
return textLoc + 1
else
return 0
end if
28
Deciding on a Slide Value
29
Boyer-Moore Slide Array Algorithm
for every
slide[
end for
for i = 1
slide[
end for
ch in the character set do
ch ] = length(pattern)
to length(pattern) do
pattern[i] ] = length(pattern) - i
30
Boyer-Moore Jump Array Algorithm
for i = 1 to length(pattern) do
jump[ i ] = 2 * length(pattern) - i
end for
test = length(pattern)
target = length(pattern) + 1
while test > 0 do
link[test] = target
while target ≤ length(pattern)
and pattern[test] ≠ pattern[target] do
jump[target] = MINIMUM( jump[target],
length(pattern)-test )
target = link[target]
end while
test = test - 1
target = target - 1
end while
31
Boyer-Moore Jump Array Algorithm
for i = 1 to target do
jump[ i ] = MINIMUM( jump[ i ],
length(pattern) + target - i )
end for
temp = link[ target ]
while target < length(pattern) do
while target ≤ temp do
jump[target] = MINIMUM(jump[target],
temp-target+length(pattern))
target = target + 1
end while
temp = link[temp]
end while
32
Jump Array Calculation Example
33
Boyer-Moore Analysis
• The slide array calculation does
O(A + P) assignments but no
comparisons
• The jump array calculation at worst
compares all of the pattern characters
with those appearing later for O(P2)
comparisons
34
Boyer-Moore Analysis
• Studies have shown that with natural
language text, and a pattern of six or
more characters, there are at most 0.4T
comparisons
• As the length of the pattern increases,
the algorithm has a lower value of about
0.25T comparisons
35
Approximate String Matching
• Spelling checkers will make suggestions
of close words that could have been
intended for misspelled words
• This involves finding words that are
close to the misspelled word
• We will talk about approximate string
matching in terms of a string and text as
in the other algorithms
36
Common errors
• The string could have characters that
are missing from the text
• The text could have characters that are
missing from the string
• There could be a character in the string
or the text that needs to be changed
37
Errors Example
• Matching the string “ad” with the text
“read” we could have:
– 2 mismatches in the first position or a
missing “re” from the string
– 2 mismatches in the second position or just
a missing “e” from the string
38
The Algorithm
• This can be complex because it might
be that a better match occurs if we look
at other possibilities
• In the example above, for the second
position there were 2 mismatches of
characters, but we get a better result if
we “add” just one character to the string
39
The Algorithm
• To keep the algorithm a little simpler, we
use a larger structure to keep track of
what we have found so far
• In this case, we will keep a twodimensional array with the best matches
found so far
• This array will have a row for each
character of the string and a column for
each character of the text
40
The Array
• For each location of the array diffs[i, j],
we will choose the minimum of:
– If stringi = textj, diffs[i – 1, j – 1]
otherwise diffs[i – 1, j – 1] + 1
– diffs[i – 1, j] + 1
– diffs[i, j – 1] + 1
41
Example
• If we compare the string “trim” with the
text “try the trumpet” we get:
42
Analysis
• We do not really need the entire array,
but just need two columns - the current
one and the previous one on which it is
based
• We compare each string character with
each text character and so do S*T
comparisons
43