Faster algorithms for string matching with k mismatches

Download Report

Transcript Faster algorithms for string matching with k mismatches

Faster algorithms for string
matching with k mismatches
Journal of Algorithms, Volume 50, Issue 2, February 2004, Pages 257-275
Amihood Amir, Moshe Lewenstein and Ely Porat
Adviser : R. C. T. Lee
Speaker: C. C. Yen
1
String matching with k mismatches
Input: A text T with length n , a pattern P with length
m and a mismatching threshold k
Output: Each sub-string S of T where HD(S,P)  k
2
The basic idea of following algorithms
 The authors discuss the number of distinct symbols in
the pattern and design algorithms to solve the
problems efficiently in different cases.
Example:
P = ACAABD
The number of distinct symbols of P is 4.
3
Three cases of the number of distinct symbols in
pattern
The paper discusses the following three cases; k is the
maximal number of mismatches allowed.
1. There are at least 2k distinct symbols.
2. There are less than 2 k distinct symbols.
3. The number of distinct symbols is 2 k between
and 2k.
4
Case 1: At least 2k distinct symbols
There are two stages in the algorithm.
1. Marking
Identify potential starts of the pattern and do a crude
pruning of the potential candidates.
2. Verification
Verify which of the potential candidates is indeed a
pattern which occurs.
In this case, the algorithm takes linear time to solve
string matching with k mismatches problem.
5
The basic idea of this paper is as follow:
1)
2)
3)
4)
5)
Let A={a1,a2…a2k} be a set of distinct alphabets appearing in P.
Let P’ be the shortest prefix of P containing A.
Let the length of P’ be C.
Let S be a substring of T of length C.
Suppose among the 2k distinct alphabets in A which also appear in S ,
there are d matches between P’and S , as shown below:
S
C
d matches
P’
6)
7)
Then, obviously, among 2k locations in P’ ,there are 2k-d mismatches.
If d  k , then 2k - d  k , we may ignore S totally.
6
But, how can we determine d ?
We may use a position table
7
Marking stage of Case1
 Let{a1….,,a2k}be 2k different alphabet symbols appearing in
the pattern and let ij be the smallest index in the pattern where aj
appears ,j=1….,2k.
 Create a position table S1 … S2k to represent distinct symbols in
pattern P and pos0 … pos2k are their first appearance locations
on P.
Example
S0 S1 S2 S3
0123456
P = ACABDAE
T = ACBBDACTADIKQDABD….
symbols A C B D
pos
0 1 3
4
pos0 pos1 pos2 pos3
= T0 … Tn-1
k=2
8
We need scan the text T for each ti, 0  i  n , if we can find a
j,0  j  2k , such that ti=sj , add 1 to location i - posj of an array
X. If i – posj is less than 0, we ignore it.
X is an array with size n and all elements of X are 0 initially .
0123456
P = ACABDAE
T = ACBBDACTADIKQDABD….
S0 S1
symbols A C B
pos
= T0 … Tn-1
k=2
S2
0 1 3
S3
D
4
pos0 pos1 pos2 pos3
S0 … S3 represent 2k distinct symbols in
pattern P and pos0 … pos3 are their first
appearance locations on P.
X = 00000000000000000….
9
After the scanning is completed, we obtain the following array :
X=4 0 00 0 3 00 1 100 0 0 0 00
For every X(a)=b, we know that there are b matches 2k distinct
character between T(a, a+c-1) and P(0, c-1) . There are at least 2kb mismatches .Since b<k, 2k-b>k. We may ignore T(a,a+c-1) in
our case, since
0 1 2 3 4 5 6 7 8 9 10 11 1213141516
X=4 0 0 0 0 3 0 0 1 1 0 0 0 0 0 0 0
We need to examine only T(0,4) and T(5,9).We ignore all
other substrings
10
Lemma 1
 For Case 1, let n denote the length of text and k be maximal number of
mismatches allowed. There are at most n/k candidate locations.
Proof :
The total number of addition to the X array is at most n because the
algorithm tests T(i) , i=1,2….n .
Let the number of locations whose numbers are larger than k be a
Then
ak  n
n
a
k
11
Through Lemma 1, we know that at most n/k
candidate locations remain.
But not all candidate locations are starting points of
matches with k maximal number of mismatches.
Take T(5) as an example:
There are four other mismatches, so
the candidate location is not a starting
point of match with k maximal
number of mismatches.
P=
ACABDAE
T = ACBBDACTADIKQDABD….
X = 40000300000000000….
12
We must verify which candidate locations are
starting points of matches with k maximal number of
mismatches.
13
Verification stage of Case1
 The authors use the Kangaroo Method to verify
whether a location has k maximal number of
mismatches in O(k).
P=
ETBDBCCDFDC
T = ABCCABDADBDETADBAADFDAAEERDXTDADCT…
We shall not elaborate on this method because it was
presented before
14
Time complexity of Case 1
 We take O(n) time in marking stage, where n is the length of
the text.
 According to Lemma 1, we have at most n/k candidate
locations.
 Using Kangaroo method, we take O(k) time to verify a
remained candidate location.
 Thus, we take O(n) time for the verification stage.
15
Case 2: Less than 2 k distinct symbols
 We can use the Boolean Convolution method to solve the
problem for this case.
16
 Thus,it is obvious that Hamming distance can be found by
convolution
 Let A=abac and B=acdc For this case,HD(A,B)=2
Convolution:
abac
cdca
1010
0001
0000
0001
0002020
2 matches HD(A,B)=2
17
Using Fast Fourier Transforms (FFT), Boolean Convolution
can be done in O(nlogm).
Our alphabet size is O( k )
We take O(n k log m) times to solve the problem for Case 2.
18
Case 3: The number of distinct
symbols is between 2 k and 2k
Definition:
frequent symbol: A symbol appears in the pattern at
least 2 k times.
Example
k = 2, 2 k  3, P = baccdbdd
d is a frequent symbol.
19
Two Sub-cases of Case 3
Case3-1:There are at least k frequent symbols in the pattern.
Case3-2:There are less than k frequent symbols in pattern.
20
Case 3-1:at least k frequent symbols
There are two stages in the algorithm for this case.
(1)Marking stage
Identify potential starts of the pattern and do a crude
pruning of the potential candidates.
(2)Verification stage
Verify which of the potential candidate is indeed a pattern
which occurs.
Verification stage will be done by Kangaroo
Method.
21
Marking stage of Case 3-1
We pick arbitrarily k frequent symbols and convert
this problem to mismatch problem with “don’t care” .
Example
Let T = ABCABDCABBCFADDABC
P = ABCAABBDBAA
and k = 4
There are 4 ( 4 is between 2 k and 2k) distinct symbols in P
and ‘A’, ‘B’ are frequent symbols. There are 2 (= k )frequent
symbols.
22
Mismatch problem with “don’t care”
Input: A text T with length n and a pattern P with length m.
where g are the characters in the pattern which are not “don’t
care” symbols. and the rest are Φ(“don’t care”).
Output: The numbers of mismatches between pattern and
each sub-string of T with length m. Only mismatches of the
g pattern characters are counted.
The number of
mismatches
4
P= A B Φ A A B B Φ B A Φ
T=
A B C A B D C A B B C F A D D A B D
23
Mismatch problem with “don’t care”
Input: A text T with length n and a pattern P with length m.
where g are the characters in the pattern which are not “don’t
care” symbols. and the rest are Φ(“don’t care”).
Output: The numbers of mismatches between pattern and
each sub-string of T with length m. Only mismatches of the
g pattern characters are counted.
The number of
mismatches
P=
T=
4
7
A B Φ A A B B Φ B A Φ
A B C A B D C A B B C F A D D A B D
24
Mismatch problem with “don’t care”
Input: A text T with length n and a pattern P with length m.
where g are the characters in the pattern which are not “don’t
care” symbols. and the rest are Φ(“don’t care”).
Output: The numbers of mismatches between pattern and
each sub-string of T with length m. Only mismatches of the
g pattern characters are counted.
The number of
mismatches
P=
T=
4
7
7
A B Φ A A B B Φ B A Φ
A B C A B D C A B B C F A D D A B D
25
Mismatch problem with “don’t care”
Input: A text T with length n and a pattern P with length m.
where g are the characters in the pattern which are not “don’t
care” symbols. and the rest are Φ(“don’t care”).
Output: The numbers of mismatches between pattern and
each sub-string of T with length m. Only mismatches of the
g pattern characters are counted.
The number of
mismatches
P=
T=
4
7
7
2
AA BB Φ
Φ A
A A
A B
B B
B Φ B A Φ
A B C A B D C A B B C F A D D A B D
26
Mismatch problem with “don’t care” can be solved in
O(n g log m)(Amir et, 1997), where n is the length of text T,
m is the length of pattern P and g are the characters in the
pattern which are not “don’t care” symbols.
27
All locations with at most k mismatches of k frequent
symbols are our candidate locations where matches with
k maximal number of mismatches start.
Example
k=4
The number of
mismatches
P=
T=
4
7
7
2
6
8
7
6
A B Φ A A B B Φ B A Φ
A B C A B D C A B B C F A D D A B D
28
Lemma 2 for Case 3-1
Let {a1,….,a k }be frequent symbols. Then there exist in the text at
most 2n k locations where there is a pattern occurrence with no
more than k errors
Proof:
The total number of mark is at most n because the algorithm tests T(i) ,
i=1,2….n .
Let the number of locations which have marks larger than k be a
Then
a2 k 
a2 k 
n
2 k
k
n
2 k
k k
a2 k 
2n
k
29
We convert marking stage to mismatch problem with “don’t
care” and take O(n k log m)to solve mismatch problem with
“don’t care” problem.
According to lemma 2 for Case3-1, there are 2n k candidate
locations and we take O(k) time to verify one candidate location.
Verification stage for Case3-1 takes
O(n k )
time.
30
Case 3-2:less than
k frequent symbols
First, we can check the number of mismatches by using
convert all frequent symbols to Φ (“don’t care”
symbol).
Example
Let
T = ABCABDCABBCFADDABC
P = ABCAABGDBAA
and k = 5
There are 5 ( 5 is between 2 k and 2k) distinct symbols in P
and ‘A’ are frequent symbols. There are 1 (< k )frequent
symbols.
31
Two cases are discussed after we convert all frequent
symbols to Φ.
3-2-1:There are less than 2k remaining symbols.
3-2-2:There are at least 2k remaining symbols.
32
Case3-2-1 There are less than 2k remaining symbols
There are less than 2k remaining symbols and the rest
are “don’t care” symbols. Finding mismatches of
remaining symbols can be solved as a mismatch
problem with “don’t care” and takes O(n k log m) time.
P’ =
T=
mismatches of
remaining =
symbols
Φ B C Φ Φ B G D B Φ Φ
A B C A B D C A B B C F A D D A B C
3
33
Case3-2-1 There are less than 2k remaining symbols
There are less than 2k remaining symbols and the rest
are “don’t care” symbols. Finding mismatches of
remaining symbols can be solved as a mismatch
problem with “don’t care” and takes O(n k log m) time.
P’ =
T=
mismatches of
remaining =
symbols
Φ B C Φ Φ B G D B Φ Φ
A B C A B D C A B B C F A D D A B C
3
5
34
Case3-2-1 There are less than 2k remaining symbols
There are less than 2k remaining symbols and the rest
are “don’t care” symbols. Finding mismatches of
remaining symbols can be solved as a mismatch
problem with “don’t care” and takes O(n k log m) time.
P’ =
T=
mismatches of
remaining =
symbols
Φ B C Φ Φ B G D B Φ Φ
A B C A B D C A B B C F A D D A B C
3
5
6
35
Case3-2-1 There are less than 2k remaining symbols
There are less than 2k remaining symbols and the rest
are “don’t care” symbols. Finding mismatches of
remaining symbols can be solved as a mismatch
problem with “don’t care” and takes O(n k log m) time.
P’ =
T=
mismatches of
remaining =
symbols
Φ B C Φ Φ B G D B Φ Φ
A B C A B D C A B B C F A D D A B C
3
5
6
4
36
Case3-2-1 There are less than 2k remaining symbols
There are less than 2k remaining symbols and the rest
are “don’t care” symbols. Finding mismatches of
remaining symbols can be solved as a mismatch
problem with “don’t care” and takes O(n k log m) time.
P’ =
T=
mismatches of
remaining =
symbols
Φ B C Φ Φ B G D B Φ Φ
A B C A B D C A B B C F A D D A B C
3
5
6
4
4
37
All locations which have less than k mismatches of all
frequent symbols and remaining symbols are matches
which we want.
38
• Conclusion:
The problem for Case 3-2-1 can be solved in O(n k log m) time
39
Case3-2-2 There are at least 2k remaining symbols
There are two stages in algorithm for this case.
(1)Marking stage
Identify potential starts of the pattern and do a crude pruning of the
potential candidates.
(2)Verification stage
Verify which of the potential candidates is indeed a pattern which
occurred.
Verification stage will be done by Kangaroo
Method.
40
Marking stage of Case 3-2-2
We pick arbitrarily 2k remaining symbols and convert
all symbols to Φ(“don’t care” symbols) except 2k
remaining symbols which we picked.
Marking stage of Case3-2-2 can be solved as
mismatch problem with “don’t care” in O(n k log m)
time.
41
• Conclusion:
The problem for Case 3-2-2 can be solved in O(n k log m)time
42
Thank you
43