Approximate pattern matching

Download Report

Transcript Approximate pattern matching

String Matching with Errors
The Theory and Computation of Evolutionary Distances: Pattern Recognition,
Sellers, P. H., Journal of Algorithms, Vol. 20, No. 1, 1980, pp. 359~373.
Speaker: C. C. Lin
Adviser: R. C. T. Lee
1
In the following, we will present a problem related
to the notion of edit distance.
Next, let us introduce edit distance.
2
In edit distance, there are three types of differences
between two strings X and Y:
Insertion: a symbol of Y is missing in X at a
corresponding position, with its cost being 1. X : A - T
Y: AGT
Substitution: symbols at corresponding positions are
distinct, with its cost being 1.
X :ACC
Y: TCC
Deletion: a symbol of X is missing in Y at a
corresponding position, with its cost being 1. X: G C A
Y: G - A
3
Given two strings X and Y, the edit distance
between X and Y is the minimum number of
insertions, deletions and substitutions needed to
transform X to Y.
4
String X︰ ATGAATCTTACCGCCTCG
String Y︰ ATGAGGCTCTGGCCCCTG
Transformation (from string Y to string X)
String X:A T G A A – – T C T T A C C G C C T C G
String Y:A T G A G G C T C T G G C C – C C T – G
EDIT(X, Y)=7 (2 insertions, 2 deletions and 3 changes).
5
Next, we will introduce a dynamic programming
method to compute the edit distance between
two strings X and Y.
6
Dynamic Programming for Edit Distance:
EDIT[ i - 1, j ]  1
(Delete)


EDIT[ i, j ]  min
EDIT[ i, j - 1]  1
(Insert)
 EDIT[ i - 1, j - 1]   ( x[i ], y[ j ]) (Substitute)

where  ( x[i ], y[ j ])  0, if x[i ]  y[ j ], and
 ( x[i ], y[ j ])  1 otherwise.
EDIT[i,0]  i, EDIT[0, j ]  j.
7
a
c
b
a
b
a
c
0
1
2
b
1
c
2
a
3
b
4
b
5
a
6
7
Given
X=abcabba
Y=cbabac
3
4
5
6
8
a
c
b
a
b
a
c
b
0
1
1
1
2
c
2
a
3
b
4
b
5
a
6
7
Given
X=abcabba
Y=cbabac
3
4
5
6
9
a
c
b
a
b
a
c
b
c
0
1
2
1
1
2
2
a
3
b
4
b
5
a
6
7
Given
X=abcabba
Y=cbabac
3
4
5
6
10
a
c
b
a
b
a
c
b
c
a
0
1
2
3
1
1
2
2
2
b
4
b
5
a
6
7
Given
X=abcabba
Y=cbabac
3
4
5
6
11
a
c
b
a
b
a
c
b
c
a
b
0
1
2
3
4
1
1
2
2
3
2
b
5
a
6
7
Given
X=abcabba
Y=cbabac
3
4
5
6
12
a
c
b
a
b
a
c
b
c
a
b
b
a
0
1
2
3
4
5
6
7
1
1
2
2
3
4
5
6
2
2
1
2
3
3
4
5
3
2
2
2
2
3
4
4
4
3
2
3
3
2
3
4
5
4
3
3
3
3
3
3
6
5
4
3
4
4
4
4
Given
X=abcabba
Y=cbabac
13
a
c
b
a
b
a
c
b
c
a
b
b
a
0
1
2
3
4
5
6
7
1
1
2
2
3
4
5
6
2
2
1
2
3
3
4
5
3
2
2
2
2
3
4
4
4
3
2
3
3
2
3
4
5
4
3
3
3
3
3
3
Given
X=abcabba
Y=cbabac
EDIT(X, Y)=4
a
c
Substitute
6
5
4
3
4
4
4
4
14
a
c
b
a
b
a
c
b
c
a
b
b
a
0
1
2
3
4
5
6
7
1
1
2
2
3
4
5
6
2
2
1
2
3
3
4
5
3
2
2
2
2
3
4
4
4
3
2
3
3
2
3
4
Substitute
5
4
3
3
3
3
3
3
6
5
4
3
4
4
4
4
Given
X=abcabba
Y=cbabac
EDIT(X, Y)=4
ba
ac
15
a
c
b
a
b
a
c
b
c
a
b
b
a
0
1
2
3
4
5
6
7
1
1
2
2
3
4
5
6
2
2
1
2
3
3
4
5
3
2
2
2
2
3
4
4
Match
4
3
2
3
3
2
3
4
5
4
3
3
3
3
3
3
6
5
4
3
4
4
4
4
Given
X=abcabba
Y=cbabac
EDIT(X, Y)=4
bba
bac
16
a
c
b
a
b
a
c
b
c
a
b
b
a
0
1
2
3
4
5
6
7
1
1
2
2
3
4
5
6
2
2
1
2
3
3
4
5
Given
X=abcabba
Y=cbabac
Match
3
2
2
2
2
3
4
4
4
3
2
3
3
2
3
4
5
4
3
3
3
3
3
3
6
5
4
3
4
4
4
4
EDIT(X, Y)=4
abba
abac
17
a
c
b
a
b
a
c
b
c
a
b
b
a
0
1
2
3
4
5
6
7
1
1
2
2
3
4
5
6
Insert
2
2
1
2
3
3
4
5
3
2
2
2
2
3
4
4
4
3
2
3
3
2
3
4
5
4
3
3
3
3
3
3
6
5
4
3
4
4
4
4
Given
X=abcabba
Y=cbabac
EDIT(X, Y)=4
cabba
–abac
18
a
c
b
a
b
a
c
b
c
a
b
b
a
0
1
2
3
4
5
6
7
1
1
2
2
3
4
5
6
Match
2
2
1
2
3
3
4
5
3
2
2
2
2
3
4
4
4
3
2
3
3
2
3
4
5
4
3
3
3
3
3
3
6
5
4
3
4
4
4
4
Given
X=abcabba
Y=cbabac
EDIT(X, Y)=4
bcabba
b–abac
19
a
0
b
1
c Substitute
1 1
b
2 2
a
3 2
b
4 3
a
5 4
c
6 5
c
a
b
b
a
2
3
4
5
6
7
2
2
3
4
5
6
1
2
3
3
4
5
2
2
2
3
4
4
2
3
3
2
3
4
3
3
3
3
3
3
4
3
4
4
4
4
Given
X=abcabba
Y=cbabac
EDIT(X, Y)=4
abcabba
cb–abac
20
a
0
b
1
c
2
a
3
b
4
b
5
a
6
7
c Substitute
1 1 2 2 3 4 5 6
b
Match
2 2 1Insert2 3 3 4 5
a
Match
3 2 2 2 2 3 4 4
b
Match
4 3 2 3 3 2 3 4
a
Insert Match
5 4 3 3 3 3 3 3
c
Delete
6 5 4 3 4 4 4 4
Given
X=abcabba
Y=cbabac
EDIT(X, Y)=4
abcabbacb–ab-ac
21
a
c
b
a
b
a
c
b
c
a
b
b
a
0
1
2
3
4
5
6
7
1
1
2
2
3
4
5
6
2
2
1
2
3
3
4
5
3
2
2
2
2
3
4
4
4
3
2
3
3
2
3
4
5
4
3
3
3
3
3
3
6
5
4
3
4
4
4
4
Given
X=abcabba
Y=cbabac
EDIT(X, Y)=4
abcabbacb–a-bac
22
We can recognize the time complexity of computing
edit distance by the above algorithm to be O(mn)
and space complexity O(mn) where n and m are the
size of text and pattern, respectively.
23
In the following, we will introduce the topic, called
the “string matching with errors” problem.
24
The definition of the problem: Given a pattern P of
length m and a text T of length n, find a substring S
of T such that EDIT(P, S) is minimal.
Given: T=abcabba
P=cbabac
Find: S=cabba
EDIT(P, S)=3
P=cbabac
S=c–abba
Given: T=abcabba
P=cbabac
T’s substring K=bcabb
EDIT(P, K)=4
P=–cbabac
K=bc–ab–b
25
Dynamic Programming for the String Matching with
Error Problem:
SE[ i - 1, j ]  1


SE[ i, j ]  min
SE[ i, j - 1]  1
SE[ i - 1, j - 1]   ( x[i ], y[ j ])

where  ( x[i ], y[ j ])  0, if x[i ]  y[ j ], and
 ( x[i ], y[ j ])  1 ot herwise.
SE[0,0]  SE[i,0]  i, SE[0, j ]  0.
26
The dynamic programming approach for the edit
distance problem:
EDIT[ i - 1, j ]  1


EDIT[ i, j ]  min
EDIT[ i, j - 1]  1
 EDIT[ i - 1, j - 1]   ( x[i ], y[ j ])

where  ( x[i ], y[ j ])  0, if x[i ]  y[ j ], and
 ( x[i ], y[ j ])  1 otherwise.
EDIT[i,0]  i, EDIT[0, j ]  j.
The difference between EDIT[i, j] is that the
EDIT[0, j]=j for the edit distance finding problem and
SE[0,j]=0 for the string with error problem.
27
In the edit distance problem, we have EDIT[0, j]=j.
In the string matching with error problem, we set
SE[0, j]=0.
28
Since this path starts at the bottom row and ends at the top
row with SE(0, j)=0, this shows that there exists a substring
S in T such that EDIT(P, S)=3.
a
c
b
a
b
a
c
b
c
a
b
b
a
0
0
0
0
0
0
0
0
1
1
1
0
1
1
1
1
2
2
1
1
1
1
1
2
3
2
2
2
1
2
2
1
4
3
2
3
2
1
2
2
5
4
3
3
3
2
2
2
6
5
4
3
4
3
3
3
T=abcabba
P=cbabac
29
We find the lowest value of the last row and trace
back from the point.
Our output may be several strings.
30
a
c
b
a
b
a
c
b
c
a
b
b
a
T=abcabba
P=cbabac
0
0
0
0
0
0
0
0
1
1
1
0
1
1
1
1
2
2
1
1
1
1
1
T: abc–abba
2
P: cbabac
3
2
2
2
1
2
2
1
4
3
2
3
2
1
2
2
5
4
3
3
3
2
2
2
6
5
4
3
4
3
3
3
S=cabba
31
edit distance
c
c
b
a
b
a
c
a
b
b
a
0
1
2
3
4
5
1
0
1
2
3
4
2
1
1
1
2
3
3
2
1
2
2
2
4
3
2
1
2
3
5
4
3
2
2
2
6
5
4
3
3
3
T=abcabba
P=cbabac
S: c–abba
P: cbabac
EDIT(P, S)=3
32
a
c
b
a
b
a
c
b
c
a
b
b
a
T=abcabba
P=cbabac
0
0
0
0
0
0
0
0
1
1
1
0
1
1
1
1
2
2
1
1
1
1
1
S: cabba–
2
P: cbabac
3
2
2
2
1
2
2
1
4
3
2
3
2
1
2
2
5
4
3
3
3
2
2
2
6
5
4
3
4
3
3
3
EDIT(P, S)=3
33
a
c
b
a
b
a
c
b
c
a
b
b
a
0
0
0
0
0
0
0
0
1
1
1
0
1
1
1
1
T=abcabba
P=cbabac
2
2
1
1
1
1
1
S: c-ab-2
P: cbabac
3
2
2
2
1
2
2
1
4
3
2
3
2
1
2
2
5
4
3
3
3
2
2
2
6
5
4
3
4
3
3
3
EDIT(P, S)=3
34
a
c
b
a
b
a
c
b
c
a
b
b
a
0
0
0
0
0
0
0
0
1
1
1
0
1
1
1
1
T=abcabba
P=cbabac
2
2
1
1
1
1
1
S: --ab-c
2
P: cbabac
3
2
2
2
1
2
2
1
4
3
2
3
2
1
2
2
5
4
3
3
3
2
2
2
6
5
4
3
4
3
3
3
EDIT(P, S)=3
35
References
For Edit Distance Computation:
[NW70] Neddleman, S.B., and Wunsch, C.D., A
general method applicable to the search for
similarities in the aminoacid sequence of two
proteins, Journal of Molecular Biology 48 (1970):
443-453.
For String matching with error:
[S80] The Theory and Computation of Evolutionary
Distances: Pattern Recognition, Sellers, P. H.,
Journal of Algorithms, Vol. 20, No. 1, 1980, pp.
359~373.
36
Thank you
37