Horspool Algorithm

Transcript Horspool Algorithm

Approximate Boyer-Moore
String Matching
Source : SIAM Journal on Computing, Vol. 22, No. 2, 1993, pp.243-260
J. Tarhio and E. Ukkonen
Advisor: Prof. R. C. T. Lee
Speaker: Kuei-hao Chen
1
• The k mismatches problem
• The k differences problem
2
Definition of the k mismatches problem
• Given a pattern string P of length m and a
text string T of length n, we would like to
find all approximate occurrences P in T with
at most k mismatches.
If k=1, then
Text
a
Pattern
b
3
Consider the following situation where a pattern
P is matching with a windows W of T and there
are already (k+1) mismatches:
W
T
k+1 mismatches
P
4
Since there are already (k+1) mismatches, we
must move the pattern. The following is
obvious:
P must be moved to such an extent that there are
at most k mismatches between a suffix S of W
and a substring S’ of P.
T
P
S
k+1 mismatches
S’
5
Our trick is as follows: Consider the (k+1)suffix of W. There are two cases:
6
Case 1: There is one character in this (k+1)-suffix
which exists in P in such a way as shown below.
Move the pattern to match these characters. Note
that in such a situation, there are at most k
mismatches between the (k+1)-suffix and its
corresponding substring in P.
(k+1)-suffix
x
T
P
x
(k+1)-suffix
T
x
P
x
7
Case 2: No such a character exists. Move the
pattern in such a way that the k-prefix of P
aligns with the k-suffix of W as shown below.
Under such a situation, again, there are at most
k-mismatches between the k-suffix of W and kprefix of P.
(k+1)-suffix
T
P
k-prefix
8
The generalization of the BM algorithm for the
k mismatches problem will be very natural: for
k=0 the generalized algorithm is exact string
matching.
Recall that the k mismatches problem asks for
finding all occurrences of P in T such that in at
most k positions of P, T and P have different
characters.
9
We just scan the pattern from right to the left
until we have found k+1 mismatches
(unsuccessful search) or the pattern ends
(successful search).
10
Preprocessing phase for approximate matching
Dk table
The value Dk for a particular alphabet is defined as the rightmost
position of that character in the pattern – 1 and the end position i
where i=[m..m-k].
Example : Let k=1, m=8, a  ∑
j
1
2
3
4
5
6
7
8
P: G C A G A G A G
Σ
A C G *
D1[i=8, a] 1 6 2 8
Σ
ji
1
2
3
4
5
6
7
8
P: G C A G A G A G
Σ
A C G *
D1[i=7, a] 2 5
1 8
A C G *
D1[i=8, a] 1 6 2 8
D1[i=7, a] 2 5 1 8
11
Algorithm for preprocessing phase
P = p1p2…pm,T = t1t2…tn
Preprocessing
For a  ∑ Do
For j=m downto m-k Do Begin
dk[j,a] ← m
Find a character a that it is close to pj. If it is
found, we calculate the distance between the
position of the character a and j and insert it
into dk[j,a].
12
Algorithm for searching phase
P = p1p2…pm,T = t1t2…tn
Searching
j=m;
While j≦ n+ k Do Begin
h=j; i=m; mismatch=0;
While i>0 and mismatch ≦ k Do Begin
d=min(dk[i, th], dk[i-1, th-1]);
If th≠pi Then mismatch=mismatch+1;
i= i- 1; h= h-1 End of while;
If mismatch ≦ k Then report match at position j;
j= j+ d End of while
13
Complete example for approximate string matching
Example 1:
Let k=1, m=4, n=17
T: T T A A C G T A A T G C A G C T A
P: A G C T
Σ
A C G T
D1[i=4, a] 3 1 2 4
D1[i=3, a] 2 3 1 4
14
Σ
Example 1 (1/6)
A C G T
D1[i=4, a] 3 1 2 4
D1[i=3, a] 2 3 1 4
T: T T A A C G T A A T G C A G C T A
P: A G C T
15
Σ
Example 1 (2/6)
A C G T
D1[i=4, a] 3 1 2 4
D1[i=3, a] 2 3 1 4
T: T T A A C G T A A T G C A G C T A
P:
A G C T
16
Σ
Example 1 (3/6)
A C G T
D1[i=4, a] 3 1 2 4
D1[i=3, a] 2 3 1 4
T: T T A A C G T A A T G C A G C T A
P:
A G C T
17
Σ
Example 1 (4/6)
A C G T
D1[i=4, a] 3 1 2 4
D1[i=3, a] 2 3 1 4
T: T T A A C G T A A T G C A G C T A
P:
A G C T
18
Σ
Example 1 (5/6)
A C G T
D1[i=4, a] 3 1 2 4
D1[i=3, a] 2 3 1 4
T: T T A A C G T A A T G C A G C T A
P:
A G C T
19
Σ
Example 1 (6/6)
A C G T
D1[i=4, a] 3 1 2 4
D1[i=3, a] 2 3 1 4
T: T T A A C G T A A T G C A G C T A
P:
A G C T
j ← 16 + p , j ← 16+ 3, j ← 19
jump out of while loop
20
Example 2:
Let k=1, m=8, n=24
T: G C A T C G C A G A G A G T A T A C A G T A C G
P: G C A G A G A G
Σ
A C G *
D1[i=8, a] 1 6 2 8
D1[i=7, a] 2 5 1 8
21
Σ
A C G *
D1[i=8, a] 1 6 2 8
Example 2 (1/14)
D1[i=7, a] 2 5
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
1 8
22
23
24
T: G C A T C G C A G A G A G T A T G C A G A G C G
P: G C A G A G A G
22
Σ
A C G *
D[i=8, a] 1 6 2 8
Example 2 (3/14)
D[i=7, a] 2 5
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
1 8
22
23
24
T: G C A T C G C A G A G A G T A T G C A G A G C G
P:
G C A G A G A G
23
Σ
D[i=8, a] 1 6 2 8
Example 2 (4/14)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
A C G *
16
D[i=7, a] 2 5
1 8
17
22
18
19
20
21
23
24
T: G C A T C G C A G A G A G T A T G C A G A G C G
P:
G C A G A G A G
24
Σ
A C G *
D[i=8, a] 1 6 2 8
Example 2 (5/14)
D[i=7, a] 2 5
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
1 8
22
23
24
T: G C A T C G C A G A G A G T A T G C A G A G C G
P:
G C A G A G A G
Then report match at position j;
j ← 13 + p , j ← 13+ 2, j ← 15
25
Σ
A C G *
D[i=8, a] 1 6 2 8
Example 2 (6/14)
D[i=7, a] 2 5
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
1 8
22
23
24
T: G C A T C G C A G A G A G T A T G C A G A G C G
P:
G C A G A G A G
26
Σ
A C G *
D[i=8, a] 1 6 2 8
Example 2 (7/14)
D[i=7, a] 2 5
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
1 8
22
23
24
T: G C A T C G C A G A G A G T A T G C A G A G C G
P:
G C A G A G A G
27
Σ
A C G *
D[i=8, a] 1 6 2 8
Example 2 (8/14)
D[i=7, a] 2 5
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
1 8
22
23
24
T: G C A T C G C A G A G A G T A T G C A G A G C G
P:
G C A G A G A G
28
Σ
A C G *
D[i=8, a] 1 6 2 8
Example 2 (9/14)
D[i=7, a] 2 5
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
1 8
22
23
24
T: G C A T C G C A G A G A G T A T G C A G A G C G
P:
G C A G A G A G
29
Σ
Example 2 (11/14)
A C G *
D[i=8, a] 1 6 2 8
D[i=7, a] 2 5
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
1 8
22
23
24
T: G C A T C G C A G A G A G T A T G C A G A G C G
P:
G C A G A G A G
30
Σ
Example 2 (13/14)
A C G *
D[i=8, a] 1 6 2 8
D[i=7, a] 2 5
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
1 8
22
23
24
T: G C A T C G C A G A G A G T A T G C A G A G C G
P:
G C A G A G A G
31
Σ
Example 2 (14/14)
A C G *
D[i=8, a] 1 6 2 8
D[i=7, a] 2 5
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
1 8
22
23
24
T: G C A T C G C A G A G A G T A T G C A G A G C G
P:
G C A G A G A G
If h = 0 Then report match at position j;
j ← 24 + p , j ← 24+ 2, j ← 26
jump out of while loop
32
Time complexity
• preprocessing phase in O(m+ kc) time and
O(kc) space complexity.
• searching phase in O(mn) time complexity.
33
Definition of the k differences problem
• Given a pattern string P of length m and a
text string T of length n, we would like to
find all approximate occurrences P in T with
edit distance not larger than k.
34
The basic approach to solve the problem is to
find the edit distance for T(1, i) and P for every
i [Ukk85b] :
Let Edit be an m+1 by n+1 table such that
Edit(i, j) is the minimum edit distance between
p1p2…pj and any substring of T ending at ti.

Edit( i - 1, j )  1

Edit( i, j )  m in
Edit( i, j - 1)  1
 Edit( i - 1, j - 1)  if p  t the n0 e lse1
j
i

Edit(i,0)  0, 0  i  n
35
Table Edit must be completely evaluated
column-by-column in time O(mn).
36
If we can find out all occurrences of i where
Edit(T(1, i), P) cannot be smaller than k. We
may skip this i.
This paper is based upon Rule 7 proposed by
Professor Lee.
37
Rule 7
• If k characters in String A do not appear in
String B, Distance(A,B) is not smaller than k.
38
In the scanning phase, we define some terms first.
A diagonal h of Edit for h=-m,…, n, consists of all Edit(i, j)
such that i- j=h.
For every Edit(i, j), there is a minimizing arc from Edit(i-1, j)
to Edit(i, j) if Edit(i, j)=Edit(i-1, j)+1, from Edit(i, j-1) to
Edit(i, j) if Edit(i, j-1)+1, and from Edit(i-1, j-1) to Edit(i, j)
if Edit(i, j)=Edit(i-1, j-1) where pj=ti or if Edit(i, j)=Edit(i-1,
j-1)+1 where pj≠ti. The costs of the arcs are 1, 1, 0 and 1,
respectively. Edit(i-1, j-1)
Edit(i, j-1)
pj=ti
Substitution
pj≠ti
pj≠ti
Edit(i-1, j)
pj≠ti
Insertion
Deletion
Edit(i, j)
Minimizing arc
39
A minimizing path is any path that consists of
minimizing arcs and leads from an entry Edit(i, 0) on
the first row of Edit to an entry Edit(h, m) on the last
row of Edit.
A minimizing path is successful if it leads to an entry
Edit(h, m)≤k.
40
Lemma 1: The entries on a successful minimizing path M
are contained in ≤ k+1 successive diagonals of
Edit.
Proof : Each addition of a diagonal comes from either an
insertion or deletion. If there are more than (k+1)
diagonals, there must be more than (k+1) operations,
either deletions or insertions. Thus there cannot be
more than (k+1) diagonals.
t1t2...
Text
...tn-1tn
p1
p2
...
Pattern
...
pm-1
pm
A successful minimizing path
Successive diagonals
41
T:ABCABBA
P:CBABAC
S:C-AB-P:CBABAC
EDIT(P, S)=3
A
C
B
A
B
A
C
0
B
0
C
0
A
0
B
0
B
0
A
0
0 There are (k+1)=3+1=4
successive diagonals
1 because there are three
deletions.
1
1
1
0
1
1
1
2
2
1
1
1
1
1
3
2
2
2
1
2
2
1
4
3
2
3
2
1
2
2
5
4
3
3
3
2
2
2
6
5
4
3
4
3
3
3
2
Successive diagonals
42
T:BCABDAB
P:CBADB
S:C-ABDAB
P:CBA-D-B
EDIT(P, S)=3
k =3
B
C
B
A
D
B
0
C
0
A
0
B
0
D
0
A
0
B
0
0 There are 1+2=3 <(k+1)
1
1
0
1
1
1
1
2
1
1
1
1
2
2
=3+1=4 successive
1 diagonals because there
are one deletion and two
1 insertions.
3
2
2
1
2
2
2
2
4
3
3
2
2
2
3
3
5
4
4
3
3
3
3
3
Successive diagonals
43
By Lemma 1, for each diagonal d, any successful
minimizing path starting at the top of this diagonal will
have a bandwidth of 1+k+k=2k+1
...tn
t1t2...
p1
p2
...
k
k
...
pm-1
pm
M
h
Bandwidth ≤ k of Edit
2k+1
44
T:ABCABBA
P:CBABAC
Result
k=3
EDIT(P, S)=3
A
C
B
A
B
A
C
B
C
A
B
S:C-AB-P:CBABAC
B
A
0 The successful
0
0
0
0
0
0
0
1
1
1
0
1
1
1
minimizing path is only
1 in the bandwidth ≤ 7 of
Edit.
2
2
1
1
1
1
1
2
3
2
2
2
1
2
2
1
k=3
4
3
2
3
2
1
2
2
5
4
3
3
3
2
2
2
6
5
4
3
4
3
3
3
k=3
Successive diagonals
45
For the width of bandwidth ≤ k of Edit, we give
it a name, call k-environment.
For each j=1, …, m, let the k-environment of the
pattern symbol pj be the string Cj=pj-k…pj+k,
where pa=ε for a<1 and a> m.
k-environment
P
pj-k...
pj-1pjpj+1... pj+k
46
t1t2...
ti
...tn
p1
p2
...
2k+1
pj
...
pm-1
pm
h
Bandwidth ≤ k of Edit
The longest vertical path in any minimizing path has
length not greater than 2k+1. We only have to determine
whether ti appears in the k environment of pj.
47
Given T=ATGCGAGAGAT, P=GCAGAGAGATG, and
k=2. We select t5, t8 and t11 three characters.
The 2-environment of t5 is C5=p3p4p5p6p7=AGAGA.
The 2-environment of t8 is C8=p6p7p8p9p10=GAGAT.
The 2-environment of t11 is C11=p9p10p11=ATG.
t5
t8
t11
T
A T C G C A G A G A T
P
G C A G A G A G A T G
p5
p8
p11
48
We now obtain a stronger version of Rule 7.
Lemma 2: Let a successful minimizing path M go
through some entry on a diagonal h of Edit. Then for
at most k indexes j, 1≤j ≤m, character th+j does not
occur in the k environment of Cj.
A formal proof can be found in the paper. In the
following, we give some physical feeling of it.
49
In this case, although there are two mismatches, by
deleting a which mismatches x, we may achieve a
perfect match. Thus the edit distance between T and P
may still be 1.
k=1
T
x
y
P
ax
by
50
In this case, it can be seen that deleting one character in
P will not result in a perfect match. Thus, the edit
distance between T and P must be larger than 1.
k=1
T
x
y
P
ab c
ab c
51
The shift table is based on table Dk. We determine the
first diagonal after h, say h+d, where at least one of
the characters th+m, th+m-1, …, th+m-k matches with
corresponding character of P. Finally, the maximum
of k+1 and d is the length of the shift.
52
The algorithm explains when a possible occurrence of
P in T was found, DP approach is immediately used
to find alignment result.
53
Algorithm
Input: P = p1p2…pm,T = t1t2…tn and k
Output: All occurrence P in T
Initially, the start position h of T =0, i=h+m;
While i≤ n+ k do begin
j=m; bad=0;
While i>k and bad ≤ k do begin
If ti does not occur in Cj then
bad=bad+1
j=j-1;i=i-1 end;
If bad ≤ k then
W is a sequence from th-k to th+m.
Using dynamic programming to align W with P
Output alignment result.
We calculate shift steps d=min(Dk[i, tr], Dk[i-1, tr-1],);
h=h+max(k+1,d) end;
54
Complete example for approximate string matching
For example :
Let k=1, m=8, n=24
T: G C A T C G C A G A G A G T A T G C A G A G C G
P: G C A G A G A G
Σ
A C G *
D1[i=8, a] 1 6 2 8
D1[i=7, a] 2 5 1 8
55
Σ
A C G *
D[i=8, a] 1 6 2 8
Example(1/15)
D[i=7, a] 2 5
k=1
>k
1
2
3
4
5
6
1 8
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
T: G C A T C G C A G A G A G T A T G C A G A G C G
P: G C A G A G A G
t8=A appears in P(7,8)
t7=C does not appear in P(6,8)
t6=G appears in P(5,7)
t5=C does not appear in P(4,6)
Shifting is needed now.
56
Σ
A C G *
D[i=8, a] 1 6 2 8
Example(2/15)
D[i=7, a] 2 5
k=1
>k
1
2
3
4
5
6
7
1 8
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
T: G C A T C G C A G A G A G T A T G C A G A G C G
P:
G C A G A G A G
t9=G appears in P(7,8)
t8=A appears in P(6,8)
t7=C does not appear in P(5,7)
t6=G appears in P(4,6)
t5=C does not appear in P(3,5)
Shifting is needed now.
57
Σ
A C G *
D[i=8, a] 1 6 2 8
Example(3/15)
D[i=7, a] 2 5
k=1
>k
1
2
3
4
5
6
7
8
1 8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
T: G C A T C G C A G A G A G T A T G C A G A G C G
P:
G C A G A G A G
t11=G appears in P(7,8)
t10=A appears in P(6,8)
t9=G appears in P(5,7)
t8=A appears in P(4,6)
t7=C does not appear in P(3,5)
t6=G appears in P(2,4)
t5=C appears in P(1,3)
58
t4=T does not appear in P(1,2) Shifting is needed now.
Σ
D[i=8, a] 1 6 2 8
Example(4/15)
k=1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
A C G *
D[i=7, a] 2 5
16
17
18
19
20
21
1 8
22
23
24
T: G C A T C G C A G A G A G T A T G C A G A G C G
P:
G C A G A G A G
W= CGCAGAGAGT
P= GCAGAGAG
G
C
A
Output : GCAGAGAG
GCAGAGAG
G
A
G
A
G
C G C A G A G A G T
0 0 0 0 0 0 0 0 0 0 0
1
1
0
1
1
0
1
0
1
0
1
2
1
1
0
1
1
1
1
1
1
1
3
2
2
1
0
1
1
2
1
2
2
2
1
0
1
1
2
1
2
2
1
0
1
1
2
2
2
1
0
1
1
2
2
1
0
1
2 59
2
1
0
1
4
5
6
7
8
Example(5/15)
W= CGCAGAGAGT
P= GCAGAGAG
G
C
A
Output : GCAGAGAGCAGAGAG
G
A
G
A
G
C G C A G A G A G T
0 0 0 0 0 0 0 0 0 0 0
1
1
0
1
1
0
1
0
1
0
1
2
1
1
0
1
1
1
1
1
1
1
3
2
2
1
0
1
1
2
1
2
2
2
1
0
1
1
2
1
2
2
1
0
1
1
2
2
2
1
0
1
1
2
2
1
0
1
2
2
1
0
1
4
5
6
7
8
60
Example(6/15)
W= CGCAGAGAGT
P= GCAGAGAG
G
C
A
Output : -CAGAGAG
GCAGAGAG
G
A
G
A
G
C G C A G A G A G T
0 0 0 0 0 0 0 0 0 0 0
1
1
0
1
1
0
1
0
1
0
1
2
1
1
0
1
1
1
1
1
1
1
3
2
2
1
0
1
1
2
1
2
2
2
1
0
1
1
2
1
2
2
1
0
1
1
2
2
2
1
0
1
1
2
2
1
0
1
2
2
1
0
1
4
5
6
7
8
61
Example(7/15)
W= CGCAGAGAGT
P= GCAGAGAG
G
C
A
Output : CGCAGAGAG
-GCAGAGAG
G
A
G
A
G
C G C A G A G A G T
0 0 0 0 0 0 0 0 0 0 0
1
1
0
1
1
0
1
0
1
0
1
2
1
1
0
1
1
1
1
1
1
1
3
2
2
1
0
1
1
2
1
2
2
2
1
0
1
1
2
1
2
2
1
0
1
1
2
2
2
1
0
1
1
2
2
1
0
1
2
2
1
0
1
4
5
6
7
8
62
Example(8/15)
W= CGCAGAGAGT
P= GCAGAGAG
G
C
A
Output : GCAGAGAGT
GCAGAGAG-
G
A
G
A
G
C G C A G A G A G T
0 0 0 0 0 0 0 0 0 0 0
1
1
0
1
1
0
1
0
1
0
1
2
1
1
0
1
1
1
1
1
1
1
3
2
2
1
0
1
1
2
1
2
2
2
1
0
1
1
2
1
2
2
1
0
1
1
2
2
2
1
0
1
1
2
2
1
0
1
2
2
1
0
1
4
5
6
7
8
63
a
A C G *
D[i=8, a] 1 6 2 8
Example(9/15)
D[i=7, a] 2 5
k=1
>k
1
2
3
4
5
6
7
8
9
10
11
12
1 8
13
14
15
16
17
18
19
20
21
22
23
24
T: G C A T C G C A G A G A G T A T G C A G A G C G
P:
G C A G A G A G
t15=A appears in P(7,8)
t14=T does not appear in P(6,8)
t13=G appears in P(5,7)
t12=A appears in P(4,6)
t11=G appears in P(3,5)
t10=A appears in P(2,4)
t9=G appears in P(1,3)
64
t8=A does not appear in P(1,2) Shifting is needed now.
a
A C G *
D[i=8, a] 1 6 2 8
Example(10/15)
D[i=7, a] 2 5
1 8
k=1
>k
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
T: G C A T C G C A G A G A G T A T G C A G A G C G
P:
G C A G A G A G
t16=T does not appear in P(7,8)
t15=A appears in P(6,8)
t14=T does not appear in P(5,7)
Shifting is needed now.
65
a
A C G *
D[i=8, a] 1 6 2 8
Example(11/15)
D[i=7, a] 2 5
1 8
k=1
>k
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
T: G C A T C G C A G A G A G T A T G C A G A G C G
P:
G C A G A G A G
t18=C does not appear in P(7,8)
t17=G appears in P(6,8)
t16=T does not appear in P(5,7)
Shifting is needed now.
66
a
A C G *
D[i=8, a] 1 6 2 8
Example(12/15)
D[i=7, a] 2 5
1 8
k=1
>k
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
T: G C A T C G C A G A G A G T A T G C A G A G C G
P:
G C A G A G A G
t19=A appears in P(7,8)
t18=C does not appear in P(6,8)
t17=G appears in P(5,7)
t16=T does not appear in P(4,6)
Shifting is needed now.
67
a
A C G *
D[i=8, a] 1 6 2 8
Example(13/15)
D[i=7, a] 2 5
1 8
k=1
>k
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
T: G C A T C G C A G A G A G T A T G C A G A G C G
P:
G C A G A G A G
t20=G appears in P(7,8)
t19=A appears in P(6,8)
t18=C does not appear in P(5,7)
t17=G appears in P(4,6)
t16=T does not appear in P(3,5)
Shifting is needed now.
68
a
A C G *
D[i=8, a] 1 6 2 8
Example(14/15)
D[i=7, a] 2 5
k=1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
1 8
>k
19
20
21
22
23
24
T: G C A T C G C A G A G A G T A T G C A G A G C G
P:
G C A G A G A G
t22=G appears in P(7,8)
t21=A appears in P(6,8)
t20=G appears in P(5,7)
t19=A appears in P(4,6)
t18=G does not appear in P(3,5)
t17=G appears in P(2,4)
t16=T does not appear in P(1,3)
Shifting is needed now.
69
a
Example(15/15)
k=1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
A C G *
D[i=8, a] 1 6 2 8
D[i=7, a] 2 5
17
18
19
20
21
1 8
22
23
24
T: G C A T C G C A G A G A G T A T G C A G A G C G
P:
G C A G A G A G
W= TGCAGAGCG
P= GCAGAGAG
GCAGAGCG
Result :
GCAGAGAG
T
G
C
A
G
A
G
A
jump out of while loop
G
0
G C A G A G C G
0 0 0 0 0 0 0 0 0
1
1
0
1
1
0
1
0
1
0
2
2
1
0
1
1
1
1
0
1
2
1
0
1
1
2
1
1
2
1
0
1
1
2
1
2
1
0
1
2
2
2
1
0
1
2
2
1
1
2
2
2
1
3
4
5
6
7
8
70
Time complexity
• preprocessing phase and searching phase in
O(mn/k) time and O(|Σ|n) space complexity.
71
References
• [Bae89a] R. Baeza-Yates, Efficient Text Searching. Ph.D. Thesis, Report
CS-89-17, University of Waterloo, Computer Science Department, 1989.
• [Bae89b] R. Baeza-Yates, String searching algorithms revisited. In:
Proceedings of the Workshop on Algorithms and Data Structures
• (ed. F. Dehne et al.), Lecture Notes in Computer Science 382, SpringerVerlag, Berlin, 1989, pp.75–96.
• [BoM77] R. Boyer and S. Moore, A fast string searching algorithm.
Communcations of the ACM 20, 1977, pp.762–772.
• [ChL90] W. Chang and E. Lawler, Approximate string matching in
sublinear expected time. In: Proceedings of the 31st IEEE Annual
Symposium on Foundations of Computer Science, 1990, pp.116–124.
• [Fel65] W. Feller, An Introduction to Probability Theory and Its
Applications. Vol. I. John Wiley & Sons, 1965.
72
References
• [Fel66] W. Feller, An Introduction to Probability Theory and Its
Applications. Vol. II. John Wiley & Sons, 1966.
• [GaG86] Z. Galil and R. Giancarlo, Improved string matching with k
mismatches. SIGACT News ,Vol. 17, 1986, pp.52–54.
• [GaG88] Z. Galil and R. Giancarlo, Data structures and algorithms for
approximate string matching. Journal of Complexity, Vol. 4, 1988, pp.33–
72.
• [GaP89] Z. Galil and K. Park, An improved algorithm for approximate
string matching. Proceedings of the 16t International Colloquium on
Automata, Languages and Programming, Lecture Notes in Computer
Science 372, Springer-Verlag, Berlin, 1989, pp.394–404.
• [GrL89] R. Grossi and F. Luccio, Simple and efficient string matching with
k mismatches. Information Processing Letters, Vol. 33, 1989, pp.113–120.
• [Hor80] N. Horspool, Practical fast searching in strings. Software Practice
& Experience, Vol. 10, 1980, pp.501–506.
• [JTU90] P. Jokinen, J. Tarhio and E. Ukkonen, A comparison of
approximate string matching algorithms. In preparation.
• [Kos88] S. R. Kosaraju, Efficient string matching. Extended abstract. Johns
Hopkins University, 1988.
73
References
• [KMP77] D. Knuth, J. Morris and V. Pratt, Fast pattern matching in strings.
SIAM Journal on Computing, Vol. 6, 1977, pp.323–350.
• [LaV88] G. Landau and U. Vishkin, Fast string matching with k differences.
Journal of Computer and System Sciences, Vol. 37 (1988), 63–78.
• [LaV89] G. Landau and U. Vishkin, Fast parallel and serial approximate
string matching. Journal of Algorithms, Vol. 10 (1989), pp.157–169.
• [Sel80] P. Sellers, The theory and computation of evolutionary distances:
Pattern recognition. Journal of Algorithms, Vol. 1, 1980, pp.359–372.
• [Ukk85a] E. Ukkonen, Algorithms for approximate string
matching.Information Control, Vol. 64, 1985, pp.100–118.
• [Ukk85b] E. Ukkonen, Finding approximate patterns in strings. Journal of
Algorithms, Vol. 6, 1985, pp.132–137.
• [UkW90] E. Ukkonen and D. Wood, Fast approximate string matching with
suffix automata. Report A-1990-4, Department of Computer Science,
University of Helsinki, 1990.
• [WaF75] R. Wagner and M. Fischer, The string-to-string correction
problem. Journal of the ACM, Vol. 21, 1975, pp.168–173.
74
THANK YOU
75

Horspool Algorithm

Transcript Horspool Algorithm

Directory