投影片 1 - ecp168.net

Transcript 投影片 1 - ecp168.net

A Hybrid Indexing Method for
Approximate String Matching
Journal of Discrete Algorithms, No. 1, Vol. 1, 2000, pp. 205-239,
Gonzalo Navarro and Ricardo Baeza-Yates
Advisor: Prof. R. C. T. Lee
Speaker: Y. K. Shieh
1
The approximate string matching problem is:
Given a text T of length n, a pattern P of length m
(n > m), and a threshold k to the number of
"errors" in the matches, find all occurrences of a
pattern in a text with k errors.
2
This paper uses an exhaustive searching mechanism.
We open a window T’ in T with size m+k (Rule 2) and
try to determine whether we are sure that every prefix
T’’ of this window T’ has ed(T’’,P) > k.
If the answer is yes, we ignore this window; otherwise,
we use dynamic programming to examine whether
any prefix T’’ of the window T’ has ed(T’’,P) ≦k.
3
We use dynamic programming to compute the edit
distance between two strings.
A matrix C0…|m|,0…|n| is filled, where Cj,i represents the
minimum number of operations need to match T1…i to
P1…j. This is computed as follows
C j , 0  j , C 0 ,i  i
C j ,i  if (Ti  Pj ) then C j 1,i 1
else 1  min(C j 1,i , C j ,i 1 , C j 1,i 1 )
Cj,0 and C0,i represent the edit distance between a
string of length j or i and the empty string.
4
example:
T = surgery
P = survey
k=2
s
0
1
s
1
0
u
2
1
r
3
2
g
4
3
e
5
4
r
6
5
y
7
6
u
r
v
e
2
3
4
5
1
2
3
4
0
1
2
3
1
0
1
2
2
1
1
2
3
2
2
1
4
3
3
2
5
4
4
3
y
6
5
4
3
3
2
2
2
There are only three prefixes of T, namely surge, surger
and surgery, whose edit distances with P=survey are
smaller than or equal to k=2.
5
Let us now see how we can be sure that for a
window T’ with size m+k , for every prefix T’’ of T’,
ed(T’’,P) > k.
We present Lemma 1 of this paper as follows.
6
Lemma 1
Let T’ in T and P be two strings such that ed(T’, P)
≦ k. Let P = P1x1P2x2… xj-1Pj, for strings Pi and xi
and for any j ≧ 1. Then, at least one string Pi appears
in T’ with at most k / j  errors.
Thus, we always divide the pattern into j pieces. We
shall point out how to divide later.
7
To be more precise, we may say that if ed(T’,P) ≦ k,
there exists a Pi in P and a T’’ in T’ such that
ed(Pi,T’’) ≦ k / j  .
8
Lemma 1 tells us that if for all Pi in P and every
substring b in T’, ed(Pi,b) > k / j  , then ed(P,T’) > k.
Suppose that there is a window T’ with size m+k and
for all Pi in P and for every substring b in T’, ed(Pi,b)
> k / j  .
Then, we can be sure that for every prefix T’’of T’ , for
all Pi in P and every substring b in T’’, ed(Pi,b)
> k / j  .
T’
T’’
T
b
P
Pi
9
Let us define the following condition.
Condition A: For all Pi in P and every substring b in
T’, ed(Pi, b) > k / j  .
Thus, if Condition A is satisfied, then for every prefix T’’
of T’, ed(T’’,P)>k.
In such a case, we ignore T’ and shift P one step to the
right.
10
Question, how can we be sure that the above condition
is satisfied.
The approach:
For each Pi, we generate all possible modified strings Pi
whose distances with Pi are smaller than or equal to k.
After generating all possible modified Pi ' s , we
may use the suffix tree of T to find all occurrences
of Pi , for all i, in T with error less than k / j  .
11
We still have the following questions:
• Question 1. How to divide P into j pieces?
• Question 2. How to generate all modified Pi’s?
• Question 3. How to find the occurrences of Pi’s in T
with edit distance less than or equal to k / j  .
12
Question 1: How to divide P into j pieces?
It can be proved that an optimal method is to
partition P into j pieces with
j  ( m  k ) / log  n  , where σ is the
alphabet size. We can get j pieces of P, and the
size of every piece is around logσn.
13
Question 2. How to generate all modified Pi’s?
The generation of all modified strings whose distances
with P can be done trivially. One method can be found in
[HHLS2006] which was reported by C. W. Lu.
Another method can be found in [HM2007] reported
By L. C. Chen.
In this paper, the authors used the second method
mentioned in [HM2007].
14
We can use non-deterministic finite automatons
(NFA).
A NFA is a five-tuple M=(Q, Σ, δ, q0 , F), where
Q is a finite set of states, Σ is a finite input
alphabet, δ is a mapping from Q×(Σ∪ {ε}) into
 state,
the set of subsets of Q, q0 Q is an initial
and F Q is a set 
of final states.
15
Lk ( P)  {Y | Y  , DL ( P, Y )  k}
One matched, no error.
a
0,0
ε
a
1,0
b
1,1
b
b
ε
b
b
One matched, one error.
ε
2,0
a
a
a
ε
a
2,1
a
2,2
a
ε
a
3,0
c
3,1
c
3,2
c
Four matched, no error.
4,0
c
ε
c
one error
4,1
c
ε
c
two errors
4,2
Two matched, two errors.
P = abac, k = 2.
The finite automaton M accepts Lk(P).
Lk(P)={aa, ab, ac, ba, bc, aaa, aab, aac, aba, abb, abc, acc,
baa, bab, bac, bbc, bcc, aaaa, aaab, aaac, aaba, aabc, aaca,
aacb, aacc, abaa, abab, abac, abba, abbb, abbc, abca, abcb,
abcc, baac, babc, bbac, bbbc, bcac}.
16
a
0,0
ε
a
1,0
b
1,1
b
2,0
b
a
a
ε
b
a ε
a
2,1
b
a
ε
a
2,2
ε
a
c
3,0
c
3,1
c
3,2
4,0
c
ε
c
4,1
c
ε
c
4,2
Recognize aa
P = abac, k = 2.
The finite automaton M accepts Lk(P).
Lk(P)={aa, ab, ac, ba, bc, aaa, aab, aac, aba, abb, abc, acc,
baa, bab, bac, bbc, bcc, aaaa, aaab, aaac, aaba, aabc, aaca,
aacb, aacc, abaa, abab, abac, abba, abbb, abbc, abca, abcb,
abcc, baac, babc, bbac, bbbc, bcac}.
17
Full example:
T = GACACAGACCAAAGCAG
P = CAAG
k=1
n = 17
m=4
18
P = CAAG
j = (m + k) / logσn
= (4 + 1) / log317
= 1.9388
Therefore, we partition P into two pieces.
P1 = CA
P2 = AG
According to Lemma 1, at least one piece appears in
substrings of T with at most 1/ 2 = 0 error.
This means that we want to find exact matching of P1 and
P2.
19
NFA with k = 1 of P1 = CA:
0,0
C
C
ε
A
1,0
A
ε
zero error
2,0
A
one error
1,1
2,1
A
NFA with k = 1 of P2 = AG:
0,0
A
εA
1,0
G
zero error
2,0
G εG
one error
1,1
2,1
G
20
T = GACACGGACCAAAGCAG
We construct the suffix tree of T.
A
C
A
G
$
C
G
A
17
11
12
8
$
G
G
A
C
C
A
A
A
G
C
A
G
$
16
13
A
A
G
C
A
G
$
14
6
15
9
7
10
5
2
1
4
3
21
We only need to consider the tree level from root to log 17  = 3 .
3
A
G
C
A
$
C
G
A
17
$
6
11
G
12
1,7
A
14
16
2
13
8
4
15
10
9
5
3
T = GACACGGACCAAAGCAG
22
T = GACACGGACCAAAGCAG
k=1
NFA of P1:
0,0
C
C
ε
A
1,0
A
A
ε
A
2,0
A
1,1
G
C
A
$
C
2,1
G
A
A
17
$
6
11
A
0,0
A
1,0
G
G
12
G εG
1,1
13
8
2,1
14
16
2,0
2
εA
1,7
A
4
15 9
10
5
3
G
NFA of P2
23
T = GACACGGACCAAAGCAG
k=1
0,0
C
C
ε
1,0
A
A
ε
A
2,0
A
1,1
A
$
C
2,1
G
A
A
A
17
(not exact match)
$
6
11
0,0
A
εA
1,0
G
14
16
2,0
A
2
13
8
2,1
1,7
A
G
12
G εG
1,1
G
C
4
15 9
10
5
3
G
(not exact match)
24
T = GACACGGACCAAAGCAG
k=1
0,0
C
C
ε
1,0
A
A
ε
A
2,0
A
1,1
G
C
A
$
C
2,1
G
A
A
17
Out of active states.
$
6
11
0,0
A
εA
1,0
G
A
2
13
8
2,1
14
16
2,0
G εG
1,1
G
12
1,7
A
4
15 9
10
5
3
G
(not exact match)
25
T = GACACGGACCAAAGCAG
13 16
k=1
0,0
C
C
ε
1,0
A
A
ε
A
2,0
A
1,1
G
C
A
$
C
2,1
G
A
A
17
Out of active states.
$
G
0,0
A
1,0
G
11
G
12
1,1
G
(exact match)
13
8
2,1
14
16
2,0
G εG
1,7
A
2
εA
6
4
15 9
10
5
3
We record positions 13 and
16 where AG occurs.
26
T = GACACGGACCAAAGCAG
k=1
C
0,0
C
C
ε
1,0
A
A
ε
A
2,0
A
1,1
G
C
A
$
C
2,1
G
A
A
17
$
6
11
0,0
A
1,0
G
G
12
C
G εG
1,1
13
8
2,1
14
16
2,0
2
εA
1,7
A
4
15 9
10
5
3
G
27
T = GACACGGACCAAAGCAG
k=1
A
0,0
C
C
ε
1,0
A
A
ε
A
2,0
G
C
A
A
1,1
$
C
2,1
A
G
A
(exact match)
17
$
6
11
0,0
A
1,0
G
G
12
G εG
1,1
13
8
2,1
14
16
2,0
2
εA
1,7
A
4
15 9
10
5
3
G
Out of active states.
We record positions 3, 10 and
28
15 where CA occurs.
T = GACACGGACCAAAGCAG
k=1
C
0,0
C
ε
1,0
A
A
ε
G
C
C
A
1,1
A
2,0
A
$
C
2,1
G
A
A
(not exact match)
17
$
6
11
0,0
A
1,0
G
G
12
G εG
1,1
13
8
2,1
14
16
2,0
2
εA
1,7
A
4
15 9
10
5
3
G
Out of active states.
29
T = GACACGGACCAAAGCAG
k=1
C
0,0
1,0
C
ε
A
A
ε
G
A
1,1
A
2,0
G
C
A
$
C
2,1
G
A
A
(not exact match)
17
$
6
11
0,0
A
εA
1,0
G
G
2
13
8
2,1
14
16
2,0
G εG
1,1
G
12
1,7
A
4
15 9
10
5
3
G
(not exact match)
30
T = GACACGGACCAAAGCAG
k=1
0,0
C
C
ε
G
1,0
A
A
ε
A
2,0
A
1,1
G
C
A
$
C
2,1
G
A
A
17
$
6
11
0,0
A
1,0
G
G
12
G
G εG
1,1
13
8
2,1
14
16
2,0
2
εA
1,7
A
4
15 9
10
5
3
G
31
T = GACACGGACCAAAGCAG
k=1
0,0
C
C
ε
1,0
A
A
ε
A
2,0
A
1,1
G
C
A
$
C
2,1
G
A
A
17
Out of active states.
$
6
11
0,0
A
1,0
G
G
12
G εG
1,1
13
8
2,1
14
16
2,0
2
εA
1,7
A
4
15 9
10
5
3
G
Out of active states.
32
T = GACACGGACCAAAGCAG
k=1
C
0,0
C
ε
1,0
A
A
ε
A
2,0
A
1,1
A
$
C
2,1
G
A
A
G
C
A
(not exact match)
17
$
6
11
0,0
A
1,0
G
G
12
G εG
1,1
13
8
2,1
14
16
2,0
2
εA
1,7
A
4
15 9
10
5
3
G
Out of active states.
33
T = GACACGGACCAAAGCAG
k=1
0,0
C
C
ε
1,0
A
A
ε
A
2,0
A
1,1
G
C
A
$
C
2,1
G
A
A
17
Out of active states.
$
6
11
0,0
A
1,0
G
G
12
G εG
1,1
13
8
2,1
14
16
2,0
2
εA
1,7
A
4
15 9
10
5
3
G
Out of active states.
34
T = GACACGGACCAAAGCAG
k=1
C
0,0
1,0
C
ε
A
A
ε
A
2,0
A
1,1
G
C
A
$
C
2,1
G
A
A
17
Out of active states.
$
6
11
0,0
A
εA
1,0
G
G
2
13
8
2,1
14
16
2,0
G εG
1,1
G
12
1,7
A
4
15 9
10
5
3
G
(not exact match)
35
After we find all probable positions in T, we
verify every substring of those positions.
The probable positions of T are: 3, 10, 13, 15, 16
We use the dynamic program to verify whether
any approximate string matching occurs between
T and P at the above locations .
36
k=1
The probable positions of T are 3, 10, 13, 15, 16
m+k
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
G A C A C G G A C C A A A G C A G
C A A G
C
A
A
G
0
1
2
3
4
G
1
1
2
3
3
A
2
2
1
2
3
C
3
2
2
2
3
A
4
3
2
2
3
C
5
4
3
3
3
No approximate
matching with k=1
found.
37
k=1
The probable positions of T are: 3, 10, 13, 15, 16
m+k
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
G A C A C G G A C C A A A G C A G
C A A G
C
A
A
G
0
1
2
3
4
A
1
1
1
2
3
C
2
1
2
2
3
A
3
2
1
2
3
C
4
3
2
2
3
G
5
4
3
3
2
No approximate
matching with k=1
found.
38
k=1
The probable positions of T are: 3, 10, 13, 15, 16
m+k
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
G A C A C G G A C C A A A G C A G
C A A G
C
A
A
G
0
1
2
3
4
C
1
0
1
2
3
A
2
1
0
1
2
C
3
2
1
1
2
G
4
3
2
2
1
G
5
4
3
3
2
CACG is found.
39
The probable positions of T are: 3, 10, 13, 15, 16
m+k
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
G A C A C G G A C C A A A G C A G
C A A G
This window does not include any probable position.
Therefore we can ignore this window.
40
The probable positions of T are: 3, 10, 13, 15, 16
m+k
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
G A C A C G G A C C A A A G C A G
C A A G
The window does not include any probable position.
Therefore we can shift the window directly.
41
k=1
The probable positions of T are: 3, 10, 13, 15, 16
m+k
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
G A C A C G G A C C A A A G C A G
C A A G
C
A
A
G
0
1
2
3
4
G
1
1
2
3
3
G
2
2
2
3
3
A
3
3
2
2
3
C
4
3
3
3
3
C
5
4
4
4
4
No approximate
matching with k=1
found.
42
k=1
The probable positions of T are: 3, 10, 13, 15, 16
m+k
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
G A C A C G G A C C A A A G C A G
C A A G
C
A
A
G
0
1
2
3
4
G
1
1
2
3
3
A
2
2
1
2
3
C
3
2
2
2
3
C
4
3
3
3
3
A
5
4
3
3
4
No approximate
matching with k=1
found.
43
k=1
The probable positions of T are: 3, 10, 13, 15, 16
m+k
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
G A C A C G G A C C A A A G C A G
C A A G
C
A
A
G
0
1
2
3
4
A
1
1
1
2
3
C
2
1
2
2
3
C
3
2
2
3
4
A
4
3
2
2
3
A
5
4
3
2
3
No approximate
matching with k=1
found.
44
k=1
The probable positions of T are: 3, 10, 13, 15, 16
m+k
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
G A C A C G G A C C A A A G C A G
C A A G
C
A
A
G
0
1
2
3
4
C
1
0
1
2
3
C
2
1
1
2
3
A
3
2
1
1
2
A
4
3
2
1
2
A
5
4
3
2
2
No approximate
matching with k=1
found.
45
k=1
The probable positions of T are: 3, 10, 13, 15, 16
m+k
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
G A C A C G G A C C A A A G C A G
C A A G
C
A
A
G
0
1
2
3
4
C
1
0
1
2
3
A
2
1
0
1
2
A
3
2
1
0
1
A
4
3
2
1
1
G
5
4
3
2
1
CAA, CAAA and
CAAAG are found.
46
k=1
The probable positions of T are: 3, 10, 13, 15, 16
m+k
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
G A C A C G G A C C A A A G C A G
C A A G
C
A
A
G
0
1
2
3
4
A
1
1
2
2
3
A
2
2
1
2
3
A
3
3
2
1
2
G
4
4
3
2
1
C
5
4
4
3
2
AAAG is found.
47
k=1
The probable positions of T are: 3, 10, 13, 15, 16
m+k
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
G A C A C G G A C C A A A G C A G
C A A G
C
A
A
G
0
1
2
3
4
A
1
1
1
2
3
A
2
2
1
1
2
G
3
3
2
2
1
C
4
3
3
3
2
A
5
4
3
3
3
AAG is found.
48
k=1
The probable positions of T are: 3, 10, 13, 15, 16
m+k
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
G A C A C G G A C C A A A G C A G
C A A G
C
A
A
G
0
1
2
3
4
A
1
1
1
2
3
G
2
2
2
2
2
C
3
2
3
4
5
A
4
3
2
3
4
G
5
4
3
3
3
No approximate
matching with k=1
found.
49
k=1
The probable positions of T are: 3, 10, 13, 15, 16
m
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
G A C A C G G A C C A A A G C A G
C A A G
C
A
A
G
0
1
2
3
4
G
1
1
2
3
3
C
2
1
2
3
3
A
3
2
1
2
3
G
4
3
2
2
2
No approximate
matching with k=1
found.
50
k=1
The probable positions of T are: 3, 10, 13, 15, 16
m-k
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
G A C A C G G A C C A A A G C A G
C A A G
C
A
A
G
0
1
2
3
4
C
1
0
1
2
3
A
2
1
0
1
2
G
3
2
1
1
1
CAG is found.
51
Time complexity
• The preprocessing time complexity of constructing
automatons and a suffix tree of T is O(|N|*|m|) and
O(n) respectively, |N| is the number of states in a
NFA and |m| is the length of m.
• The search time obtained using the partitioning
scheme is O(nλlogn), where λ < 1 when error
tolerated α < 1-e/ , where e = 2.718… .
52
references
[AG85]Combinatorial Algorithms on Words. A. Apostolico and Z. Galil. Springer-Verlag,
New York, 1985.
[ANZ97]Large text searching allowing errors. M. Ara´ujo, G. Navarro, and N. Ziviani. In
Proc. 4th South American Workshop on String Processing (WSP’97), pages 2–20.
Carleton University Press, 1997.
[B92]Text retrieval: Theory and practice. R. Baeza-Yates. In 12th IFIPWorld Computer
Congress, volume I, pages 465–476. Elsevier Science, September 1992.
[B96]A unified view of string matching algorithms. R. Baeza-Yates. In SOFSEM’96: Theory
and Practice of Informatics, LNCS 1175, pages 1–15, 1996. Invited paper.
[BG96]Fast text searching for regular expressions or automaton searching on a trie. R. BaezaYates and G. Gonnet. Journal of the ACM, 43, 1996.
[BG99]A fast algorithm on average for all-against-all sequence matching. R. Baeza-Yates and
G. Gonnet.In Proc. 6th Symposium on String Processing and Information Retrieval
(SPIRE’99). IEEE CS Press, 1999. Previous version unpublished, Dept. of Computer
Science, Univ. of Chile, 1990.
[BN99]Faster approximate string matching. R. Baeza-Yates and G. Navarro. Algorithmica,
23(2):127–158, 1999. Preliminary version in Proc. CPM’96, LNCS 1075.
[BN2000]Block-addressing indices for approximate text retrieval. R. Baeza-Yates and G.
Navarro. Journal of the American Society for Information Science (JASIS), 51(1):69–82,
January 2000.
[BBHECS85]The smallest automaton recognizing the subwords of a text. A. Blumer, J.
Blumer, D. Haussler, A. Ehrenfeucht, M. Chen, and J. Seiferas. Theoretical Computer
Science, 40:31–55, 1985.
53
[CM94]Approximate string matching and local similarity. W. Chang and T. Marr. In Proc. 5th
Annual Symposium on Combinatorial Pattern Matching (CPM’94), LNCS 807, pages
259–273, 1994.
[C95]Fast approximate matching using suffix trees. A. Cobbs. In Proc. 6th Annual
Symposium on Combinatorial Pattern Matching (CPM’95), LNCS 937, pages 41–54,
1995.
[C86]Transducers and repetitions. M. Crochemore. Theoretical Computer Science, 45:63–86,
1986.
[FFM98]Overcoming the memory bottleneck in suffix tree construction. M. Farach, P.
Ferragina, and S. Muthukrishnan. In Proc. 9th Symposium on Discrete Algorithms
(SODA’98), pages 174–183, 1998.
[GKS99]Efficient implementation of lazy suffix trees. R. Giegerich, S. Kurtz, and J. Stoye. In
Proc. 3rdWorkshop on Algorithm Engineering (WAE’99), LNCS 1668, pages 30–42, 1999.
[G92]A tutorial introduction to Computational Biochemistry using Darwin. G. Gonnet.
Technical report, Informatik E.T.H., Zuerich, Switzerland, 1992.
[GBS92]Information Retrieval: Data Structures and Algorithms, chapter 3: New indices for
text: Pat trees and Pat arrays. Gonnet, R. Baeza-Yates, and T. Snider. Pages 66–82.
Prentice-Hall, 1992.
[H95]Overview of the Third Text REtrieval Conference. D. Harman. In Proc. Third Text
REtrieval Conference (TREC-3), pages 1–19, 1995. NIST Special Publication 500-207.
[HS94]N. Holsti and E. Sutinen. Approximate string matching using q-gram places. In Proc.
7th Finnish Symposium on Computer Science, pages 23–32. University of Joensuu, 1994.
[IT99]An efficient method for in memory construction of suffix arrays. H. Itoh and H. Tanaka.
In Proc. 6th Symposium on String Processing and Information Retrieval (SPIRE’99),
pages 81–87. IEEE CS Press, 1999.
[JU91]Two algorithms for approximate string matching in static texts. P. Jokinen and E.
Ukkonen. In Proc. 2nd Annual Symposium on Mathematical Foundations of Computer
54
Science (MFCS’91), volume 16, pages 240–248, 1991.
[K73]The Art of Computer Programming, volume 3: Sorting and Searching. D. Knuth.
Addison-Wesley, 1973.
[MM93]Suffix arrays: a new method for on-line string searches. U. Manber and E. Myers.
SIAM Journal on Computing, pages 935–948, 1993.
[Mw94]GLIMPSE: A tool to search through entire file systems. U. Manber and S. Wu. In
Proc. USENIX Technical Conference, pages 23–32, Winter 1994.
[M94]A sublinear algorithm for approximate keyword searching. E. Myers. Algorithmica,
12(4/5):345–374, Oct/Nov 1994.
[N98]Approximate Text Searching. G. Navarro. PhD thesis, Dept. of Computer Science, Univ.
of Chile, December 1998. Technical Report TR/DCC-98-14.
ftp://ftp.dcc.uchile.cl/pub/users/gnavarro/thesis98.ps.gz.
[N99]A guided tour to approximate string matching. G. Navarro. Technical Report TR/DCC99-5, Dept. of Computer Science, Univ. of Chile, 1999. To appear in ACM Computing
Surveys. ftp://ftp.dcc.uchile.cl/pub/users/gnavarro/survasm.ps.gz.
[NB98]Improving an algorithm for approximate pattern matching. G. Navarro and R. BaezaYates. Technical Report TR/DCC-98-5, Dept. of Computer Science, Univ. of Chile, 1998.
Submitted.
[NB98]A practical q-gram index for text retrieval allowing errors. G. Navarro and R. BaezaYates. CLEI Electronic Journal, 1(2), 1998. http://www.clei.cl.
[NB99]A new indexing method for approximate string matching. G. Navarro and R. BaezaYates. In Proc. 10th Annual Symposium on Combinatorial Pattern Matching (CPM’99),
LNCS 1645, pages 163–186, 1999.
[NB99]Very fast and simple approximate string matching. G. Navarro and R. Baeza-Yates.
Information Processing Letters, 72:65–70, 1999.
[NSTT2000]Indexing text with approximate q-grams. G. Navarro, E. Sutinen, J. Tanninen,
and J. Tarhio. In Proc. 11th Annual Symposium on Combinatorial Pattern Matching
(CPM’2000), Montreal, Canada, 2000.
55
[S98]A fast algorithm for making suffix arrays and for the Burrows-Wheeler transformation.
K. Sadakane. In Proc. Data Compression Conference (DCC’98), pages 129–138, 1998.
[S80]The theory and computation of evolutionary distances: pattern recognition. P. Sellers.
Journal of Algorithms, 1:359–373, 1980.
[S96]Fast approximate string matching with q-blocks sequences. F. Shi. In Proc. 3rd South
American Workshop on String Processing (WSP’96), pages 257–271. Carleton University
Press, 1996.
[ST95]On using q-gram locations in approximate string matching. E. Sutinen and J. Tarhio.
In Proc. ESA’95, LNCS 979, pages 327–340, 1995.
[ST96]Tarhio. Filtration with q-samples in approximate string matching. E. Sutinen and J. In
Proc. 7th Annual Symposium on Combinatorial Pattern Matching (CPM’96), LNCS 1075,
pages 50–61, 1996.
[U96]Approximate string matching over suffix trees. E. Ukkonen. In Proc. 4th Annual
Symposium on Combinatorial Pattern Matching (CPM’93), pages 228–242, 1993.
[U95]Constructing suffix trees on-line in linear time. E. Ukkonen. Algorithmica, 14(3):249–
260, Sep 1995.
[U85]Finding approximate patterns in strings. Esko Ukkonen. Journal of Algorithms, 6:132–
137, 1985.
[WM92]Fast text searching allowing errors. S.Wu and U. Manber. Comm. of the ACM,
35(10):83–91, October 1992.
56
Thank you
57