Transcript Slide 1

Simple and fast linear space computation
of
Longest common subsequences
Claus Rick, 1999
What is the LCS problem?
AABAC
ABC
…Finding a sequence of greatest possible
length that can be obtained
From both A and B by deleting zero or more
(not necessarily adjacent) symbols.
Some boring history…
Year
Author
Time
Constants Paradigm
1975
Hirschberg
O(mn)
2
Dyn. Prog.
1985
Apostolico, Guerra
O(mLgm+pm)
[2,logm]
contours
1986
Myers
O(n(n-p))
2
Shortest path
1987
Kumar, Rangan
O(n(m-p))
3
contours
1990
Wu et al.
O(n(m-p))
2
Shortest path
1992
Apostolico, et al.
O(n(m-p))
3
contours
1992
Apostolico, et al.
O(pm)
3
contours
1999
Goeman, Clausen
O(min(pm, mLgm +
p(n-p)])
[5,25,lgM] contours
1999
This article
O(min(pm,p(n-p)])
2
contours
Pre-Info
 Divide and conquer
 Midpoint
Some basic terms
Ordered Pair (i,j)
AABAC
ABC
(2,3)= (A,C)
Some basic terms
Match
AABAC
ABC
Some basic terms
Chain
AABAC
ABC
Rank k
AABAC
ABC
Some basic terms
c b
a b b
Matching
Matrix
a
b
a
c
b
c
b
a
a c a c
Some basic terms
Dominant matches
All Upper-left matches
in each rank
Dominant matches
a
b
a
c
b
c
b
a
c b a b b a c a c
1
2
3
4
5
AABAC
ABC
a
b
a
c
b
c
b
a
c b a b b a c a c
Backward contours (BC)
5
4
3
2
1
c b a b b a c a c
a
b
a
c
b
c
b
a
Some last basic terms
FCk
BCk
Forward contours (FC)
a
b
a
c
b
c
b
a
c b a b b a c a c
1
2
3
4
5
Backward contours (BC)
5
4
3
2
1
c b a b b a c a c
a
b
a
c
b
c
b
a
Lemma 1
Let p be the length of an LCS between strings A
and B. Then for every match (i,j) the
following holds:
•There is an LCS containing (i,j) if and
only if (i,j) is on the kth forward contour
and on the (p-k+1)st backward contour.
Lemma 1- proof
|FC|= (k) P |BC|- (p-k+1)
<(p-k+1)
K
<(p-k+1)
P
Start calculating
FC1
BC1
FC2
Sooner or later…
BC2
Really really last terms
Define sets Mi as:
M0= M
M1= M0\FC1
M2= M1\BC1
M2i-1=M2(i-1) \FCi
M2i=M2i-1\BCi
M
a
b
a
c
b
c
b
a
c b a b b a c a c
c b a b b a c a c
a
b
a
c
b
c
b
a
M54321
a
b
a
c
b
c
b
a
c b a b b a c a c
c b a b b a c a c
a
b
a
c
b
c
b
a
Let call the first empty Mi….
M p’
Lemma 2
 The Length of an LCS is p’ and each match in
M(p’-1) is a possible midpoint
Lemma 2- proof
K
K-2
K-1
10
M 120kk-1
K=p
Little problem…
 We can`t keep tracks of each set- very expensive
a
b
a
c
b
c
b
a
c b a b b a c a c
c b a b b a c a c
a
b
a
c
b
c
b
a
What do we do?
Keep only dominant matches…
When we see a dominant match below- done.
a
b
a
c
b
c
b
a
c b a b b a c a c
c b a b b a c a c
a
b
a
c
b
c
b
a
Lets define:
 FCf’ , BCb’ the minimal indices as stated
above
Lemma 3
 The Length of an LCS is b’ + f’ -1.
Complexity
Finding the dominant matches each contour:
O(min(m, (n-p))
Number of contours:
P
O(Min(pm, p(n-p)
The End
Simple and fast linear space computation of
longest common subsequence
Written by:
Claus Rick,1999
Based on algorithm by:
D.Hirschberg, 1975
Cast:
Matrices
Lines
Arrows
Squares
Appendix
What is the LCS
Lemma 1
Divided And Conquer
Define M…
Match
Chain
Dominant Matches
Lemma 2
Keep just Dominant…
FC
Lemma 3
BC
Complexity