Transcript Slide 1
Simple and fast linear space computation of Longest common subsequences Claus Rick, 1999 What is the LCS problem? AABAC ABC …Finding a sequence of greatest possible length that can be obtained From both A and B by deleting zero or more (not necessarily adjacent) symbols. Some boring history… Year Author Time Constants Paradigm 1975 Hirschberg O(mn) 2 Dyn. Prog. 1985 Apostolico, Guerra O(mLgm+pm) [2,logm] contours 1986 Myers O(n(n-p)) 2 Shortest path 1987 Kumar, Rangan O(n(m-p)) 3 contours 1990 Wu et al. O(n(m-p)) 2 Shortest path 1992 Apostolico, et al. O(n(m-p)) 3 contours 1992 Apostolico, et al. O(pm) 3 contours 1999 Goeman, Clausen O(min(pm, mLgm + p(n-p)]) [5,25,lgM] contours 1999 This article O(min(pm,p(n-p)]) 2 contours Pre-Info Divide and conquer Midpoint Some basic terms Ordered Pair (i,j) AABAC ABC (2,3)= (A,C) Some basic terms Match AABAC ABC Some basic terms Chain AABAC ABC Rank k AABAC ABC Some basic terms c b a b b Matching Matrix a b a c b c b a a c a c Some basic terms Dominant matches All Upper-left matches in each rank Dominant matches a b a c b c b a c b a b b a c a c 1 2 3 4 5 AABAC ABC a b a c b c b a c b a b b a c a c Backward contours (BC) 5 4 3 2 1 c b a b b a c a c a b a c b c b a Some last basic terms FCk BCk Forward contours (FC) a b a c b c b a c b a b b a c a c 1 2 3 4 5 Backward contours (BC) 5 4 3 2 1 c b a b b a c a c a b a c b c b a Lemma 1 Let p be the length of an LCS between strings A and B. Then for every match (i,j) the following holds: •There is an LCS containing (i,j) if and only if (i,j) is on the kth forward contour and on the (p-k+1)st backward contour. Lemma 1- proof |FC|= (k) P |BC|- (p-k+1) <(p-k+1) K <(p-k+1) P Start calculating FC1 BC1 FC2 Sooner or later… BC2 Really really last terms Define sets Mi as: M0= M M1= M0\FC1 M2= M1\BC1 M2i-1=M2(i-1) \FCi M2i=M2i-1\BCi M a b a c b c b a c b a b b a c a c c b a b b a c a c a b a c b c b a M54321 a b a c b c b a c b a b b a c a c c b a b b a c a c a b a c b c b a Let call the first empty Mi…. M p’ Lemma 2 The Length of an LCS is p’ and each match in M(p’-1) is a possible midpoint Lemma 2- proof K K-2 K-1 10 M 120kk-1 K=p Little problem… We can`t keep tracks of each set- very expensive a b a c b c b a c b a b b a c a c c b a b b a c a c a b a c b c b a What do we do? Keep only dominant matches… When we see a dominant match below- done. a b a c b c b a c b a b b a c a c c b a b b a c a c a b a c b c b a Lets define: FCf’ , BCb’ the minimal indices as stated above Lemma 3 The Length of an LCS is b’ + f’ -1. Complexity Finding the dominant matches each contour: O(min(m, (n-p)) Number of contours: P O(Min(pm, p(n-p) The End Simple and fast linear space computation of longest common subsequence Written by: Claus Rick,1999 Based on algorithm by: D.Hirschberg, 1975 Cast: Matrices Lines Arrows Squares Appendix What is the LCS Lemma 1 Divided And Conquer Define M… Match Chain Dominant Matches Lemma 2 Keep just Dominant… FC Lemma 3 BC Complexity