On the Sorting-Complexity of Suffix Tree Construction

Download Report

Transcript On the Sorting-Complexity of Suffix Tree Construction

On the Sorting-Complexity of
Suffix Tree Construction
MARTIN FARACH-COLTON
PAOLO FERRAGINA
S. MUTHUKRISHNAN
Requires Math fonts downloadable from here
Fact From the Previous Talk
Harel and Tarjan 1984,
Bender and Farach-Colton 2000
A tree T with m nodes can be
preprocessed in O(m) time so that,
for any pair of its nodes u, v, lca(u, v)
can be computed in constant time.
What’s in This Paper
• Bounds depend on the alphabet
– Constant size alphabet – O(n) (Weiner 1973)
– For unbounded alphabet (n log n)
– For {1…n} – linear time
• RAM algorithm
• DAM algorithm (I/O optimal)
• Algorithm also works for PRAM, PDAM
Talk Outline
• Suffix trees
– Reminder
– Tools
• RAM algorithm for suffix tree construction
• Conclusion
Suffix Trees
S= 1 2 1 1 1 2 2 1 2 2 2 1 $
n = 13
1
$
2
13
1
$
2
12
2
3
4
2
1
5
7
11
9
6
8
S[8,13]=12221$
10
Suffix Tree Representation
1
13
$
2
3
4
12
2
1
5
8
7
11
9
6
l= 1 2 1 1 1 2 2 1 2 2 2 1 $
10
Properties of Suffix Trees
lcp((v), (w)) = |(lca(v, w)|
1
=11
L=2
v
3
=1
L=1
lca(v, w)
1
2
=12
L=2
4
w
1
13
12
2
2
1
5
8
7
11
9
6
10
Suffix Links
Lemma [Weiner 1973]
Let a   and   *.
If there is a node v in Ts such that (v)=a,
then there is a node w in Ts such that (w)= .
Define the suffix link as sl(v) = w.
Suffix Links
1
=1
L=1
2
=12
L=2
3
2
4
=2
L=1
1
7
11
13
2
12
1
2
2
1
5
8
=122
L=3
9
6
10
Suffix Links Example
1
1
2
3
4
2
12
2
1
2
3
5
8
13
2
7
11
9
3
6
10
Suffix Arrays
• Let ={Si | Si  * , |Si|=ni}
• T = compacted trie of 
• In order traversal of leaves gives strings in
lexicographical order – S p1, …, S p||
• sort array  AT[i]=pi
• longest common prefix array 
LCPT[i] = lcp(S pi, S pi+1)
Suffix Array Example
1
=11
L=2
13
1
12
3
4
2
1
5
11
7
9
6
8
10
AT
3
4
1
5
8 12 2
7 11 6 10 9 13
LCPT
2
1
2
3
1
2
0
2
1
3
2
0
-
RAM Algorithm
Input: string S
Output: Ts
Divide and Conquer:
1. Recursively compute To – compacted trie
of suffixes beginning at odd positions
2. Recursively compute Te – compacted trie
of suffixes beginning at even positions
3. Merge Te and To to get Ts
Divide and Conquer Scheme
A(n)
Divide
A(n/2)
A(n/4)
A(n/2)
A(n/4)
A(n/4)
S(n/2)
A(n/4)
Conquer
S(n/2)
Merge
S(n)
RAM Algorithm Scheme
|S|=n, =[n]
1
Divide
|S’|=n/2, ’=[n/2]
TS’ (n/2)
ATe
(n/
2),
LCPTe
(n/
2)
4
5
2
ATs’ (n/2), LCPTs’ (n/2)
3
ATo(n/2), LCPTo (n/2)
AT(n/2), LCPT (n/2)
6
TS (n)
Conquer
Merge
Switching Representations
|S|=n, =[n]
1
Divide
|S’|=n/2, ’=[n/2]
TS’ (n/2)
ATe
(n/
2),
LCPTe
(n/
2)
4
5
2
ATs’ (n/2), LCPTs’ (n/2)
3
ATo(n/2), LCPTo (n/2)
AT(n/2), LCPT (n/2)
6
TS (n)
Conquer
Merge
Suffix Tree  Suffix Array
1
=11
L=2
13
1
12
3
4
2
1
5
11
7
9
6
8
10
AT
3
4
1
5
8 12 2
7 11 6 10 9 13
LCPT
2
1
2
3
1
2
0
2
1
3
2
0
-
Suffix Array  Suffix Tree
1
=11
L=2
13
1
12
3
4
2
1
5
11
7
9
6
8
10
AT
3
4
1
5
8 12 2
7 11 6 10 9 13
LCPT
2
1
2
3
1
2
0
2
1
3
2
0
-
Compressing S
|S|=n, =[n]
1
Divide
|S’|=n/2, ’=[n/2]
TS’ (n/2)
ATe
(n/
2),
LCPTe
(n/
2)
4
5
2
ATs’ (n/2), LCPTs’ (n/2)
3
ATo(n/2), LCPTo (n/2)
AT(n/2), LCPT (n/2)
6
TS (n)
Conquer
Merge
Compressing S
• Input: |S|=n
=[n]
• Map character pairs into single characters:
– For i=1 to n form pairs S[2i-1], S[2i]
– Sort lexicographically by radix sort O(n)
– Remove duplicates
• S’[i] = rank of S[2i-1], S[2i]
• Now |S’|=n/2 and ’=[n/2]
Example
S=121112212221$ =[13]
1. Pairs
1,2 1,1 1,2 2,1
2,2 2,1
2. Ordered pairs
1,1 1,2 1,2 2,1
2,1 2,2
3. Duplicates removed
1,1 1,2 2,1 2,2
4. S’=212343$
=[4]
Decompressing S
|S|=n, =[n]
1
Divide
|S’|=n/2, ’=[n/2]
TS’ (n/2)
ATe
(n/
2),
LCPTe
(n/
2)
4
5
2
ATs’ (n/2), LCPTs’ (n/2)
3
ATo(n/2), LCPTo (n/2)
AT(n/2), LCPT (n/2)
6
TS (n)
Conquer
Merge
Decompressing S
• Input : ATs’ , LCPTs’
• Notice : S[2i-1] · · ·S[n]$ = S’[i] · · ·S[n/2]$
• ATo[i] = ATs’[i] · 2 – 1
{
1
if S[ATo[i]+2*LCPTs’[i]] =
S[ATo[i+1]+2*LCPTs’[i]]
0
otherwise
LCPTo = 2 · LCPTs’+
11 12 21 22
···
11 12 21 21
···
Building the Even Tree
|S|=n, =[n]
1
Divide
|S’|=n/2, ’=[n/2]
TS’ (n/2)
ATe
(n/
2),
LCPTe
(n/
2)
4
5
2
ATs’ (n/2), LCPTs’ (n/2)
3
ATo(n/2), LCPTo (n/2)
AT(n/2), LCPT (n/2)
6
TS (n)
Conquer
Merge
Building the Even Tree
• Input : ATo , LCPTo
• Observation : P = even suffix of S
then P = aP’ and P’ = odd suffix of S
• To get ATe apply radix sort on even suffixes
S[2i,n] using the keys S[2i], S[2i+1,n]
{
lcp(S[2i+1,n], S[2j+1,n])+1 if S[2i]=S[2j]
lcp(S[2i,n], S[2j,n]) =
0
otherwise
Merging To and Te
|S|=n, =[n]
1
Divide
|S’|=n/2, ’=[n/2]
TS’ (n/2)
ATe
(n/
2),
LCPTe
(n/
2)
4
5
2
ATs’ (n/2), LCPTs’ (n/2)
3
ATo(n/2), LCPTo (n/2)
AT(n/2), LCPT (n/2)
6
TS (n)
Conquer
Merge
Merging To and Te
Input : ATo, LCPTo and ATe, LCPTe
Trivial method –
sort suffixes lexicographically (n2)
• What if we have an oracle for
lcp(S[2i, n], S[2j-1, n]) ?
• Merge ATo and ATe directly (like sorted lists)
• Compute LCPT from previous results:
1.lcp of adjacent odd suffixes by LCPTo
2.lcp of adjacent even suffixes by LCPTe
3.lcp of odd suffix and even suffix by oracle
•
•
Coupled-DFS (the uncompacted case)
T1
1
1
A
T2
2
2
C
B
1
1
D
TM
2
3
E
F
Coupled-DFS (the uncompacted case)
T1
1
1
A
T2
2
2
1
1
C
B
D
TM
1
2
3
E
F
Coupled-DFS (the uncompacted case)
T1
1
1
A
T2
2
2
1
1
C
B
D
TM
1
A+D
1
2
3
E
F
Coupled-DFS (the uncompacted case)
T1
1
1
A
T2
2
2
1
1
C
B
D
TM
1
A+D
1
2
B
2
3
E
F
Coupled-DFS (the uncompacted case)
T1
1
1
A
T2
2
2
1
1
C
B
D
TM
1
A+C
1
3
2
B
E
2
3
E
F
Coupled-DFS (the uncompacted case)
T1
1
1
A
T2
2
2
1
1
C
B
D
TM
1
A+C
1
2
3
E
2
3
2
B
C+F
E
F
Coupled-DFS (the compacted case)
T1
1234
1
A
2
T2
2
C
B
TM
12
1
D
1234
3
E
2
F
Coupled-DFS (the compacted case)
T1
1234
1
A
12
T2
2
1
C
34
2
12
D
B
12
TM
1
D
2
3
G
C+F
3
E
2
F
Over-Merging To and Te
• How do we merge compacted tries?
• An over-merge is like a merge but:
– Compare only first characters of edges
– In case of two edges with different lengths, k<l
break l into k and l-k
– Identify edges with first letter only
Over-Merge Example
T1
1234
1
A
12
T2
2
1
C
34
2
13
D
B
1x
TM
1
D
2
3
G
C+F
3
E
2
F
Over-Merge of Running Example
To
1
S=121112212221$
13
1
3
1
9
2
2
5
7
11
Over-Merge of Running Example
Te
1
4
S=121112212221$
1
8
12
2
2
6
10
Over-Merge of Running Example
TM
S=121112212221$
1
10 4
3
2
1
12
6 8
5
13
2
2
3
7
11
9
6
10
Building the lcp Oracle
• Definitions
– Node in both TM and To is odd
– Node in both TM and Te is even
– Node with both odd and even descendents is odd/even
• For every odd/even node u find l2i and l2j-1 such
that u = lca(l2i, l2j-1)
• Compute d(u) = lca(l2i+1, l2j)
• Compute (u) = depth(u) in d-pointers tree
Over-Merge of Running Example
TM
S=121112212221$
13
1
10 4
3
2
1
12
6 8
5
2
2
3
7
11
9
6
10
Main Theorem
The function d defines a tree on the odd/even
nodes of TM, and for any l2i and l2j-1 we have
( lca(l2i, l2j-1) ) = lcp(S[2i,n], S[2j-1,n])
Helpful Observations
Let u be an odd/even node in TM.
u is Either even or odd and so L(u) is defined.
Let u be an even node:
1. For l2i and l2j
below u lcp(S[2i,n], S[2j,n])  L(u)
2. For l2i’-1 and l2j’-1 below u lcp(S[2i’-1,n], S[2j’-1,n])  L(u)
3. For l2i” and l2j”-1 below u lcp(S[2i”,n], S[2j”-1,n])  L(u)
Symmetrical proof is u is an odd node.
Lemma
The lcp value of any odd and even pair of leaves
whose lca is u must be the same
Proof:
Suppose lca(l2i’, l2j’-1) = lca(l2i’’, l2j”-1) = u
 lcp(S[2i’,n], S[2j’-1,n]) = k  L(u)
lcp(S[2i’,n], S[2i”,n])  L(u)  k
 lcp(S[2i”,n], S[2j’-1,n]) = k
L(u)
k
S[2i’,n]
S[2j’-1,n]
S[2i”,n]
Induction on the lcp
Pick a pair of odd an even suffixes
S[2i’,n] and S[2j’-1,n].
Base: If S[2i’]  S[2j’-1] then lca = root
(recall the merge procedure)  lcp = 0.
Assumption: Suppose theorem is true for lcp < k.
Induction Step: lcp(S[2i,n], S[2j-1,n]) = k > 0
u = lca(l2i, l2j-1)  u  root.
Suppose d(u) = lca(l2i’+1, l2j’) then:
(u) =1 1 + (d(u))
=2 1 + lcp(S[2i’+1,n], S[2j’,n])
=3 lcp(S[2i,n], S[2j-1,n])
Done!
|S|=n, =[n]
1
Divide
|S’|=n/2, ’=[n/2]
TS’ (n/2)
ATe
(n/
2),
LCPTe
(n/
2)
4
5
2
ATs’ (n/2), LCPTs’ (n/2)
3
ATo(n/2), LCPTo (n/2)
AT(n/2), LCPT (n/2)
6
TS (n)
Conquer
Merge
The End