JFlow: Practical Mostly-Static Information Flow Control

Download Report

Transcript JFlow: Practical Mostly-Static Information Flow Control

Blockwise Suffix Sorting for
Space-Efficient Burrows-Wheeler
Ben Langmead
Based on work by Juha Kärkkäinen
Motivation
• Burrows-Wheeler Transformation (BWT) of a large text allows:
– Fast exact matching
– Compact representation (compared to suffix tree/array)
– More readily compressible (basis of bzip)
• The FM Index exploits an indexed and compressed BWT to allow:
– Exact matching in time linear in the size of the pattern
– Memory footprint as much as 50% smaller than original string
• FM Index and related techniques may allow us to “map reads”
(match a large set of small patterns) in a single pass over the reads
on a typical workstation without spilling onto the hard disk
Background
• Recall that BWT is derived from the Burrows-Wheeler matrix, which
is related to the Suffix array
acaacg$
gc$aaac
Text
BWT
Suffix array
Burrows
Wheeler
Matrix
Last column
Problem
• Memory footprint of building and storing suffix array is much larger
than the BWT itself
– Human genome: SA: ~12 GB, BWT: ~0.8 GB
– Attempt to build BWT over whole human genome on a 32 GB
server exhausts memory and crashes (I tried)
Solution
• Kärkkäinen: “Fast BWT in Small Space by Blockwise Suffix Sorting”
– Theoretical Computer Science, 387 (3), pp. 249-257, Sept. 2007
• Observation:
– BWT[i] depends only on SA[i], not on any other element of SA
• Corollary:
– No need to keep all of SA in memory at once!
• Solution:
– Build SA and BWT a small “chunk” or “block” at a time
– Greatly reduces the memory overhead
• By something like a factor of B, where B = # of blocks
Solution
• Typical suffix sort:
Solution
• Blockwise suffix sort:
Solution
• Calculate and sort a random sample of the suffixes
Solution
• Samples are used as “bookends” for “buckets”
?
$
B1
B2
B3
B4
Solution
• In B linear-time passes over the text (B = # buckets), sort all
suffixes into buckets, one bucket at a time, then sort the bucket
$
Pass 1
B1
B2
B3
B4
Solution
• After a bucket has been sorted and turned into a BWT segment, it is
discarded
$
Pass B
B1
B2
B3
B4
Solution
• Good time bounds in the presence of long repeats require use of a
difference cover sample
– Acts like an oracle that determines relative lexicographical order
of two suffixes that share a prefix of some length v
Project Goals
• Basic goal:
– Write a correct, usable library implementing blockwise SA sort
and BWT building
– Characterize performance and time/space tradeoffs
• Stretch goals:
– Fine-tune for performance and memory usage
– Implement difference cover sample
• Question: is this necessary for good performance on real-life inputs?
Concluding Remarks
• BWT is one application of Blockwise Suffix Sort, but any information
derived locally from SA rows (e.g. LCP information) can be made
more space-efficient this way