Hierachy-conscious Data Structures for String Analysis

Download Report

Transcript Hierachy-conscious Data Structures for String Analysis

Hierarchy-conscious Data
Structures for String Analysis
Carlo Fantozzi
PhD Student (XVI ciclo)
Bioinformatics Course - June 25, 2002
Outline of the Talk
We will describe data structures
to efficiently deal with memory hierarchies
while analyzing strings



What is a memory hierarchy?
Computational models
Data structures
Memory Hierarchy




Imposed by technology
Increasing size
Decreasing speed
Orders of magnitude!
ALU
L1 inst
L1 data
L2 cache


Relevant data must be kept
in the faster levels
Who should do it?
Maybe the programmer
L3 cache
main memory
disk(s), network
Computational Models


No model describes the whole hierarchy
External memory: I/O model




RAM: limited capacity M
Disk: transfers in blocks of size B
Cost: number of I/O transfers only
Internal memory: ideal cache model


Cache  RAM, RAM  disk
Cache updates are automatic and optimal
Basic Data Structures: Tries


Trie: tree built on a set of strings
Every string has a path in the tree
PATRICIA trie: nonbranching nodes collapsed,
only 1st character stored


n strings  Q(n) space
Loss of information
b
a
a
b
a
b
b
(a,2)
(a,1)
b
a
(b,1)
(b,1)
(a,2)
(b,2)
B-trees





Balanced search tree (enforced)
~n sorted keys per node, which
partition the keys in the subtrees
One node fits one disk block
Searching for keys matching prefix x
takes O(logBN + k/B) I/Os
Updates: O(logBN) I/Os
k1 k2 k3 … kn
String B-trees (Ferragina & Grossi)


Useful for “big” or “irregular” strings
The keys are pointers to strings





Unsorted strings are in a separate array
Who knows lexicographic order?
Each node contains a PATRICIA trie
built on the corresponding strings
search(x): O(logBn + (|x|+k)/B) I/Os
update(x): O(logBn + |x|/B) I/Os
Cache-oblivious B-trees (1)


Developed by Bender, Demaine, Farach
The B-tree is stored in memory
according to the van Emde Boas layout
h



1
1 2 3 4 5
2 3 4 5
recursively!
Balancing rule; long-distance nodes
search(x): O(logBN + k/B) I/Os
update(x): O(logBN) I/Os for suitable B
Cache-oblivious B-trees (2)

Developed by Brodal, Fagerberg and Jacob
1.
2.
3.


Build a dynamic, binary, balanced search tree
Embed it into a static, binary tree
Store it using the van Emde Boas layout
Same I/Os with a simpler data structure
Cache obliviousness:
The ideal cache model can be
simulated on a realistic, multilevel model
with constant slowdown (expected)
Suffix Trees

Flat memory model: s.t. can be built in



I/O model: studied by Farach et al.
1.
2.
3.

linear time if |S| is finite
sorting time if |S| is infinite
Build the odd tree (through aux suffix tree)
Build the even tree (use the odd tree)
Merge the two trees (through anchor nodes)
Sorting I/O complexity (optimal)
Suffix Arrays (1)




“Simplified suffix tree”
AS[i] = position of i-th suffix of S
search(x): O((|x|/B)logN + k/B) I/Os
Array construction, Manber & Myers:
1.
2.
3.

Put suffixes into buckets
Stage i: sort according to first 2i symbols
Read buckets in order
Takes 8N space and O(NlogN) I/Os
Suffix Arrays (2)

Array construction, Gonnet et al. :
cubic I/O complexity, but…


worst case analysis is too pessimistic
nearly all I/Os are bulk, i.e. sequential
Gonnet’s algorithm beats
the one of Manber & Myers in practice

Ferragina and Grossi: new algorithm


Better I/O complexity than Gonnet’s
Faster than Gonnet’s in practice