Crimson: A Data Management System to Support Evaluating Phylogenetic Tree Reconstruction Algorithms Yifeng Zheng, Stephen Fisher, Shirley cohen, Sheng Guo, Junhyong Kim.

Download Report

Transcript Crimson: A Data Management System to Support Evaluating Phylogenetic Tree Reconstruction Algorithms Yifeng Zheng, Stephen Fisher, Shirley cohen, Sheng Guo, Junhyong Kim.

Crimson: A Data Management System to Support Evaluating Phylogenetic Tree
Reconstruction Algorithms
Yifeng Zheng, Stephen Fisher, Shirley cohen, Sheng Guo, Junhyong Kim and Susan B. Davidson
Extended Dewey Labeling
Phylogenetic Trees
Background
• Phylogenetics – the science of identifying and understanding evolutionary
relationship between different species
• Cyberinfrastructure for Phylogenetic Research project (CIPRes)
6
5
– Design efficient data storage and query capabilities for managing phylogenetic trees
– Evaluate existing phylogenetic tree reconstruction algorithms
• Building “gold standards” by simulating very large phylogenetic tree as well as sequences for each
species in the tree according to models that are carefully curated by experts.
2
– ...
3
1
4
2
• Crimson system focuses on providing data management support for CIPRes
simulation.
Technical Challenges
• PHylogenetic trees may cntain millions of species associated with sequences
with thousands of characters. Efficiently manage and query this data is important.
• Data management strategies developed for XML are not suitable for phylogenetic
tree management.
– Different from XML documents used in web and commercial application which are
relatively shallow, phylogenetic trees can be very deep.
• According to a survey of 200,000 XML documents by Mignet, Barbosa and Veltri in WWW 2003, the
average depth of XML was reported to be 4 and the deepest was 135.
• Simulation phylogenetic tree have an average depth of greater than 1000, and the deepest can be
more than 1 million.
Our Solution
Data storage and index strategy: extension of the Dewey labeling scheme
Query evaluation algorithm which achieve high performance
An user friendly data management system: Crimson system
– Sampling a set of species according to a given time
System Architecture
Input
Query
• The phylogenetic reconstruction problem is NP-hard, so current
algorithms can only handle a relative small input set. To benchmark
these reconstruction algorithms, we must therefore be able to
efficiently sample a subset of species according to various criteria,
and project the tree pattern induced by the smaple in the simulation
tree.
Sampling Sampling Species
with Sequences
Strategy
Query
History
Projection
Tree
Simulation
Tree
Tree Projector
GUI
Manager
Repository
Manager
Species
Repository
Tree
Repository
1
2
3
4
5
6
Leaves nodes in the file (*100)
Benchmark
Manager
• determining the relationship among a set of species by appealing to an authoritative
tree
• Given a tree T and a subset S of its leaves, the tree projection of T over S is a “subtree”
T’ in which each edge is a subpath of a path from the root of T to a node in S and each
node has at least two children.
References:
Query
Repository
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
– Tree projection
Tree Viewer
Sampling
• Guarantee that the sampling results are derived from an evolutionary time period.
• Given a tree T with weight on the edge representing time, sampling a set of species
according to a given time t will return a subset of T’s leaves set such that for all species,
whose evaluation time (the weighted distance from the root to this specie) is t, have the
same number of descendant species sampled out.
Time to generate the tree and store it given a 20
leaf node set
Data Loader
• Cyberinfrastructure for Phylogenetic Research (CIPRES) project
(www.phylo.org)
• Susan B. Davidson, Junhyong Kim, Yifeng Zheng: Efficiently
Supporting Structure Queries on Phylogenetic Trees. SSDBM
2005: 93-102
Time to generate and store a subtree from the selected leaves
of a phylogenetic tree with 2000 leaves
Time(seconds)
•
•
•
Performance Results
Phylogenetic Queries
Time(seconds)
– Queries used with phylogenetic trees are also very different from the path-oriented or
restructuring quries supported by XPath and XQuery.
3
2.5
2
1.5
1
0.5
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Number of randomly selected leaves(*10)