Transcript Slide 1

BioFilter
An architecture for parallel deployment and dynamic chaining
of standalone BioInformatics tools.
Master’s Thesis
Avinash Kewalramani
Outline
•
•
•
•
•
•
•
Motivation.
Abstract.
Software Patterns and BioFilter.
Biofilter architecture.
Performance results.
Future work.
Conclusion.
Motivation
•
•
•
•
•
•
Most Bioinformatics tools are processor intensive.
Few have capabilties to run on multi-processor
machines
Almost none have capabilty to be run on Beowulf
clusters.
Bioinformatics analysis is mostly based on multiple
tools with their Input/Output chained together.
Human intervention to set up this chaining
BioFilter provides the above two capabilities
–
–
Easy and fast deployment of such tools on the
cluster providing a huge gain in performance
Allows creation of dynamic pipelines of different
analysis tools as per the requirement.
Abstract
•
•
•
•
•
•
•
•
Uses “Pipes and Filter pattern” to setup dynamic pipelines.
It uses “Client-Dispatcher-Server” pattern for parallelization.
BioFilter is not tool-specific but more of a plug and play
environment.
Original tools are not altered. So output is consistent with single
CPU machine runs
Cluster hardware, operating system and queuing system
independent
Operating system on the cluster should support multiprocessing
via forks
BioFilter code has been implemented in Object Oriented Perl
Cluster should have a shared disk system and Perl installed.
Typical Annotation Pipelines
Prodom
Blocks
Prosite
Genome
Glimmer
Blast
BlastParser
Psort
tRNAscan
Coils
Seg
FastaRecordsFile
Blast
BlastParser
Sofware Patterns and BioFilter
•
•
Patterns are based on a three part schema
–
Context
–
Problem
–
Solution
Pattern categories
–
Architectural
–
Design
Pipes and Filters pattern
• Context : Processing of data streams
• Problem
– Specifications and Forces
• Avoiding Monolithic system design
• Classes which perform single transformation or analysis
provide greater flexibilty and reuse.
• System can be built by different people on the team.
• Dynamic combination of such classes to form pipelines.
• Common interface provides transparent use.
Pipes and Filters pattern
• Solution
– Divide the task into several sequential
processing steps with the data flowing through
the system connecting these steps.
– Components
•
•
•
•
Filters – Active/Passive
Pipes – Queues/method calls
Data Source -Provider
Data Sink -Consumer
Pipes and Filter pattern
Static structure of a Pull Filter for Blast Pipeline
Pipes and Filters Pattern
Dynamic structure of a Pull Filter for Blast Pipeline
Client-Dispatcher-Server Pattern
• Context: A software system integrating a
set of distributed servers, with the servers
running locally or distributed over a
network.
• Problem
– Specification and Forces
• Software system that uses servers distributed over a
network must provide a means of communication
between them
• Separation of Functional/Communication code.
• Service providers should be location independent.
Client-Dispatcher-Server Pattern
• Solution
– Components
• Client
– Carries out domain specific tasks
– Requests the dispatcher to set up a communication
channel with the server.
• Server
– Provides set of operations to the clients
– Registers with the dispatcher
• Dispatcher
– Location transparency for the Servers.
Client-Dispatcher-Server pattern
Static structure for Blast Filter
Client-Dispatcher-Server pattern
Dynamic structure for Blast Filter
BioFilter Architecture
Framework
• Splitting of the input data set and the database
– Database splitting model is cumbersome.
– Monolithic shared disk database with input data
splitting is easier.
– Capacity computing model
• Exclusivity of partitions of the input data set
• Capability to process one input partition at a time.
– Synchronization of results is an issue.
• Queuing System
– Interaction with queuing system is part of the
architecture
BioFilter Architecture
Static structure-Blast Pipeline
BioFilter Architecture
Dynamic structure-Blast Pipeline
BioFilter Architecture
• Crash Recovery
– TCP/IP based socket communication is blocking
– Timed sockets overcome blocking
– Timed waiting is predetermined but dynamically
computed dependent on the computational
intensity of the job submitted in the
communication between components.
• Interaction with nodes outside the cluster is
possible by manually starting the server or by
rsh utilities
Filter Variants
• Simple Filters.
• Split Filters.
• Join Filters
Concrete Source
Split
FastaRecordsFile
OrfFilter
Simple
Join
Fasta2TblFilter
BuildImmFilter
Simple
GlimmerFilter
Split
TranslationFilter
Performance Results
• 240 node cluster, Each node is a Pentium III 1200 Mhz
processor
• 20GB local disk and memory ranging from 1-2GB.
• Shared disk via Netapps disk server.
• Speedup Ratio for x nodes= Time to run the job on the
initial number of nodes/Time to run the job on x nodes
Performance Results
Benchmark Test 1
• 1000 protein
sequences from
Bacteria (287 AA
average length) blast
against NR database.
• Servers ranging from
1-200
• Non-linear increase in
performance
– Shared Database
bottleneck
– Single Dispatcher
Performance Results
Benchmark Test 2 –Data Scalability
Performance Results
Benchmark Test 3-It’s just not a Blast accelerator
Performance Results
Benchmark Test 3
Future Work
• Tests required
to analyse and
cure the
performance
bottleneck.
• Forking the
pipeline for
added
parallelism,
requires a
queue type of
datastructure
Conclusion
Advantages
•Eliminate intermediate files, provides filter reuse and rapid
prototyping of pipelines by filter recombination and
exchange
•Provides an environment for easy deployment of tools
which are embarrasingly parallel on the Beowulf clusters.
•Development time and Run time preformance gain.
•Encapsulates the queuing system information in the
implementation.
•Interaction with nodes outside the cluster.
•Provides exchangeabilty of servers, location and
migration transperancy, reconfiguration of servers and
fault tolerance
Conclusion
Disadvantages
• Speed increase not proportional to the
number of nodes.
•Absence of a GUI based interface.
• Architecture not tested with tools where the
input data cannot be split.
•Error handling is very difficult.
•Pipeline disaster recovery is difficult.
•Client-Dispatcher-Server pattern makes the
architecture slower.
Thanks and Possibilities
• Advisors
– Dr Sun Kim (Indiana University-School of
Informatics)
– Thomas Brettin (Los Alamos National Labs)
• Clients or Open Source.
– California Digital.
– Rocketcalc.
– Microway.
– Western Scientific.
References
•
•
•
•
•
•
•
•
•
•
•
•
•
•
S. Salzberg, A. Delcher, S. Kasif, and O. White.,"Microbial gene identification using interpolated Markov models"
Nucleic Acids Research 26:2 (1998), 544-548.
A.L. Delcher, D. Harmon, S. Kasif, O. White, and S.L. Salzberg., "Improved microbial gene identification with
GLIMMER" Nucleic Acids Research, 27:23, 4636-4641.
Lowe, T.M. & Eddy, S.R. (1997) ``tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic
sequence'', Nucl. Acids Res., 25, 955-964.
Altschul, S.F.,et al., "Basic local alignment search tool.". J Mol Bio. 215 403-410(1990)
Nakai K, Kanehisa M.,"Expert system for predicting protein localization sites in gram-negative bacteria." Proteins.
1991;11(2):95-110.
Lupas, A., Van Dyke, M., and Stock, J.,"Predicting Coled Coils from Protein Sequences", Science 252:1162-1164.
Wootton, J. C. and S. Federhen (1993)., "Statistics of local complexity in amino acid sequences and sequence
databases.", Computers in Chemistry 17:149-163.
Wootton, J. C. and S. Federhen (1996).," Analysis of compositionally biased regions in sequence databases. Methods
in Enzymology 266: 554-571.
Servant F, Bru C, Carrère S, Courcelle E, Gouzy J, Peyruc D, Kahn D (2002) ProDom: Automated clustering of
homologous domains. Briefings in Bioinformatics. vol 3, no 3:246-251
J.G. Henikoff, E.A. Greene, S. Pietrokovski & S. Henikoff, "Increased coverage of protein families with the blocks
database servers", Nucl. Acids Res. 28:228-230 (2000).
S.Henikoff, J.G.Henikoff & S. Pietrokovski, "Blocks+: A non-redundant database of protein alignment blocks derived
from multiple compilations", Bioinformatics 15(6):471-479 (1999).
Sigrist C.J., Cerutti L., Hulo N., Gattiker A., Falquet L., Pagni M., Bairoch A., Bucher P., "PROSITE: a documented
database using patterns and profiles as motif descriptors." Brief Bioinform. 3:265-274(2002).
Mark Grand. "Patterns in Java .Volume 1" ,Second Edition,Wiley Publication Inc ,2002
Frank Buschmann, Regine Meunier, Hans Rohnert, Peter Sommerlad,Michael Stal."Pattern Oriented Software
Architecture Volume 1:A system of Patterns" Chicester,England:John Wiley and Sons, 1996.