A Global Grid for Analysis of Arthropod Evolution Craig A. Stewart, Rainer Keller, Richard Repasky, Matthias Hess, David Hart, Matthias Müller, Ray Sheppard, Uwe Wössner,

Download Report

Transcript A Global Grid for Analysis of Arthropod Evolution Craig A. Stewart, Rainer Keller, Richard Repasky, Matthias Hess, David Hart, Matthias Müller, Ray Sheppard, Uwe Wössner,

A Global Grid for Analysis
of Arthropod Evolution
Craig A. Stewart, Rainer Keller, Richard
Repasky, Matthias Hess, David Hart,
Matthias Müller, Ray Sheppard, Uwe
Wössner, Martin Aumüller, Huian Li,
Donald K. Berry, John Colbourne
Indiana University – University Information Technology Services
Höchstleistungsrechnencentrum Stuttgart (High Performance Computing
Center Stuttgart)
Indiana University – Center for Genomics and Bioinformatics
License Terms
•
•
•
•
Please cite this presentation as: Stewart, C.A., R. Keller, R. Repasky, M. Hess, D.
Hart, M. Müller, R. Sheppard, U. Wössner, M. Aumüller, H. Li, D.K. Berry and J.
Colbourne. A Global Grid for Analysis of Arthropod Evolution. 2004. Presentation.
Presented at: Grid2004 - 5th IEEE/ACM International Workshop on Grid Computing
(Pittsburgh, PA, 8 Nov 2004). Available from: http://hdl.handle.net/2022/14784
Portions of this document that originated from sources outside IU are shown here and used
by permission or under licenses indicated within this document.
Items indicated with a © or denoted with a source url are under copyright and used here
with permission. Such items may not be reused without permission from the holder of
copyright except where license terms noted on a slide permit reuse.
Except where otherwise noted, the contents of this presentation are copyright 2004 by the
Trustees of Indiana University. This content is released under the Creative Commons
Attribution 3.0 Unported license (http://creativecommons.org/licenses/by/3.0/). This license
includes the following terms: You are free to share – to copy, distribute and transmit the
work and to remix – to adapt the work under the following conditions: attribution – you must
attribute the work in the manner specified by the author or licensor (but not in any way that
suggests that they endorse you or your use of the work). For any reuse or distribution, you
must make clear to others the license terms of this work.
Outline
•
•
•
•
•
The biological problem
The software used
The global grid
What we learned
Acknowledgements
Biological problem
Are Hexapods (animals with six legs) a single
evolutionary group? Are ecdysozoans (animals that shed
their skins) a single evolutionary group?
Phylogenetic inference
• Goal – reconstruct
evolutionary history by
comparison of DNA
sequences
• NP-hard problem
• Heuristic approach
used in maximum
likelihood inference
• Data are available;
analysis had never
been attempted due to
computational
demands
Why this project on a grid?
• Important & time-sensitive biological question
requiring massive computer resources
• A biologically-oriented code that scales well
• Grid middleware environment & collaboration tool
well suited to the task at hand
• Opportunity to create a grid spanning every
continent on earth (except Antarctica)
Software and data analysis
• Non-grid preparatory work
– Download sequences from NCBI (67 Taxa,
12,162 bp, mitochondrial genes for 12 proteins)
– Align sequences with Multi-Clustal
– Determine rate parameters with TreePuzzle
• Grid preparatory work
– Analyze performance of fastDNAml with Vampir
– Meetings via Access Grid & CoVise
• The grid software
– PACXMPI – Grid/MPI middleware
– Covise – Collaboration and visualization
– fastDNAml – Maximum Likelihood phylogenetics
fastDNAml
• ML analysis of phylogenetic trees based on
DNA sequences
• Foreman/worker MPI program
• Fault tolerance for grid computing built into
program since 1998
• For 67 taxa: 2.12 ~10109 trees
• Goal: 300 bootstraps, 10 jumbles per – 3000
executions (more than 3x typical!)
• PACX-MPI (PArallel Computer eXtension) enables
seamlessly execution of MPI-conforming parallel
applications on a Grid.
• Application recompiled and linked w. PACX-MPI.
• Communication between MPI processes internally is
done with the vendor MPI, while communication to
other parts of the Metacomputer is done via the
connecting network.
• Key advantages:
– Optimized vendor MPI library is used.
– Two daemons (MPI processes) take care of
communication between systems – allows
bundling of communication.
COVISE
•
•
•
•
COllaborative VIsualization and Simulation Environment.
Focus: collaborative & interactive use of supercomputers
Interactive startup of calculation on Grid
Real-Time visualization of the results
Application framework
Work of Matthias Hess, HLRS
GleiderfüsslerGrid
The Metacomputers
One
Two
Three
Four
Five
SGI Origin 2000
32
Linux cluster
64
Linux cluster
12
T3E
128
IBM SP
64
Dec Alpha
4
Sunfire 6800
16
Hitachi SR8000
32
Cray T3E
128
Cray T3E
32
IBM SP (Blue Horizon) 32
Dec Alpha (Lemieux)
64
Linux system
1
CEBPA (Spain)
AIST (Japan)
ANU (Australia)
HLRS (Germany)
IUB (US)
USP (Brazil)
NUS (Singapore)
Germany
MCC (UK)
PSC (US)
SDSC (US)
PSC (US)
ISET’com (Tunisia)
8 types of systems (several on Top500 list & TeraGrid);
6+ vendors; 641 processors; 9 countries; 6 continents
Results of one run
Conclusions
• Results
– The grid actually worked (HPC Challenge award)
– Real science was done (500 runs, 5,318,281
trees analyzed, 7800 CPU hours used)
• Lessons learned
– Access Grid was essential
– CVS is good
– Importance of fault tolerance & interaction of
fault tolerance with network speeds
– Importance of the grid frameworks
– Firewall issues & value of PACX-MPI
• Going forward
– The key value of the grid approach was in
reducing wall-clock time to amounts tolerable
for the application scientists!
Acknowledgments
• This research was supported in part by the Indiana
Genomics Initiative. The Indiana Genomics Initiative
of Indiana University is supported in part by Lilly
Endowment Inc.
• This work was supported in part by Shared
University Research grants from IBM, Inc. to Indiana
University.
• This material is based upon work supported by the
National Science Foundation under Grant No.
0116050 and Grant No. CDA-9601632. Any
opinions, findings and conclusions or
recommendations expressed in this material are
those of the author(s) and do not necessarily reflect
the views of the National Science Foundation (NSF).
• Assistance with this presentation: John Herrin,
Malinda Lingwall, W. Les Teach, Jennifer Fairman
• Thanks to the SciNet team and SC2003 organizers!
Jennifer Steinbachs
Center for Genomics and Bioinformatics, Indiana University
Gary W. Stuart
Center for Genomics and Bioinformatics, Indiana University
Michael Resch
HLRS, University of Stuttgart
Eric Wernert
UITS, Indiana University
Markus Buchhorn
Australia National University
Hiroshi Takemiya
National Institute of Advanced Industrial Science & Technology, Japan
Rim Belhaj
ISET'Com, Tunesia
Wolfgang E. Nagel
ZHR, Technical University of Dresden
Sergui Sanielevici
Pittsburgh Supercomputing Center
Sergio takeo Kofuji
LCCA/CCE-USP
David Bannon
Victorian Partnership for Advanced Computing, Australia
Norihiro Nakajima
Japan Atomic Energy Research Institute
Rosa Badia
CEPBA-IBM Research Institute
Mark A. Miller
San Diego Supercomputer Center
Hyungwoo Park
Korea Institute of Science and Technology Information
Rick Stevens
Argonne National Laboratory
Fang-Pang Lin
National Center for High Performance Computing
John Brooke
Manchester Computing
David Moffett
Purdue University
Tan Tin Wee
National University of Singapore
Greg Newby
Arctic Region Supercomputer Center
J.C.T. Poole
CACR, Cal-Tech
Ramched Hamza
Sup'com, Tunesia
Mary Papakhian, John N. Huffman UITS, Indiana University
Leigh Grundhoeffer
UITS, Indiana University
Ray Sheppard
UITS, Indiana University
Peter Cherbas
Center for Genomics and Bioinformatics, Indiana U.
Stephen Pickles, Neil Stringfellow CSAR, University of Manchester
Arthurina Breckenridge
HLRS, University of Stuttgart
Our partners
Questions?
Be sure to check out the current issue of Communications
of the ACM Special Section on Bioinformatics – especially
the article “The Emerging role of BioGrids”