Computational Biology: Practical lessons and thoughts for the future Dr. Craig A.
Download
Report
Transcript Computational Biology: Practical lessons and thoughts for the future Dr. Craig A.
Computational Biology: Practical
lessons and thoughts for the future
Dr. Craig A. Stewart
[email protected]
Visiting Scientist, Höchstleistungsrechenzentrum
Universität Stuttgart
Director, Research and Academic Computing, University
Information Technology Services
Director, Information Technology Core, Indiana
Genomics Initiative
25 June 2003
License terms
•
•
Please cite as: Stewart, C.A. Computational Biology: Practical lessons and
thoughts for the future. 2003. Presentation. Presented at: Facultaet
Informatik (Universitaet Stuttgart, Stuttgart, Germany, 25 Jun 2003).
Available from: http://hdl.handle.net/2022/15218
Except where otherwise noted, by inclusion of a source url or some other
note, the contents of this presentation are © by the Trustees of Indiana
University. This content is released under the Creative Commons
Attribution 3.0 Unported license
(http://creativecommons.org/licenses/by/3.0/). This license includes the
following terms: You are free to share – to copy, distribute and transmit the
work and to remix – to adapt the work under the following conditions:
attribution – you must attribute the work in the manner specified by the
author or licensor (but not in any way that suggests that they endorse you
or your use of the work). For any reuse or distribution, you must make
clear to others the license terms of this work.
2
Outline
• The revolution in biology & IU’s response –the Indiana
Genomics Initiative
• Example software applications
– Central Life Sciences Database Service
– fastDNAml
• What are the grand challenge problems in computational
biology?
• Some thoughts about dealing with biological and
biomedical researchers in general
• A brief description of IU’s high performance computing,
storage, and visualization environments
The revolution in biology
• Automated, high-throughput
sequencing has revolutionized
biology.
• Computing has been a part of
this revolution in three ways so
far:
– Computing has been essential to
the assembly of genomes
– There is now so much biological
data available that it is impossible
to utilize it effectively without aid
of computers
– Networking and the Web have
made biological data generally
and publicly available
http://www.ncbi.nlm.nih.gov/Genbank/
genbankstats.html
Indiana Genomics Initiative (INGEN)
• Created by a $105M grant from the Lilly
Endowment, Inc. and launched December, 2000
• Build on traditional strengths and add new areas
of research for IU
• Perform the research that will generate new
treatments for human disease in the postgenomic era
• Improve human health generally and in the State
of Indiana particularly
• Enhance economic growth in Indiana
INGEN Structure
Programs
–
–
–
–
–
–
Bioethics
Genomics
Bioinformatics
Medical Informatics
Education
Training
Cores
– Tech Transfer
– Gene Expression
– Cell & Protein
Expression
– Human Expression
– Proteomics
– Integrated Imaging
– In vivo Imaging
– Animal
– Information
Technology ($6.7M)
Challenges for UITS
and the INGEN IT Core
• Assist traditional biomedical researchers in adopting use
of advanced information technology (massive data
storage, visualization, and high performance computing)
• Assist bioinformatics researchers in use of advanced
computing facilities
• Questions we are asked:
– Why wouldn't it be better just to buy me a newer PC?
• Questions we asked:
– What do you do now with computers that you would like
to do faster?
– What would you do if computer resources were not a
constraint?
So, why is this better than
just buying me a new PC?
• Unique facilities provided by IT Core
– Redundant data storage
– HPC – better uniprocessor performance; trivially
parallel programming, parallel programming
– Visualization in the research laboratories
• Hardcopy document – INGEN's
advanced IT facilities: The least you
need to know
• Outreach efforts
• Demonstration projects
Example projects
• Multiple simultaneous Matlab jobs for brain
imaging.
• Installation of many commercial and open source
bioinformatics applications.
• Site licenses for several commercial packages
• Evaluation of several software products that were
not implemented
• Creation of new software
Software pages from external sourcces
• Commercial
– GCG/Seqweb
– DiscoveryLink
– PAUP
• Open Source
–
–
–
–
BLAST
FASTA
CLUSTALW
AutoDock
• Several programs written by UITS staff
Creation of new software
• Gamma Knife – Penelope. Modified existing version for
more precise targeting with IU's Gamma Knife.
• Karyote (TM) Cell model. Developed a portion of the code
used for model cell function.
http://biodynamics.indiana.edu/
• PiVNs. Software to visualize human family trees
• 3-DIVE (3D Interactive Volume Explorer).
http://www.avl.iu.edu/projects/3DIVE/
• Protein Family Annotator – collaborative development with
IBM, Inc.
• Centralized Life Sciences Data service
• fastDNAml – maximum likelihood phylogenies
(http://www.indiana.edu/~rac/hpc/fastDNAml/index.html)
Data Integration
• Goal set by IU School of Medicine: Any
research within the IU School of Medicine
should be able to transparently query all
relevant public external data sources and all
sources internal to the IU School of Medicine to
which the researcher has read privileges
• IU has more than 1 TB of biomedical data
stored in massive data storage system
• There are many public data sources
• Different labs were independently downloading,
subsetting, and formatting data
• Solution: IBM DiscoveryLink, DB/2 Information
Integrator
A life sciences data example Centralized Life Science Database
• Based on use of IBM DiscoveryLink(TM) and
DB/2 Information Integrator(TM)
• Public data is still downloaded, parsed, and
put into a database, but now the process is
automated and centralized.
• Lab data and programs like BLAST are
included via DL’s wrappers.
• Implemented in partnership with IBM Life
Sciences via IU-IBM strategic relationship
in the life sciences
• IU contributed writing of data parsers
A computational example - evolutionary
biology
• Evolutionary trees
describe how
different organisms
relate to each other
• This was originally
done by
comparison of
fossils
• Statistical
techniques and
genomic data have
made possible new
approaches
fastDNAml: Building Phylogenetic
Trees
• Goal: an objective means by
which phylogenetic trees can be
estimated
• The number of bifurcating
unrooted trees for n taxa is
(2n-5)!/ (n-3)! 2n-3
• Solution: heuristic search
• Trees built incrementally. Trees
are optimized in steps, and best
tree(s) are then kept for next
round of additions
• High communication/compute
ratio
fastDNAml algorithm, incremental
tree building
• Compute the optimal tree
for three taxa (chosen
randomly) - only one
topology possible
• Randomly pick another
taxon, and consider each
of the 2i-5 trees possible
by adding this taxon into
the first, three-taxa tree.
• Keep the best (maximum
likelihood tree)
fastDNAml algorithm - Branch
rearrangement
• Local branch
rearrangement: move any
subtree crossing n vertices
(if n=1 there are 2i-6
possibilities)
• Keep best resulting tree
• Repeat this step until local
swapping no longer
improves likelihood value
Because of local effects….
• Where you end up sometimes depends on where you start
• This process searches a huge space of possible trees, and
is thus dependent upon the randomly selected initial taxa
• Can get stuck in local optimum, rather than global
• Must do multiple runs with different randomizations of taxon
entry order, and compare the results
• Similar trees and likelihood values provide some
confidence, but still the space of all possible trees has not
been searched extensively
fastDNAml parallel algorithm
fastDNAml
Performance on IBM SP
70
60
SpeedUp
50
40
30
20
10
0
0
10
20
30
40
50
Number of Processors
Perfect Scaling
50 Taxa
101 Taxa
150 Taxa
From Stewart et al., SC2001
60
70
Other grand challenge problems
and some thoughts about the
future
Gamma Knife
• Used to treat
inoperable tumors
• Treatment methods
currently use a
standardized head
model
• UITS is working with
IU School of Medicine
to adapt Penelope
code to work with
detailed model of an
individual patient’s
head
“Simulation-only” studies
Aquaporins -proteins which conduct
large volumes of water through cell
walls while filtering out charged particles
like hydrogen ions.
Massive simulation showed that water
moves through aquaporin channels in
single file. Oxygen leads the way in. Half
way through, the water molecule flips
over.
That breaks the ‘proton wire’
Work done at PSC
Klaus Schulten et al, U. of Illinois, SCIENCE (April 19, 2002)
35,000ethours
Klaus Schulten
al, U. TCS
of Illinois, SCIENCE (April 19, 2002)
35,000 hours TCS
integrated Genomic Annotation Pipeline - iGAP
structure info
SCOP, PDB
Building FOLDLIB:
PDB chains
SCOP domains
PDP domains
CE matches PDB vs. SCOP
90% sequence non-identical
minimum size 25 aa
coverage (90%, gaps <30, ends<30)
sequence info
NR, PFAM
104
entries
Deduced Protein sequences
~800 genomes
@ 10k-20k per
=~107 ORF’s
Prediction of :
signal peptides (SignalP, PSORT)
transmembrane (TMHMM, PSORT)
coiled coils (COILS)
low complexity regions (SEG)
Create PSI-BLAST profiles for Protein sequences
Structural assignment of domains by
PSI-BLAST on FOLDLIB
4 CPU
years
228 CPU
years
3 CPU
years
Only sequences w/out A-prediction
Structural assignment of domains by
123D on FOLDLIB
570 CPU
years
Only sequences w/out A-prediction
Functional assignment by PFAM, NR,
PSIPred assignments
FOLDLIB
Domain location prediction by sequence
Slide source: San Diego Supercomputing Center
252 CPU
years
3 CPU
years
Store assigned regions in the DB
Drug Design
• Protein folding “the right
way”
– Homology modeling
– Then adjust for
sidechain variations,
etc.
• Drug screening
– Target generation –
so what
– Target verification –
that’s important!
– Toxicity prediction –
VERY important
What is the killer application in
computational biology?
• Systems biology – latest buzzword, but…. (see special
issues in Nature and Science)
• Goal: multiscale modeling from cell chemistry up to
multiple populations
• Current software tools still inadequate
• Multiscale modeling calls for use of established HPC
techniques – e.g. adaptive mesh refinement, coupled
applications
• Current challenge examples: actin fiber creation, heart
attack modeling
• Opportunity for predictive biology?
Current challenge areas
Problem
High
Throughput
Grid
Capability
Protein modeling
X
Genome annotation,
alignment,
phylogenetics
X
X
x*
Drug Target Screening
X
X
X
(corporate grids)
Systems biology
X
X
Medical practice
support
X
X
*Only a few large scale problems merit ‘capability’ status
Other example large-scale
computational biology grid projects
• Department of Energy “Genomes to Life”
http://doegenomestolife.org/
• Biomedical Informatics Research Network
(BIRN) http://birn.ncrr.nih.gov/birn/
• Asia Pacific BioGrid (http://www.apbionet.org/)
• Encyclopedia of Life (http://eol.sdsc.edu/)
Thoughts about working with
biologists
Bioinformatics and
Biomedical Research
• Bioinformatics, Genomics,
Proteomics, ____ics will
radically change understanding
of biological function and the
way biomedical research is
done.
• Traditional biomedical
researchers must take
advantage of new possibilities
• Computer-oriented researchers
must take advantage of the
knowledge held by traditional
biomedical researchers
Anopheles gambiae
From www.sciencemag.org/feature/
data/mosquito/mtm/index.html
Source Library: Centers for Disease Control
Photo Credit: Jim Gathany
INGEN IT Status Overall
•
•
•
•
So far, so good
108 users of IU’s supercomputers
104 users of massive data storage system
Six new software packages created or enhanced, more than
20 packages installed for use by INGEN-affiliated researchers
• Three software packages made available as open source
software as direct result of INGEN. Opportunities for tech
transfer due to use of Lesser GNU.
• The INGEN IT Core is providing services valued by
traditionally trained biomedical researchers as well as
researchers in bioinformatics, genomics, proteomics, etc.
• Work on Penelope code for Gamma Knife likely to be first
major transferable technology development. Stands to
improve efficacy of Gamma Knife treatment at IU.
So how do you find biologists with
whom to collaborate?
• Chicken and egg
problem?
• Or more like fishing?
• Or bank robbery?
Bank robbery
• Willie Sutton, a famous American bank robber, was
asked why he robbed banks, and reportedly said
“because that's where the money is.”*
• Cultivating collaborations with biologists in the short run
will require:
– Active outreach
– Different expectations than we might have when working with
an aerospace design firm
– Patience
• There are lots of opportunities open for HPC centers
willing to take the effort to cultivate relationships with
biologists and biomedical researchers. To do this, we’ll
all have to spend a bit of time “going where the
biologists are.”
*Unfortunately this is an urban legend; Sutton never said this
Some information about the Indiana
University high performance
computing environment
Networking: I-light
• Network jointly owned by
Indiana University and Purdue
University
• 36 fibers between Bloomington
and Indianapolis (IU’s main
campuses)
• 24 fibers between Indianapolis
and West Lafayette (Purdue’s
main campus)
• Co-location with Abilene
GigaPOP
• Expansion to other universities
recently funded
Sun E10000 (Solar)
•
•
•
•
•
•
Acquired 4/00
Shared memory architecture
~52 GFLOPS
64 400MHz cpus, 64GB memory
> 2 TB external disk
Supports some bioinformatics
software available only (or
primarily) under Solaris (e.g.
GCG/SeqWeb)
• Used extensively by researchers
using large databases (db
performance, cheminformatics,
knowledge management)
Photo: Tyagan Miller. May be reused by IU for noncommercial
purposes. To license for commercial use, contact the photographer
IBM Research SP
(Aries/Orion Complex)
•
632 cpus, 1.005 TeraFLOPS.
First University-owned
supercomputer in US to exceed
1 TFLOPS aggregate peak
theoretical processing capacity.
•
Geographically distributed at
IUB and IUPUI
•
Initially 50th, now 170th in Top
500 supercomputer list
•
Distributed memory system
with shared memory nodes
Photo: Tyagan Miller. May be reused by IU for noncommercial
purposes. To license for commercial use, contact the photographer
AVIDD
• AVIDD (Analysis and Visualization of
Instrument-Driven Data) Analysis and
Visualization of Instrument-Driven Data
• Project funded largely by the National Science
Foundation (NSF), funds from Indiana
University, and also by a Shared University
Research grant from IBM, Inc.
AVIDD (Analysis and Visualization of
Instrument-Driven Data) Analysis and
Visualization of Instrument-Driven Data
• Hardware components:
– Distributed Linux cluster
• Three locations: IU Northwest, Indiana University Purdue University
Indianapolis, IU Bloomington
• 2.164 TFLOPS, 0.5 TB RAM, 10 TB Disk
• Tuned, configured, and optimized for handling real-time data
streams
– A suite of distributed visualization environments
– Massive data storage
• Usage components:
– Research by application scientists
– Research by computer scientists
– Education
Goals for AVIDD
• Create a massive, distributed facility ideally
suited to managing the complete
data/experimental lifecycle (acquisition to insight
to archiving)
• Focused on modern instruments that produce
data in digital format at high rates. Example
instruments:
– Advanced Photon Source, Advanced Light Source
– Atmospheric science instruments in forest
– Gene sequencers, expression chip readers
Goals for AVIDD, Con’t
• Performance goals:
– Two researchers should be able simultaneously to analyze 1 TB data
sets (along with other smaller jobs running)
– The system should be able to give (nearly) immediate attention to realtime computing tasks, while still running at high rates of overall
utilization
– It should be possible to move 1 TB of data from HPSS disk cache into
the cluster in ~2 hours
• Science goals:
– The distribution of 3D visualization environments in scientists’ labs
should enhance the ability of scientists to spontaneously interact with
their data.
– Ability to manage large data sets should no longer be an obstacle to
scientific research
– AVIDD should be an effective research platform for cluster engineering
R&D as well as computer science research
Real-time pre-emption of jobs
• High overall rate of utilization, while able to respond
‘immediately’ to requests for real-time data analysis.
• System design
– Maui Scheduler: support multiple QoS levels for jobs
– PBSPro: support multiple QoS, and provide signaling for job
termination, job suspension, and job checkpointing
– LAM/MPI and Redhat: kernel-level checkpointing
• Options to be supported:
–
–
–
–
–
cancel and terminate job
Re-queue job
signal, wait, and requeue job
checkpoint job (as available)
signal job (used to send SIGSTOP/SIGRESUME)
1 TFLOPS Achieved on Linpack!
• AVIDD-I and AVIDD-B together = have peak theoretical capacity of
1.997 TFLOPS.
• We have just achieved 1.02 TFLOPS on Linpack benchmark for this
distributed system.
• 51st place on current Top500 list; highest ranked distributed linux
cluster
• Details:
– Force10 switches, non-routing 20 GB/Sec network connecting AVIDD-I and
AVIDD-B. (~90 km distance)
– LINPACK implementation from University of Tenessee called HPL (High
Perfomrance LINPACK), ver 1.0 (http://www.netlib.org/benchmark/hpl/). Problem
size we used is 220000, and block size is 200.
– LAM/MPI 6.6 beta development version (3/23/2003)
– Tuning: block size (optimized for smaller matrices, and then seemed to continue
to work well), increased the default frame size for communications, fiddled with
number of systems used, rebooted entire system just before running benchmark
(!)
Cost of grid computing on performance
• Each of the two clusters alone achieved 682.5
GFLOPS, or 68% of peak theoretical of 998.4
GFLOPS per cluster
• The aggregate distributed cluster achieved 1.02
TFLOPS out of 1.997, or 51% of peak theoretical
Massive Data Storage System
• Based on HPSS (High Performance
Software System)
• First HPSS installation with
distributed movers; STK 9310 Silos
in Bloomington and Indianapolis
• Automatic replication of data
between Indianapolis and
Bloomington, via I-light, overnight.
Critical for biomedical data, which is
often irreplaceable.
• 180 TB capacity with existing tapes;
total capacity of 480 TB. 100 TB
currently in use; 1 TB for biomedical
data.
• Common File System (CFS) – disk
storage ‘for the masses’
Photo: Tyagan Miller. May be reused by IU for noncommercial
purposes. To license for commercial use, contact the photographer
John-E-Box
Invented by John N. Huffman, John C. Huffman, and Eric
Wernert
Acknowledgments
• This research was supported in part by the Indiana
Genomics Initiative (INGEN). The Indiana Genomics
Initiative (INGEN) of Indiana University is supported in part
by Lilly Endowment Inc.
• This work was supported in part by Shared University
Research grants from IBM, Inc. to Indiana University.
• This material is based upon work supported by the National
Science Foundation under Grant No. 0116050 and Grant
No. CDA-9601632. Any opinions, findings and conclusions
or recommendations expressed in this material are those of
the author(s) and do not necessarily reflect the views of the
National Science Foundation (NSF).
Acknowledgements con’t
• UITS Research and Academic Computing Division
managers: Mary Papakhian, David Hart, Stephen Simms,
Richard Repasky, Matt Link, John Samuel, Eric Wernert,
Anurag Shankar
• Indiana Genomics Initiative Staff: Andy Arenson, Chris
Garrison, Huian Li, Jagan Lakshmipathy, David Hancock
• Assistance with this presentation: John Herrin, Malinda
Lingwall
• Thanks to Dr. M. Resch, Director, HLRS, for inviting me to
visit HLRS
• Thanks to Dr. H. Bungartz for his hospitality, help, and for
including Einführung in die Bioinformatik as an elective
• Thanks to Dr. S. Zimmer for help throughout the semester
• Further information is available at
– ingen.iu.edu
– http://www.indiana.edu/~uits/rac/
– http://www.ncsc.org/casc/paper.html
– http://www.indiana.edu/~rac/staff_papers.html
• A recommended German bioinformatics site:
– http://www.bioinformatik.de/
• Paper coming soon for SIGUCCS conference
Oct. 2003