Computational Biology: Data, computation, and visualization Dr. Craig A. Stewart [email protected] & Dr. Eric Wernert [email protected] 7 August 2003

Download Report

Transcript Computational Biology: Data, computation, and visualization Dr. Craig A. Stewart [email protected] & Dr. Eric Wernert [email protected] 7 August 2003

Computational Biology: Data,
computation, and visualization
Dr. Craig A. Stewart
[email protected]
&
Dr. Eric Wernert
[email protected]
7 August 2003
License terms
•
•
Please cite as: Stewart, C.A. and E. Wernert. Computational Biology:
Data, computation, and visualization. 2003. Presentation. Presented at:
Visualization Workshop (Arctic Region Supercomputer Center, University
of Alaska Fairbanks, 7 Aug 2003). Available from:
http://hdl.handle.net/2022/15219
Except where otherwise noted, by inclusion of a source url or some other
note, the contents of this presentation are © by the Trustees of Indiana
University. This content is released under the Creative Commons
Attribution 3.0 Unported license
(http://creativecommons.org/licenses/by/3.0/). This license includes the
following terms: You are free to share – to copy, distribute and transmit the
work and to remix – to adapt the work under the following conditions:
attribution – you must attribute the work in the manner specified by the
author or licensor (but not in any way that suggests that they endorse you
or your use of the work). For any reuse or distribution, you must make
clear to others the license terms of this work.
2
Outline
• A bit about biomedical data
• Computation and visualization
• The revolution in biology & IU’s response –the
Indiana Genomics Initiative
• Hardware
• Some thoughts about dealing with biological and
biomedical researchers in general
The revolution in biology
• Automated, high-throughput
sequencing has revolutionized
biology.
• Computing has been a part of
this revolution in three ways so
far:
– Computing has been essential to
the assembly of genomes
– There is now so much biological
data available that it is impossible
to utilize it effectively without aid
of computers
– Networking and the Web have
made biological data generally
and publicly available
http://www.ncbi.nlm.nih.gov/Genbank/
genbankstats.html
FASTA format
>gi|532319|pir|TVFV2E|TVFV2E envelope protein
ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT
QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC
HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCK
MDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLETISKK
TYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESIWAAELDRYKLVEITPIGF
Some of the issues about this exponential
growth in data stores
• WO/RN
• Comparability/replicability problems with certain
types of data
• HIPPA – how do you de-identify patient data?
Indiana Genomics Initiative (INGEN)
• Created by a $105M grant from the Lilly
Endowment, Inc. and launched December, 2000
• Build on traditional strengths and add new areas
of research for IU
• Perform the research that will generate new
treatments for human disease in the postgenomic era
• Improve human health generally and in the State
of Indiana particularly
• Enhance economic growth in Indiana
Challenges for UITS
and the INGEN IT Core
• Assist traditional biomedical researchers in adopting use
of advanced information technology (massive data
storage, visualization, and high performance computing)
• Assist bioinformatics researchers in use of advanced
computing facilities
• Questions we are asked:
– Why wouldn't it be better just to buy me a newer PC?
• Questions we asked:
– What do you do now with computers that you would like
to do faster?
– What would you do if computer resources were not a
constraint?
So, why is this better than
just buying me a new PC?
• Unique facilities provided by IT Core
– Redundant data storage
– HPC – better uniprocessor performance; trivially
parallel programming, parallel programming
– Visualization in the research laboratories
• Hardcopy document – INGEN's
advanced IT facilities: The least you
need to know
• Outreach efforts
• Demonstration projects
Example projects
• Data integration
• fastDNAml – maximum likelihood phylogenies
(http://www.indiana.edu/~rac/hpc/fastDNAml/index
.html)
• PiVN - Software to visualize human family trees
• 3-DIVE (3D Interactive Volume Explorer).
http://www.avl.iu.edu/projects/3DIVE/
• Protein Family Annotator – collaborative
development with IBM, Inc.
Data Integration
• Goal set by IU School of Medicine: Any
research within the IU School of Medicine
should be able to transparently query all
relevant public external data sources and all
sources internal to the IU School of Medicine to
which the researcher has read privileges
• IU has more than 1 TB of biomedical data
stored in massive data storage system
• There are many public data sources
• Different labs were independently downloading,
subsetting, and formatting data
• Solution: IBM DiscoveryLink, DB/2 Information
Integrator
A life sciences data example Centralized Life Science Database
• Based on use of IBM DiscoveryLink(TM) and
DB/2 Information Integrator(TM)
• Public data is still downloaded, parsed, and
put into a database, but now the process is
automated and centralized.
• Lab data and programs like BLAST are
included via DL’s wrappers.
• Implemented in partnership with IBM Life
Sciences via IU-IBM strategic relationship
in the life sciences
• IU contributed writing of data parsers
Dot Plots
• Simple way to get a feel for how sequences
compare to each other.
• Used both with DNA and Protein sequences
• http://www.cgr.ki.se/cgr/groups/sonnhammer/Dot
ter.html/
• "A dot-matrix program with dynamic threshold
control suited for genomic DNA and protein
sequence analysis" Erik L.L. Sonnhammer and
Richard Durbin Gene 167(2):GC1-10 (1995)
http://www.dkfz-heidelberg.de/tbi/bioinfo/Pairwise/DotPlots/index.html
Protein Family Annotator
• New project
• Designed to allow federation and searching of
protein family data
• ‘Visualizing’ the effect of variation in proteins a
real challenge for the biologists
Phylogenetic Inference
• Determine likely evolutionary relationships
among different taxa
• NP hard
• Very large search space
• Heuristic search required
• Problems:
– searches that are clearly going nowhere
– Comparison of different trees
PViN
Gamma Knife
• Used to treat
inoperable tumors
• Treatment methods
currently use a
standardized head
model
• UITS is working with
IU School of Medicine
to adapt Penelope
code to work with
detailed model of an
individual patient’s
head
Tomography
•
•
•
•
Key issue is processing of images
Weeks to days
Days to minutes
Visualization techniques applied do not need to
be fancy to be useful
• Starting with some simple visualizations and
then moving to some very sophisticated
visualizations would be tremendous
Some information about the Indiana
University high performance
computing environment
Networking: I-light
• Network jointly owned by
Indiana University and Purdue
University
• 36 fibers between Bloomington
and Indianapolis (IU’s main
campuses)
• 24 fibers between Indianapolis
and West Lafayette (Purdue’s
main campus)
• Co-location with Abilene
GigaPOP
• Expansion to other universities
recently funded
Massive Data Storage System
• Based on HPSS (High Performance
Software System)
• First HPSS installation with
distributed movers; STK 9310 Silos
in Bloomington and Indianapolis
• Automatic replication of data
between Indianapolis and
Bloomington, via I-light, overnight.
Critical for biomedical data, which is
often irreplaceable.
• 180 TB capacity with existing tapes;
total capacity of 480 TB. 100 TB
currently in use; 1 TB for biomedical
data.
• Common File System (CFS) – disk
storage ‘for the masses’
Photo: Tyagan Miller. May be reused by IU for noncommercial
purposes. To license for commercial use, contact the photographer
AVIDD (Analysis and Visualization of
Instrument-Driven Data) Analysis and
Visualization of Instrument-Driven Data
• Hardware components:
– Distributed Linux cluster
• Three locations: IU Northwest, Indiana University Purdue University
Indianapolis, IU Bloomington
• 2.164 TFLOPS, 0.5 TB RAM, 10 TB Disk
• Tuned, configured, and optimized for handling real-time data
streams
– A suite of distributed visualization environments
– Massive data storage
• Usage components:
– Research by application scientists
– Research by computer scientists
– Education
Goals for AVIDD
• Create a massive, distributed facility ideally
suited to managing the complete
data/experimental lifecycle (acquisition to insight
to archiving)
• Focused on modern instruments that produce
data in digital format at high rates. Example
instruments:
– Advanced Photon Source, Advanced Light Source
– Atmospheric science instruments in forest
– Gene sequencers, expression chip readers
Goals for AVIDD, Con’t
• Performance goals:
– Two researchers should be able simultaneously to analyze 1 TB data
sets (along with other smaller jobs running)
– The system should be able to give (nearly) immediate attention to realtime computing tasks, while still running at high rates of overall
utilization
– It should be possible to move 1 TB of data from HPSS disk cache into
the cluster in ~2 hours
• Science goals:
– The distribution of 3D visualization environments in scientists’ labs
should enhance the ability of scientists to spontaneously interact with
their data.
– Ability to manage large data sets should no longer be an obstacle to
scientific research
– AVIDD should be an effective research platform for cluster engineering
R&D as well as computer science research
John-E-Box
Invented by John N. Huffman, John C. Huffman, and Eric
Wernert
Thoughts about visualization and
collaboration in bioinformatics
• Do you want a one-off, or sustained improvement in productivity of
scientists?
• Collaboration tools can be highly sophisticated, or pretty darn ugly
• Sometimes they must be sophisticated
• The key for collaborative technology is that the collaboration has to
solve a problem (other than ‘what are we going to do in the booth
this year’) and has to feel natural to the application scientist
• Many problems are as much about the theory and practice of
interacting with the information
• Placing facilities in the lab is tremendously beneficial
• We should encourage researchers not to be too cost-sensitive
• Grand challenge problems are great, but there have to be facilities
that facilitate a learning curve and increases in sophistication over
time for the application scientist. This creates a feeder system for
the high end systems!
HPC Challenge
• “Arthropods evolving all over the world”
• (sort of) computational steering
• Big problem: how do you summarize the views
of LOTS of different trees?
What are some really important
challenges in visualization today?
• Expression chip data
• Trees
• Multi-scale problems
Thoughts about working with
biologists
Bioinformatics and
Biomedical Research
• Bioinformatics, Genomics,
Proteomics, ____ics will
radically change understanding
of biological function and the
way biomedical research is
done.
• Traditional biomedical
researchers must take
advantage of new possibilities
• Computer-oriented researchers
must take advantage of the
knowledge held by traditional
biomedical researchers
• So why do you want to interrupt
the work of my paper mill?
Anopheles gambiae
From www.sciencemag.org/feature/
data/mosquito/mtm/index.html
Source Library: Centers for Disease Control
Photo Credit: Jim Gathany
So how do you find biologists with
whom to collaborate?
• Chicken and egg
problem?
• Or more like fishing?
• Or bank robbery?
Bank robbery
• Willie Sutton, a famous American bank robber, was
asked why he robbed banks, and reportedly said
“because that's where the money is.”*
• Cultivating collaborations with biologists in the short run
will require:
– Active outreach
– Different expectations than we might have when working with
an aerospace design firm
– Patience
• There are lots of opportunities open for HPC centers
willing to take the effort to cultivate relationships with
biologists and biomedical researchers. To do this, we’ll
all have to spend a bit of time “going where the
biologists are.”
*Unfortunately this is an urban legend; Sutton never said this
Acknowledgments
• This research was supported in part by the Indiana
Genomics Initiative (INGEN). The Indiana Genomics
Initiative (INGEN) of Indiana University is supported in part
by Lilly Endowment Inc.
• This work was supported in part by Shared University
Research grants from IBM, Inc. to Indiana University.
• This material is based upon work supported by the National
Science Foundation under Grant No. 0116050 and Grant
No. CDA-9601632. Any opinions, findings and conclusions
or recommendations expressed in this material are those of
the author(s) and do not necessarily reflect the views of the
National Science Foundation (NSF).
Acknowledgements con’t
• UITS Research and Academic Computing Division
managers: Mary Papakhian, David Hart, Stephen Simms,
Richard Repasky, Matt Link, John Samuel, Eric Wernert,
Anurag Shankar
• Indiana Genomics Initiative Staff: Andy Arenson, Chris
Garrison, Huian Li, Jagan Lakshmipathy, David Hancock
• Assistance with this presentation: John Herrin, Malinda
Lingwall
• Thanks to Dr. M. Resch, Director, HLRS, for inviting me to
visit HLRS
• Thanks to Dr. H. Bungartz for his hospitality, help, and for
including Einführung in die Bioinformatik as an elective
• Thanks to Dr. S. Zimmer for help throughout the semester
• Further information is available at
– ingen.iu.edu
– http://www.indiana.edu/~uits/rac/
– http://www.ncsc.org/casc/paper.html
– http://www.indiana.edu/~rac/staff_papers.html