The Future of Bioinformatics Philip E. Bourne The University of California San Diego [email protected] http://www.sdsc.edu/pb April 12, 2004 Michael Conrad Memorial Lecture.

Download Report

Transcript The Future of Bioinformatics Philip E. Bourne The University of California San Diego [email protected] http://www.sdsc.edu/pb April 12, 2004 Michael Conrad Memorial Lecture.

The Future of
Bioinformatics
Philip E. Bourne
The University of California San Diego
[email protected]
http://www.sdsc.edu/pb
April 12, 2004
Michael Conrad Memorial Lecture
Many of Michael’s contributions are
now being more fully realized in the
fields of bioinformatics and systems
biology. We will explore current and
future trends in these fields to
further appreciate Michael’s vision
April 12, 2004
Michael Conrad Memorial Lecture
We have Come a Long Way…
April 12, 2004
Michael Conrad Memorial Lecture
April 12, 2004
Michael Conrad Memorial Lecture
It will be through the increasing merger
of computer science, computational
science, information science and the
life sciences that Michael’s foresights
will be fully appreciated.
Large amounts of complex data puts
these disciplines on the same page
and the book of bioinformatics can be
written. It is therefore appropriate that
today we spend time looking at the
immediate future of bioinformatics
April 12, 2004
Michael Conrad Memorial Lecture
Today’s Outline
 We will address the following questions
from two perspectives – data complexity
and biological complexity:
 How did bioinformatics get here?
 What are the challenges today? Apology –
many illustrations are drawn from our own
work in structural bioinformatics
 What will the short and long term future hold?
April 12, 2004
Michael Conrad Memorial Lecture
ANYTHING
Disclaimer - Plotting Change
You Are Here
TIME
April 12, 2004
“The thing about change is that
things will be different afterwards.”
Michael Conrad Memorial Lecture
— Alan McMahon
Rules of Prediction
 Looking back, everything appears to have
developed faster than reality
 Looking forward, everything will develop
faster that you predict
 Hence, we are all very poor at predicting
beyond the next 5 years – examples:
 The Next Fifty Years : Science in the First Half of the Twenty-first
Century by John Brockman (Editor)
 CACM Volume 40 , Issue 2 (February 1997)
April 12, 2004
Michael Conrad Memorial Lecture
"This is like deja vu all over again."
Can I even do 5 years?
April 12, 2004
Michael Conrad Memorial Lecture
Bourne
Bioinformatics Editorial 1999 15(9):715
“Over the next 5 years there will be an estimated 10
major structural genomics efforts each yielding 200
structures per year. While these efforts will deplete
regular structure determination efforts, improvements
in technology and a general expansion of the field
will continue to yield 50 structures per week worldwide
outside of the structural genomics initiatives.”
Net result 35,000 structures by 2005
There were 11,000 structures at the time of this prediction
April 12, 2004
Michael Conrad Memorial Lecture
"You can observe a lot just by
watching."
PDB Growth Curve
Approx. 25,000 structures today
In 2003 approx. 5,000 structures were deposited
April 12, 2004
Michael Conrad Memorial Lecture
History
April 12, 2004
Michael Conrad Memorial Lecture
Predictions Can
Be Good
So Let Us Review the History of Bioinformatics
Thus Far – General Observations
 A scientific endeavor driven out of a paradigm shift in




which biology became a data driven science – Today
macromolecular structure data will be used to illustrate
this paradigm shift
A relatively new term for a scientific endeavor that has
been around much longer
Medical informatics preceded it, and defined some of the
foundations
A scientific endeavor that has gained from fundamental
developments is computer and information science e.g.,
algorithms, ontologies, Bayesian networks, simulation,
neural networks, text mining and which in turn defines
new problem domains for computer science
Systems biology may overtake it
April 12, 2004
Michael Conrad Memorial Lecture
"Do you mean now?" -- When asked for the time.
A More Specific Chronology – Pre
1970
Bioinformatics (2003) 19 2176-2190
1945 Biochemical Pathways - Horowitz
1953 Structure of DNA – W&C
1969 Genetic Variation
1962 Molecular Homology – Florkin
1965 Evolutionary Patterns – Purling
1966 Molecular Modeling - Levinthal
1967 Phylogenetic Trees – Fitch
1969 Properties – Ptitsyn
1970 Dynamic Programming N&W
1970 Adaptability - Conrad
1953 Game Theory – Neumann and Morgenstern
1959 Grammars – Chomsky
1962 Information Theory – Shannon & Weaver
1966 Cellular automata – Neuman
April 12, 2004
Michael Conrad Memorial Lecture
A More Specific Chronology – 1970’s
Problem Definition
Improved Sequence Alignments
Sanakoff
Smith Waterman Algorithm
Structure Prediction
Levitt
Chou and Fasman
Scheraga
Exon/Introns
Gilbert
Public Resources
Dayhoff, PDB
April 12, 2004
Structural Patterns
And Properties
Richards
Michael Conrad Memorial Lecture
Information processing
In molecular systems
Conrad
A More Specific Chronology – 1980’s
Computational Biology Emerges
Domains recognized
Rashin
Neural nets
Hopfield
Tree of Life Emerges
Molecular computing
Conrad
FASTA
Lipman & Pearson
Nanotechnology
Drexler
Profiles
Gribskov
Reductionism begins
Thornton
Sander
April 12, 2004
Clustering
Shepard
Relational Databases
Networks – EMBLnet, BIONET
Michael Conrad Memorial Lecture
A More Specific Chronology – 1990Bioinformatics and Biotechnology
Emerge
 Human Genome
 Internet/Web
Project
Conrad, M., Adaptability theory as a guide for interfacing
computers and human society, Systems Research 10, 3-23 (1993).
April 12, 2004
Michael Conrad Memorial Lecture
2004 – Overview of the Current
Challenges
Genomes
Gene
Products
Structure &
Function
Pathways &
Physiology
~ Scientific Challenges - Deciphering the genome, mapping the genotypephenotype relationships, dissecting organismic function, engineering organisms
with altered functionality, figuring out complex traits and polymorphism,
understanding physiology.
~ Algorithmic Challenges - comparisons of whole and partial genomes, metrics
for similarity and homology, metabolic reconstruction, dissecting pathways, and
whole cell modeling.
~ Computational Challenges - creating the informatics infrastructure,
information integration, annotation, curation and dissemination of databases,
development of parallel computational methods.
April 12, 2004
Michael Conrad Memorial Lecture
Bioinformatics Journal
1400
Sociological
Challenge
1200
1000
800
Submissions
600
400
200
0
1997
1998
1999
2000
2001
2002
2003
Bioinformatics Journal
Data from
Bioinformatics
5
4.5
4
3.5
3
2.5
Impact Factor
2
1.5
Growth outweighs readership
particularly among biologists
1
0.5
0
1997
April 12, 2004
1998
Michael Conrad Memorial Lecture
1999
2000
2001
2002
2003
Bioinformatics - A Vice Chancellor’s View
Biological Experiment
Collect
Data
Information
Characterize
Knowledge
Compare
Model
Discovery
Infer
Complexity
Higher-life
Technology
1
Organ
10
Brain
Mapping
Model Metaboloic
Pathway of E.coli
Sub-cellular
Structure
(C) Copyright Phil Bourne 1998
102 Neuronal
Modeling
106
Virus
Structure
Ribosome
Human
Genome
Project
Yeast
E.Coli
C.Elegans
Genome Genome Genome
90
1
# People/Web Site
Genetic
Circuits
ESTs
Sequence
April 12, 2004
100000 Computing
Power
Cardiac
Modeling
Cellular
Assembly
Data
1000
100
Gene Chips
Michael Conrad Memorial Lecture
95
00
Year
1 Small
Genome/Mo.
Human
Genome
05
Sequencing
Technology
A Data Centric View of the Future
 Data complexity
 High throughput data collection
 Database vs literature
 Bioinformatics as data driver
 Data representation
 Data integration
"If you come to a fork in the road, take it."
April 12, 2004
Michael Conrad Memorial Lecture
Numbers and Complexity
Complexity is increasing
(a) myoglobin (b) hemoglobin (c) lysozyme (d) transfer RNA
(e) antibodies (f) viruses
(g) actin
(h) the nucleosome
(i) myosin
(j) ribosome
Courtesy of David Goodsell, TSRI
High Throughput - The Structural Genomics
Pipeline (X-ray Crystallography)
Basic Steps
Crystallomics
• Isolation,
Target • Expression,
Data
Selection • Purification, Collection
• Crystallization
Bioinformatics
• Distant
homologs
• Domain
recognition
Automation
Bioinformatics
• Empirical
rules
Automation
Better
sources
Structure
Solution
Structure
Refinement
Software integration
Decision Support
MAD Phasing Automated
fitting
Bioinformatics Throughout the Process
April 12, 2004
Michael Conrad Memorial Lecture
Functional
Annotation
Publish
Bioinformatics
No?
• Alignments
• Protein-protein
interactions
• Protein-ligand
interactions
• Motif recognition
An Aside on the Future of Publishing
Full Description Captured as the Paper/Database is
Written/Deposited Does away with ...
?
Oops!
ß sandwich? Where?
Large loop? Which one??
Loop-sheet-helix???
… the p53 core domain
structure consists of a ß
sandwich that serves as
a scaffold for two large
loops and a loop-sheethelix motif ...
1TSR
----Science Vol.265, p346
Corresponding structure from the PDB
April 12, 2004
Michael Conrad Memorial Lecture
BioEditor - A DTD Driven
Domain Specific Editor
http://bioeditor.sdsc.edu
April 12, 2004
Michael Conrad Memorial Lecture
Bioinformatics 2003 19(7) 897-898
The Data - Bioinformatics Cycle
Result – Computation and Experiment
become More Synergistic
Turn Knowledge into New Data Requirements
Data
Bioinformatics
Turn Data into Knowledge
April 12, 2004
Michael Conrad Memorial Lecture
Deuterium Exchange Mass Spec to Predict Structure
Woods, Baker et al.
Target Protein
Structure Templates
CASP
DXMS
Threading
k (Stability)
Best Structure(s)
Amino Acid
Profile Match Method
April 12, 2004
Michael Conrad Memorial Lecture
COREX
Biological Representation
 The Gene Ontology changes everything




Molecular function
Biochemical process
Cellular location
DAG – machine usable
 The number of papers referencing the
gene ontology has increased dramatically
in the last year
April 12, 2004
Michael Conrad Memorial Lecture
Biological Data Representation
Future
 Tools to construct ontologies from free
text?
 Ontologies for details of function, proteinprotein interaction, protocols, complete
pathway information
April 12, 2004
Michael Conrad Memorial Lecture
Data Integration
Web Services – the
holy grail of
interoperability?
April 12, 2004
Michael Conrad Memorial Lecture
Web Services
 Its not CORBA – biologists can do it
 You know longer have to remember where
you left it – i.e. registries
 Platform independent
 Driver to force data providers to define and
publish a detailed API
 Compelling - introduces the prospect of
global workflow
April 12, 2004
Michael Conrad Memorial Lecture
Perl Web Services Client Example
 A small PERL program to access all Pubmed
abstracts containing the word ‘ferritin’
use SOAP::Lite;
$ids_ref = SOAP::Lite
-> uri(‘http://server.location.edu/pdbWebServices’)
-> proxy(‘http://server.location.edu/pdbWebServices’)
-> pubmedAbstractQuery($ARGV[0])
-> result;
@ids = @($ids_ref);
Print “@ids\n”;
Mycomputer(1)% web_service.pl ferritin
1AEW 1AQO 1BCF 1BFR 1BG7 1DPS 1EUM 1FHA 1JGC 1JI5 1JIG 1MFR
1QGH 1RCC 1RCD 1RCE 1RCG 1RCI 1RYT 2FHA
April 12, 2004
Michael Conrad Memorial Lecture
The Future A Biological Complexity
Perspective
April 12, 2004
Michael Conrad Memorial Lecture
REPRESENTATIVE
DISCIPLINE
EXAMPLE
UNITS
Anatomy
MRI
Physiology
Heart
Cell Biology
Neuron
Proteomics
Genomics
Structure
Sequence
Medicinal
Chemistry
April 12, 2004
Protease
Inhibitor
SCIENTIFIC RESEARCH
& DISCOVERY
Simulation
Organisms
REPRESENTATIVE
TECHNOLOGY
Migratory
Sensors
Organs
Ventricular
Modeling
Cells
Electron
Microscopy
Macromolecules
Biopolymers
Data
Atoms & Molecules
Infrastructure
Technologies
Michael Conrad Memorial Lecture
X-ray
Crystallography
Protein
Docking
Training
Exploring Biological Complexity
Requires:
 We do NOT neglect the details
 Synergy between theory and experiment
which highlights the need for better
algorithms and quality control
But….
 We have existing and emerging
technologies to measure complex systems
 Provides the opportunity to address some
of biology’s fundamental questions
April 12, 2004
Michael Conrad Memorial Lecture
Structure is a Useful Tool to Study
Biological Complexity as Nature
has Provided a Helping Hand…
 An average protein is 350 amino acids in length,
with 20 amino acids there are 20350 possible
proteins – way more than all the atoms in the
universe
 In actuality there may be only 2-5x106 proteins
 There are likely between 1-5000 unique folds
 Fold is far more conserved than sequence and
permits us to look back farther in evolutionary
time than sequence
April 12, 2004
Michael Conrad Memorial Lecture
But.. much detail remains
and our current
methodologies fall short..
Consider structure comparison
and alignment of the diverse
protein kinases
April 12, 2004
Michael Conrad Memorial Lecture
An Example of a Structural Superfamily:
The Protein Kinase-Like Superfamily
SCOP grouping for kinases
1) Class: Alpha+Beta
2) Fold: Protein Kinase Catalytic Core
3) Superfamily: Protein Kinase
Catalytic Core
4) Families:
a) Ser/Thr Kinases
b) Tyr Kinases
7
8
c) Atypical Kinases
d) Antibiotic Kinases
e) Lipid Kinases
Superfamily: not all eukaryotic or
protein kinases: some homologues
discovered in bacteria that
phosphorylate antibiotics, others
phosphorylate lipids
April 12, 2004
Typical Kinase Core (c-Src, PDB ID: 2SRC)
Michael Conrad Memorial Lecture
Evolution of the Kinase
Superfamily: Comparison of
Three Superfamily Members
•A: Casein kinase 1 (PDB ID:
1CSN)
•B: Aminoglycoside kinase
(PDB ID: 1J7L)
•C: Phosphatidylinositol 3kinase (PDB ID: 1E8X).
•D: The previous three
structures with only their shared
region superposed (1CSN: light
blue, 1J7L: red, 1E8X: yellow).
•The three kinases share a
minimal core required for ATP
binding and phosphotransfer.
April 12, 2004
Michael Conrad Memorial Lecture
An accurate alignment would
allow us to look back farther in
evolutionary time that sequence
alone. Alignment algorithms
need to simulate what humans
can do and beyond
April 12, 2004
Michael Conrad Memorial Lecture
An Example of Manual vs. Automated with Combinatorial Extension (CE)
•The manual alignment can be used to better
understand the limitations of our automated
method
•Alignment of helix C of two tyrosine kinases
•Insulin Receptor Kinase (pdb id 1IR3)
•c-Src (pdb id 2SRC)
•Can be aligned with 40% ident, 3.0Å
RMSD
•In Src, C-helix is displaced and rotated
outward
•Rotation pushes n-terminal end of helix
out very far from n-terminal end of IRK
•CE gaps a part of this (yellow), splitting
helix, aligning part of IRK helix C with
loop leading to helix C in Src
April 12, 2004
Michael Conrad Memorial Lecture
Orange: IRK, Blue: c-Src
Yellow: CE gap region
Improving CEfam:
Multiple Alignments
with CE
•Example with strands 1 and 2 of
kinase superfamily
•A: original
•B: optimal parameters
•C: manual
•Parameters also improved
results with other protein
superfamilies in visual analysis
•Just as sequence alignments are
benchmarked against structure
alignments, structure alignments
should be benchmarked to
manual results
•Improvement in optimization is
now being folded into the next
generation of CE
April 12, 2004
Michael Conrad Memorial Lecture
Quality Control
Consider an example
The definition of domains from
3-D structure
April 12, 2004
Michael Conrad Memorial Lecture
The 3D Domain Assignment Problem
Domain is a fundamental structural, functional and evolutionary unit of
protein:
Compact
Stable
Have hydrophobic core
Fold independently
Perform specific function
Can be re-shuffled and put together in different
combinations
Evolution works on the level of domain
April 12, 2004
Michael Conrad Memorial Lecture
Exact assignments of domains remains a difficult
and unresolved problem.
There is no complete agreement among experts on domain assignment
given a protein structure.
Expert methods agree on 80% of all existing manual assignments, the
remaining 20% represent “difficult” cases
Expert assignment #3
Expert assignment #1
Expert assignment #2
April 12, 2004
Michael Conrad Memorial Lecture
Manual vs. automatic consensuses: do they overlap?
Chains with manual consensus: 375 (80% of entire dataset)
Chains with automatic consensus: 374 (80% of entire dataset)
Chains with consensus (automatic or manual) : 424 (90.6% of entire dataset)
Automatic consensus only
46 chains (10.9% of chains
with consensus)
Manual consensus only
47 chains (11.1% of
chains with consensus)
Manual and automatic consensus
agree
328 chains
(77.3% of chains with consensus)
Automatic consensus and manual
consensus disagree 3 chains (0.7%
of chains with consensus)
Veretnik et al. 2004 JMB in press
April 12, 2004
Michael Conrad Memorial Lecture
1cjaa (actin-fragmin kinase, slime mold): an unusual kinase
[complex interface]
SCOP, PDP,
DomainParser
1 domain
April 12, 2004
CATH
1 domain + unassigned
Michael Conrad Memorial Lecture
DALI
4 domains
typical kinase
Exemplar Bioinformatics Problems
The Next 5 Years…
1. Full genome comparisons
2. Rapid assessment of polymorphic
variations
3. Complete construction of orthologous
and paralogous groups
4. Structure resolution of large
assemblies/complexes
5. Dynamical simulation of realistic systems
6. Rapid structural/topological clustering of
proteins
7. Protein folding
Exemplar Bioinformatics Problems
The Next 5 Years
8. Computer simulation of membrane insertion
9. Simulation of cellular pathways/ sensitivity
analysis of pathways stoichiometry and
kinetics
10 Comparison of complex networks and
pathways
11 Deciphering the metabolome
12 Integration and interpretation of data at different
biological scales – genomic to population
13 Identification of biomarkers for use in diagnostic
medicine
April 12, 2004
Michael Conrad Memorial Lecture
These problems will be dealt
with by a new generation of
scientists comforable at both
the bench and computer.
Until then bioinforamticians
need to work hard to
overcome the “high noon”
problem
April 12, 2004
Michael Conrad Memorial Lecture
High Noon – A Working Definition
12:00
The cost:benefit ratio of entry to bioinformatics
tools and resources is
too high for the majority of biologists
Thus, those who could gain and
contribute most from the services provided
are not users
April 12, 2004
Michael Conrad Memorial Lecture
One Approach - MBT
 Java toolkit for developing custom molecular
visualization applications
 High-quality
interactive
rendering of:
 sequence
 structure
 function
http://mbt.sdsc.edu
April 12, 2004
Michael Conrad Memorial Lecture
MBT Architecture
April 12, 2004
Michael Conrad Memorial Lecture
Future - The Structure Should
be the User Interface
Ligand - What other
entries contain this?
Chain - What other
entries have chains with
>90% sequence identity?
Residue - What is the
environment of this residue?
April 12, 2004
Michael Conrad Memorial Lecture
Beyond 5 Years…





Transitional medicine
Personalized medicine
Merger of medical-, chem- and bio- informatics
Societies that reflect this
Training in cooperative in silico and experimental
research
 Centers that reflect that training ie different to
NCBI or EBI
April 12, 2004
Michael Conrad Memorial Lecture
Think! How the hell are you gonna think and hit at the same time?"
Beyond 5 Years
 Simulations used in the clinic setting
 Smart {genome} cards
 A ubiquitous life sciences Web that
permits views from populations to atoms
April 12, 2004
Michael Conrad Memorial Lecture
"I knew I was going to take the wrong train, so I left early."
Acknowledgements
 To all those who have chosen
bioinformatics as a career and make the
field so rich
 Particularly those who do so for lesser
rewards – the data providers and
annotators
 My group for the fun we had discussing
this topic
 http://rinkworks.com/said/yogiberra.shtml
April 12, 2004
Michael Conrad Memorial Lecture
"I didn't really say everything I said."