Introduction to computational biology Craig A. Stewart [email protected] Indiana University 1 July 2005 Höchstleistungsrechenzentrum Stuttgart License terms • • Please cite as: Stewart, C.A.

Download Report

Transcript Introduction to computational biology Craig A. Stewart [email protected] Indiana University 1 July 2005 Höchstleistungsrechenzentrum Stuttgart License terms • • Please cite as: Stewart, C.A.

1
Introduction to computational
biology
Craig A. Stewart
[email protected]
Indiana University
1 July 2005
Höchstleistungsrechenzentrum Stuttgart
License terms
•
•
Please cite as: Stewart, C.A. 2005. Introduction to computational biology. Tutorial
presented at High Performance Computing Center, Stuttgart. 1 July 2005,
Stuttgart, Germany. http://hdl.handle.net/2022/13995
Some figures are shown here taken from web, under an interpretation of fair use
that seemed reasonable at the time and within reasonable readings of copyright
interpretations. Such diagrams are indicated here with a source url. In several
cases these web sites are no longer available, so the diagrams are included here
for historical value. Except where otherwise noted, by inclusion of a source url or
some other note, the contents of this presentation are © by the Trustees of Indiana
University. This content is released under the Creative Commons Attribution 3.0
Unported license (http://creativecommons.org/licenses/by/3.0/). This license
includes the following terms: You are free to share – to copy, distribute and
transmit the work and to remix – to adapt the work under the following
conditions: attribution – you must attribute the work in the manner specified by
the author or licensor (but not in any way that suggests that they endorse you or
your use of the work). For any reuse or distribution, you must make clear to others
the license terms of this work.
2
3
Planned schedule for the day
•
•
•
•
•
•
•
•
•
•
•
•
9:00-9:15
9:15-10:00
10:00-10:15
10:15-10:30
10:30-10:45
10:45-11:30
11:30-12:00
12:00-13:00
13:00-13:30
13:30-14:00
14:00-14:30
14:30:14:45
Introduction and objectives
An introduction to the biological basis [11]
Biological data sources [30]
Similarity searching and Alignment [39]
Coffee Break
Similarity searching and alignment, con;t
Multiple Alignment [58]
Lunch
Phylogenetics [75]
RNA and Protein Structure [91]
Systems Biology [119]
Miscellaneous semi-random topics [138]
4
Plan & Objectives
• Materials focus on open source software (generally not the
presenters own work)
• Objectives. At the end of the class, participants should:
– understand enough biology to understand key computational
biology problems, and be familiar with some strategies for
collaborating with biologists and biomedical scientists
– be conversant with key open source applications in
computational biology and bioinformatics, and current problems
in these areas
– Be ready to download some code and start making it better!
5
Motivation
•
•
•
•
•
The “-omics” trend
Finding press pieces about huge computing problems is easy
How many bio codes really scale to hundreds of processors?
What are the coming high performance needs of biologists?
Importance of computational biology and bioinformatics to the
HPC community
• The challenges and promise are real
• Successes and failures so far
– Successes: Protein structure, Genome assembly, Surgical
assistance, Phylogenetics
– Mismatched priorities: Ab initio protein folding
– Not yet successful: Drug discovery
6
What has changed recently?
• Bioinformatics not new
– Protein structure
– Phylogenetics
• What is new is highthroughput sequencing:
– Lots more data
– The possibility of going
from a knowledge of the
DNA sequence to an
understanding of
diseases and health http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html
7
Genome Projects Timeline
•
•
•
•
•
•
•
•
•
•
•
1978
1986
1994
1995
1996
1997
1998
1998
1999
2000
2003
First virus (SV40) sequenced (5224 base pairs)
DOE announces Human Genome Initiative
First complete map of all human chromosomes
First living organism sequenced (H. influenzae) 2 Mb
Yeast (S. cerevisiae) - 12 Mb
Intestinal bacterium (E. coli) - 5 Mb
Nematode worm (C. elegans) - 100 Mb
Celera announcement; Public effort regroups
Human Chromosome 22 – 34 Mb
Joint announcement by NHGRI – Celera
“As good as it gets” human genome
This slide based on slide by Manfred D. Zorn
8
Definitions
• Computational Biology: any use of advanced information
technology in the study of biological problems.
• “Bioinformatics applies the principles of information sciences
and technologies to make the vast, diverse and complex life
sciences data mnore understandable and useful” (NIH BISTIC
Committee grants1.nih.gov/grants/bistic/CompuBioDef.pdf)
• Genomics – study of genomes and gene function
• Proteomics – study of proteins and protein function
• ___omics –
9
Challenges
•
•
•
•
Different types of biological data at different scales
Data of varying quality
Much of the underlying biology is not well understood
Prior to the availability of high-throughput sequencing,
scientists could only study small pieces of the genetic
information of any organism.
• Now the entire genome of several organisms has been
completed, but knowing the genome is different than knowing
how it works!
10
Comparison of Complexity
• Physics & Chemistry
• Biology
– 2 elementary particles
– 3B base pairs in humans
– 4 forces
– Min. 30,000 genes in
humans
– 112 elements
– ~1.5M species
– When random events
occur it is often possible
– Individual random
to study average behavior
events important; no law
of large numbers
– Typically ahistoric
(astrophysics an
– Intensely historic,
exception)
heavily contingent
11
Complexity, Con't
•
Chip design
– All components known
– Device physics for
individual components
known
– Itanium has 3 x 10^8
connections and 2 x 10^8
devices
– Unified basic currency
(electrons)
– Computer program
required to understand
(e.g. SPICE)
•
Cells
– Components not known
– Function of individual
components not known
– # components ~10^13
– No unified basic
currency
– Ecell, Karyote, etc.
attempting to model
cells
– Computer programs do
not yet permit full
understanding
12
A rapid introduction to key elements
of biology
Why is it important to know some
biology?
Anopheles gambiae
From www.sciencemag.org/feature/data/
mosquito/mtm/index.html
Source Library:Centers for Disease Control
Photo Credit:Jim Gathany
• Would you study numerical
methods without knowing
some mathematics?
• Much current biological
knowledge is very specific
to particular organisms,
genes, or diseases
• If you just wade into the
available data online you
can do some very silly
things.
13
14
Central dogma of biology
• The central dogma of
biology is that genes act to
create phenotypes through a
flow of information form
DNA to RNA to proteins, to
interactions among proteins
(regulatory circuits and
metabolic pathways), and
ultimately to phenotypes.
Collections of individual
phenotypes constitute a
population (first put forward
by Crick in 1958) http://www.ncbi.nlm.nih.gov/About/primer/genetics_cell.html
15
Cell Structure
http://www.ncbi.nlm.nih.gov/About/primer/genetics_cell.html
Eukaryotes
• Chromosomes linear
• Introns, exons,
postprocessing
• Nucleus & nuclear wall
• Mictochondria and (in
plants) Chloroplasts
Prokaryotes
• Chromosome circular
• Location is everything
• No nucleus
• No plastids
16
Four (or Five) Bases
• DNA consists of four nucleotides:
Cytosine, Thymine, Adenine, and
Guanine.
• In the double helix, A&T are
always bound, and C&G are
always bound to each other
• RNA consists of four nucleotides
as well: Cytosine, Uracil,
Adenine, and Guanine
• RNA may loop back on itself but
it does not form a double helix
http://www.ornl.gov/TechResources/Human_Genome/graphics/slides/images/structur.gif
17
http://www.ornl.gov/sci/techresources/Human_Genome/graphics/slides/98-648jpg.shtml
18
www.ornl.gov/sci/techresources/Human_Genome/graphics/slides/images/molecularmachine.jpg
19
http://www.abdn.ac.uk/sfirc/images/ligand_receptor.gif
Translating DNA to RNA and
Transcribing RNA to Proteins
DNA
AAAAAGGAGCAAATT
1
RNA
One possible amino
acid string
2
4
3
6
5
UUUUUCCUCGUUUAA
Phe
Asn
Asp
Ala
20
21
Genetic Code
Ala Alanine
Arg Arginine
Asn Asparagine
Asp Aspartic acid
Cys Cysteine
Glu Glutamic acid
Gln Glutamine
Gly Glycine
His Histidine
Ile Isoleucine
http://www.ncbi.nlm.nih.gov/Class/MLACourse/
Original8Hour/Genetics/geneticcode.html
Leu Leucine
Lys Lysine
Met Methionine
Phe Phenylalanine
Pro Proline
Ser Serine
Thr Threonine
Trp Tryptophan
Tyr Tyrosine
Val Valine
22
Sickle Cell
Normal RBC
• GAG codes for Glutamine
• disc-Shaped, soft
• easily flow through small
blood vessels
• lives for 120 days
Sickle RBC
• GTG codes for Valine
• sickle-Shaped, hard
• often get stuck in small
blood vessels
• lives for 20 days or less
Malaria vs. Anaemia!
http://www.nlm.nih.gov/medlineplus/
ency/imagepages/1223.htm
23
What is a Gene?
• An inheritable trait associated with a region of DNA that codes
for a polypeptide chain or specifies an RNA molecule which in
turn has an influence on some characteristic phenotype of the
organism.
– Early views: genes lined up on the chromosome like beads
on a string; one gene => one protein
– Examples of genes: color blindness, sickle-cell anaemia
– Mendelian genes, Sex-linked genes, Quantitative traits
• Annotation: Extraction, definition, and interpretation of
features on the genome sequence
• Annotations vs. genes:
– Many annotations describe features that constitute a gene.
– Others may not always directly correspond in this way
– An annotation is what we think… nature may disagree!
• Inheritance problem with annotations
24
http://www.ornl.gov/TechResources/Human_Genome/graphics/slides/images/98-647.jpg
25
Human Chromosomes
• Prokaryotes and Eukaryotes that
reproduce asexually just get a
copy of all of their genes from
their ‘parent’
• Eukaryotes (that includes us) that
reproduce sexually receive one
chromosome from each of our
parents. This means we have two
copies of the same information –
one copy from each parent
• Genes can be dominant (brown
eyes) or recessive (blue eyes)
http://www.ornl.gov/TechResources/
Human_Genome/graphics/slides/
elsikaryotype.html
26
Mendelian Genetics
• Round peas are “dominant” to
that a pea with the genotype RR
or Rr is going to look round
• Only peas with the rr genotype
will look wrinkled
• For very simple Mendelian genes,
you can use something called a
Punnet square
• Many genes are inherited in ways
OTHER than simple Mendelian
genetics, which is why
sequencing the geneome has been
very important!
27
Gene Components
• Prokaryotes
– Location is everything
– Essentially all of the DNA is transcribed (few mitochondrial diseases)
• Eukaryotes
– Non-contiguous pieces of DNA may comprise one gene
– Start sequence (complicated and long)
– Stop Codons – end transcription
– Exons – portions of sequence that are transcribed and used
– Introns – portions of sequence that are not used
28
Alternate splicing
http://www.blc.arizona.edu/marty/411/Modules/altsplice.html
29
Population genetics & evolution
• Mutations create the raw material
for evolution
• Natural selection and chance
affect the frequency with which
particular genes or DNA
sequences are present in
populations
• Given enough time and enough
change, evolution, speciation, and
so forth happen
• Genes can be ‘fixed’ or
‘maintained in an equilibrium’ in
a population by chance or through
natural selection
http://faculty.wm.edu/bsgran/
30
Key points (so far)
• Biological processes are complicated; the historicity and complexity
of biological processes and our lack of understanding of many
matters makes biology an interesting topic!
• The fundamental dogma of molecular biology is that genes act to
create phenotypes through a flow of information form DNA to RNA
to proteins, to interactions among proteins (regulatory circuits and
metabolic pathways), and ultimately to phenotypes. Collections of
individual phenotypes constitute a population.
• DNA consists of four base pairs (ATCG). A is always paired with T;
C always paired with G.
• DNA is translated into RNA. RNA consists of four base pairs as
well (AUCG).
• The linear structure of DNA is transcribed into RNA and then into
proteins. Proteins have their 3D configuration as the basis for their
structure.
31
Bioinformatics data sources
32
MedLine & PubMed
• Medline:
– U.S. National Library of Medicine – NLM Medline
http://www.nlm.nih.gov/
– ~12 million references on life sciences/biomedicine (since 1996)
• Pubmed - Standard search tool for Medline http://www.ncbi.nlm.nih.gov/entrez/
• Structured Language - Medical Subject Heading
http://www.nlm.nih.gov/mesh/MBrowser.html
33
Genomic, Proteomic, etc. data sources
• A tremendous amount of data is available through public data
sources via the Web, ftp, or by other means.
• To analyze biological data, we first have to get it….
• Several ways to organize presentation of material – by site, by
type, etc. We will organize by data type.
• Types of Databases:
–
–
–
–
–
–
Chromosomal (http://www.ncbi.nlm.nih.gov/mapview)
DNA/Genes
Protein
Biochemistry and metabolic pathways
Microarray
Web collections
34
DNA databases
• GenBank. Operated by NCBI (National Center for
Biotechnology Information). http://www.ncbi.nlm.nih.gov
• European Molecular Biology Laboratory – Nucleotide
Sequence Database. http://www.ebi.ac.uk/genomes/
• DNA Database of Japan (DDBJ). http://www.ddbj.nig.ac.jp
• All share data daily. Update conflicts avoided by policy.
• Differences are in “value added” and interfaces
35
Data Structures
• Current
– Primary DNA repository data based on ASN.1. Makes
possible linkages among many types of biomedical info.
– The software libraries now often handle XML as well
– Software toolkits and docs available at
http://www.ncbi.nlm.nih.gov/IEB/
• Genbank Flat File format
– http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html
• FASTA
>gi|532319|pir|TVFV2E|TVFV2E envelope protein
ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT
QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC
http://www.ncbi.nlm.nih.gov
36
37
Protein Structure
• NCBI
• Swiss-Prot/TrEMBL at http://www.expasy.org/
– Note: 125,744 chemically determined vs 861,482 inferred
from automated translation of DNA sequences!!!!!
• Protein Data Base – PDB http://www.rcsb.org/pdb/ - one of
the first online bioinformatics databases!!!
38
Biochemistry and pathways
• Biochemistry
– ENZYME (part of the ExPASy system)
– BIND (part of the NCBI system)
• Pathways
– PathDB http://www.ncgr.org/software/version_2_0.html
– Kegg WIT http://wit.mcs.anl.gov/WIT2/
39
Web Resources - General
• NCBI
http://www.ncbi.nlm.nih.gov/
• EBI Biocatalog
http://www.ebi.ac.uk/biocat/
• IUBio Archive
http://iubio.bio.indiana.edu
http://www.ncbi.nlm.nih.gov/
40
Similarity matching & Sequence
Alignment
Why pattern matching (and what are
the problems)
and…
US!
Bonobo
http://www.sandiegozoo.org/special/zoo-featured/pygmy_chimps.html
41
42
Problems!
• For proteins, 95% similarity is ~ identical, 80% similarity is a
lot. Even less similarity than that needed for DNA
• Database techniques inadequate – they are too precise!
• Datasets very large to search
• Homology
• Common ancestry
• Sequence (and usually structure) conservation
• Homology is inferred rather than measured
• Identity
• Objective and well defined
• Can be quantified easily, but not very useful!
• Similarity
• Most common method used, but not as easily defined
43
How do sequences differ?
• Differences in individual bases
CGTACCGTTAATAT
CGTACCGATAATAT
• Bases may be added to a sequence CGTACCCCGTAATAT
CGTACC . .GTAATAT
• Bases may be deleted from a
sequence
CGTACCGTTAATAT
CGTACCG . . .ATAT
44
Alignment
• An alignment is an arrangement of two sequences opposite
one another
• It shows where they are different and where they are similar
• We want to find the optimal alignment - the most similarity
and the least differences
• Alignments have two aspects:
– Quantity: To what degree are the sequences similar (percentage,
other scoring method)
– Quality: Regions of similarity in a given sequence
45
Scoring Alignments
GCTAAATTC
++ x x
GC AAGTT
• Matches are good: they get a positive value
• Mismatches are bad: they get a negative value
• Gaps are bad: they get a negative value
– Gap opening penalty
– Gap extension penalty
– Score = Matches –Mismatches
-∑{gap opening penalty +(length)*gap length penalty}
CGTACCGTTAATAT
CGTACCGTTAATAT
CGTACCG . . .ATAT CGT. C . GTT .ATAT
46
Dotter
• Simple way to get a feel for how
sequences compare to each other.
• Used both with DNA and Protein
sequences
• http://www.cgr.ki.se/cgr/groups/son
nhammer/Dotter.html/
• "A dot-matrix program with
dynamic threshold control suited
for genomic DNA and protein
sequence analysis" Erik L.L.
Sonnhammer and Richard Durbin
Gene 167(2):GC1-10 (1995)
• Modular nature of proteins
47
Local Alignments with BLAST
•
•
•
•
Basic Linear Alignment Search Tool
First a quick demo (hopefully)
http://www.ncbi.nlm.nih.gov/BLAST
So, what did we do?
– BLAST – Basic Linear Alignment Search Tool
– In particular, BLASTn (for nucleotides)
– Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman,
D.J. 1990. Basic Local alignment search tool. Journal of
Molecular Biology 215:403-410
48
(Original) BLAST Algorithm
• Original algorithm does not permit gaps
• The original BLAST algorithm is a local (heuristic) alignment
tool
• Given a search sequence, e.g. ACGTAGGCATGAA
• BLAST first makes a list of all “words” of a given length that
would possibly have a score of at least T against the search
string.
• In the case of this example there would be (at least) the
following:
– ACGTAGGCATG
– CGTAGGCATGA
– GTAGGCATGAA
49
(Original) BLAST Algorithm, 2
• BLAST takes the list of all words with a score of at least T against the
string one is trying to match…. and then searches a database for any
matches to these words. So if one were using the example and the NR
database, BLAST would search NR for all occurrences of the words:
– ACGTAGGCATG
– CGTAGGCATGA
– GTAGGCATGAA
• Suppose BLAST finds in the NR database an exact match to
– ACGTAGGCATG
• BLAST then attempts to extend the match in both directions
– ACGTAGGCATGA
– ACGTAGGCATGA
• So now we have an exact match of 12 letters
50
(Original) BLAST algorithm,3
• So BLAST keeps going, and in this case would stop at an
exact match of 13 letters (if one existed), since 13 letters was
the entire initial search string:
– ACGTAGGCATGAA
– ACGTAGGCATGAA
• BLAST has a stopping algorithm for dropping particular
search directions, or stopping altogether
51
BLAST algorithm in more detail
• The BLAST algorithm searches for MSPs – Maximal Scoring Pairs – such
that the score of sequences cannot be improved either by lengthening it or
shortening it. “Pairs” here refers to a string – or a substring – of the initial
string used as the search string – and one or more strings or substrings
found in a database.
• The search starts with the creation of all possible subwords of a given
length (default typically 11 for DNA sequences, 3 amino acids for protein
sequences) that would score at least T when matched against the original
search string. (T is short for Threshold)
• BLAST searches for any occurrence of each of these words that have a
score of at least T. This is a “hit” – or a “High Scoring Pair (HSP)”
• The search then continues by trying to extend these HSPs.
• Suppose “S” is the best score found for a word of length k. BLAST stops
trying to extend words when the score drops a certain amount below the
best value S in the previous round.
• BLAST continues on and on until it is no longer possible to improve the
score of HSPs by making them longer.
• Then it generates a list of the best HSPs. Default is a cutoff E-value of 10
• BLAST (original) has an infinite gap penalty
52
PSI BLAST
• Position Specific Iterative BLAST
• http://nar.oupjournals.org/cgi/content/full/25/17/3389
• Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z,
Miller W, Lipman DJ. 1997. Gapped BLAST and PSIBLAST: a new generation of protein database search
programs. Nucleic Acids Res 1997 Sep 1;25(17):3389-402
• Required two non-overlapping similarities with search term to
occur within a certain distance (A) on the genome
• Permits gaps in the alignments
• Can be iterated to allow for user-specified scoring matrices By
default, uses the BLOSUM-62 Matrix
53
PSI BLAST
• In the original
BLAST, the step
of extending the
length of the
‘hits’ took ~90%
of execution time.
• The initial
threshold value T
must be lower
than with the
original BLAST,
but far fewer hits
are pursued,
meaning that the
extension time is
lower
Two hits, T=11 A=40 vs One hit, T=13
http://nar.oupjournals.org/content/
vol25/issue17/images/gka56202.gif
54
Example Problem (hoffentlich)
55
mpiBLAST http://mpiblast.lanl.gov/
56
mpiBLAST Algorithm
• Darling, A.E., L. Carey, W.-C. Feng. 2003. The design,
implementation, and evaluation of mpiBLAST. Presented at
ClusterWorld2003.
http://www.cs.wisc.edu/%7Edarling/mpiblast-cwce2003.pdf
• Algorithm
– Database is segmented. Portions of database are placed on data
storage devices on multiple nodes in a HPC system.
mpiformatdb is a wrapper for the BLAST formatdb program.
Number of subdivisions specified by user
– Foreman/worker algorithm. Portions of the database are assigned
to workers, using a greedy algorithm
57
mpiBLAST performance
• Scaling can be super linear when pieces are small enough that
they fit into memory
• Scalability limitations due to communication, implicit barrier
before assembly of results
• If pieces of data distributed out to workers are larger than
available RAM, then scaling is still good but not super linear
• Blast is the most heavily used bioinformatics tool in existence.
Parallelization of BLAST has huge payoff for practicing
biologists
BLAST superlinear scaling as memory
phenomenon
• Standard BLAST running on a system with 128 MB of
memory.
Slide courtesy of Wu-chun Feng
[email protected] Los Alamos National Laboratory
58
59
Multiple Alignment
60
Multiple Alignment
• Sequences lined up so that homologous residues are next to one another
Color reflects residue type (e.g., green = hydrophobic)
This slide based on a slide created by Dr. Richard Repasky, Indiana University
61
Uses
• Alignments reveal the degree to which sequences have been
conserved
• Most functional sequences are conserved
• Multiple alignment is used to locate them
–
–
–
–
Functional groups of enzymes
Predict protein structure
Gene promoters
Unknown functional units of non-coding regions of DNA
• Alignments necessary to estimate evolutionary trees
This slide based on a slide created by Dr. Richard Repasky, Indiana University
62
Progressive alignments
• Pairwise dynamic programming alignment algorithms can be extended to
multiple sequences but scale poorly to large numbers of sequences (or to
long sequences)
• Heuristic algorithms are employed. Commonly used heuristic methods are
progressive - build multiple alignment by aggregating from paired
alignments
• Order in which sequences are added determined by a guide-tree that
reflects similarity/distance
• Guide tree constructed from sequences
• Closely related sequences aggregated/added first
• Errors in early additions tend to propagate
• Algorithms differ in strategy for minimizing error propagation
• Algorithms also differ in guide tree construction & scoring
This slide based on a slide created by Dr. Richard Repasky, Indiana University
63
Progressive alignment steps
•
•
•
•
1 - align B & C
2 - align D & E
3 - align (D & E) & A
4 -align (D & E & A) & (B
& C)
This slide based on a slide created by Dr. Richard Repasky, Indiana University
64
Three algorithms
• CLUSTAL W
– Oldest of three & most widely used
– Initial alignment error not addressed
– Good performance by adding realistic details
• T-COFFEE
– Initial alignment error addressed by using consistency methods
– Uses CLUSTAL W, improves performance
• ProbCons
– New this year
– Initial alignment error also addressed by consistency methods
– Uses hidden Markov models
This slide based on a slide created by Dr. Richard Repasky, Indiana University
65
CLUSTAL W
• http://www.ebi.ac.uk/clustalw/
• Thompson et al. 1994. Nucleic Acids Res. 22: 4673-4680
• Uses dynamic programming with distance matrices and gap
penalties for alignments
• Selective use of scoring matrices
– Strict matrices for closely related sequences
– Permissive matrices for distantly related sequences
– Relatedness determined by branch lengths in guide tree
• Uses residue-specific gap penalties from reference
alignments
This slide based on a slide created by Dr. Richard Repasky, Indiana University
66
CLUSTAL W
• Gap penalties reduced in short stretches of hydrophilic
residues (usually associated with bends and are gap-prone)
• Gap penalties increased in areas within 8 residues of existing
gaps because such gaps are rare in reference alignments
• Sequences weighted by relatedness
– Attempt to correct for unbalanced sampling across guide tree
– Closely related sequences discounted in importance
• Progression
– Leaves of tree joined by dynamic programming
– Leaves joined with internal nodes by sequence-profile alignment
– Internal nodes joined by profile-profile alignment
This slide based on a slide created by Dr. Richard Repasky, Indiana University
67
Example output
FOS_RAT
FOS_MOUSE
FOS_CHICK
FOSB_MOUSE
FOSB_HUMAN
MMFSGFNADYEASSSRCSSASPAGDSLSYYHSPADSFSSMGSPVNT
MMFSGFNADYEASSSRCSSASPAGDSLSYYHSPADSFSSMGSPVNT
MMYQGFAGEYEAPSSRCSSASPAGDSLTYYPSPADSFSSMGSPVNS
-MFQAFPGDYDS-GSRCSS-SPSAESQ--YLSSVDSFGSPPTAAAS
-MFQAFPGDYDS-GSRCSS-SPSAESQ--YLSSVDSFGSPPTAAAS
*:..* .:*:: .***** **:.:* * *..***.* :.. :*:
FOS_RAT
FOS_MOUSE
FOS_CHICK
FOSB_MOUSE
FOSB_HUMAN
IPTVTAISTSPDLQWLVQPTLVSSVAPSQ-------TRAPHPYGLP
IPTVTAISTSPDLQWLVQPTLVSSVAPSQ-------TRAPHPYGLP
VPTVTAISTSPDLQWLVQPTLISSVAPSQ-------NRG-HPYGVP
VPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMP
VPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPVVDPYDMP
:******:** **********:**:* **... ::. .**.:* :
68
ClustalW-MPI
• Li, K.-B.2003. ClustalW-MPI: ClustalW analysis using
distributed and parallel computing. Bioinformatics 19: 15851586
• Initial pairwise alignment process is parallelized and scales
very well
• Multiple alignment process is parallelized and scales modestly
• Scaling tests published thus far up to 16 processors, reduces
time from hours to minutes
69
Consistency Methods
• Make estimates based on more information - “averaging”
• Lazy teacher analogy
• In progressive multiple alignment, use as much information as
possible when adding sequences to the alignment
• T-COFFEE: each position in one alignment is weighted by
consistency in all alignments of all pairs of sequences that
include the sequences being aligned
• ProbCons: posterior probabilities in pairwise HMM
alignments weighted by posterior probabilities of same
positions in other alignments
This slide based on a slide created by Dr. Richard Repasky, Indiana University
70
T-COFFEE
• Http://igs-server.cnrsmrs.fr/~cnotred/Projects_home_page/t_coffee_home_page.ht
ml
• Notredame, et al. (2000, J. Mol. Biol. 302:205-217)
• Gives better alignments than CLUSTAL W on benchmark
datasets
• Avoids problem of early bad gaps using consistency methods
• Progressive alignment based on weights pooled from all
pairwise alignments rather than currently accumulated
sequences
This slide based on a slide created by Dr. Richard Repasky, Indiana University
71
T-COFFEE steps
• Calculation of weights
– All pairwise alignments using CLUSTAL W, local alignments
using FASTA Lalign
– For each aligned base pair in each pair of sequences calculate
weight
– Aggregate weights for aligned base pairs using triplets of
sequences
• Progression
– Align all sequences pairwise using weights
– Build guide tree from pairwise alignments
– Progressively build multiple alignment using tree and weights
This slide based on a slide created by Dr. Richard Repasky, Indiana University
72
ProbCons
•
•
•
•
•
•
•
•
Http://probcons.stanford.edu
ISMB 2004, Bioinformatics 20:Supplement 1
Constistency methods & HMM to align
Gives better alignments than CLUSTAL W & T-COFFEE on alignment
benchmarks
Use HMM to align all pairs of sequences
Keep posterior probability matrices & update each value by averaging over
all alignments in which the sequence position occurs. Do twice.
Create a guide tree from expected accuracies (sums of posterior
probabilities of highest summing path in matrix)
Progressive alignment objective function is sum of posterior probabilities
for all aligned residues
This slide based on a slide created by Dr. Richard Repasky, Indiana University
73
Multiple alignment viewers
• CLUSTAL X - X windows
– ftp://ftp-igbmc.ustrasbg.fr/pub/ClustalX/
• Jalview - Java
– http://www.ebi.ac.uk/~michel
e/jalview/
• Variable color schemes
• Editing
• Front end to aligners
This slide based on a slide created by Dr. Richard Repasky, Indiana University
74
Abstracting Multiple Alignments
•
•
•
•
•
Hidden Markov models can be used to describe alignments
Called profile HMMS
Think of them as definitions of proteins or averages
Useful for aligning newly discovered sequences
Search sequence databases for sequences that match the
alignment profile (Consider the alternative!)
• Build databases of profiles and search for profiles that match
query sequences
This slide based on a slide created by Dr. Richard Repasky, Indiana University
HMMER
•
•
•
•
•
•
•
•
•
http://hmmer.wustl.edu/
Profile HMMs for protein sequence analysis
Builds profiles from existing alignments
Searches sequence databases for molecules that match profiles
Can be used to construct db’s of profiles and to search for
profiles that match sequences
Generates random sequences from profiles
Also available as a parallel code using PVM
Has been ported to vector supercomputers
Scales reasonably well as regards number of processors. Does
not scale as well as regards size of the biological problem
75
76
Phylogenetic Inference
77
Building Phylogenetic Trees
• Goal: an objective means
by which phylogenetic trees
can be estimated in
tolerable amounts of wallclock time, producing
phylogenetic trees with
measures of their
uncertainty
• All evolutionary changes
are described as bifurcating
trees
-genes or gene products
-organisms
78
Why is tree-building a HPC problem?
• The number of bifurcating
unrooted trees for n taxa is
(2n-5)!/ (n-3)! 2n-3
• for 50 taxa the number of
possible trees is ~1074; most
scientists are interested in
much larger problems
• NP-hard problem
• The number of rooted trees
is (2n-5)!
79
Stochastic change of DNA
• Markov process, independent for each site: 4 x 4 matrix for
DNA, 20 x 20 for amino acids
•
A
C
G
T
• A
p(A->A)
p(A->C)
p(A->G) …
• C
p(C->A)
p(C->C)
p(C->G) …
• G
.
• T
.
• Transitions more probable than transversions.
• Must account for heterogeneity in substitution rates among
sites (DNArates – Olsen)
80
fastDNAml
•
•
•
•
Developed by Gary Olsen
Derived from Felsensteins’s PHYLIP programs
One of the more commonly used ML methods
The first phylogenetic software implemented in a parallel
program (at Argonne National Laboratory, using P4 libraries)
• Olsen, G.J.,et al.1994. fastDNAml: a tool for construction of
phylogenetic trees of DNA sequences using maximum
likelihood. Computer Applications in Biosciences 10: 41-48
• MPI version produced by Indiana University in collaboration
with Gary Olsen available from
http://www.indiana.edu/~rac/hpc/fastDNAml/
81
fastDNAml algorithm – adding taxa
• Optimize tree for 3
(randomly chosen) taxa only one topology
possible
• Randomly pick another
taxon –
(2i-5) trees possible
• Keep the best
(maximum likelihood
tree)
Basic fastDNAml algorithm - Branch
rearrangement
• Move any subtree crossing n
vertices (if n=1 there are 2i6 possibilities)
• Keep best resulting tree
• Repeat this step until local
swapping no longer
improves likelihood value
82
83
fastDNAml algorithm con’t: Iterate
•
•
•
•
•
•
•
Get sequence data for next taxon
Add new taxa (2i-5)
Keep best
Local rearrangements (2i-6)
Keep best
Keep going….
When all taxa have been added, perform a full tree check
84
Overview of parallel program flow
•
Program modules
– Master (generates trees,
receives back from Foreman
best tree at each step)
– Foreman (dispatches trees to
workers, determines best tree,
tracks activity of workers)
– Worker
– Monitor (instrumentation)
– Parallel versions include fault
tolerance features (useful in
large clusters and grid
computing)
85
Performance of fastDNAml
70
60
SpeedUp
50
40
30
20
10
0
0
10
20
30
40
50
Number of Processors
Perfect Scaling
50 Taxa
101 Taxa
150 Taxa
60
70
86
SC2003 HPC Challenge (“It seemed like a
good idea at the time”)
Are Hexapods a single evolutionary group? Are ecdysozoans a
single evolutionary group?
87
A partial bestiary
All organism illustrations copyright
Jennifer Fairman, 2003.
www.fairmanstudios.com
Used by agreement
88
Software and data analysis
•
Non-grid preparatory work
– Download sequences from NCBI (67 Taxa, 12,162 bp, mitochondrial genes
for 12 proteins)
– Align sequences with Multi-Clustal
– Determine rate parameters with TreePuzzle
•
•
Grid preparatory work
– Analyze performance of fastDNAml with Vampir
– Meetings via Access Grid & CoVise
The grid software
– PACXMPI – Grid/MPI middleware
– Covise – Collaboration and visualization
– Application Framework – Matthias Hess
– fastDNAml – Maximum Likelihood phylogenetics
Application framework, COVISE,
FastDNAml
•
•
•
•
•
ML analysis of
phylogenetic trees based
on DNA sequences
Foreman/worker MPI
program
Heuristic search for best
trees
For 67 taxa: 2.12 ~10109
trees
Goal: 300 bootstraps, 10
jumbles per – 3000
executions (more than 3x
typical!)
89
90
Why this project on a grid?
•
•
•
•
•
Important & time-sensitive biological question requiring massive computer
resources
A biologically-oriented code that scales well
Grid middleware environment & collaboration tool well suited to the task at hand
Opportunity to create a grid spanning every continent on earth (except Antarctica)
It seemed like a good idea at the time
91
The results
• Hundreds of trees
were analyzed during
the course of the week
• The biological results
are still being analyzed
• Our HPC challenge
project was awarded
the prize for “Most
geographically
distributed
application”
• We have not yet
become rich or famous
92
RNA & Protein Structure
93
RNA & Protein Structure
•
•
•
•
•
•
Want to know functions
Function dictated by structure
Need structure to understand function
Empirical determination of structure difficult/expensive
Shortcut: predict structure from sequence
Algorithms & software for predicting structure
94
RNA
•
•
•
•
•
Nucleotide sequence
Composition differs from DNA
Thymine replaced by uracil
Alphabet: C, G, A, U
Types of RNA
– Messenger RNA - template from DNA - decoded to produce protein
– Transfer RNA - interface attached to amino acids - identifies amino
acid to protein producing machinery
– Ribosomal RNA - protein producing machinery
– Regulatory RNA - small polynucleotides that bind to other molecules
and alter behavior
– Catalytic RNA - most catalyze reactions of DNA
95
RNA secondary structure
•
•
•
•
•
Single stranded
RNA folds on itself
Complementary bases join
A – U, G - C
Forms loops & hairpins
• In nature, structure nearly
minimizes energy
• Energy - more or less
bending/stress on bond angles
• Zuker algorithm minimizes
calculated energy
tRNA
96
RNA Structure – Vienna RNA
• http://www.tbi.univie.ac.at/~ivo/RNA/
• Package consists of several parts (from the web site):
– RNAfold - predict minimum energy secondary structures and pair
probabilities
– RNAeval - evaluate energy of RNA secondary structures
– RNAheat - calculate the specific heat (melting curve) of an RNA
sequence
– RNAinverse - inverse fold (design) sequences with predefined structure
– RNAdistance - compare secondary structures
– RNApdist - compare base pair probabilities
– RNAsubopt - complete suboptimal folding
http://www.tbi.univie.ac.at/~ivo/RNA/
97
Protein
•
•
•
•
•
•
•
•
Enzymes - catalysts
Regulatory - bind with molecules to alter behavior
Transport - move here to there as oxygen in hemoglobin
Storage - e.g., caches of nitrogen or metal ions
Mobility - contractile & motile proteins (muscle, flagella)
Structural proteins - fill space, provide support
Scaffold - supports for construction of macro molecules
Defense/Attack - immune system proteins, venom
98
Structure and function of proteins
•
•
•
•
•
Enzymes receive most attention
Enzymes catalyze reactions = lower energy required
Place reactants in favorable positions for reaction
Location is everything
Example enolase
99
http://bmbiris.bmb.uga.edu/wampler/tutorial/prot0.html
100
http://bmbiris.bmb.uga.edu/wampler/tutorial/prot0.html
101
Primary structure - amino acids
• Share nitrogen group
(amino)
• Share acid group (carboxyl)
• Differ in side chains
• 20 common amino acids
• Polymerize amino-tocarboxyl
• Side chains determine
secondary & tertiary
structure
102
Secondary structure & Bond rotation
• Reflects angles in amino acid chain
• Shape of the peptide chain over short sequences
• Determined by amino acid composition
Angles of rotation & secondary
structure
http://www.cryst.bbk.ac.uk/PPS95/
course/3_geometry/rama.html
103
104
Visualization: ways to view molecules
• Wireframe
– Often used by
crystallographers while
interpreting data
– Frames fit nicely in electron
density mesh
• Space filled (Van der Waals radii)
– Often used in docking
applications
– Van derWaals radii useful for
thinking about hydrogen
bonds
105
Visualization: ways to view molecules
• Richardson-type (also ribbon or
noodle)
– Depict secondary and tertiary
structure
– Omit details of atoms and
polymerization
• Ball and stick
– Usually more useful for
ligands than for whole
proteins
– Atoms & covalent bonds
106
Molecular viewing software
• Most programs do several types of visualizations
• VRML – Cosmo Player
http://www.karmanaut.com/cosmo/player/
• RASMOL - http://www.openrasmol.org/
• RasTop - http://www.geneinfinity.org/rastop/
• CHIME - http://www.mdl.com/chime/index.html
• Swiss Pdb Viewer - http://www.expasy.ch/spdbv/
• MICE - http://mice.sdsc.edu/
• Many tend to be touchy about browsers and plugins
107
Empirical structure
• X-ray crystallography
• Yields 3-D plot of electron
density in space
• Create model of molecule that
matches electron density
• ~127,863 entries in SwissProt
• ~857,950 entries in TrEMBL
http://crystal.uah.edu/~carter/protein/xray.htm
108
Secondary structure prediction
• Induction: assign structure based on sequence similarity to
proteins of known structure
• Consensus methods do best
• Search database for similar sequences
• Align sequences
• Apply several algorithms (e.g., neural network, nearestneighbor) to predict structure type
• Take consensus of predictions
• JPRED: www.compbio.dundee.ac.uk/~www-jpred/
109
Tertiary structure prediction
• Long segments fold
• Folds are held in place by
molecular forces (e.g.,
electrostatic, hydrogen bonds,
some covalent bonds)
• Proteins fold to minimize energy
• Folding algorithms seek
conformation with minimum
energy
• Two main methods
– Ab initio
– Fold recognition
110
Criteria
• Goal: prediction of molecule position within 1 angstrom
• Remember, location is everything in enzymes
• Measuring quality of fit
– Root mean square of atom distances from correct position
RMSD = √ (∑di2)/N
– Q3 = (true positives + true negatives)/total residues
• Better than 70% right is really good!
111
Fold Recognition (Threading)
•
•
•
•
Impose known folds on molecule; evaluate fit
Dissimilar sequences may fold similarly
Number of possible folds is finite
Many methods of fitting (e.g., dynamic programming, Gibbs
sampling, hidden Markov models)
• Calculate energy or distances
• Web services - many methods
http://cubic.bioc.columbia.edu/predictprotein
• General recommendation: use many methods and build a
consensus
112
Ab Initio methods
•
•
•
•
•
•
L. From the beginning (O. E. D.)
A real ab initio chemist would complain about use of the term
A scoring function is used to judge conformations
Search function used to explore conformational space
Criterion: usually minimize free energy
Scoring function types
– Molecular mechanics calculations
– Use empirically derived scoring functions based on probability
distributions of data in Protein Data Bank
• Search function may be coarse-grained or fine-grained, usually
matches granularity of scoring function
113
Amber
• Assisted Model Building with Energy Refinement
• http://amber.scripps.edu
• Not open source – modest license cost
• D.A. Pearlman, D.A. Case, J.W. Caldwell, W.R. Ross, T.E.
Cheatham, III, S. DeBolt, D. Ferguson, G. Seibel and P.
Kollman. AMBER, a computer program for applying
molecular mechanics, normal mode analysis, molecular
dynamics and free energy calculations to elucidate the
structures and energies of molecules. Comp. Phys. Commun.
91, 1-41 (1995)
• Simulated annealing approach with energy refinements
114
GAMESS
• General Atomic and Molecular Electronic Structure System
• M.W.Schmidt, M.W., K.K.Baldridge, J.A.Boatz, S.T.Elbert,
M.S.Gordon, J.H.Jensen, S.Koseki, N.Matsunaga,
K.A.Nguyen, S.Su, T.L.Windus, M.Dupuis, J.A.Montgomery.
1993. General Atomic and Molecular Electronic Structure
System J. Comput. Chem.14: 1347-63.
• NPACI/SDSC Web portal for GAMESS:
https://gridport.npaci.edu/gamess/
• No-cost academic license
• It’s parallel
115
CHARMM
•
•
•
•
CHARMM (Chemistry at HARvard Molecular Mechanics)
For prediction of macromolecular structure
Not open source - modest license cost
CHARMM: A Program for Macromolecular Energy, Minimization, and
Dynamics Calculations, J. Comp. Chem. 4, 187-217 (1983), by B. R.
Brooks, R. E. Bruccoleri, B. D. Olafson, D. J. States, S. Swaminathan, and
M. Karplus.
• CHARMM: The Energy Function and Its Parameterization with an
Overview of the Program, in The Encyclopedia of Computational
Chemistry, 1, 271-277, P. v. R. Schleyer et al., editors (John Wiley & Sons:
Chichester, 1998), by A. D. MacKerell, Jr., B. Brooks, C. L. Brooks, III, L.
Nilsson, B. Roux, Y. Won, and M. Karplus.
116
Rosetta
• Work with fragments 3-9 amino acids in length
• Restrict conformations of individual fragments to the
distribution of conformations exhibited in real proteins
• Fragment conformations modeled stochastically using
distributions of observed conformations
• Seek array of conformations that minimize energy
• Local minima likely
• Run many searches
• Cluster results & pick large clusters as likely conformations
• Typically licensed at no cost for academic purposes
117
Molecular Docking
• Will two molecules bind?
• Usually interested in docking of small molecules (e.g., drug
candidates) to proteins
• Small molecule called ligand (from Latin ligare - to bind)
• Specific question: will ligand bind to a receptor in a protein?
• Receptor usually the largest pocket in the surface of a protein
• Steps
– Characterize receptor site
– Orient ligand(s) & evaluate
118
Molecular Docking
• Process: usually create negative image of receptor site and ask
whether ligands take that conformation
• Rigid models use a grid search
• Models with flexible surfaces
– Usually assume receptor fixed and ligand flexible.
– Explore conformation of ligand (simulated annealing)
• Models with flexible surfaces do better than those with fixed
surfaces
• Autodock is a commonly used package
– http://www.scripps.edu/pub/olson-web/doc/autodock/
– Flexible model
– Can do simulated annealing and genetic algorithms
119
Systems Biology
120
Systems Biology
• Special issue of Science:
295, Mar. 2002
• Special issue of Nature:
420, Nov. 2002
• “Systems biology is a
new field in biology that
aims at a systems-level
understanding of
biological systems.”
• Nobody’s quite sure what
it is, but it sure is hot!
http://www.ornl.gov/TechResources/Human_Genome/
graphics/slides/images/01-0052_web.gif
Historical approach to biological
experiments
• From Lazebnik, Y. 2002. Cancer cell 2:179: Traditional biological
experimentation much like the process of trying to fix a broken radio (or if
you are or were a 12-year old boy…)
• Some typical steps:
–
–
–
–
Cataloguing components and their attributes
Perturbing the system
Knock-out experiments
Drawing diagrams
• Eventually may find a component that, when replaced, repairs the radio
• In a very complex system, knowing what all of the parts are, and knowing
the function of individual pathways, may still not tell you how the systems
work. It may simply be impossible to deduce this from 1-st order
interactions
• Interactions, multiple changes
– Power supply and other components (well-known PC repair example!)
– Change everything all at once so that we’ll never know what worked!
121
122
Systems Biology
• Systems biology emphasizes close integration of experiment, theory and
computational modeling
• Goal: understanding the structure and dynamics of biological systems,
placing the parts in the context of the dynamic whole
– Studies the complex interactions of many levels of biological
information
– Quantitative, predictive models are central
– Computational modeling in particular is a key tool
• Why model
– You are forced to really state what you are hypothesizing
– Allows you to understand an *approximation* of reality in great detail
• Computational Cell Biology. 2002. Springer Verlag (Fall et al, eds).
• Foundations of systems biology. MIT Press, 2001. Kitano (ed)
123
A small sampling
•
•
•
•
•
•
•
•
•
•
•
•
BALSA
BASIS
BIOCHAM
BioCharon
biocyc2SBML
BioGrid
BioNetGen
BioPathways Explorer
Bio Sketch Pad
BioSPICE Dashboard
BioSpreadsheet
BioUML
•
•
•
•
•
•
•
•
•
•
•
•
•
Cellware
Cytoscape
DBsolve
Dizzy
E-CELL
FluxAnalyzer
Gepasi
INSILICO discovery
Jarnac
JDesigner
JSIM
JWS
Karyote
124
Example - MCell
• MCell is: A General Monte Carlo Simulator of Cellular
Microphysiology. http://www.mcell.cnl.salk.edu/
• MCell focuses on simulations using a Brownian dynamics random
walk algorithm.
• MCell's use to date has been focused on the microphysiology of
synaptic transmission.
• Images and MCell-related material courtesy of Joel R. Stiles,
Pittsburgh SupercomputingCenter and Carnegie Mellon University,
and Thomas M. Bartol, Computational Neurobiology Laboratory,
The Salk Institute.http://www.mcell.cnl.salk.edu/
125
MCell Scalability
Images and MCell-related material courtesy of Joel R. Stiles,
Pittsburgh Supercomputing Center and Carnegie Mellon University,
and Thomas M. Bartol, Computational Neurobiology Laboratory,
The Salk Institute. http://www.mcell.cnl.salk.edu/
126
CompuCell
• CompuCell currently uses a combination of "extended Potts model" for cell
sorting and clustering, and "Schnakenberg Reaction Diffusion" equations to
establish the underlying chemical field to which cells respond and form
typical patterns found in such biological systems as a growing chicken
limb.
• http://www.nd.edu/~icsb/
Image courtesy of James Glazier
http://www.biocomplexity.indiana.edu/software.php
127
SBML
• There is currently a proliferation of software, but no single package
answers all needs
• Systems Biology Markup Language
– Purpose: develop software and standards to
– Enable sharing of simulation & analysis software
– Enable sharing of models
•
•
•
•
Goal: make it easier to share than to reimplement
An XML-based markup language
Active and functional leadership and reasonable funding stream
SBML is focused on biochemical networks, but of all of the biologyoriented markup languages, it seems to be the one with the most traction
• Permits storage, transmission, and reuse of models
• Consists of “levels”
128
What does an SBML model look like?
<?xml version="1.0" encoding="UTF-8" ?>
- <sbml xmlns="http://www.sbml.org/sbml/level1" version="2" level="1"
xmlns:celldesigner="http://www.sbml.org/2001/ns/celldesigner">
- <model name="ban00010">
- <annotation>
<celldesigner:modelVersion>2.2</celldesigner:modelVersion>
<celldesigner:modelDisplay sizeX="876" sizeY="1177" />
- <celldesigner:listOfCompartmentAliases>
- <celldesigner:compartmentAlias id="ca1" compartment="uVol">
<celldesigner:class>SQUARE</celldesigner:class>
<celldesigner:bounds x="10.0" y="10.0" w="856" h="1157" />
</celldesigner:compartmentAlias>
</celldesigner:listOfCompartmentAliases>
- <celldesigner:listOfSpeciesAliases>
-
129
SBML Levels
• Level 1 – Biochemical networks. Frozen.
• Level 2 – enhancements and extensions to level 1. Frozen June
2003.
– Uses MathML for equation specifications
– Uses same metadatascheme as CellML (exp named function
defs), catalysts, time delays
– Fixes minor issues in Level 1 specification
– Any Level 1 model can be run within software that supports
Level 2
• Level 3 – current development effor
130
Components of a Level 1 or 2 model
• Compartment: a well-stirred container
• Species: chemical compounds
• Reaction: transformation, transport, or binding process
involving a species. May have a rate parameter
• Parameter: a quantity that has a symbolic name (global and
local)
• Unit definition
• Rule: added to set constraints, initial conditions, bounds, etc
on the reactions
• Everything in SBML is one of the above!
131
So you actually want to run one…
• MANY programs will handle a model written in SBML
• libSBML provides a C/C++ API if you want to write your own
• Math SBML – an open source toolbox for running SBML
models within Mathematica
• SBML Toolbox – the equivalent for MatLab
• While an open source toolkit for a proprietary software
package seems odd at first blush…
• There is a KEGG to SBML converter!
132
JWS Online
From http://jjj.biochem.sun.ac.za/
133
134
CellML
• Originally designed to describe and exchange models of
cellular and subcellular processes.
• http://www.cellml.org/public/about/what_is_cellml.html
• XML-based specification of interchange of cell model
information
• Includes:
• Information about model structure
• Math, based on MathML
• Metadata about the model
• Project of Bioengineering Institute of University of Auckland
with support from Physiome Sciences Inc.
135
BioSpice
•
•
•
•
•
Lead by Adam Arkin – a DARPA-backed effort
Described in some detail in two recent issues of “-Omics”
www.biospice.org
More licensing term details than many open source efforts
The BioSpice Dashboard may be one of the better
“integrative” tools under development at present
• Uses SBML for model specification
136
Systems biology URLs
•
•
•
•
•
•
•
•
•
•
SBW & SBML
www.sbw-sbml.org
NetBuilder
strc.herts.ac.uk/bio/Maria/NetBuilder
CellML
www.cellml.org
Jarnac + JDesigner www.cds.caltech.edu/~hsauro
Gepasi
www.gepasi.org
Virtual Cell
www.nrcam.uchc.edu/ (NIH-supported)
E-CELL
www.e-cell.org (based in Japan
JigCell
gnida.cs.vt.edu/~cellcyclepse/
DARPA BioSPICEwww.biospice.org
Karyote
http://biodynamics.indiana.edu/overview/
137
Some Good Books
• Winter, P.C., G.I. Hickey, H.L. Fletcher. 1998. Instant notes in
genetics. Springer-Verlag, NY. ISBM 0-387-91562-1
• Durbin, R., S. Eddy, A. Krogh, G. Mitchison. 2000. Biological
sequence analysis. Cambridge University Press.
• Gibas, C., and P. Jambeck. 2001. Developing bioinformatics
computer skills. O’Reilly.
• Tisdall, J. 2001. Beginning perl for bioinformatics. O’Reilly.
• Gusfield, D. 1997. Algorithms on strings, trees, and
sequences. Cambridge University Press.
• Berman, F., G.C. Fox, A.J.G. Hey. (eds) 2003. Grid
computing: making the grid infrastructure a reality. Wiley,
Sussex
138
And a few semi-random other
matters
139
GeneIndex
•
•
•
•
•
Location of initiators,
promoters, etc. a key
question in genomics
First step in this is creating
a dictionary of words of
various lengths (many
possible next steps)
To be useful, analysis must
be performed on entire
genomes at once
GeneIndex finds
frequencies and positions of
all words of a given length
in a DNA sequence.
Visualization with Tcl/Tk.
Genome is broken up into n
sections, where n = number
of processors
After each segment is
analyzed, linked lists are
joined
140
141
GeneIndex Scalability: Speedup
Drosophila
70
Speedup
60
50
40
30
20
10
0
0
20
40
Num ber of CPU
60
80
142
BioPerl
• Duct tape for DNA/protein sequence bioinformatics
• Perl API for
– Reading many data formats/data sources
– Writing many data formats/data sources
– Manipulating data objects in well known ways
• Object oriented Perl module(s)
• As with all frameworks, it can be extremely painful for quick
and dirty applications
• http://bio.perl.org/
• Siblings: BIOPYTHON (www.biopython.org), BIOJAVA
(www.biojava.org)
143
Apple bioclusters
• Apple Xserve clusters: head node plus compute nodes
• Batch & queuing with Platform LSF, Sun Grid Engine, Open
PBS, or PBS Pro
• Parallel API’s: MPICH, MPI Pro, LAM/MPI
• Globus tookit available
• iNquiry Bioinformatics tools available
– Many open source bioinformatics packages pre-compiled
– All available through Pise web interface
• “Easy” system/cluster administration tools
• http://www.apple.com/
144
BIRN
• Biomedical Informatics Research Network
• http://www.nbirn.net/
• NIH-sponsored attempt to create health-oriented
cyberinfrastructure
• Function BIRN – brain function and disorders, e.g.
schizophrenia
• Morphometry BIRN – brain structural disorders, e.g.
Alzheimers
• Mouse BIRN – studying mouse brain and mouse models of
human brain disorders
• Grid technology, using federated data system approach, based
on Globus, SRB, etc.
145
Grid.org
•
•
•
•
Volunteer cycles to handle various problems
Uses United Devices software
Human proteome project – uses Rosetta
Cancer research – does screening against potential projects
Computational biology, biomedical
research, and HPC
• Two challenges:
– Scalability of applications
– Wall-clock time sensitivity
• Bioinformatics, Genomics, Proteomics, ____ics will radically
change understanding of biological function and the way
biomedical research is done.
• Traditional biomedical researchers must take advantage of new
possibilities
• Computer-oriented researchers must take advantage of the
knowledge held by traditional biomedical researchers
146
147
Future directions & needs
• Drug Discovery
– Target generation – so what
– Target verification – that’s important!
– Toxicity prediction – VERY important!! (Cholesterol example)
– Counterintuitive problem: the more personalized a therapy is, the smaller
its target audience!
• Many biologists are unfamiliar with the real possibilities
• Useful applications may require straightforward application of well known
principles, but writing a parallel application that can be used to treat people
is a very difficult challenge
• Attacks on all fronts simultaneously are needed
• Interactive Tera-scale applications might for many biologists be more
valuable right now than Peta-scale applications (even if we had them!)
• Portals and the TeraGrid –> solutions to problems that biologists care about
• Lots of open source codes are out there waiting for you
148
Acknowledgments
•
•
•
Some of the research described herein was supported by the following:\
– The Indiana Genomics Initiative of Indiana University, supported in part by Lilly
Endowment Inc.
– The Indiana METACyt Initiative of Indiana University, supported in part by Lilly
Endowment Inc.
– Shared University Research grants from IBM, Inc. to Indiana University.
– National Science Foundation under Grant No. 0116050 and Grant No. CDA9601632. Any opinions, findings and conclusions or recommendations expressed in
this material are those of the author(s) and do not necessarily reflect the views of the
National Science Foundation.
Some of the ideas presented here were developed while the senior author was a visiting
scientist at Höchstleistungsrechenzentrum Universität Stuttgart. Thanks to HLRS and
everyone I have worked with there, especially Michael Resch, Matthias Müller, Peggy
Lindner, Matthias Hess, Rainer Keller, and Edgar Gabriel.
John Herrin, Malinda Lingwall, & W. Leslie Teach assisted with graphics