Computational biology, bioinformatics, and high performance computing Craig A. Stewart stewart@iu.edu Indiana University SC2003 Tutorial 16 November 2003 S14

Computational biology, bioinformatics, and high performance computing Craig A. Stewart [email protected] Indiana University SC2003 Tutorial 16 November 2003 S14

Transcript Computational biology, bioinformatics, and high performance computing Craig A. Stewart [email protected] Indiana University SC2003 Tutorial 16 November 2003 S14

1
Computational biology,
bioinformatics, and high
performance computing
Craig A. Stewart
[email protected]
Indiana University
SC2003 Tutorial 16 November 2003
S14
License terms
•
•
Please cite as: Stewart, C.A. 2003. Computational Biology. Tutorial presented
at SC2003, 15-21 Nov, Phoenix, AZ. http://hdl.handle.net/2022/14000
Some figures are shown here taken from web, under an interpretation of fair
use that seemed reasonable at the time and within reasonable readings of
copyright interpretations. Such diagrams are indicated here with a source url.
In several cases these web sites are no longer available, so the diagrams are
included here for historical value. Except where otherwise noted, by inclusion
of a source url or some other note, the contents of this presentation are © by
the Trustees of Indiana University. This content is released under the Creative
Commons Attribution 3.0 Unported license
(http://creativecommons.org/licenses/by/3.0/). This license includes the
following terms: You are free to share – to copy, distribute and transmit the
work and to remix – to adapt the work under the following conditions:
attribution – you must attribute the work in the manner specified by the author
or licensor (but not in any way that suggests that they endorse you or your use
of the work). For any reuse or distribution, you must make clear to others the
license terms of this work.
2
3
Table of Contents
•
•
•
•
•
•
•
•
•
Class Plan and Objectives
A rapid introduction to key elements of biology
Bioinformatics data sources
Similarity matching
Phylogenetics
RNA and Protein Structure
Systems Biology
Grand challenge problems
Acknowledgements & references
3
11
32
48
95
108
126
140
163
Note: Slides with the Indiana University wordmark in the bottom left corner
were generated at Indiana University, with images sometimes from other
sources. In such cases the url for the source of the image is indicated on the
slide. Slides with a plain white background have been graciously provided
by someone outside IU, and sources are attributed on such slides.
4
Class Plan & Objectives
• Class Plan & Strategy
– Materials focus on open source software (generally not the
presenters own work)
– One critical application will be covered in great depth, and
several others will be reviewed
• Objectives. At the end of the class, participants should:
– understand enough biology to understand key computational
biology problems
– be conversant with current key applications, and current
problems facing bioinformatics and computational biology
– Be familiar with some strategies for collaborating with
biologists and biomedical scientists
5
Motivation
•
•
•
•
•
The “-omics” trend
Finding press pieces about huge computing problems is easy
How many bio codes really scale to hundreds of processors?
What are the coming high performance needs of biologists?
Importance of computational biology and bioinformatics to the
HPC community
• The challenges and promise are real
• Successes and failures so far
– Successes: Protein structure, Genome assembly, Surgical
assistance, Phylogenetics
– Mismatched priorities: Ab initio protein folding
– Not yet successful: Drug discovery
6
What has changed recently?
• Bioinformatics not new
– Protein structure
– Phylogenetics
• What is new is highthroughput sequencing:
– Lots more data
– The possibility of going
from a knowledge of the
DNA sequence to an
understanding of
diseases and health
http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html
7
Genome Projects Timeline
•
•
•
•
•
•
•
•
•
•
•
1978
1986
1994
1995
1996
1997
1998
1998
1999
2000
2003
First virus (SV40) sequenced (5224 base pairs)
DOE announces Human Genome Initiative
First complete map of all human chromosomes
First living organism sequenced (H. influenzae) 2 Mb
Yeast (S. cerevisiae) - 12 Mb
Intestinal bacterium (E. coli) - 5 Mb
Nematode worm (C. elegans) - 100 Mb
Celera announcement; Public effort regroups
Human Chromosome 22 – 34 Mb
Joint announcement by NHGRI – Celera
“As good as it gets” human genome
This slide based on slide by Manfred D. Zorn
8
Definitions
• Computational Biology: any use of advanced information
technology in the study of biological problems.
• “Bioinformatics applies the principles of information sciences
and techologies to make the vast, diverse and complex life
sciences data mnore understandable and useful” (NIH BISTIC
Committee grants1.nih.gov/grants/bistic/CompuBioDef.pdf)
• Genomics – study of genomes and gene function
• Proteomics – study of proteins and protein function
• ___omics –
9
Challenges
•
•
•
•
Different types of biological data at different scales
Data of varying quality
Much of the underlying biology is not well understood
Prior to the availability of high-throughput sequencing,
scientists could only study small pieces of the genetic
information of any organism.
• Now the entire genome of several organisms has been
completed, but knowing the genome is different than knowing
how it works!
10
Comparison of Complexity
• Physics & Chemistry
• Biology
– 2 elementary particles
– 3B base pairs in humans
– 4 forces
– Min. 30,000 genes in
humans
– 112 elements
– ~1.5M species
– When random events
occur it is often possible
– Individual random
to study average behavior
events important; no law
of large numbers
– Typically ahistoric
(astrophysics an
– Intensely historic,
exception)
heavily contingent
11
Complexity, Con't
•
Chip design
– All components known
– Device physics for
individual components
known
– Itanium has 3 x 10^8
connections and 2 x 10^8
devices
– Unified basic currency
(electrons)
– Computer program
required to understand
•
Cells
– Components not known
– Function of individual
components not known
– # components ~10^13
– No unified basic
currency
– Ecell, Karyote, etc.
attempting to model
cells
12
A rapid introduction to key elements
of biology
Why is it important to know some
biology?
Anopheles gambiae
From www.sciencemag.org/feature/data/
mosquito/mtm/index.html
Source Library:Centers for Disease Control
Photo Credit:Jim Gathany
13
• Would you study numerical
methods without knowing
some mathematics?
• Much current biological
knowledge is very specific
to particular organisms,
genes, or diseases
• If you just wade into the
available data online you
can do some very silly
things.
14
Central dogma of biology
• The central dogma of
biology is that genes act to
create phenotypes through a
flow of information form
DNA to RNA to proteins, to
interactions among proteins
(regulatory circuits and
metabolic pathways), and
ultimately to phenotypes.
Collections of individual
phenotypes constitute a
population (first put forward
by Crick in 1958)
http://www.ncbi.nlm.nih.gov/About/primer/genetics_cell.html
15
Cell Structure
http://www.ncbi.nlm.nih.gov/About/primer/genetics_cell.html
Eukaryotes
• Chromosomes linear
• Introns, exons,
postprocessing
• Nucleus & nuclear wall
• Mictochondria and (in
plants) Chloroplasts
Prokaryotes
• Chromosome circular
• Location is everything
• No nucleus
• No plastids
16
Four (or Five) Bases
• DNA consists of four nucleotides:
Cytosine, Thymine, Adenine, and
Guanine.
• In the double helix, A&T are
always bound, and C&G are
always bound to each other
• RNA consists of four nucleotides
as well: Cytosine, Uracil,
Adenine, and Guanine
• RNA may loop back on itself but
it does not form a double helix
http://www.ornl.gov/TechResources/Human_Genome/graphics/slides/images/structur.gif
17
http://www.ornl.gov/TechResources/Human_Genome/graphics/slides/images/98-647.jpg
18
Genetic Code
Ala Alanine
Arg Arginine
Asn Asparagine
Asp Aspartic acid
Cys Cysteine
Glu Glutamic acid
Gln Glutamine
Gly Glycine
His Histidine
Ile Isoleucine
http://www.ncbi.nlm.nih.gov/Class/MLACourse/
Original8Hour/Genetics/geneticcode.html
Leu Leucine
Lys Lysine
Met Methionine
Phe Phenylalanine
Pro Proline
Ser Serine
Thr Threonine
Trp Tryptophan
Tyr Tyrosine
Val Valine
Translating DNA to RNA and
Transcribing RNA to Proteins
DNA
AAAAAGGAGCAAATT
1
RNA
One possible amino
acid string
2
4
3
6
5
UUUUUCCUCGUUUAA
Phe
Asn
Asp
Ala
19
20
Human Chromosomes
http://www.ncbi.nlm.nih.gov/Class/MLACourse
/Original8Hour/Genetics/cytogenetic.html
http://www.ornl.gov/TechResources/
Human_Genome/graphics/slides/
elsikaryotype.html
21
Sickle Cell
Normal RBC
• GAG codes for Glutamine
• disc-Shaped, soft
• easily flow through small
blood vessels
• lives for 120 days
Sickle RBC
• GTG codes for Valine
• sickle-Shaped, hard
• often get stuck in small
blood vessels
• lives for 20 days or less
Malaria vs. Anemia!
http://www.nlm.nih.gov/medlineplus/
ency/imagepages/1223.htm
22
What is a Gene?
• An inheritable trait associated with a region of DNA that codes
for a polypeptide chain or specifies an RNA molecule which in
turn has an influence on some characteristic phenotype of the
organism.
– Early views: genes lined up on the chromosome like beads
on a string; one gene => one protein
– Examples of genes: color blindness, sickle-cell anemia
– Mendelian genes, Sex-linked genes, Quantitative traits
• Annotation: Extraction, definition, and interpretation of
features on the genome sequence
• Annotations vs. genes:
– Many annotations describe features that constitute a gene.
– Others may not always directly correspond in this way
– An annotation is what we think… nature may disagree!
• Inheritance problem with annotations
23
Gene Components
• Procaryotes
– Location is everything
– Essentially all of the DNA is transcribed (few mitochondrial diseases)
• Eucaryotes
– Non-contiguous pieces of DNA may comprise one gene
– Start sequence (complicated and long)
– Stop Codons – end transcription
– Exons – portions of sequence that are transcribed and used
– Introns – portions of sequence that are not used
• Genes and Chromosomes
– In eukaryotes, an organism has two of each chromosome (in pairs).
– Among sexually reproducing organisms, one chromosome comes from
each parent
– In “simple Mendelian genes” there are two alleles for each gene – one
on each chromosome (e.g. wrinkly)
24
Alternate splicing
http://www.blc.arizona.edu/marty/411/Modules/altsplice.html
A (very) little about evolutionary
genetics
Hardy-Weinberg Law
Parents
Ww
Ww
Offspring
WW
Ww
Ww
ww
Based on this, can you explain why the gene for Sickle Cell Anemia
persists in populations of people in Africa?
25
26
Population genetics & evolution
• Mutations create the raw material
for evolution
• Natural selection and chance
affect the frequency with which
particular genes or DNA
sequences are present in
populations
• Given enough time and enough
change, evolution, speciation, and
so forth happen
• Genes can be ‘fixed’ or
‘maintained in an equilibrium’ in
a population by chance or through
natural selection
http://faculty.wm.edu/bsgran/
27
How do sequences differ?
• Differences in individual bases
CGTACCGTTAATAT
CGTACCGATAATAT
• Bases may be added to a sequence CGTACCCCGTAATAT
CGTACC . .GTAATAT
• Bases may be deleted from a
sequence
CGTACCGTTAATAT
CGTACCG . . .ATAT
28
Random genetic change
• “things happen”
• Molecular clock
– theory – ~ 2% change per million years (2 x 10-9
substitutions per base location per year)
– Practice – a rule of thumb is different than something like
Newton’s 2nd law of motion
• Random change may often be responsible for speciation – e.g.
two populations of birds, separated by a geographic barrier,
may at random eventually develop into two different species
29
Key points (so far)
• Biological processes are complicated; the historicity and complexity
of biological processes and our lack of understanding of many
matters makes biologty an interesting topic!
• The fundamental dogma of molecular biology is that genes act to
create phenotypes through a flow of information form DNA to RNA
to proteins, to interactions among proteins (regulatory circuits and
metabolic pathways), and ultimately to phenotypes. Collections of
individual phenotypes constitute a population.
• DNA consists of four base pairs (ATCG). A is always paired with T;
C always paired with G.
• DNA is translated into RNA. RNA consists of four base pairs as
well (AUCG).
• The linear structure of DNA is transcribed into RNA and then into
proteins. Proteins have their 3D configuration as the basis for their
structure.
30
DNA sequencing
Send in the clones!
• DNA chopped into
blocks
• Blocks inserted into
bacterial cells using
viruses
• The bacterial clones
make lots of copies of
DNA so that you have
something to work with
• The sequence of each
chunk of genetic
material is determined
using gel electrophoresis
31
Sanger
•Cut DNA at
various places (at
T, G, C, A)
•Add a radioactive
molecule at the
end of the DNA
chain
•Find out how
long the chain is
by gel
electrophoresis
•Read off the
sequence
www.ornl.gov/TechResources/
Human_Genome/publicat/primer/
Dye-terminator
Sequencing
www.ornl.gov/TechResources/
Human_Genome/graphics/slides/
images/standardRGB200.jpg
32
Sequence assembly
•
•
•
•
•
Phred – base calling
Phrap – shotgun sequence assembly
Consed – finishing
http://www.phrap.org/
High quality software
33
Bioinformatics data sources
34
Bioinformatics Data Sources
•
•
•
•
There are many
Characteristics vary
There are many ways to organize view of the biological data
A pragmatic approach:
– Biomedical literature sources
– Structured vocabularies
– DNA, RNA, Protein etc. data sources
35
Biomedical literature
• Abstracts of biomedical lit.
largely available online
• Text processing itself is an
interesting problem
• U.S. National Library of
Medicine – NLM Medline
http://www.nlm.nih.gov/
• ~12 million references on
life sciences/biomedicine.
• Covers 1966 to present.
• Citations from over 4,600
journals; most published in
English
36
PubMed
• Standard search tool for
Medline
• http://www.ncbi.nlm.nih.gov/e
ntrez/
• Useful limit terms:
– Gender
– Age Groups
– Human or Animal
– Publication Date
• You can save queries
37
Structured Languages
• NLP or write with agreed-upon terms?
• Three important structured languages:
– MeSH
– GO (Gene Ontology)
– LOINC
38
MeSH
• Medical Subject Heading
• http://www.nlm.nih.gov/mes
h/MBrowser.html
• ~17,000 Thesaurus Terms
• Typically 10-15 used per
article in MedLine; 3-4 as
major points (indicated with
* in PubMed)
• When done right…. the
terms used are the most
specific possible
• There are both advantages
and disadvantages!
39
GO (Gene Ontology)
• http://www.geneontology.org/
• “The goal of the Gene OntologyTM Consortium is to produce
a dynamic controlled vocabulary that can be applied to all
organisms even as knowledge of gene and protein roles in cells
is accumulating and changing.”
• Based on xml file format
• Several browsers (AmiGO, QuickGO, MGO)
• Directed Acyclic Graph (child may have multiple parents)
– ISA (is a)
%
– Part of
<
• Three ontologies
– Molecular function
– Biological processes
– Cellular components
40
Genomic, Proteomic, etc. data sources
• A tremendous amount of data is available through public data
sources via the Web, ftp, or by other means.
• To analyze biological data, we first have to get it….
• Several ways to organize presentation of material – by site, by
type, etc. We will organize by data type.
• Types of Databases:
–
–
–
–
–
–
Chromosomal (http://www.ncbi.nlm.nih.gov/mapview)
DNA/Genes
Protein
Biochemistry and metabolic pathways
Microarray
Web collections
41
Types of genomic data
• Genomic DNA: DNA sequences, typically complete with
coding and noncoding sequences
• GSS: Genome survey sequence. Single pass sequence read
directly from robot.
• mRNA: an RNA sequence from an mRNA molecule. May or
may not cover all of a particular gene
• cDNA: complement DNA – a DNA sequence generated by
conversion of an mRNA sequence
• EST: Expressed Sequence Tag – short cDNA sequences from
studies of cells under particular circumstances. Typically
incomplete.
• SNP – Single Nucleotide Polymorphism
42
DNA databases
• GenBank. Operated by NCBI (National Center for
Biotechnology Information). http://www.ncbi.nlm.nih.gov
• European Molecular Biology Laboratory – Nucleotide
Sequence Database. http://www.ebi.ac.uk/genomes/
• DNA Database of Japan (DDBJ). http://www.ddbj.nig.ac.jp
• All share data daily. Update conflicts avoided by policy.
• Differences are in “value added” and interfaces
http://www.ncbi.nlm.nih.gov
43
44
Data Structures
• Current
– Primary DNA repository data based on ASN.1. Makes
possible linkages among many types of biomedical info.
– The software libraries now often handle XML as well
– Software toolkits and docs available at
http://www.ncbi.nlm.nih.gov/IEB/
• Genbank Flat File format
– http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html
• FASTA
>gi|532319|pir|TVFV2E|TVFV2E envelope protein
ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT
QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC
45
Primary vs. Secondary Data sources
• Primary data sources:
– Genetic sequences in NCBI, EMBL, DDJP
– Protein sequences in PDB
• Secondary data sources:
– Inferred protein sequences (what do we know already about
issues here?)
– Curated data sources
46
Protein Structure
• NCBI (of course…)
• Swiss-Prot/TrEMBL at http://www.expasy.org/
– Note: 125,744 chemically determined vs 861,482 inferred
from automated translation of DNA sequences!!!!!
• Protein Data Base – PDB http://www.rcsb.org/pdb/ - one of
the first online bioinformatics databases!!!
47
Biochemistry and pathways
• Biochemistry
– ENZYME (part of the ExPASy system)
– BIND (part of the NCBI system)
• Pathways
– PathDB http://www.ncgr.org/software/version_2_0.html
– Kegg WIT http://wit.mcs.anl.gov/WIT2/
48
Web Resources - General
• NCBI
http://www.ncbi.nlm.nih.gov/
• EBI Biocatalog
http://www.ebi.ac.uk/biocat/
• IUBio Archive
http://iubio.bio.indiana.edu
http://www.ncbi.nlm.nih.gov/
49
Similarity matching
Why pattern matching (and what are
the problems)
and…
US!
Bonobo
http://www.sandiegozoo.org/special/zoo-featured/pygmy_chimps.html
50
51
Problems!
• For proteins, 95% similarity is ~ identical, 80% similarity is a
lot. Even less similarity than that needed for DNA
• Database techniques inadequate – they are too precise!
• Datasets very large to search
• Homology
• Common ancestry
• Sequence (and usually structure) conservation
• Homology is inferred rather than measured
• Identity
• Objective and well defined
• Can be quantified easily, but not very useful!
• Similarity
• Most common method used, but not as easily defined
52
Alignment
• An alignment is an arrangement of two sequences opposite
one another
• It shows where they are different and where they are similar
• We want to find the optimal alignment - the most similarity
and the least differences
• Alignments have two aspects:
– Quantity: To what degree are the sequences similar (percentage,
other scoring method)
– Quality: Regions of similarity in a given sequence
53
Alignment
• Methods:
– dynamic programming
– Hidden Markov Models
– Pattern matching
• Key problem: keeping the calculation time manageable
• Some alignment packages:
– BLAST (http://www.ncbi.nlm.nih.gov/BLAST/)
– FASTA (http://gcg.nhri.org.tw/fasta.html)
54
Scoring Alignments
GCTAAATTC
++ x x
GC AAGTT
• Matches are good: they get a positive value
• Mismatches are bad: they get a negative value
• Gaps are bad: they get a negative value
– Gap opening penalty
– Gap extension penalty
– Score = Matches –Mismatches
-∑{gap opening penalty +(length)*gap length penalty}
CGTACCGTTAATAT
CGTACCGTTAATAT
CGTACCG . . .ATAT CGT. C . GTT .ATAT
55
Now what?
• Taking a sequence and simply comparing it against all existing
sequences in a database in all possible ways approaches O(N!)
if you do it badly enough. Plus it would be silly.
• So: many algorithms possible
• Algorithms are in some ways the same, and in some ways
different, between DNA and proteins.
• We’ll start with DNA, and not do things in historical order
56
Dotter
• Simple way to get a feel for how
sequences compare to each other.
• Used both with DNA and Protein
sequences
• http://www.cgr.ki.se/cgr/groups/son
nhammer/Dotter.html/
• "A dot-matrix program with
dynamic threshold control suited
for genomic DNA and protein
sequence analysis" Erik L.L.
Sonnhammer and Richard Durbin
Gene 167(2):GC1-10 (1995)
• And now (hopefully) a live demo
• Modular nature of proteins
57
Local Alignments with BLAST
•
•
•
•
•
Basic Linear Alignment Search Tool
We’ll spend a LOT of time with BLAST
First a quick demo (hopefully)
http://www.ncbi.nlm.nih.gov/BLAST
So, what did we do?
– BLAST – Basic Linear Alignment Search Tool
– In particular, BLASTn (for nucleotides)
– Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman,
D.J. 1990. Basic Local alignment search tool. Journal of
Molecular Biology 215:403-410
58
(Original) BLAST Algorithm
• Original algorithm does not permit gaps
• The original BLAST algorithm is a local (heuristic) alignment
tool
• Given a search sequence, e.g. ACGTAGGCATGAA
• BLAST first makes a list of all “words” of a given length that
would possibly have a score of at least T against the search
string.
• In the case of this example there would be (at least) the
following:
– ACGTAGGCATG
– CGTAGGCATGA
– GTAGGCATGAA
59
(Original) BLAST Algorithm, 2
• BLAST takes the list of all words with a score of at least T against the
string one is trying to match…. and then searches a database for any
matches to these words. So if one were using the example and the NR
database, BLAST would search NR for all occurrences of the words:
– ACGTAGGCATG
– CGTAGGCATGA
– GTAGGCATGAA
• Suppose BLAST finds in the NR database an exact match to
– ACGTAGGCATG
• BLAST then attempts to extend the match in both directions
– ACGTAGGCATGA
– ACGTAGGCATGA
• So now we have an exact match of 12 letters
60
(Original) BLAST algorithm,3
• So BLAST keeps going, and in this case would stop at an
exact match of 13 letters (if one existed), since 13 letters was
the entire initial search string:
– ACGTAGGCATGAA
– ACGTAGGCATGAA
• BLAST has a stopping algorithm for dropping particular
search directions, or stopping altogether
61
Scoring of DNA
A
A
C
G
T
R
Y
M
W
S
K
D
H
V
B
N
4
-3
-3
-3
1
-1
1
1
-2
-2
1
1
1
-2
1
4
-3
-3
-1
1
1
-2
1
-2
-2
1
1
1
1
C
4
-3
1
-1
-2
-2
1
1
1
-2
1
1
1
G
T
4
-1 1
1 -3
-2 0
1 0
-2 0
1 0
1 1
1 0
-2 1
1 0
1 0
R
Y
1
0
0
0
0
0
1
0
1
0
M
1
0
0
0
0
1
1
0
0
W
1
0
0
1
1
0
0
0
S
1
0
0
0
1
1
0
K
1
1
0
0
1
0
D
1
0
0
0
0
H
1
0
0
0
V
1
0
0
B
1
0
N
1
62
BLAST algorithm in more detail
•
•
•
•
•
•
•
•
The BLAST algorithm searches for MSPs – Maximal Scoring Pairs – such that the score of
sequences cannot be improved either by lengthening it or shortening it. “Pairs” here refers to
a string – or a substring – of the initial string used as the search string – and one or more
strings or substrings found in a database.
The search starts with the creation of all possible subwords of a given length (default typically
11 for DNA sequences, 3 amino acids for protein sequences) that would score at least T when
matched against the original search string. (T is short for Threshold)
BLAST then goes through the database being searched against looking for any occurrence of
each of these words that have a score of at least T. This is a “hit” – or a “High Scoring Pair
(HSP)”
The search then continues by trying to extend these HSPs.
Suppose “S” is the best score found for a word of length k. BLAST stops trying to extend
words when the score drops a certain amount below the best value S in the previous round.
BLAST continues on and on until it is no longer possible to improve the score of HSPs by
making them longer.
Then it generates a list of the best HSPs. Default is a cutoff E-value of 10
BLAST (original) has an infinite gap penalty
63
BLAST Statistics
• BLAST reports E values rather than P values, but it turns out that when E <
0.01, E~P
• What do we do about the fact that we have done many tests?
• If the sequence is length n, and the total length of the database being
searched is N, then a reasonable approach is to multiply E by N/n
• Edge effects – statistics tend to be conservative for short sequences
• Problems:
– Highly repetitive segments
– Low complexity regions
– Bias in composition
• Solution: low complexity regions can be excluded
64
BLAST Options
•
•
•
•
•
•
Set subsequence (of the submitted sequence)
Choose Database (NB: nr ≠ non redundant!)
Limit by entrez query or select an organism
Choose Filter
Expect Value
Word size (default = 11 for nucleotides)
65
Protein Sequence Alignment
• What most people do most of the time
• DNA sequences are useful for relationships that are close, but
DNA sequences are not nearly as well conserved as Amino
Acid sequences
• Now we need to talk about the characteristics of Amino Acids
and ways to compare what is similar and what is not!
• Amino acids can have similar chemical properties, and similar
functions as part of a protein, without being identical!
66
Point Accepted Mutations (PAM)
• For scoring amino acid sequence
alignments
• Dayhoff, M.O., Schwartz, R.M., Orcutt,
B.C. 1978. "A model of evolutionary
change in proteins." In Atlas of Protein
Sequence and Structure 5(3) M.O.
Dayhoff (ed.), 345 - 352, National
Biomedical Research Foundation,
Washington.
• PAM N corresponds to N mutations in
DNA sequence per 100 amino acids. N
can be greater than 100.
• PAM 250 is most commonly used; PAM
100 is also used. PAM 250 => chains with
~20% identity
• PAM matrix calculator at
www.cmbi.kun.nl/bioinf/tools/pam.shtml
http://www.psc.edu/biomed/training/
tutorials/sequence/db/index.html
67
BLOSUM Matrices
• Henikoff and Henikoff (1992) Proc Natl Acad Sci
89(22):10915-9
• Based on analysis of the BLOCKS database
(http://www.blocks.fhcrc.org/)
• BLOSUM = BLOcks SUM database
• Based on analysis of conserved and variable regions of
proteins Naming convention is different than for PAM
matrices.
• BLOSUMxy is based on likelihood ratios for two chains of
amino acids that are xy% identical
• BLOSUM62 is the ‘typical default’
• PAM250 is roughly equivalent to BLOSUM45
68
PSI BLAST
• Position Specific Iterative BLAST
• http://nar.oupjournals.org/cgi/content/full/25/17/3389
• Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z,
Miller W, Lipman DJ. 1997. Gapped BLAST and PSIBLAST: a new generation of protein database search
programs. Nucleic Acids Res 1997 Sep 1;25(17):3389-402
• Required two non-overlapping similarities with search term to
occur within a certain distance (A) on the genome
• Permits gaps in the alignments
• Can be iterated to allow for user-specified scoring matrices By
default, uses the BLOSUM-62 Matrix
69
PSI BLAST
• In the original
BLAST, the step
of extending the
length of the
‘hits’ took ~90%
of execution time.
• The initial
threshold value T
must be lower
than with the
original BLAST,
but far fewer hits
are pursued,
meaning that the
extension time is
lower
Two hits, T=11 A=40 vs One hit, T=13
http://nar.oupjournals.org/content/
vol25/issue17/images/gka56202.gif
70
http://nar.oupjournals.org/content/vol25/
issue17/images/gka56201.gif
71
Gaps in PSI-Blast
• PSI BLAST seeks alignments with single gaps
• Gaps are sought only when a two-hit score exceeds the value
Sg
• Gaps: handled by using a different gap cost function:
-(a+bk+cj)
a is the cost for opening a gap
b is the per unit cost for the length of the gap
k is the length of the gap
c is the cost per of unaligned sequences in the gap
j is the number of sequences left unaligned
72
Discontinuous MEGA Blast
• Useful especially for identifying diverged DNA sequences
• Uses templates; within the template only those items with “1”s
are compared.
• E.g. 1101101101101101
How many BLASTs?
http://www.ncbi.nlm.nih.gov/BLAST/producttable.html
73
mpiBLAST http://mpiblast.lanl.gov/
74
mpiBLAST Algorithm
• Darling, A.E., L. Carey, W.-C. Feng. 2003. The design,
implementation, and evaluation of mpiBLAST. Presented at
ClusterWorld2003.
http://www.cs.wisc.edu/%7Edarling/mpiblast-cwce2003.pdf
• Algorithm
– Database is segmented. Portions of database are placed on data
storage devices on multiple nodes in a HPC system.
mpiformatdb is a wrapper for the BLAST formatdb program.
Number of subdivisions specified by user
– Foreman/worker algorithm. Portions of the database are assigned
to workers, using a greedy algorithm
75
mpiBLAST performance
• Scaling can be superlinear when pieces are small enough that
they fit into memory
• Scalability limitations due to communication, implicit barrier
before assembly of results
• If pieces of data distributed out to workers are larger than
available RAM, then scaling is still good but not superlinear
• Blast is the most heavily used bioinformatics tool in existence.
Parallelization of BLAST has huge payoff for practicing
biologists
76
Motivation: BLAST with Low Memory
• Standard BLAST running on a system with 128 MB of
memory.
Slide courtesy of Wu-chun Feng
[email protected] Los Alamos National Laboratory
77
mpiBLAST: Low-Memory Performance
• Environment
– 1, 2, or 4 nodes.
– Each node w/ dual
550-MHz CPUs and
128-MB memory.
– Same query and
database used.
• Conclusions
– blastn is I/O bound.
Superlinear speed-up
possible.
– tblastx is CPU bound.
Slide courtesy of Wu-chun Feng
[email protected] Los Alamos National Laboratory
mpiBLAST
on Green Destiny
BLAST Run Time for 300-kB Query against nt
Nodes
Runtime (s)
Speedup over 1 node
1 80774.93
1.00
4
8751.97
9.23
8
4547.83
17.76
16
2436.60
33.15
32
1349.92
59.84
64
850.75
94.95
128
473.79
170.49
The Bottom Line: mpiBLAST reduces search time
from 1346 minutes (or 22.4 hours) to under 8
minutes!
Slide courtesy of Wu-chun Feng
[email protected] Los Alamos National Laboratory
78
Global Alignments: Needleman-Wunsch
Algorithm
• Start at the beginning, end t the end
• Needleman, S.B., and C.D. Wunsch. 1970. A general method
applicable to the search for similarities in the amino acid
sequences of two proteins. J. Mol. Bio. 48: 443-453.
• “The amino acid sequences of a number of proteins have been
compared to determine whether the relationships existing
between them could have occurred by chance. Generally, these
sequences are from proteins having closely related functions
and are so similar that simple visual comparisons can reveal
sequence coincidence….”
79
80
Needleman-Wunsch
•
•
•
•
Amino acid sequences are lined up as column and row headers for a matrix
Ai is the ith amino acid in protein A
Bj is the jth amino acid in protein B
Start with a matrix where the matches between the Ai s and Bj s are 1 of
there is a match, 0 otherwise
• The optimal alignment can be represented as a path through the matrix
• If MATmn is part of a pathway including MATij, the only permissible
relationships are m> i and n>j, or m<I and n<j
• The optimal pathway is found by filling out the matrix from the bottom
right corner towards the upper left, where in each cell you insert the
maximum score arising from an alignment that includes this cell in the
matrix
81
Needleman-Wunsch and Smith-Watermann
• Shortcomings of Needleman-Wunsch?
• Can you think of biological situations in which you might
want to use Needleman-Wunsch?
• Smith-Waterman: similar to Needleman-Wunsch, except
– Requires a penalty for gaps
– Will do partial alignments (e.g. has stopping point)
• Computational requirements
– Original Needleman-Wunsch and Smith Waterman both require
O(N*M) time and O(N*M) memory
– There are improvements of Smith-Waterman that require
O(N*M) time and O(N) space
82
ALIGN
•
•
•
•
•
Simple protein alignment tool
Included in FASTA distributions 2.x, but not 3.x
Still, it’s a nice learning tool
Can be downloaded for Linux or for Windows
Can also be run from web at
http://fasta.bioch.virginia.edu/fasta/align.htm
• Can also be run from web at http://us.expasy.org/tools
83
Protein Alignment with the FASTA family
• FASTA is one of the earliest protein alignment tools, and still
actively maintained
• Pronounced FAST and then a long A
• A local alignment, heuristic tool
• Can be downloaded from
http://www.people.virginia.edu/~wrp/pearson.html
• FASTA family maintained by Prof. William R. Pearson
• Can also be run from Web
84
FASTA Algorithm
• Ktup = word length (2 default; 1 sometimes used)
• FASTA searches for words of length ktup matching between
sequences
• FASTA searches for ungapped regions of a particular length
that have the highest number of identical ktups
• FASTA scores the 10 ungapped alignments that have the
highest number of identical ktups, scoring with a scoring
matrix (default is BLOSUM50)
• FASTA then tests for the ability to merge the ungapped
alignments into a single alignment without dropping the
overall score too much
• FASTA uses the Smith-Waterman algorithm within the local
alignment regions!
85
Multiple Alignment - Clustal-W
• Why do we need to align many different sequences at once?
– Look for highly conserved regions
– Gene searching (of mice and men)
• http://www.ebi.ac.uk/clustalw/
• Thompson et al. 1994. Nucleic Acids Res. 22: 4673-4680
• Heuristic & Progressive
– Begin with 2 sequences
– Add others one-by-one
• Uses profile alignment
– Align sequence with group of aligned sequences
– Align groups of aligned sequences
– Misalignments in conserved regions penalized heavily
86
Example output
FOS_RAT
FOS_MOUSE
FOS_CHICK
FOSB_MOUSE
FOSB_HUMAN
MMFSGFNADYEASSSRCSSASPAGDSLSYYHSPADSFSSMGSPVNT
MMFSGFNADYEASSSRCSSASPAGDSLSYYHSPADSFSSMGSPVNT
MMYQGFAGEYEAPSSRCSSASPAGDSLTYYPSPADSFSSMGSPVNS
-MFQAFPGDYDS-GSRCSS-SPSAESQ--YLSSVDSFGSPPTAAAS
-MFQAFPGDYDS-GSRCSS-SPSAESQ--YLSSVDSFGSPPTAAAS
*:..* .:*:: .***** **:.:* * *..***.* :.. :*:
FOS_RAT
FOS_MOUSE
FOS_CHICK
FOSB_MOUSE
FOSB_HUMAN
IPTVTAISTSPDLQWLVQPTLVSSVAPSQ-------TRAPHPYGLP
IPTVTAISTSPDLQWLVQPTLVSSVAPSQ-------TRAPHPYGLP
VPTVTAISTSPDLQWLVQPTLISSVAPSQ-------NRG-HPYGVP
VPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMP
VPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPVVDPYDMP
:******:** **********:**:* **... ::. .**.:* :
87
Clustal-W Algorithm
• Construct matrix of distances
– Alignment scores from all pairwise combinations
– Alignments by dynamic programming method
– Alignment scores transformed to evolutionary distances
– Cluster distances into hierarchical tree (neighbor joining)
• Progressively align sequences using tree as a guide
– Begin with closest pair
– Work up tree in order of decreasing similarity
– Use pairwise alignment for pairs
– Use sequence-profile alignment to add sequences to
clusters
– Use profile-profile alignment to join clusters
88
CLUSTAL-W key features
• Sequences weighted to reduce representation bias associated with large
subfamilies (usual sum-of-pairs score problem)
• Substitution matrix used for scoring depends on distance between
sequences.
– BLOSUM80 for near sequences
– BLOSUM50 for distant sequences
• Gap penalties at hydrophobic residues heavier than those at hydrophilic
residues
• Gap penalties also contingent upon exact residue identity at gap site
• Gaps corralled by increasing penalties at sites where gaps are rare when
gaps are common nearby
• When building alignment, low-scoring additions rescheduled to be added
later
89
ClustalW-MPI
• Li, K.-B.2003. ClustalW-MPI: ClustalW analysis using
distributed and parallel computing. Bioinformatics 19: 15851586
• Initial pairwise alignment process is parallelized and scales
very well
• Multiple alignment process is parallelized and scales modestly
• Scaling tests published thus far up to 16 processors, reduces
time from hours to minutes
HMMR
• http://hmmer.wustl.edu/
• Profile HMMs for protein sequence analysis
• A profile is a statistical model of patterns that are likely for
multiple alignments, including variability at various positions
and probabilities of various residues
• Useful when similarities are too faint to be picked up by
BLAST
• Several profiles based on existing alignments exist
• Available as a parallel code using PVM
• Scales reasonably well as regards number of processors. Does
not scale as well as regards size of the biological problem
90
91
GeneIndex
• Location of initiators, promoters, etc. a key question in
genomics
• First step in this is creating a dictionary of words of various
lengths (many possible next steps)
• To be useful, analysis must be performed on entire genomes at
once
• GeneIndex finds frequencies and positions of all words of a
given length in a DNA sequence. Visualization with Tcl/Tk.
92
GeneIndex Parallelization
• Genome is broken up
into n sections, where
n = number of
processors
• After each segment is
analyzed, linked lists
are joined
93
94
GeneIndex Scalability: Processing Time
Drosophila
3000
Time (seconds)
2500
2000
1500
1000
500
0
0
20
40
Num ber of CPU
60
95
GeneIndex Scalability: Speedup
Drosophila
70
Speedup
60
50
40
30
20
10
0
0
20
40
Num ber of CPU
60
80
96
Phylogenetics
97
Building Phylogenetic Trees
• Goal: an objective means
by which phylogenetic trees
can be estimated in
tolerable amounts of wallclock time, producing
phylogenetic trees with
measures of their
uncertainty
• All evolutionary changes
are described as bifurcating
trees
-genes or gene products
-organisms
98
Phylogenetic trees from DNA sequences
• Changes DNA modeled as Markov processes
• Sequences available:
• DNA (sequences are series of the base molecules; aligned
sequences will also contain +s for gaps)
• Amino acid sequences (series of letters indicating the 20
amino acids). Computational challenges more severe than with
DNA sequences.
• RNA
• The availability of data at present exceeds the ability of
researchers to analyze it!
99
Why is tree-building a HPC problem?
• The number of bifurcating
unrooted trees for n taxa is
(2n-5)!/ (n-3)! 2n-3
• for 50 taxa the number of
possible trees is ~1074; most
scientists are interested in
much larger problems
• NP-hard problem
• The number of rooted trees
is (2n-5)!
100
Phylogenetic software
• Phylip. (J. Felsenstein). Collection of software packages that
cover most types of analysis. One of the most popular software
collections. Free.
• PAUP. (D. Swofford). Parsimony, distance, and ML methods.
Also one of the most popular software collections. Not free,
but not expensive.
• fastDNAml. (G. Olsen). Maximum likelihood method for
DNA; becoming one of the more popular ML packages. MPI
version available soon; well suited to tree searching in large
data sets. Free.
• GRAPPA (Bader et al.): Breakpoint analysis program - scales
well
101
Stochastic change of DNA
• Markov process, independent for each site: 4 x 4 matrix for
DNA, 20 x 20 for amino acids
•
A
C
G
T
• A
p(A->A)
p(A->C)
p(A->G) …
• C
p(C->A)
p(C->C)
p(C->G) …
• G
.
• T
.
• Transitions more probable than transversions.
• Must account for heterogeneity in substitution rates among
sites (DNArates – Olsen)
102
fastDNAml
•
•
•
•
Developed by Gary Olsen
Derived from Felsensteins’s PHYLIP programs
One of the more commonly used ML methods
The first phylogenetic software implemented in a parallel
program (at Argonne National Laboratory, using P4 libraries)
• Olsen, G.J.,et al.1994. fastDNAml: a tool for construction of
phylogenetic trees of DNA sequences using maximum
likelihood. Computer Applications in Biosciences 10: 41-48
• MPI version produced by Indiana University in collaboration
with Gary Olsen available from
http://www.indiana.edu/~rac/hpc/fastDNAml/
103
fastDNAml algorithm – adding taxa
• Optimize tree for 3
(randomly chosen) taxa only one topology
possible
• Randomly pick another
taxon –
(2i-5) trees possible
• Keep the best
(maximum likelihood
tree)
Basic fastDNAml algorithm - Branch
rearrangement
• Move any subtree crossing n
vertices (if n=1 there are 2i6 possibilities)
• Keep best resulting tree
• Repeat this step until local
swapping no longer
improves likelihood value
104
105
fastDNAml algorithm con’t: Iterate
•
•
•
•
•
•
•
Get sequence data for next taxon
Add new taxa (2i-5)
Keep best
Local rearrangements (2i-6)
Keep best
Keep going….
When all taxa have been added, perform a full tree check
106
Overview of parallel program flow
•
Program modules
– Master (generates trees,
receives back from Foreman
best tree at each step)
– Foreman (dispatches trees to
workers, determines best tree,
tracks activity of workers)
– Worker
– Monitor (instrumentation)
– Parallel versions include fault
tolerance features (useful in
large clusters and grid
computing)
107
Performance of fastDNAml
70
60
SpeedUp
50
40
30
20
10
0
0
10
20
30
40
50
Number of Processors
Perfect Scaling
50 Taxa
101 Taxa
150 Taxa
60
70
108
Why bother with parallel code?
• Why not just achieve
speedup of n on n
processors by running n
independent jobs?
• Practical benefits of seeing
results quickly
• Parallel program permits
assault on much more
complicated problems (e.g.
protein sequences)
109
RNA & Protein Structure
110
RNA Structure – Vienna RNA
• http://www.tbi.univie.ac.at/~ivo/RNA/
• Package consists of several parts (from the web site):
– RNAfold - predict minimum energy secondary structures and pair
probabilities
– RNAeval - evaluate energy of RNA secondary structures
– RNAheat - calculate the specific heat (melting curve) of an RNA
sequence
– RNAinverse - inverse fold (design) sequences with predefined structure
– RNAdistance - compare secondary structures
– RNApdist - compare base pair probabilities
– RNAsubopt - complete suboptimal folding
http://www.tbi.univie.ac.at/~ivo/RNA/
111
Types of Proteins
• Enzymes- biological catalysts Most of the chemical reactions
which occur in biological systems are catalyzed by enzymes.
• Storage. Various ions, small molecules and other metabolites
are stored by complexing with proteins; for example
haemoglobin carries oxygen.
• Transport. Proteins are involved in the transportation of
particles ranging from electrons to macromolecules.
• Messengers. Proteins are involved in the transmission of
nervous impulses. Hormones play a coordinating role.
• Antibodies. Proteins which bind to specific foreign particles
such as bacteria and viruses.
• Regulation. Enzymes synthesize proteins by translating
sequences of DNA.
• Structural proteins. Mechanical proteins (e.g. collagen)
Proteins – a sparse vocabulary build up
from amino acids
• Average time to fold based on random motion
• Actual folding – small fractions of a second
• Only a small subset of possible amino acid sequences actually
code for a real protein
• Minimization of free energy – the key in real life and in
analysis!
112
113
http://bmbiris.bmb.uga.edu/wampler/tutorial/prot0.html
114
http://bmbiris.bmb.uga.edu/wampler/tutorial/prot0.html
115
Molecular viewing software options
• VRML – Cosmo Player
http://www.karmanaut.com/cosmo/player/
• RASMOL - http://www.openrasmol.org/
• CHIME - http://www.mdl.com/chime/index.html
• Swiss Pdb Viewer - http://www.expasy.ch/spdbv/
• MICE - http://mice.sdsc.edu/
• Many tend to be touchy about browsers and plugins
116
Different ways to view molecules
•
•
•
•
•
Wireframe
Stick
Ball and stick
Space filled (Van der Waals radii)
Some examples:
– http://class.fst.ohio-state.edu/FST822/Images/helix.pdb
– http://www.rcsb.org/pdb/
– http://www.rcsb.org/pdb/cgi/explore.cgi?job=graphics;pdbId=1
GFL;page=;pid=264201048789105&opt=vrml_default
117
Protein structure determination
• Xray crystallography
• X-ray reflections form a
pattern
• Model the known sequence of
atoms fitting into a 3D
structure so that the reflection
pattern matches the observed
pattern
• Spectroscopic analysis of
molecule structure precise but
still slow!
• ~127,863 entries in SwissProt
• ~857,950 entries in TrEMBL
http://crystal.uah.edu/~carter/protein/xray.htm
118
Protein structure prediction methods
• Knowledge-based methods
– Based on information extracted from existing structures to
estimate structure
• Physico-chemical methods
– “Ab initio” protein structure prediction
• Feature detection methods:
– Look for post-translational modification signals
• Cleavage sites
• Glycosylation sites
• Phosphorylation
• Site for prediction server: http://www.cbs.dtu.dk/services/
119
Protein Structure Prediction
• Key requirement: prediction of molecule position within 1
angstrom
• Measuring quality of fit
– Root mean square of atom distances
RMSD = √ (∑di2)/N
– Q3 = (true positives + true negatives)/total residues
• Better than 70% right is really good!
120
Secondary Structure Prediction
• Secondary – or local –
structure prediction is the
first step in classifying
amino acid sequences
– Alpha helix
– Beta sheet
– coil
http://www.cryst.bbk.ac.uk/PPS95/
course/3_geometry/rama.html
http://www.cryst.bbk.ac.uk/PPS95/
course/3_geometry/helix1.html
Different approaches to tertiary structure
prediction
• Do a sequence alignment to find a protein that is like the
unknown sequence in whole or in part
• Threading
– Thread a molecule on to a guide
– Add sidechains
– Optimize sidechains
• Piecewise reconstrcution
– Estimate the structure of smaller pieces
– Then estimate how they fit together
121
122
SDSC Biology Workbench
• Probably one of the best
overall sites in the US
• http://workbench.sdsc.edu
• Requires registration but
this is relatively painless
• You do need to read the
instructions first…
123
Ab initio methods - Amber
• http://amber.scripps.edu/#ff
• sander: Simulated annealing with NMR-derived energy restraints.
• gibbs: Free energy perturbation (FEP) and thermodynamic integration (TI) ,
and also allows potential of mean force (PMF) calculations.
• roar: Allows mixed quantum-mechanical/molecular-mechanical (QM/MM)
calculations, "true" Ewald simulations, and alternate molecular dynamics
integrators.
• nmode: Normal mode analysis program using first and second derivative
information, used to find search for local minima, perform vibrational
analysis, and search for transition states.
• (from http://amber.scripps.edu/#code)
124
Ab initio methods - GAMESS
• M.W.Schmidt, M.W., K.K.Baldridge, J.A.Boatz, S.T.Elbert,
M.S.Gordon, J.H.Jensen, S.Koseki, N.Matsunaga,
K.A.Nguyen, S.Su, T.L.Windus, M.Dupuis, J.A.Montgomery.
1993. General Atomic and Molecular Electronic Structure
System J. Comput. Chem.14: 1347-63.
• NPACI/SDSC Web portal for GAMESS:
https://gridport.npaci.edu/gamess/
125
Hybrid approaches: Rosetta
• Library of identification of short sequence motifs that correlate strongly
with protein local structural properties.
• Basic idea:
– sequence-dependent local interactions bias segments of the chain
– nonlocal interactions select the lowest free-energy tertiary structures
from the many conformations compatible
– Use protein database and take the distribution of local structures
adopted by short sequence segments (fewer than 10 residues in length)
in known three-dimensional structures
– Put these structures together using non-local interactions
• hydrophobic burial, electrostatics, main-chain hydrogen bonding
and excluded volume.
• Free energy is then minimized to create candidate structures
126
Molecular Docking
•
•
•
•
Key in drug searching
Autodock is a commonly used package
http://www.scripps.edu/pub/olson-web/doc/autodock/
“AutoDock is a suite of automated docking tools. It is
designed to predict how small molecules, such as substrates or
drug candidates, bind to a receptor of known 3D structure.”
(from the web site)
• Nice visualization of an AutoDock docking simulation:
http://wwwcmc.pharm.uu.nl/moret/dockings/home.html
127
Systems Biology
128
Systems Biology
• Special issue of
Science: 295, Mar.
2002
• Special issue of
Nature: 420, Nov.
2002
• Nobody’s quite sure
what it is, but it sure
is hot!
http://www.ornl.gov/TechResources/Human_Genome/
graphics/slides/images/01-0052_web.gif
Historical approach to biological
experiments
• From Lazebnik, Y. 2002. Cancer cell 2:179
• Traditional biological experimentation much like the process
of trying to fix a broken radio
• (Or, for those of us who have experienced either being or
living with a 12-year old boy, the process of breaking a
functioning radio)
• Some typical steps:
– Cataloguing components and their attributes
– Perturbing the system
– Knock-out experiments
– Drawing diagrams
• Eventually may find a component that, when replaced, repairs
the radio
129
130
Issues
• In a very complex system, knowing what all of the parts are,
and knowing the function of individual pathways, may still not
tell you how the systems work. It may simply be impossible to
deduce this from 1-st order interactions
• Interactions, multiple changes
– Power supply and other components (well-known PC repair
example!)
– Change everything all at once so that we’ll never know what
worked!
131
Systems Biology
• Systems biology emphasizes close integration of experiment, theory and
computational modeling
• Goal: understanding the structure and dynamics of biological systems,
placing the parts in the context of the dynamic whole
– Studies the complex interactions of many levels of biological
information
– Quantitative, predictive models are central
– Computational modeling in particular is a key tool
• Why model
– You are forced to really state what you are hypothesizing
– Allows you to understand an *approximation* of reality in great detail
• Computational Cell Biology. 2002. Springer Verlag (Fall et al, eds).
• Foundations of systems biology. MIT Press, 2001. Kitano (ed)
132
Example - MCell
• MCell is: A General Monte Carlo Simulator of Cellular
Microphysiology. http://www.mcell.cnl.salk.edu/
• MCell focuses on simulations using a Brownian dynamics random
walk algorithm.
• MCell's use to date has been focused on the microphysiology of
synaptic transmission.
• Images and MCell-related material courtesy of Joel R. Stiles,
Pittsburgh SupercomputingCenter and Carnegie Mellon University,
and Thomas M. Bartol, Computational Neurobiology Laboratory,
The Salk Institute.http://www.mcell.cnl.salk.edu/
133
MCell Scalability
Images and MCell-related material courtesy of Joel R. Stiles,
Pittsburgh Supercomputing Center and Carnegie Mellon University,
and Thomas M. Bartol, Computational Neurobiology Laboratory,
The Salk Institute. http://www.mcell.cnl.salk.edu/
134
M-Cell
• Uses MDL (Model Description
Language (MDL), designed with
biologically-oriented users in
mind.
• Embarrassingly parallel Monte
Carlo application
• Supports checkpointing!
Images and MCell-related material
courtesy of Joel R. Stiles,
Pittsburgh Supercomputing Center and
Carnegie Mellon University,
and Thomas M. Bartol, Computational
Neurobiology Laboratory,
The Salk Institute.
http://www.mcell.cnl.salk.edu/
135
CompuCell
• CompuCell currently uses a combination of "extended Potts model" for cell
sorting and clustering, and "Schnakenberg Reaction Diffusion" equations to
establish the underlying chemical field to which cells respond and form
typical patterns found in such biological systems as a growing chicken
limb.
• http://www.nd.edu/~icsb/
Image courtesy of James Glazier
http://www.biocomplexity.indiana.edu/software.php
136
Karyote
• Information theory approach - construction of probability for
parameters so that uncertainty in their estimation is assessed.
• The incompleteness of model is addressed via a probability
functional approach for computing the time-dependence of the
concentration of key enzymes
• Small features such as ribosomes or viruses behave in ways
that rely on their atomic scale structure but which take part in
the overall (macroscopic) balance of metabolic reaction and
transport. “Zones” may be treated in more detail via the
solution of mesoscopic models using finite element methods.
• Can be run over web at
http://biodynamics.indiana.edu/overview/
137
Issue: Getting Tools to Interoperate
• There is currently a proliferation of software, but no single
package answers all needs
• No single tool is likely to do so in the near future
• But: problems with using multiple packages
• One effort to address this problem:
– Systems Biology Workbench Project
• Purpose: develop software and standards to
– Enable sharing of simulation & analysis software
– Enable sharing of models
• Goal: make it easier to share than to reimplement
138
The Systems Biology Workbench Project
• http://www.sbw-sbml.org/
• Simple framework for
application interaction.
• Cross-platform compatible &
language-neutral
Script
Interpreter
Visual
Editor
Database
Interface
Stochastic
Simulator
• Modules are separately
compiled executables. A
module defines services which
have methods
• SBW native-language libraries
provide APIs.
• SBW Broker acts as coordinator
SBW
ODE-based
Simulator
139
CellML
• http://www.cellml.org/public/about/what_is_cellml.html
• XML-based specification of interchange of cell model
information
• Includes:
• Information about model structure
• Math, based on MathML
• Metadata about the model
• Project of Bioengineering Institute of University of Auckland
with support from Physiome Sciences Inc.
140
Systems biology URLs
•
•
•
•
•
•
•
•
•
•
SBW & SBML
www.sbw-sbml.org
NetBuilder
strc.herts.ac.uk/bio/Maria/NetBuilder
CellML
www.cellml.org
Jarnac + JDesigner www.cds.caltech.edu/~hsauro
Gepasi
www.gepasi.org
Virtual Cell
www.nrcam.uchc.edu/
E-CELL
www.e-cell.org
JigCell
gnida.cs.vt.edu/~cellcyclepse/
DARPA BioSPICEwww.biospice.org
Karyote
http://biodynamics.indiana.edu/overview/
141
Grand challenge problems and some
thoughts about the future
142
Modeling Heart Function
• Based on Noble, D. 2002. Modeling the heart – from genes to
cells to the whole organ. Science 295: 1678-1682
• Two mutations known for sodium channels
– DeltaKPQ – deletion of 3 amino acids (lysine-prolineglutamine) – causes persistent sodium flow through cell
wall
– Missense mutations in sodium channels which cause
ventricular fibrulations that can be fatal
• Models of heart function can produce counterintuitive
predictions
• Grand challenge problem: the full scale reconstruction of a
heart attack
143
Real-time fMRI
3.0T MRI Scanner
CRAY T3E
SGI Onyx
In 1996, this required a supercomputer
Today, it’s routine
Slide courtesy of Ralph Roskies,
Pittsburgh Supercomputing Center,
[email protected]
144
Gamma Knife
• Used to treat inoperable
tumors
• Treatment methods
currently use a standardized
head model
• UITS is working with IU
School of Medicine to adapt
Penelope code to work with
detailed model of an
individual patient’s head
145
PENELOPE Basics
• “PENELOPE performs Monte Carlo simulation of coupled
electron-photon transport in arbitrary materials and complex
quadric geometries”
(http://www.nea.fr/abs/html/nea-1525.html)
• Improvement of targeting based on CT scans of patient’s head
– 200 512 x 512 voxel slices
• Simulation takes ~7 hours using a serial version of
PENELOPE running on a 1 GHz PIII Windows system
• Goal: 5 minutes to one hour
Parallelization of
PENELOPE
• Each processor:
– Views entire target
– Generates its own random
numbers
– Generates a set number of
independent trajectories
– Accumulates data
• Process 0:
– Collects the raw data
– Computes desired results
• Uses F90 for parallel random
number generator from MILC
consortium
• Uses MPI elsewhere
146
147
PENELOPE Scalability: processing time
100000
Total Wallclock Time (sec.)
10000
1000
100
10
1
0
50
100
150
200
Number of Processors
On IBM SP/Power3
250
300
148
PENELOPE Scalability: Speedup
300
250
Speedup
200
150
100
50
0
0
50
100
150
200
# of Processors
250
300
Some very boring Vampir traces of
PENELOPE
149
150
“Simulation-only” studies
• Aquaporins -proteins which conduct large volumes of water
through cell walls while filtering out charged particles like
hydrogen ions.
• Massive simulation (35,000 hours TCS) showed that water
moves through aquaporin channels in single file. Oxygen leads
the way in. Half way through, the water molecule flips over.
• That breaks the ‘proton wire’
• Work done at Pittsburgh Supercomputing Center
• Klaus Schulten et al, U. of Illinois, SCIENCE (April 19, 2002)
Other example large-scale
computational biology grid projects
• Department of Energy “Genomes to Life”
http://doegenomestolife.org/
• Encyclopedia of Life (http://eol.sdsc.edu/)
• Biomedical Informatics Research Network (BIRN)
http://birn.ncrr.nih.gov/birn/
• Asia Pacific BioGrid (http://www.apbionet.org/)
• eDiamond – breast cancer/mammography grid
(http://www.mirada-solutions.com/PH1.asp?PAGE_ID=739)
151
152
Visualization: OpenDX
• http://www.opendx.org/
• OpenDX is the open source
software version of IBM's
Visualization Data Explorer
Product
• Good sources of information
in books, tutorials, etc.
• Interesting example of open
source
• Animations as well
http://www.opendx.org/highlights.php
153
Visualization: SciRUN
• Some of the most dramatic biological visualizations ever done
• Has been used for surgical support
• Scientific Computing and Imaging Institute – Christopher R.
Johnson
• http://www.sci.utah.edu/
154
Genomes to Life
• http://www.doegenomestolife.org/
• Goals:
– Identify and Characterize the Molecular Machines of Life — the
Multiprotein Complexes That Execute Cellular Functions and
Govern Cell Form
– Characterize Gene Regulatory Networks
– Characterize the Functional Repertoire of Complex Microbial
Communities in Their Natural Environments at the Molecular
Level
– Develop the Computational Methods and Capabilities to
Advance Understanding of Complex Biological Systems and
Predict Their Behavior
– (Goals taken directly from Genomes to Life web site)
155
EOL Basic Topology
Genomic Data
Putative Functional and 3D Assignment
Integration with Other Resources
Public and Private Databases
To Serve Thousands Worldwide
http://eol.sdsc.edu/methodology.html
Current Genomic Pipeline
sequence info
structure info
NR, PFAM
SCOP, PDB
Building FOLDLIB:
PDB chains
SCOP domains
PDP domains
CE matches PDB vs. SCOP
90% sequence non-identical
minimum size 25 aa
coverage (90%, gaps <30, ends<30)
156
Arabidopsis Protein sequences
Prediction of :
signal peptides (SignalP, PSORT)
transmembrane (TMHMM, PSORT)
coiled coils (COILS)
low complexity regions (SEG)
Create PSI-BLAST profiles for Protein sequences
Structural assignment of domains by
PSI-BLAST on FOLDLIB
Only sequences w/out A-prediction
Structural assignment of domains by
123D on FOLDLIB
Only sequences w/out A-prediction
Functional assignment by PFAM, NR,
PSIPred assignments
FOLDLIB
Domain location prediction by sequence
http://eol.sdsc.edu/methodology.html
Store assigned regions in the DB
Scale of Multi-genome Analysis
sequence info
structure info
NR, PFAM
SCOP, PDB
Building FOLDLIB:
PDB chains
SCOP domains
PDP domains
CE matches PDB vs. SCOP
90% sequence non-identical
minimum size 25 aa
coverage (90%, gaps <30, ends<30)
104
entries
157
~800 genomes
@ 10k-20k per
=~107 ORF’s
Genomes Protein sequences
Prediction of :
signal peptides (SignalP, PSORT)
transmembrane (TMHMM, PSORT)
coiled coils (COILS)
low complexity regions (SEG)
Create PSI-BLAST profiles for Protein sequences
Structural assignment of domains by
PSI-BLAST on FOLDLIB
4 CPU
years
228 CPU
years
3 CPU
years
Only sequences w/out A-prediction
Structural assignment of domains by
123D on FOLDLIB
9 CPU
years
Only sequences w/out A-prediction
Functional assignment by PFAM, NR,
PSIPred assignments
FOLDLIB
Domain location prediction by sequence
http://eol.sdsc.edu/methodology.html
252 CPU
years
3 CPU
years
Store assigned regions in the DB
158
BIRN
• Biomedical Informatics Research Network
• http://www.nbirn.net/
• NIH-sponsored attempt to create health-oriented
cyberinfrastructure
• Function BIRN – brain function and disorders, e.g.
schizophrenia
• Morphometry BIRN – brain structural disorders, e.g.
Alzheimers
• Mouse BIRN – studying mouse brain and mouse models of
human brain disorders
• Grid technology, using federated data system approach, based
on Globus, SRB, etc.
159
Drug Design
•
•
•
•
•
Target generation – so what
Target verification – that’s important!
Toxicity prediction – VERY important!!
(Cholesterol example)
Counterintuitive problem: the more personalized a therapy is,
the smaller its target audience!
What is the killer application in
computational biology?
• Systems biology – latest buzzword, but….
• Goal: multiscale modeling from cell chemistry up to multiple
populations
• Current software tools still inadequate
• Multiscale modeling calls for use of established HPC
techniques – e.g. adaptive mesh refinement, coupled
applications
• Current challenge examples: actin fiber creation, heart attack
modeling
• Opportunity for predictive biology?
160
Computational biology, biomedical
research, and HPC
• Two challenges:
– Scalability of applications
– Wall-clock time sensitivity
• Bioinformatics, Genomics, Proteomics, ____ics will radically
change understanding of biological function and the way
biomedical research is done.
• Traditional biomedical researchers must take advantage of new
possibilities
• Computer-oriented researchers must take advantage of the
knowledge held by traditional biomedical researchers
161
162
Peta-Scale applications?
• Is this what most biologist really need?
• Many biologists are unfamiliar with the real possibilities
• Useful – even lifesaving – applications may require
straightforward application of well known principles.
• The low hanging fruit taste just fine. e.g. “Parallel” Matlab,
GeneIndex, batch scripts
(www.indiana.edu/~rac/bioinformatics/iubatchscripts.html)
• Writing a parallel application that can be used to treat people is
a very difficult challenge
• Attacks on all fronts simultaneously are needed
• Interactive Tera-scale applications might for many biologists be
more valuable right now than Peta-scale applications (even if
we had them!)
• All of these open source codes are out there waiting for you to
parallelize and/or tune them!
So how do you find biologists with
whom to collaborate?
•
•
•
•
Chicken and egg problem?
Or more like fishing?
Or bank robbery?
Willie Sutton, a famous American bank robber, was asked why he
robbed banks, and reportedly said “because that's where the money
is.” (This is, sadly, an urban legend: Sutton never said this)
• Cultivating collaborations with biologists in the short run will
require:
– Active outreach
– Different expectations than we might usually have
– Patience
• There are lots of opportunities open for HPC centers willing to take
the effort to cultivate relationships. To do this, we’ll all have to
spend a bit of time “going where the biologists are.”
163
164
Acknowledgments
•
•
•
•
•
•
•
•
Some of the research described herein was supported in part by the Indiana Genomics
Initiative. The Indiana Genomics Initiative of Indiana University is supported in part by Lilly
Endowment Inc.
Some of the research described herein was supported in part by Shared University Research
grants from IBM, Inc. to Indiana University.
Some of the material described herein is based upon work supported by the National Science
Foundation under Grant No. 0116050 and Grant No. CDA-9601632. Any opinions, findings
and conclusions or recommendations expressed in this material are those of the author(s) and
do not necessarily reflect the views of the National Science Foundation (NSF).
Some of the ideas presented here were developed while the senior author was a visiting
scientist at Höchstleistungsrechenzentrum Universität Stuttgart. The support and collaboration
of HLRS and Michael Resch, Matthias Müller, Peggy Lindner, Matthias Hess, and Rainer
Keller are gratefully acknowledged.
Thanks to UITS Research and Academic Computing Division managers: Mary Papakhian,
Stephen Simms, Richard Repasky, Matt Link, John Samuel, Eric Wernert, Anurag Shankar
Indiana Genomics Initiative Staff: Andy Arenson, Chris Garrison, Huian Li, Jagan
Lakshmipathy, David Hancock
UITS Senior Management: Associate Vice President and Dean (Retired) Christopher Peebles,
RAC(Data) Director Gerry Bernbom, Associate Vice President and Dean Bradley Wheeler
Assistance with this presentation: John Herrin, Malinda Lingwall, W. Les Teach
165
Some Good Books
• Winter, P.C., G.I. Hickey, H.L. Fletcher. 1998. Instant notes in
genetics. Springer-Verlag, NY. ISBM 0-387-91562-1
• Durbin, R., S. Eddy, A. Krogh, G. Mitchison. 2000. Biological
sequence analysis. Cambridge University Press.
• Gibas, C., and P. Jambeck. 2001. Developing bioinformatics
computer skills. O’Reilly.
• Tisdall, J. 2001. Beginning perl for bioinformatics. O’Reilly.
• Gusfield, D. 1997. Algorithms on strings, trees, and
sequences. Cambridge University Press.
• Berman, F., G.C. Fox, A.J.G. Hey. (eds) 2003. Grid
computing: making the grid infrastructure a reality. Wiley,
Sussex

Computational biology, bioinformatics, and high performance computing Craig A. Stewart [email protected] Indiana University SC2003 Tutorial 16 November 2003 S14

Transcript Computational biology, bioinformatics, and high performance computing Craig A. Stewart [email protected] Indiana University SC2003 Tutorial 16 November 2003 S14

Directory