No Slide Title

Download Report

Transcript No Slide Title

Introduction to Bioinformatics
ChBi406/506
Ozlem Keskin
For today’s lectures Many slides from gersteinlab.org/courses/452
And
Bioinformatics and Functional Genomics
by Jonathan Pevsner (ISBN 0-471-21004-8).
EMails
Ozlem Keskin
[email protected]
Engin Cukuroglu
[email protected]
Who is taking this course?
• People with very diverse backgrounds in
biology, chemical engineering (MS/BS)
• People with diverse backgrounds in computer
science -please visit Attila Hoca’s office!
• Most people have a favorite gene, protein, or disease
What are the goals of the course?
• To provide an introduction to bioinformatics with
a focus on the National Center for Biotechnology
Information (NCBI) and EBI
• To focus on the analysis of DNA, RNA and proteins
• To introduce you to the analysis of genomes
• To combine theory and practice to help you
solve research problems
Themes throughout the course
Textbooks
Web sites
Literature references
Gene/protein families
Computer labs
Textbook
The course textbook is J. Pevsner, Bioinformatics and
Functional Genomics (Wiley, 2009).
Several other bioinformatics texts are available:
Baxevanis and Ouellette
Mount
Durbin et al.
Lesk
In our library you will find (e-book)
Bioinformatics [electronic resource] : sequence and genome analysis / David W.
Mount.ImprintCold Spring Harbor, N.Y. : Cold Spring Harbor Laboratory Press, c2001.
Bioinformatics : a practical guide to the analysis of genes and proteins / editedby
Andreas D. Baxevanis, B.F. Francis Ouellette.ImprintHoboken, N.J. : John Wiley, 2005.
(SOON)
Themes throughout the course:
Literature references
You are encouraged to read original source
articles. Although articles are not required,
they will enhance your understanding of the
material.
You can obtain articles through PubMed
and Web of Science.
Web sites
The course website is reached via:
http://pevsnerlab.kennedykrieger.org/bioinfo_course.htm
(or Google “pevsnerlab”  courses)
This site contains the powerpoints for each lecture.
The textbook website is:
http://www.bioinfbook.org
This has 1000 URLs, organized by chapter
This site also contains the same powerpoints.
You will also find the lecture slides at F-folder.
Grading
Midterm
Final
HWs
Project
30%
35%
20%
15%
(might change, the course will evolve)
Themes throughout the course:
gene/protein families
We will use beta globin and retinol-binding protein 4
(RBP4) as model genes/proteins throughout the course.
Globins including hemoglobin and myoglobin carry
oxygen. RBP4 is a member of the lipocalin family. It is a
small, abundant carrier protein. We will study globins and
lipocalins in a variety of contexts including
• --sequence alignment
• --gene expression
• --protein structure
• --phylogeny
• --homologs in various species
The HIV-1 pol gene encodes three proteins
Aspartyl
protease
Reverse
transcriptase
PR
RT
Integrase
IN
Outline for today (chapters 1 and 2)
Definition of bioinformatics
Overview of the NCBI website
Accessing information about DNA and proteins
--Definition of an accession number
--Four ways to find information on proteins and DNA
Access to biomedical literature
Bioinformatics
Biological
Data
+
Computer
Calculations
What is bioinformatics?
• Interface of biology and computers
• Analysis of proteins, genes and genomes
using computer algorithms and
computer databases
• Genomics is the analysis of genomes.
The tools of bioinformatics are used to make
sense of the billions of base pairs of DNA
that are sequenced by genomics projects.
Protein coordinates, DNA array data, annotated gene sequences
Biological information is being generated now days in parallel.
We can easily run 10,000 simultaneous experiments on a single DNA microarray.
To cope with this much data we really need computers.
So Bioinformatics is that field that combines biology and computers.
Where does Bioinformatics come
from?
Data from the Human Genome Project has fueled the
development of new bioinformatics methods
HGP
What is Bioinformatics?
• (Molecular) Bio - informatics
• One idea for a definition?
Bioinformatics is conceptualizing biology in terms of
molecules (in the sense of physical-chemistry) and then
applying “informatics” techniques (derived from
disciplines such as applied math, CS, and statistics) to
understand and organize the information associated with
these molecules, on a large-scale.
• Interface of biology and computers
Analysis of proteins, genes and genomes using computer
algorithms and computer databases
Genomics is the analysis of genomes.
The tools of bioinformatics are used to make
sense of the billions of base pairs of DNA
that are sequenced by genomics projects.
•
Top ten challenges for bioinformatics
[1] Precise models of where and when transcription
will occur in a genome (initiation and termination)
[2] Precise, predictive models of alternative RNA splicing
[3] Precise models of signal transduction pathways;
ability to predict cellular responses to external stimuli
[4] Determining protein:DNA, protein:RNA, protein:protein
recognition codes
[5] Accurate ab initio protein structure prediction
Top ten challenges for bioinformatics
[6] Rational design of small molecule inhibitors of proteins
[7] Mechanistic understanding of protein evolution
[8] Mechanistic understanding of speciation
[9] Development of effective gene ontologies:
systematic ways to describe gene and protein function
[10] Education: development of bioinformatics curricula
Source: Ewan Birney,
Chris Burge, Jim Fickett
Simulating the cell
On bioinformatics
“Science is about building causal relations between natural
phenomena (for instance, between a mutation in a gene and
a disease). The development of instruments to increase our
capacity to observe natural phenomena has, therefore,
played a crucial role in the development of science - the
microscope being the paradigmatic example in biology. With
the human genome, the natural world takes an
unprecedented turn: it is better described as a sequence of
symbols. Besides high-throughput machines such as
sequencers and DNA chip readers, the computer and the
associated software becomes the instrument to observe it,
and the discipline of bioinformatics flourishes.
On bioinformatics
However, as the separation between us (the observers) and
the phenomena observed increases (from organism to cell
to genome, for instance), instruments may capture
phenomena only indirectly, through the footprints they leave.
Instruments therefore need to be calibrated: the distance
between the reality and the observation (through the
instrument) needs to be accounted for. This issue of
Genome Biology is about calibrating instruments to observe
gene sequences; more specifically, computer programs to
identify human genes in the sequence of the human
genome.”
Martin Reese and Roderic Guigó, Genome Biology 2006 7(Suppl I):S1,
introducing EGASP, the Encyclopedia of DNA Elements (ENCODE)
Genome Annotation Assessment Project
bioinformatics
medical
informatics
Tool-users
public health
informatics
Tool-makers
algorithms
databases
infrastructure
Three perspectives on bioinformatics
The cell
The organism
The tree of life
Page 4
After Pace NR (1997)
Science 276:734
Page 6
Time of
development
Body region, physiology,
pharmacology, pathology
Page 5
DNA
RNA
protein
phenotype
Page 5
DNA
RNA
protein
phenotype
Sequences (millions)
Base pairs of DNA (billions)
Growth of GenBank
Updated 8-12-04:
>40b base pairs
1982
1986
1990
1994
Year
1998
2002
Fig. 2.1
Page 17
70
60
50
40
30
20
10
0
Base pairs of
DNA (billions)
Sequences (millions)
Growth of GenBank
1985
December
1982
1990
1995
2000
June
2006
Base pairs of
DNA (billions)
Growth of the International Nucleotide
Sequence Database Collaboration
Base pairs contributed by GenBank
EMBL
DDBJ
http://www.ncbi.nlm.nih.gov/Genbank/
Central dogma of molecular biology
DNA
genome
RNA
transcriptome
protein
proteome
Central dogma of bioinformatics and genomics
What is the Information?
Molecular Biology as an Information Science
• Central Dogma
of Molecular Biology
DNA
-> RNA
-> Protein
-> Phenotype
-> DNA
• Molecules
– Sequence, Structure, Function
• Processes
– Mechanism, Specificity, Regulation
• Central Paradigm
for Bioinformatics
Genomic Sequence
Information
-> mRNA (level)
-> Protein Sequence
-> Protein Structure
-> Protein Function
-> Phenotype
• Large Amounts of Information
– Standardized
– Statistical
•Most cellular functions are performed or
facilitated by proteins.
•Primary biocatalyst
•Cofactor transport/storage
•Mechanical motion/support
•Genetic material
•Information transfer (mRNA)
•Protein synthesis (tRNA/mRNA)
•Some catalytic activity
•Immune protection
•Control of growth/differentiation
(idea from D Brutlag, Stanford, graphics from S Strobel)
•
•
•
Proteins fold into 3D structures with specific functions which are reflected in a
pheonotype.
These functions are selected in a Darwinian sense by the environment of the
phenotype.
Which drives the evolution of the DNA sequence.
•
Many Bioinformatics techniques address this flow of molecular biology
information inside the organism hoping to understand the organization and
control of genes even predicting protein structure from sequence.
•
There is a second flow of information that bioinformatics seeks to address is
the large amount of data generated by new high through methods.
Bioinformatics owes its lively hood to the availability of large data sets that
are too complex to allow manual analysis.
DNA
genomic
DNA
databases
RNA
cDNA
ESTs
UniGene
protein
phenotype
protein
sequence
databases
Fig. 2.2
Page 20
There are three major public DNA databases
EMBL
GenBank
DDBJ
The underlying raw DNA sequences are identical
Page 16
There are three major public DNA databases
EMBL
Housed
at EBI
European
Bioinformatics
Institute
GenBank
DDBJ
Housed
at NCBI
National
Center for
Biotechnology
Information
Housed
in Japan
Page 16
>100,000 species are represented in GenBank
all species
128,941
viruses
6,137
bacteria
31,262
archaea
2,100
eukaryota
87,147
Table 2-1
Page 17
The most sequenced organisms in GenBank
Homo sapiens (6.9 million entries)
Mus musculus (5.0 million)
Zea mays (896,000)
Rattus norvegicus (819,000)
Gallus gallus (567,000)
Arabidopsis thaliana (519,000)
Danio rerio (492,000)
Drosophila melanogaster (350,000)
Oryza sativa (221,000)
National Center for Biotechnology
Information (NCBI)
www.ncbi.nlm.nih.gov
Taxonomy nodes at NCBI
8/06
http://www.ncbi.nlm.nih.gov/Taxonomy/txstat.cgi
The most sequenced organisms in GenBank
Homo sapiens
Mus musculus
Rattus norvegicus
Danio rerio
Zea mays
Oryza sativa
Drosophila melanogaster
Gallus gallus
Arabidopsis thaliana
Updated 8-12-04
GenBank release 142.0
10.7 billion bases
6.5b
5.6b
1.7b
1.4b
0.8b
0.7b
0.5b
0.5b
Table 2-2
Page 18
The most sequenced organisms in GenBank
Homo sapiens
Mus musculus
Rattus norvegicus
Danio rerio
Bos taurus
Zea mays
Oryza sativa (japonica)
Xenopus tropicalis
Canis familiaris
Drosophila melanogaster
Updated 8-29-05
GenBank release 149.0
11.2 billion bases
7.5b
5.7b
2.1b
1.9b
1.4b
1.2b
0.9b
0.8b
0.7b
Table 2-2
Page 18
The most sequenced organisms in GenBank
Homo sapiens
Mus musculus
Rattus norvegicus
Bos taurus
Danio rerio
Zea mays
Oryza sativa (japonica)
Strongylocentrotus purpurata
Sus scrofa
Xenopus tropicalis
Updated 7-19-06
GenBank release 154.0
12.3 billion bases
8.0b
5.7b
3.5b
2.5b
1.8b
1.5b
1.2b
1.0b
1.0b
Table 2-2
Page 18
Molecular Biology Information DNA
• Raw DNA Sequence
– Coding or Not?
– Parse into genes?
– 4 bases: AGCT
– ~1 K in a gene,
~2 M in genome
– ~3 Gb Human
atggcaattaaaattggtatcaatggttttggtcgtatcggccgtatcgtattccgtgca
gcacaacaccgtgatgacattgaagttgtaggtattaacgacttaatcgacgttgaatac
atggcttatatgttgaaatatgattcaactcacggtcgtttcgacggcactgttgaagtg
aaagatggtaacttagtggttaatggtaaaactatccgtgtaactgcagaacgtgatcca
gcaaacttaaactggggtgcaatcggtgttgatatcgctgttgaagcgactggtttattc
ttaactgatgaaactgctcgtaaacatatcactgcaggcgcaaaaaaagttgtattaact
ggcccatctaaagatgcaacccctatgttcgttcgtggtgtaaacttcaacgcatacgca
ggtcaagatatcgtttctaacgcatcttgtacaacaaactgtttagctcctttagcacgt
gttgttcatgaaactttcggtatcaaagatggtttaatgaccactgttcacgcaacgact
gcaactcaaaaaactgtggatggtccatcagctaaagactggcgcggcggccgcggtgca
tcacaaaacatcattccatcttcaacaggtgcagcgaaagcagtaggtaaagtattacct
gcattaaacggtaaattaactggtatggctttccgtgttccaacgccaaacgtatctgtt
gttgatttaacagttaatcttgaaaaaccagcttcttatgatgcaatcaaacaagcaatc
aaagatgcagcggaaggtaaaacgttcaatggcgaattaaaaggcgtattaggttacact
gaagatgctgttgtttctactgacttcaacggttgtgctttaacttctgtatttgatgca
gacgctggtatcgcattaactgattctttcgttaaattggtatc . . .
. . .
caaaaatagggttaatatgaatctcgatctccattttgttcatcgtattcaa
caacaagccaaaactcgtacaaatatgaccgcacttcgctataaagaacacggcttgtgg
cgagatatctcttggaaaaactttcaagagcaactcaatcaactttctcgagcattgctt
gctcacaatattgacgtacaagataaaatcgccatttttgcccataatatggaacgttgg
gttgttcatgaaactttcggtatcaaagatggtttaatgaccactgttcacgcaacgact
acaatcgttgacattgcgaccttacaaattcgagcaatcacagtgcctatttacgcaacc
aatacagcccagcaagcagaatttatcctaaatcacgccgatgtaaaaattctcttcgtc
ggcgatcaagagcaatacgatcaaacattggaaattgctcatcattgtccaaaattacaa
aaaattgtagcaatgaaatccaccattcaattacaacaagatcctctttcttgcacttgg
Molecular Biology Information:
Protein Sequence
• 20 letter alphabet
– ACDEFGHIKLMNPQRSTVWY
but not BJOUXZ
• Strings of ~300 aa in an average protein (in bacteria),
~200 aa in a domain
• >1M known protein sequences (uniprot)
d1dhfa_
d8dfr__
d4dfra_
d3dfr__
LNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTTTSSVEGKQ-NLVIMGKKTWFSI
LNSIVAVCQNMGIGKDGNLPWPPLRNEYKYFQRMTSTSHVEGKQ-NAVIMGKKTWFSI
ISLIAALAVDRVIGMENAMPWN-LPADLAWFKRNTL--------NKPVIMGRHTWESI
TAFLWAQDRDGLIGKDGHLPWH-LPDDLHYFRAQTV--------GKIMVVGRRTYESF
d1dhfa_
d8dfr__
d4dfra_
d3dfr__
LNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTTTSSVEGKQ-NLVIMGKKTWFSI
LNSIVAVCQNMGIGKDGNLPWPPLRNEYKYFQRMTSTSHVEGKQ-NAVIMGKKTWFSI
ISLIAALAVDRVIGMENAMPW-NLPADLAWFKRNTLD--------KPVIMGRHTWESI
TAFLWAQDRNGLIGKDGHLPW-HLPDDLHYFRAQTVG--------KIMVVGRRTYESF
d1dhfa_
d8dfr__
d4dfra_
d3dfr__
VPEKNRPLKGRINLVLSRELKEPPQGAHFLSRSLDDALKLTEQPELANKVDMVWIVGGSSVYKEAMNHP
VPEKNRPLKDRINIVLSRELKEAPKGAHYLSKSLDDALALLDSPELKSKVDMVWIVGGTAVYKAAMEKP
---G-RPLPGRKNIILS-SQPGTDDRV-TWVKSVDEAIAACGDVP------EIMVIGGGRVYEQFLPKA
---PKRPLPERTNVVLTHQEDYQAQGA-VVVHDVAAVFAYAKQHLDQ----ELVIAGGAQIFTAFKDDV
d1dhfa_
d8dfr__
d4dfra_
d3dfr__
-PEKNRPLKGRINLVLSRELKEPPQGAHFLSRSLDDALKLTEQPELANKVDMVWIVGGSSVYKEAMNHP
-PEKNRPLKDRINIVLSRELKEAPKGAHYLSKSLDDALALLDSPELKSKVDMVWIVGGTAVYKAAMEKP
-G---RPLPGRKNIILSSSQPGTDDRV-TWVKSVDEAIAACGDVPE-----.IMVIGGGRVYEQFLPKA
-P--KRPLPERTNVVLTHQEDYQAQGA-VVVHDVAAVFAYAKQHLD----QELVIAGGAQIFTAFKDDV
Molecular Biology Information:
Macromolecular Structure
• DNA/RNA/Protein
– Almost all protein
(RNA Adapted From D Soll Web Page,
Right Hand Top Protein from M Levitt web page)
Molecular Biology
Information:
Whole Genomes
• The Revolution Driving
Everything
Fleischmann, R. D., Adams, M. D., White, O., Clayton, R. A., Kirkness, E.
F., Kerlavage, A. R., Bult, C. J., Tomb, J. F., Dougherty, B. A., Merrick, J.
M., McKenney, K., Sutton, G., Fitzhugh, W., Fields, C., Gocayne, J. D.,
Scott, J., Shirley, R., Liu, L. I., Glodek, A., Kelley, J. M., Weidman, J. F.,
Phillips, C. A., Spriggs, T., Hedblom, E., Cotton, M. D., Utterback, T. R.,
Hanna, M. C., Nguyen, D. T., Saudek, D. M., Brandon, R. C., Fine, L. D.,
Fritchman, J. L., Fuhrmann, J. L., Geoghagen, N. S. M., Gnehm, C. L.,
McDonald, L. A., Small, K. V., Fraser, C. M., Smith, H. O. & Venter, J. C.
(1995). "Whole-genome random sequencing and assembly of Haemophilus
influenzae rd." Science 269: 496-512.
(Picture adapted from TIGR website, http://www.tigr.org)
• Integrative Data
1995, HI (bacteria): 1.6 Mb & 1600 genes done
1997, yeast: 13 Mb & ~6000 genes for yeast
1998, worm: ~100Mb with 19 K genes
1999: >30 completed genomes!
2003, human: 3 Gb & 100 K genes...
Genome sequence now
accumulate so quickly that,
in less than a week, a single
laboratory can produce
more bits of data than
Shakespeare managed in a
lifetime, although the latter
make better reading.
-- G A Pekso, Nature 401: 115-116 (1999)
1995
Bacteria,
1.6 Mb,
~1600 genes
[Science 269: 496]
1997
Genomes highlight
the Finiteness
of the “Parts” in
Biology
Eukaryote,
13 Mb,
~6K genes
[Nature 387: 1]
1998
real thing, Apr ‘00
Animal,
~100 Mb,
~20K genes
[Science 282:
1945]
2000?
Human,
~3 Gb,
~100K
genes [???]
‘98 spoof
Cor
Other Types of Data
• Gene Expression
– Early experiments yeast
– Now tiling array technology
• 50 M data points to tile the human genome at ~50 bp res.
– Can only sequence genome once but can do an infinite
variety of array experiments
• Phenotype Experiments
• Protein Interactions
– For yeast: 6000 x 6000 / 2 ~ 18M possible interactions
– maybe 30K real
Weber
Cartoon
Bioinformatics is born!
(courtesy of Finn Drablos)
Major Application I:
Designing Drugs
Cor
e
• Understanding How Structures Bind Other
Molecules (Function)
• Designing Inhibitors
• Docking, Structure Modeling
(From left to right, figures adapted from Olsen Group Docking Page at Scripps, Dyson NMR Group Web page at Scripps, and from Computational
Chemistry Page at Cornell Theory Center).
Major Application II: Finding
Homologs
Cor
e
Major Application II:
Finding Homologues
• Find Similar Ones in Different Organisms
• Human vs. Mouse vs. Yeast
– Easier to do Expts. on latter!
Best Sequence
Similarity
Matches
Date Reproduced
Between Positionally
Cloned
(Section from
NCBI Disease
Genes to
Database
Below.)
Human Genes and S. cerevisiae Proteins
Human Disease
MIM #
Human
Gene
GenBank
BLASTX
Acc# for
P-value
Human cDNA
Yeast
Gene
GenBank
Yeast Gene
Acc# for
Description
Yeast cDNA
Hereditary Non-polyposis Colon Cancer
Hereditary Non-polyposis Colon Cancer
Cystic Fibrosis
Wilson Disease
Glycerol Kinase Deficiency
Bloom Syndrome
Adrenoleukodystrophy, X-linked
Ataxia Telangiectasia
Amyotrophic Lateral Sclerosis
Myotonic Dystrophy
Lowe Syndrome
Neurofibromatosis, Type 1
120436
120436
219700
277900
307030
210900
300100
208900
105400
160900
309000
162200
MSH2
MLH1
CFTR
WND
GK
BLM
ALD
ATM
SOD1
DM
OCRL
NF1
U03911
U07418
M28668
U11700
L13943
U39817
Z21876
U26455
K00065
L19268
M88162
M89914
9.2e-261
6.3e-196
1.3e-167
5.9e-161
1.8e-129
2.6e-119
3.4e-107
2.8e-90
2.0e-58
5.4e-53
1.2e-47
2.0e-46
MSH2
MLH1
YCF1
CCC2
GUT1
SGS1
PXA1
TEL1
SOD1
YPK1
YIL002C
IRA2
M84170
U07187
L35237
L36317
X69049
U22341
U17065
U31331
J03279
M21307
Z47047
M33779
DNA repair protein
DNA repair protein
Metal resistance protein
Probable copper transporter
Glycerol kinase
Helicase
Peroxisomal ABC transporter
PI3 kinase
Superoxide dismutase
Serine/threonine protein kinase
Putative IPP-5-phosphatase
Inhibitory regulator protein
Choroideremia
Diastrophic Dysplasia
Lissencephaly
Thomsen Disease
Wilms Tumor
Achondroplasia
Menkes Syndrome
303100
222600
247200
160800
194070
100800
309400
CHM
DTD
LIS1
CLC1
WT1
FGFR3
MNK
X78121
U14528
L13385
Z25884
X51630
M58051
X69208
2.1e-42
7.2e-38
1.7e-34
7.9e-31
1.1e-20
2.0e-18
2.1e-17
GDI1
SUL1
MET30
GEF1
FZF1
IPL1
CCC2
S69371
X82013
L26505
Z23117
X67787
U07163
L36317
GDP dissociation inhibitor
Sulfate permease
Methionine metabolism
Voltage-gated chloride channel
Sulphite resistance protein
Serine/threoinine protein kinase
Probable copper transporter
Cor
e
Major Application I|I:
Overall Genome Characterization
• Overall Occurrence of a
Certain Feature in the
Genome
– e.g. how many kinases in Yeast
• Compare Organisms
and Tissues
– Expression levels in Cancerous vs
Normal Tissues
• Databases, Statistics
(Clock figures, yeast v. Synechocystis,
adapted from GeneQuiz Web Page, Sander Group, EBI)
What do you get from largescale data mining? Global
statistics on the population of
proteins
EX-2: Occurrence of 1-4
salt bridges in genomes
of thermophiles v
mesophiles
0.70
EK(3)
EK(4)
0.60
0.50
LOD value
EX-1: Occurrence of
functions per fold &
interactions per fold
over all genomes
0.40
0.30
0.20
0.10
0.00
MP
MG
EC
SC
HP
10 to 45
Mesophile
SS
HI
MT
MJ
AF
AA
OT
65
85
83
95
98
Thermophile
Physiological temperature in C
http://www.nature.com/nature/journal/vaop/ncurrent/pdf/
End of First Lecture
2009
Remaining Slides
Lecture 2
Organizing
Molecular Biology
Information:
Redundancy and
Multiplicity
• Different Sequences Have the Same
Structure
• Organism has many similar genes
• Single Gene May Have Multiple
Functions
• Genes are grouped into Pathways
• Genomic Sequence Redundancy due to
the Genetic Code
• How do we find the similarities?....
Cor
e
Integrative Genomics genes  structures 
functions  pathways 
expression levels 
regulatory systems  ….
Ome
molecular group
Genome
Proteome
Transcriptome
Phenome
Interactome
Metabolome
Physiome
Orfeome
Secretome
Morphome
Glycome
Regulome
Functome
Cellome
Transportome
Ribonome
Operome
'Omics: studying
populations of
molecules in a database
framework
Ome
Google
molecular group
Hits
Genome
Proteome
58200000
1850000
Transcriptome
707000
Phenome
418000
Interactome
87500
Metabolome
80700
Physiome
56300
Orfeome
29800
Secretome
23900
Morphome
11400
Glycome
995
Regulome
618
Functome
390
Cellome
246
Transportome
155
Ribonome
131
Operome
57
'Omics: studying
populations of
molecules in a database
framework
'Omics: studying
populations of
molecules in a database
framework
Ome
Google
PubMed
PubMed
molecular group
Hits
Hits
First year
Genome
58200000
537993
1953
1850000
6005
1995
Transcriptome
707000
1665
1997
Phenome
418000
53
1989
Interactome
87500
87
1999
Metabolome
80700
182
1998
Physiome
56300
41
1997
Orfeome
29800
25
2002
Secretome
23900
48
2000
Morphome
11400
2
2000
Glycome
995
34
2000
Regulome
618
6
2004
Functome
390
1
2001
Cellome
246
17
2002
Transportome
155
1
2004
Ribonome
131
1
2002
Operome
57
0
Proteome
PubMed Hits
Cor
e
Proteome
A Parts List Approach to Bike
Maintenance
Extra
A Parts List Approach to Bike
Maintenance
How many roles
can these play?
How flexible and
adaptable are they
mechanically?
What are the
shared parts (bolt,
nut, washer, spring,
bearing), unique
parts (cogs,
levers)? What are
the common parts - types of parts
(nuts & washers)?
Cor
e
Extra
Where are
the parts
located?
Molecular Parts = Conserved
Domains, Folds, &c
Vast Growth in
(Structural) Data...
but number of
Fundamentally New (Fold)
Parts Not Increasing
that Fast
Total in Databank
New Submissions
New Folds
World of Structures is even more Finite,
providing a valuable simplification
1
2
3
4
5
6
7
8
9
10 11
12 13
14 15 16
17 18 19
20
…
~100000 genes
~1000 folds
(human)
(T. pallidum)
1
2
3
4
5
6
7
8
9
10 11
Same logic for pathways, functions,
sequence families, blocks, motifs....
Global Surveys of a
Finite Set of Parts from
Many Perspectives
Functions picture from www.fruitfly.org/~suzi (Ashburner); Pathways picture from,
ecocyc.pangeasystems.com/ecocyc (Karp, Riley). Related resources: COGS, ProDom,
Pfam, Blocks, Domo, WIT, CATH, Scop....
12 13
14 15 …
~1000 genes
What is Bioinformatics?
• (Molecular) Bio - informatics
• One idea for a definition?
Bioinformatics is conceptualizing biology in terms
of molecules (in the sense of physical-chemistry)
and then applying “informatics” techniques
(derived from disciplines such as applied math, CS,
and statistics) to understand and organize the
information associated with these molecules, on a
large-scale.
• Structural Bioinformatics is a practical discipline
with many applications that deals with biological
three dimensional structural data.
General Types of
“Informatics” techniques
in Bioinformatics
• Databases
 Building, Querying
 Object DB
• Text String Comparison




Text Search
1D Alignment
Significance Statistics
Google, grep
• Finding Patterns
 AI / Machine Learning
 Clustering
 Datamining
• Geometry
 Robotics
 Graphics (Surfaces, Volumes)
 Comparison and 3D Matching
(Vision, recognition)
• Physical Simulation




Newtonian Mechanics
Electrostatics
Numerical Algorithms
Simulation
Bioinformatics Topics -Genome Sequence
• Finding Genes in Genomic
DNA
 introns
 exons
 promotors
• Characterizing Repeats in
Genomic DNA
 Statistics
 Patterns
• Duplications in the Genome
 Large scale genomic alignment
• Whole-Genome Comparisons
• Finding Structural RNAs
• Sequence Alignment
 non-exact string matching, gaps
 How to align two strings
optimally via Dynamic
Programming
 Local vs Global Alignment
 Suboptimal Alignment
 Hashing to increase speed
(BLAST, FASTA)
 Amino acid substitution scoring
matrices
• Multiple Alignment and
Consensus Patterns
 How to align more than one
sequence and then fuse the
result in a consensus
representation
 Transitive Comparisons
 HMMs, Profiles
 Motifs
Bioinformatics
Topics -Protein Sequence
• Scoring schemes and
Matching statistics
 How to tell if a given alignment
or match is statistically
significant
 A P-value (or an e-value)?
 Score Distributions
(extreme val. dist.)
 Low Complexity Sequences
• Evolutionary Issues
 Rates of mutation and change
Bioinformatics
Topics -Sequence /
Structure
• Secondary Structure
“Prediction”
 via Propensities
 Neural Networks, Genetic
Alg.
 Simple Statistics
 TM-helix finding
 Assessing Secondary
Structure Prediction
• Structure Prediction:
Protein v RNA
• Tertiary Structure Prediction
 Fold Recognition
 Threading
 Ab initio
• Function Prediction
 Active site identification
• Relation of Sequence Similarity to
Structural Similarity
Problems in Protein Bioinformatics
• Prediction of structure from sequence
 Fold recognition
 Fragment construction
• Proteome annotation
• Protein-protein docking
Protein folding code
Protein
folding
code
Protein
sequence
Protein
structure
Prediction of correct fold
Query
sequence
Fold
recognitio
n
Matched
fold
Match sequence against library of known folds
Eisenberg et al.
Jones, Taylor, Thornton
Computational Requirements
• 1 sequence search takes 12 mins (3Ghz)
• Benchmarking on 100 proteins with 100 runs for a
simplex search of parameter space = 80 days
• 30 approaches explored = 7 years (on 1 cpu)