Comparative genome analysis

Download Report

Transcript Comparative genome analysis

Proteome analysis
in silico
Peer Bork
EMBL & MDC
Heidelberg & Berlin
[email protected]
http://www.bork.embl-heidelberg.de/
‘omes: use and misuse
Original intention exemplified by the genome:
‘ome – entirety of biomolecular objects (ALL genes etc)
‘omics – research on an entirety of biomolecular objects
Proteomics – research on the entirety of proteins (so
far in an organism) coined beginning of the 90th
Common Praxis:
‘omics - used to describe large-scale approaches
(whereby large is sometimes 1)
Proteomics - used for research on many proteins
(whereby many might mean 3)
Originally two main directions:
Protein profiling and interaction proteomics
Protein profiling: establishment of protein inventories
under controlled conditions (organelles, tissues,
organisms).
Interaction proteomics: identification of temporally
and spatially defined functional modules formed by
proteins
Bioinformatics analysis is essential in both areas
Proteome analysis in silico
Part I
Protein detection and annotation by homology and
orthology (function in1D)
Part II
Protein interactions and protein networks (function in 2D)
Temporal and spatial considerations (function in 3D+4D)
Bork et al.
JMolBiol 1998
Genome
annotation
Alternative
Splicing
Domain analysis
Protein networks
Literature mining
coupled to
genomic data
70% prediction accuracy is great!
Prediction of
|acc*cov | %acc | % cov of reference set|
Human promoters:
.35 50% 70% of annotated test set
reference
Prestidge, 1995; Bucher , pers. Comm.
Human regulatory RNA elements .34 85% 40% of new DNA
Dandekar & Sharma, 1998
Human genes (only presence):
Dunham et al., 1999 and refs therein
.49 70% 70% of chromosome. 22
Human SNPs by EST comparison: .21 70% 30% of all proteins with SNP
Sunyaev et al., 2000; Buetow et al., 1999
Human alternative splicing:
Hanke et al., 1999
.45 90% 50% of all splice sites
Transmembranes (only presence): .85 85% 99% of annotated test set
Tusnady & Simon, 1998 and refs therein
Signal peptides (only presence):
.90 90% 100% of annotated test set
Nielsen et al., 1999
GPI ancors (incl cleavage site):
.72 72% 100% of annotated test set
Eisenhaber et al., 1999
Coiled coil (only presence):
.81 90% 90% of annotated coiled coil
Lupas, 1996
Secondary structure (3 states):
.77 77% 100% of 3D test set
Jones, 1999 and refs therein
Buried or exposed residues:
.74 74% 100% of 3D test set
Rost, 1996
Residue hydration:
.72 72% 100% of 3D test set
Ehrlich et al., 1998
Protein folds (in Mycoplasma):
.49 98% 50% of Mycoplasma ORFs
Teichmann et al,1999 and refs therein
Homology (several methods):
.49 98% 50% of 3D test set
Muller et al, 1999 and refs therein
Functional features by homology: .63 90% 70% unicellular genomes
Bork and Koonin, 98; Brenner, 99
Function association by context:
.25 50% 10% ‘high confidence’ in yeast
Marcotte et al.,1999b
Cellular localization (2 states):
.77 77% 100% of annotated test set
Andrade et al., 1998
Concepts in function prediction
Homology-based (intrinsic molecular features)
- Sequence and domain DBs (Blast, Pfam,Smart)
- Function transfer by orthology
Gene context (functional associations)
- Gene neighbourhood, fusion, co-occurrence
- Shared regulatory elements
Other (residue level, functional class )
- Correlated mutations
- Interaction threading
- Feature analysis
I. Homology-based protein annotation
Homology detection and domain annotation
Metazoan genome annotation: the dark side…
Metazoan proteome analysis: human vs chicken
Evolution of protein function
www.bork.embl-heidelberg.de
Status of
homology based
function prediction
Many homologues, an
increasing number of
predictable folds, but
tough times for
automatic function
prediction
Molecular Functions
have to be defined
on a domain basis
i.e. separately for
each structurally
independent unit
within a sequence
Henikoff et al. 1997
Science 278, 609
40
35
30
History of signaling domain discovery
cytoplasmic domains
nuclear domains
25
20
15
10
0
<1985
85/86
87/88
89/90
91/92
93/94
95/96
97/98
99/00
01//02
03/now
5
Systematic
discovery by
1) searching
‘in between’
regions
2) starting
with repeats
Doerks et al. 2002
Genome Res.
Ponting et al. 2001
Genome Res.
Domain discovery in disease genes
gene/protein disease
domains
reference
dystrophin
Muscular dystrophy WW
Bork & Sudol: TIBS 19(94)531
X11
Friedreich's ataxia (c) PI/PTB+PDZ
Bork & Margolis: Cell 80(95)693
PKD1
Polycystic kidney
many (PKD1)
Int. PKD1 consortium: Cell 81(95)298
HD
Huntington's
HEAT repeats Andrade & Bork: Nat.Genet.11(95)115
BRCA2
Breast cancer
BRC repeats
Bork et al.: Nat. Genet. 13 (96) 22
BRCA1
Breast cancer
BRCT
Koonin et al.: Nat. Genet. 13 (96) 266
dsh
DiGeorge syndrome DEP
Ponting & Bork: TIBS 21(96) 245
X25 (FRDA)
Friedreich's ataxia
CyaY
Gibson et al. : TINS 19 (96) 465
beige/CH
Chediak-Higashi
BEACH
Nagle et al. : Nat. Genet. 14 (96) 307
RB
Retinoblastoma
BRCT
Bork et al. :FASEB J. 11 (97) 68
9 incl. HML1 Colon cancer
HSP90
Mushegian et al. : PNAS 94 (97) 5831
TSG101
Breast cancer
UBC
Ponting, Cai & Bork: JMM 75 (97) 467
WRN/BLM
Werner + Bloom syn. HRDC
Morozov et al. : TIBS 22 (97) 417
2 inc pyrin
Mediterrian fever
SPRY
Schultz et al. : PNAS 95 (98) 5857
p73
various tumors?
SAM
Bork & Koonin: Nat. Genet. 18 (98) 313
mahagony
Obesity
PSI
Nagle et al.: Nature 398 (99) 148
Parkin
AP-J Parkinsonism
IBR
Morett & Bork: TIBS 24 (99) 229
SMART
Blast-like input
-Access to
different
databases
-Domain
annotation &
architecture
-Alerting
Collaboration with
Chris Ponting
www.smart.embl-heidelberg.de
SMART
Digested output
-signal sequence,
Coiled coil and TM
-Pfam integrated
-comparison of
domain context
www.smart.embl-heidelberg.de
A putative transport-associated microtubule-binding domain
Unifying disorders associated to hereditary spastic paraplegia?
Mutation
Plant-related
MIT
Spartin
• Spastin
• SKD1 protein
• VPS4p ATPase (Vacuolar protein sorting factor 4A and 4B)
• Tobacco mosaic virus helicase domain-binding protein
MIT
• Sorting nexin 15
MIT
• RSK-like protein
MIT
• Similar to ribosomal protein S6 kinase
MIT
• Calpain7
MIT
MIT
• CG8866
Patel, H. et al. Nat Genet 31(02)347, Ciccarelli, F. D., et al. Genomics 81(03)437
I. Homology-based genome annotation
Homology detection and domain annotation
Metazoan genome annotation: the dark side…
Metazoan proteome analysis: human vs chicken
Evolution of protein function
www.bork.embl-heidelberg.de
Number of human genes in time
No human genes in thousands
120
100
HGS, Incyte and co
Textbooks, public opinion
80
52
60
Celera
40
HGP
20
0
Feb00
39
HGS
Basis for Feb 01
publications
others
10T
8T
6T
4T
38
32
27
24
2T
22
21
Aug00 Oct00 Dec00 Feb01 Apr01 Jan05
Improvement of gene cluster predictions
Mouse chr4:94-94,6 Mb p450 (CYP2J) region: 8 genes / 11 pseudogenic fragments
cyp2j13
Known genes
ESTs
Manual (8genes)
Twinscan (1 gene)
GeneID (3 genes)
ENSEMBL
(9 genes)
fgenesh++
(13 genes)
(comparison performed in 2004)
cyp2j6 cyp2j9
cyp2j5
Mm.cyp2j.pep (len= 501)
100
5482
76742499
9960
GENE_14
~(576195..588175)
cov= 0.547 id%= 87.8
GENE_13
~(552820..562733)
cov= 0.184 id%= 62.7
GENE_12
~(515451..538069)
cov= 0.986 id%= 61.0
GENE_11
~(464757..500175)
cov= 0.996 id%= 67.2
GENE_10
~(462789..462893)
cov= 0.070 id%= 57.0
GENE_9
~(391323..454110)
cov= 0.992 id%= 59.9
GENE_8
~(302976..378293)
cov= 0.451 id%= 53.5
GENE_7
~(241542..295291)
cov= 0.978 id%= 63.4
GENE_6
~(181333..220995)
cov= 0.441 id%= 50.2
GENE_5
~(131666..172308)
cov= 0.980 id%= 63.2
GENE_3
~(87921..106913)
GENE_4
cov= 0.166 id%= 59.4
~(126547..126708)
cov= 0.108 id%= 68.0
GENE_2
~(35993..77274)
cov= 0.972 id%= 66.5
GENE_1
~(4764..13967)
cov= 0.441 id%= 60.4
GENE_7
~(241542..295291)
cov= 0.978 id%= 63.4
BLAST2GENE finds independent gene copies
BLAST of cyp2j13 protein vs. Mouse chr4:94-94,6 Mb
~ 150 Alignments
BLAST2GENE
400
300
200
9502
2772
355
294
355
733
816
294
600
644
775
248
383
362
986
986
2662
5482
2662
2161
4259
6354
2524
5704
4957
3955
1978
1262
6286
9024
2563
8844
5074
3089
7684
2403
4717
3443
2412
3180
8678
1863
5482
1988
3280
2111
3613
9547
7380
2960
1772
3522
1656
3839
1549
5141
9639
3289
1452
5270
3289
1452
5270
22025
12983
25664
20546
22025
10328
10646
18576
19633
12288
12983
25664
20546
19731
22780
19940
16451
14587
13029
23116
20352
15275
14703
13461
11826
11826
Hundrets often considerable differences to current gene prediction pipelines!
Annotation of pseudogenes changes gene numbers
1. Similarity search in intergenic regions
Masking of known
repeats
already predicted genes
1.5-2
millionand
fragments
BLASTX vs nr prot. db
E-value < 0.001
Exclusion of transposon and virus
derived sequence
fragments with significant
sequence similarity
Merging of fragments of the same element
regions containing
independent elements
Closest known protein
(first blast hit)
Ca 20.000 detectable pseudogenes
in each: human, mouse, rat
Torrents, Suyama, Bork
Genome Res. 13(2003)2550 Ka/Ks functionality check
GENEWISE
Annotation of pseudogenes changes gene numbers
2. Consistency check of gene predictions
Processed Pseudogene
Processed Pseudogene
Genewise prediction using sptrembl|Q9HBM5
Genewise prediction using SwissProt|RS2_RAT
80 kb
e1
e2
e3
e4
e5
e6
Predicted Gene
Mm chr1:7608644-7681026
Stop codon or
frameshift
Still >3000 pseudogenes among the predicted human
genes mid 2004 (build 34)
Arrays, chips et al. 20%off?
genes
What do we count?
20-40k genes
>100k transcripts
>1000k proteins?
Protein diversity
Rate of detectable alternative splicing depends
on EST coverage and library range
50
2.8
45
2.7
35
2.6
30
2.5
25
2.4
20
2.3
15
2.2
mouse
10
human
5
2.1
2.0
0
0
500.000
1.000.000
1.500.000
2.000.000
ESTs
2.500.000
3.000.000
3.500.000
Brett et al. Nature Genet. 30(2002)29
AS per mRNA (x)
%AS
40
Boue et al. Bioessays 03
www.bork.embl-heidelberg.de
Homology-based predictions of exons and
alternative transcripts (www.smart.embl-heidelberg.de)
SMART domain DB
links to genomes
Top 10 domains* in human: 30% diff.!
Species
human
Total no genes
26500(26500)
765 (381)
Immunoglobulin
706 (607)
C2H2zinc finger
575 (501)
Protein kinase
569 (616)
Rhod.-like GPCR
433
P-loop NTPase
350
Rev.transcriptase
RRM (RNA-binding) 300 (224)
WD40 (G-protein)
277 (136)
Ankyrin repeat
276 (145)
267 (160)
Homeobox
fly
worm
13300
140
357
319
97
198
10
157
162
105
148
18200
64
151
437
358
183
50
96
102
107
109
*Only no of genes given, no of domains higher; note that only around 90% is sequenced
Nature 409 (01)860; Science 291(01)1304
Metazoan genome annotation an ongoing process
and far from complete





>2000 pseudogenes in mammalian gene sets:
Only now they are about to be included in
prediction pipelines
Ca 150 retro-related genes in mammalian gene
sets (>1000 in 2004), but true human genes
sometimes suppressed
Annotation of gene clusters need considerable
improvements
Alternative splicing still a major unknown
Considerable human factor in annotation
I. Homology-based genome annotation
Homology detection and domain annotation
Metazoan genome annotation: the dark side…
Metazoan proteome analysis: human vs chicken
Evolution of protein function
www.bork.embl-heidelberg.de
5
75
40
310MY
450MY
600-1200MY?
human
chimp
mouse
rat
chicken
fugu
C.eleg.
?
250MY
Human: Nature Feb 2001
Mosquito: Science Oct 2002
Mouse: Nature Dec 2002
chicken: Nature Dec 2004
Rat: Nature Apr 2004
D.mena.
mosquito
Chicken genome analysis
Hillier et al
Nature 04
Zdobnov et al
Science 02
15%
45%
Chicken genome analysis: orthology and cellular processes
75.4% identity (median)
between
chicken and human
1:1 orthologs
Immune response
evolves fastest
Chicken genome analysis:
Innovation and Expansion of domain families
www.bork.embl-heidelberg.de
Orthology analysis
reveals more
subtle functional
changes
Evolution by duplication: Burst of an olfactory receptor family
chicken
…thought to
recognize MHC
diversity
…221 copies
in chicken
human
…given a ca 300
ORs in chicken
and 450 in human
Chicken genome analysis: Evolution of function
by domain accretion
Scavenger receptor cysteine-rich domain acquired
by a fibrinogen-domain containing protein
(identified and displayed by SMART)
I. Homology-based genome annotation
Homology detection and domain annotation
Metazoan genome annotation: the dark side…
Metazoan proteome analysis: human vs chicken
Evolution of protein function
www.bork.embl-heidelberg.de
Phylogenetic
Distribution of
orthologs
- Losses
Sterol Metabolism
DA PY W H M
Squalene monooxygenase (EC 1.14.99.7)
-
- x x -
7-dehydrocholesterol reductase (EC 1.3.1.21)
-
- x x x x x
Farnesyl-diphosphate farnesyltransferase ( EC 2.5.1.21)
-
- x x -
x x
Lanosterol synthase (EC 5.4.99.7)
-
- x x -
x x
Lanosterol synthase (EC 5.4.99.7)
-
- x x -
x x
3-oxo-5-alpha-steroid 4-dehydrogenase 1 (EC 1.3.99.5)
-
- x - x x x
C-5 sterol desaturase (EC 1.3.3.2) Ergosterol
biosynthesis
-
- x x -
Cytochrome P450 P51, sterol 14-alpha demethylase
-
- x x - x x
diminuto/24-dehydrocholesterol reductase ('seladin1')
-
- x -
Kynureninase (EC 3.7.1.3)
-
-
-
x x x x
3-hydroxyanthranilate 3,4-dioxygenase (EC 1.13.11.6)
synthesis of excitotoxin quinolinic acid
-
-
-
x x x x
Quinolinate phosphoribosyltransferase (EC 2.4.2.19)
-
- x x -
x x
DNA (cytosine-5)-methyltransferase 1)
-
- x -
-
x x
uracil-DNA glycosylases
-
- x -
x x x
DNA-(apurinic or apyrimidinic site) lyase (EC 4.2.99.18)
-
-
x x
x x
x x x
Biosynthesis of NAD
DNA-methylation and repair
-
x x -
-
Gene loss in
diptera
Functional changes at evolutionary time scales
Orthologs mapped onto
metazoan phylogeny
Summary (homology-based function prediction)
Emphasis in homology based genome annotation shifts from
sensitivity (e.g. domain identification) to selectivity issues (orthology
assignment for 1:1 function transfer)
Metazoan genome annotation is far from being complete and caution
is needed when using incomplete and partially erroneous parts list
(e.g. when predicting networks)
Yet, with the incoming number of metazoan genomes our
understanding of functional diversification at the protein level will
increase dramatically ....although the proteome remains far from
being deciphered