Exploiting Gene Clusters to Curate Annotations Ross Overbeek, Fellowship for Interpretation of Genomes (FIG) October, 2003

Download Report

Transcript Exploiting Gene Clusters to Curate Annotations Ross Overbeek, Fellowship for Interpretation of Genomes (FIG) October, 2003

Exploiting Gene
Clusters to Curate
Annotations
Ross Overbeek,
Fellowship for Interpretation of Genomes
(FIG)
October, 2003
Outline of the Talk
•
•
•
•
•
The Emerging Opportunity
The Use of Clusters to Find “Missing Genes”
Experiences with a Single Pathway
“The Project”
Tools Needed to Support the Project
Three “Laws”
 The amount of available DNA sequence data will double
every 18 months
 The number of available genomes will double every 18
months
 The cost of sequence will drop by a factor of 2 every 18
months.
Basic Facts
 We have about 230-250 publicly available more-or-less
complete genomes
 We will have about 1000 complete genomes within 3
years
 This will lead to better annotations, not worse
 The majority of annotations will need to be automated,
and the process must accurately follow the steps that a
human expert would take
The Use of Clusters to Find
Missing genes
Central Machinery of Life:
Horizons of gene discovery
• 3,000 - 4,000 functional roles (300 – 3,000 per organism)
• Largely conserved across the three kingdoms
(sequences; functions; pathways)
• “Missing genes” are still there
Distinct Functional Roles
(enzymes with complete E.C. numbers)
Genes unknown
Genes known
Total
Fall 2001
Fall 2002
169
613
782
22%
78%
100%
112
670
782
14%
86%
100%
104
59
450
613
17%
10%
73%
100%
94
85
491
670
14%
13%
73%
100%
Taxon-Specific Functional Roles
(among known genes only)
Prokaryotic
Eukaryotic
Universal (present in both)
Total
Missing genes in metabolic pathways
making a case
Missing gene
C + D
E
F
genome 5
gene C
genome 4
gene B?
E3
genome 3
gene A
E2
genome 2
E1
genome 1
A + B
A1
A2
A3
A4
A5
?
?
?
?
?
C1
C2
C3
C4
C5
Functional context/neighborhood
Pathway A --> F
*Enzyme E1
protein family A
*Enzyme E2
protein family B = ?
*Enzyme E3
protein family C
Globally Missing Gene (never identified in any species)
Missing genes in metabolic pathways
making a case
Missing gene
C + D
E
F
genome 5
gene C
genome 4
gene B?
E3
genome 3
gene A
E2
genome 2
E1
genome 1
A + B
A1
A2
A3
A4
A5
?
?
?
B4
B5
C1
C2
C3
C4
C5
Functional context/neighborhood
Pathway A --> F
*Enzyme E1
protein family A
*Enzyme E2
protein family B = ?
*Enzyme E3
protein family C
Locally Missing Gene (non-orthologous gene displacement)
Techniques of genome context analysis (I)
checking neighbors
GENE CLUSTERING ON THE CHROMOSOME (OPERONS)
GENOME 1
GENOME 2
GENOME 3
gene G1
gene Y2
gene Q3
gene T1
gene C2
gene X3
gene A1
gene A2
gene A3
gene X1
gene C1
gene M2
gene S3
gene X2
gene U3
gene R1
gene N2
gene Y3
Techniques of genome context analysis (II)
checking connections
PROTEIN FUSION EVENTS
GENOME 1
gene A1
gene C1
GENOME 3
gene A3
gene C3
GENOME 4
gene A4
GENOME 5
gene C5
/ A5
/ X4
gene C4
/ Z3
Techniques of genome context analysis (III)
co-regulation
SHARED REGULATORY SITES (REGULONS
GENOME 1
gene T1
GENOME 2
gene C2
GENOME 5
gene A1
gene X1
gene C1
gene A2
gene R5
)
gene R1
gene W2
gene C5
/ A5
gene X5
Techniques of genome context analysis (IV)
co-evolution
OCCURRENCE PROFILES
IN-GROUP
GENOME 1
gene A1
gene C1
gene X1
gene G1
gene H1
gene I1
gene W1
gene Y1
gene Z1
GENOME 2
gene A2
gene C2
gene X2
gene G2
gene H2
gene I2
gene W2
gene Y2
-
GENOME 3
gene A3
gene C3
gene X3
gene G3
gene H3
gene I3
gene W3
gene Y3
gene Z3
GENOME 4
gene A4
gene C4
gene X4
-
gene H4
gene I4
gene W4
-
-
GENOME 5
gene A5
gene C5
gene X5
gene G5
gene H5
gene I5
gene W5
-
-
OUT-GROUP
GENOME 6
-
-
-
-
gene H6
gene I6
gene W6
gene Y6
gene Z6
GENOME 7
-
-
-
gene G7
gene H7
gene I7
gene W7
-
gene Z7
GENOME 8
-
-
-
-
gene H8
gene I8
gene W8
gene Y8
-
GENOME 9
-
-
-
-
gene H9
gene I9
gene W9
gene Y9
gene Z9
GENOME 10
-
-
-
-
-
gene I10
gene W10
gene Y10
gene Z10
10
10
10
6
5
4
4
3
Score:
8
Missing gene case
primary suspects
genome 2
genome 3
genome 4
genome 5
genome 6
genome 7
genome 8
genome 9
genome 10
OUT-GROUP
genome 1
IN-GROUP
yes
yes
yes
yes
yes
no
no
no
no
no
protein family A
*Enzyme E2
A1
A2
A3
A4
A5
-
-
-
-
-
? protein family ?
*Enzyme E3
?
?
?
?
?
C1
C2
C3
C4
C5
-
-
-
-
-
protein family X
Enzyme E4
X1
X2
X3
X4
X5
-
-
-
-
-
protein family Y
Y1
Y2
Y3
-
-
Y6
-
Y8
Y9 Y10
Z1
-
Z3
-
-
Z6
Z7
-
Z9 Z10
Functional context/neighborhood
Pathway A --> F
*Enzyme E1
protein family C
Genome context/neighborhood
Clustering on chromosome
Hypothetical protein
Fusion events
Hypothetical protein
protein family Z
Shared regulatory sites
Enzyme E5
protein family W
W1 W2 W3 W4 W5 W6 W7 W8 W9 W10
Occurrence profiles
Membrane protein
protein family G
G1
G2
G3
-
G5
-
G7
-
-
-
Example I: Chorismate Pathway
Missing gene
in all archaea
D-Erythrose 4-P
Shikimate
+
Phosphoenol pyruvate
1
H
H
H
aroH
aroF
aroG
OH
OH
H
OH
H
H
Shikimate
Kinase
(EC 1.1.1.25)
5
aroK
aroL
H
H
H
COOH
OP
OH
H
H
H
COOH
OH
6
7P-2-Dehydro-3deoxy-D-arabino
-heptulosonate
2
Shikimate - 5 - P
aroA
O5-(1-Carboxyvinyl)3-P-Shikimate
aroB
7
aroC
3-Dehydro-Quinate
Chorismate
3
aroD
4
ydiB
aroD
3-Dehydro-Shikimate
Trp Phe Tyr
syntheses
Chorismate
catabolism
Isochorismate
anabolism
Chromosomal Clustering: Prediction
??
??
Fusion Protein
Fusion Protein
Functional coupling in chorismate pathway
Clustering Fusion Occurence
1
EC4.1.2.15
2
EC4.6.1.3
3
EC4.2.1.10
4
EC1.1.1.25
5
EC2.7.1.71
6
EC2.5.1.19
7
EC4.6.1.4
5'
?
3Dehydroquinate
Dehydratase
Shikimate 5Dehydrogenase
Bacterial
Shikimate
Kinase
Phosphoshikimat
e 1-Carboxyvinyl
Transferase
Chorismate
Synthase
Archaeal
Shikimate
Kinase
aroB
aroD
aroE, ydiB
aroK, aroL
aroA
aroC
hypothetical
REC05984
REC01650
REC05912
REC01649
REC05985
REC0372
REC00874
REC05421
-
2-Dehydro-33Deoxyphosphohe Dehydroquinate
p-tonate Aldolase
Synthase
Escherichia
coli
aroH, aroG, aroF
REC01661
REC05569
REC00721
Helicobacter
pylori
RHP01088
RHP01230
RHP00443
RHP00644
RHP01111
RHP01338
RHP00097
-
Thermotoga
maritima
RTM00236
RTM00229
RTM00228
RTM00231
RTM00229
RTM00232
RTM00230
-
Bacillus
subtilis
RBS02969
RBS02266
RBS2304
RBS2442
RBS02559
RBS00316
RBS02256
RBS02267
-
Clostridium
acetobutylicum
RCA01750
RCA01752
RCA01757
RCA01755
RCA01756
RCA01753
RCA01754
-
Streptococcus
pneumoniae
RPN00965-66
RPN00386
RPN00384
RPN00385
RPN00391
RPN00390
RPN00387
-
RSC05895
-
Saccharomyces
cerevisiae
RSC01644
RSC08655
Methanococcus
jannaschii
-
-
RMJ05308
RMJ00483
-
RMJ00806
RMJ07769
RMJ7785
Archaeoglobus
fulgidus
-
-
RAG18799
RAG27692
-
RAG27692
RAG50410
RAG45918
Methanobacter.
Thermoautotrop.
-
-
RTH01640
RTH01082
-
RTH02023
RTH00020
RTH01890
Aeropyrum
pernix
RAP00399
RAP00398
RAP00397
RAP00396
-
RAP00394
RAP00393
RAP00395
Pyrococcus
furiosus
RPF01413
RPF01411-12
RPF01410
RPF01409
-
RPF01402
RPF01401
RPF01407-08
Pyrococcus
abysii
RPO01190
RPO01191
RPO01192
RPO01193
-
RPO01200
RPO01201
RPO01194
RSC06906
Pentafunctional Enzyme
RSC06906
Example II: “Missing Drug Target” in S.pneumoniae
accA
accD
accB
Gene fabI of Enoyl-ACP
reductase (EC 1.3.1.9) is
missing in a number of
Streptococci
accC
fab
D
fab
F
fab
G
fabZ
fabI
acp
P
fabH
Clustering of FAB Genes : Prediction
MAF
g30k
L32P
PLSX
fabH
fabD
fabG
acpP
fabF
EC 4…
2.7.4.9
2.7.7.7
Escherichia coli
hyp TR?
3.5.1.?
fabD
fabH acpP
fabG
fabF
accB
fabZ
accC
accD
accA
fabI
hyp
6.3.4.15
Genome X
TR?
fabH acpP
?
fabD
fabG
fabF
accB
fabZ
accC
accD
accA
fabZ
accC
accD
accA
2.1.1.79
FRNS
Genome Y
TR?
fabH acpP
?
fabD
fabG
fabF
accB
5.99.1.2
Clostridium acetobutylicum
TR?
fabH acpP
?
fabD
fabG
fabF
accB
fabZ
accC
accD
accA
Streptococcus pyogenes
A conserved hypothetical FMN-binding
protein “?” is the best candidate for the
missing gene fabI in Gram-positive cocci
hyp
Independent Experimental Verification
13 July 2000
Nature 406, 145 - 146 (2000) © Macmillan Publishers
Ltd.
Microbiology:
A triclosan-resistant bacterial
enzyme
RICHARD J. HEATH AND CHARLES O. ROCK
Triclosan is an antimicrobial agent that is widely used in a variety of
consumer products and acts by inhibiting one of the highly
conserved enzymes (enoyl-ACP reductase, or FabI) of bacterial
fatty-acid biosynthesis. But several key pathogenic bacteria do
not possess FabI, and here we describe a unique triclosanresistant flavoprotein, FabK, that can also catalyse this reaction
in Streptococcus pneumoniae. Our finding has implications for
the development of FabI-specific inhibitors as antibacterial agents.
Missing genes, examples in cofactor pathways
prediction and experimental verification
M is s ing/Found in: Ke y e vide nce
Expe rim e ntal
Ve rification
Functional Role
E.C.#
Pathw ays
KYNURENINE FORMAMIDASE*
3.5.1.9
NAD/NADP
B
(gram+, gram-)
RIBOSYLNICOTINAMIDE KINASE*
2.7.1.22
NAD/NADP
B
(gram-)
NaMN ADENYLYLTRANSFERASE*
2.7.7.18
NAD/NADP
B
(gram+, gram-)
DUAL SPECIFICITY NMN/NaMN
ADENYLYLTRANSFERASE
2.7.7.1 /
2.7.7.18
NAD/NADP
E
(human, fungi)
Projection from Zhou et al. 2002(3D)
bacteria
Zhang et al. 2003(3D)
3.5.1.-
NAD/NADP
B A
(deep
branched)
Operon/Fusion
Shatalin et al.
in progess (3D)
BIFUNCTIONAL PANTETHEINEPHOSPHATE ADENYLYLTRANSFERASE/dpCoA KINASE
2.7.7.3 /
2.7.1.24
COENZYME A
E
Fusion
Daugherty et al, 2002
PANTETHEINE-PHOSPHATE
ADENYLYLTRANSFERASE
2.7.7.3
COENZYME A
E A
(fungi, plants)
Projection from
human
Daugherty et al, 2002
MONOFUNCTIONAL PHOSPHOPANTOTHENOYLCYSTEINE LIGASE
6.3.2.5
COENZYME A
E
(human)
Projection from
bacteria
Daugherty et al, 2002
PANTOTHENATE KINASE
2.7.1.33
COENZYME A
A
Operon
MONOFUNCTIONAL PYRIMIDINE
DEAMINASE
3.5.4.26
FMN/FAD
A
Operon
EXTENDED FAD SYNTHASE
2.7.7.2
FMN/FAD
E
(human)
Projection from
yeast
FMN/FAD
B
(gram+)
Regulon
NAD SYNTHASE,
GLN-DEAMIDASE SUBUNIT*
RIBOFLAVIN TRANSPORTER
(human)
Operon
Operon/Fusion
Operon
Kurnasov et al.
in press
Kurnasov et al. 2002
Singh et al.2002(3D)
Zhang et al. 2002(3D)
Mseeh et al.
unpublished
The Leucine Degradation Cluster:
Origin of a New Perspective on
Uses of Clusters
Context-based enrichment of initial functional assignments
example from Brucella melitensis genome analysis
Leu
deamination
oxydation
IsovalerylCoA
Isovaleryl-CoA
dehydrogenase
(EC 1.3.99.10)
Methylcrotonoyl-CoA
carboxylase (EC 6.4.1.4)
Methylcrotonoylbiotin-containing subunit
CoA
AcetylCoA
MethylglutaconylCoA
MethylglutaconylCoA hydratase
(EC 4.2.1.18)
HMGCoA
carboxylase subunit
E.C. No
Functional role
Gene ID
No. in cluster
1.3.99.10 ISOVALERYL-COA DEHYDROGENASE
BR0020
BR0020
1
1 6.4.1.4 METHYLCROTONYL-COA CARBOXYLASE
BR0018
- Biotin-containing subunit
BR0018
3
- Carboxylase subunit
BR0019
4
BR0019
4.2.1.18
METHYLGLUTACONYL-COA HYDRATASE
BR0016
2
BR0016
-------------------------------------------------------------------------------------------------------------------BR0017*
4.1.3.4
HYDROXYMETHYLGLUTARYL-COA LYASE
BR0017*
6
6.2.1.16 ACETOACETATE-COA LIGASE
BR0021
5
BR0021
* Biotin
carboxylase; Carboxyl transferase familty subunit; Enoyl-CoA hydratase/isomerase family
4.1.3.4
Acetoacetate
6.2.1.16
TIGR
specific
non-specific*
non-specific*
non-specific*
frameshift
specific
Leucine degradation in Baccili
No gene assigned in any
organism in KEGG, NCBI, TIGR
Gene assigned in B. melitensis
2003 (IG)
Gene assignment propagated over
26 organisms using gene clustering
Organism
Gene anchor
Clustered genes
158
New assignments
Gene cluster in B. subtilis
NCBI
similar t o
similar t o long- similar t o
butyryl-CoA
chain acyl-CoA biotin
dehydrogenase synt hetase
carboxylase
gene
not
called
similar t o 3similar t o
hydroxbut yrylhydroxymethylgl CoA
ut aryl-CoA lyase dehydratase
PIR
butyryl-CoA
probable aciddehydrogenase CoA ligase (EC
homolog yngJ
6.2.1.-) yngI
gene
not
called
probable enoylhydroxymethylgl CoA hydratase propionyl-CoA
ut aryl-CoA lyase (EC 4.2.1.17)
carboxylase
homolog yngG
yngF
homolog yngE
biotin
carboxylase
homolog yngH
similar t o
propionyl-CoA
carboxylase
Leucine degradation in Baccili
?
E.C. No
1.3.99.10
6.4.1.4
Functional role
No. in cluster
ISOVALERYL-COA DEHYDROGENASE
2
METHYLCROTONYL-COA CARBOXYLASE
- BIOTIN CONTAINING SUBUNIT
3
- CARBOXYLASE SUBUNIT
1
BIOTIN CARBOXYL CARRIER
7
4.2.1.18
METHYLGLUTACONYL-COA HYDRATASE
4
-------------------------------------------------------------------------------------------------------------4.1.3.4
HYDROXYMETHYLGLUTARYL-COA LYASE
5
6.2.1.16
ACETOACETATE-COA LIGASE
6
6.2.1.16*
ACETOACETATE-COA LIGASE*
14
EC
# in clust er
Functional
role
1.3.99.10
6.3.4.14
3
7
I s ovalerylC oA
Biotin
dehydrogena c orboxyl
se
c arrier
6.4.1.4
4
4
Biotin
c arboxylas e
M ethylc rotonyl- C oA
c arboxtlas e
biotinc ontaining
s ubunit
subunit
4.2.1.18
1
4.1.3.4
2
6.2.1.16
6
6.2.1.3
5
14
M ethylglutac oH ydroxymeth
nyl- C oA
ylglutarylA c etoac etateL ong- c hain-fattyhydratas e
C oA lyas e
C oA ligas e ac id- C oA ligas e
c arboxylas e
s ubunit
Bru. meli
1909
1911
1910
Bru. abor.
1089
1087
1088
1085
1086
1090
1471
Bac. ant h.
2343
2345
2344
2348
2347
2346
2349
4397
Bac. cere.
2373
2375
2374
2378
2377
2376
2379
4413
Bac. halo.
Bac. subt.
1171
1826
1174
not c alled
1173
1824
1177
1821
1176
1822
1175
1823
1 1 7 83 1 7 9 3 1 7 8 1 1 7 2
1825
2856
Oce. ihey.
Cau. cres.
1695
2243
1697
1696
1699
2241
1698
2240
2234
1913 1914
1912
3230 4343
476
1907 1908
504
3724
1620
2122
979
Listeria
1
Cell division protein mraZ
3
S-adenosyl-methyltransferase mraW (EC 2.1.1.-)
4
Cell division protein ftsI
2
UDP-N-acetylmuramoylalanine--D-glutamate ligase (EC 6.3.2.9)
Shew.
2
UDP-N-acetylmuramoylalanyl-D-glutamate--2,6-diaminopimelate
ligase (EC 6.3.2.13)
Xylella
5
Phospho-N-acetylmuramoyl-pentapeptide-transferase (EC
2.7.8.13)
Clostridia
Ralstonia
Brevibacter
Enterococcus
Brucella
Geobacter
1
Phospho-N-acetylmuramoyl-pentapeptide-transferase (EC 2.7.8.13)
2
UDP-N-acetylmuramoylalanine--D-glutamate ligase (EC 6.3.2.9)
6
Cell division protein ftsW
5
UDP-N-acetylglucosamine--N-acetylmuramyl-(pentapeptide) pyrophosphoryl-undecaprenol Nacetylglucosamine transferase (EC 2.4.1.227)
2
UDP-N-acetylmuramate--alanine ligase (EC 6.3.2.8)
9
Cell division protein ftsZ
11
UDP-N-acetylenolpyruvoylglucosamine reductase (EC 1.1.1.158)
2
D-alanine--D-alanine ligase (EC 6.3.2.4)
Bacteroides
thetaiotaomicron
Bacillus cereus
Geobacter
metallireducens
Buchnera
5
Cell division protein ftsW
1
UDP-N-acetylglucosamine--N-acetylmuramyl-(pentapeptide) pyrophosphorylundecaprenol N-acetylglucosamine transferase (EC 2.4.1.227)
2
UDP-N-acetylmuramoylalanine--D-glutamate ligase (EC 6.3.2.9)
8
UDP-N-acetylenolpyruvoylglucosamine reductase (EC 1.1.1.158)
9
Cell division protein ftsQ
2
UDP-N-acetylmuramoylalanyl-D-glutamate--2,6-diaminopimelate ligase (EC 6.3.2.13)
3
Cell division protein ftsA
6
Cell division protein ftsZ
Oceanobacillus
iheyensis
Enterococcus
faecium DO
Escherichia coli K12
Wigglesworthia
brevipalpis
2
Cell division protein ftsA
1
Cell division protein ftsZ
8
Hypothetical protein
10
Hypothetical protein
12
RNA binding protein
7
UDP-3-O-[3-hydroxymyristoyl] N-acetylglucosamine deacetylase (EC 3.5.1.-)
13
Protein translocas subunit secA
The Project: Annotate 1000 Genomes in
Three Years
• By making the task concrete, we force
engineering decisions
• It will be easier to annotate 1000 genomes
well than to annotate 50 well (comparative
analysis is the key)
• Analysis by subsystem (rather than by
genome) is clearly the key
• The use of clusters is the key to precise
annotation of subsystems
Annotation by Subsystem
• Requires knowledge of known variants
• Evolution of clusters plays a major role
• There are three components of the task:
– Building tools to support analysis
– Actually doing the analysis on 30-50
subsystems
– Coordinating with groups doing a limited set of
wet lab confirmations
FIG: Building the Initial Annotation Tools
• Releasing the browser/curation tool with
approximately 220-230 genomes within a
few months
• Peer-to-peer updates/synchronization
• Open source and free (initially for Macs and
Linux systems)