Exploiting Gene Clusters to Curate Annotations Ross Overbeek, Fellowship for Interpretation of Genomes (FIG) October, 2003
Download ReportTranscript Exploiting Gene Clusters to Curate Annotations Ross Overbeek, Fellowship for Interpretation of Genomes (FIG) October, 2003
Exploiting Gene Clusters to Curate Annotations Ross Overbeek, Fellowship for Interpretation of Genomes (FIG) October, 2003 Outline of the Talk • • • • • The Emerging Opportunity The Use of Clusters to Find “Missing Genes” Experiences with a Single Pathway “The Project” Tools Needed to Support the Project Three “Laws” The amount of available DNA sequence data will double every 18 months The number of available genomes will double every 18 months The cost of sequence will drop by a factor of 2 every 18 months. Basic Facts We have about 230-250 publicly available more-or-less complete genomes We will have about 1000 complete genomes within 3 years This will lead to better annotations, not worse The majority of annotations will need to be automated, and the process must accurately follow the steps that a human expert would take The Use of Clusters to Find Missing genes Central Machinery of Life: Horizons of gene discovery • 3,000 - 4,000 functional roles (300 – 3,000 per organism) • Largely conserved across the three kingdoms (sequences; functions; pathways) • “Missing genes” are still there Distinct Functional Roles (enzymes with complete E.C. numbers) Genes unknown Genes known Total Fall 2001 Fall 2002 169 613 782 22% 78% 100% 112 670 782 14% 86% 100% 104 59 450 613 17% 10% 73% 100% 94 85 491 670 14% 13% 73% 100% Taxon-Specific Functional Roles (among known genes only) Prokaryotic Eukaryotic Universal (present in both) Total Missing genes in metabolic pathways making a case Missing gene C + D E F genome 5 gene C genome 4 gene B? E3 genome 3 gene A E2 genome 2 E1 genome 1 A + B A1 A2 A3 A4 A5 ? ? ? ? ? C1 C2 C3 C4 C5 Functional context/neighborhood Pathway A --> F *Enzyme E1 protein family A *Enzyme E2 protein family B = ? *Enzyme E3 protein family C Globally Missing Gene (never identified in any species) Missing genes in metabolic pathways making a case Missing gene C + D E F genome 5 gene C genome 4 gene B? E3 genome 3 gene A E2 genome 2 E1 genome 1 A + B A1 A2 A3 A4 A5 ? ? ? B4 B5 C1 C2 C3 C4 C5 Functional context/neighborhood Pathway A --> F *Enzyme E1 protein family A *Enzyme E2 protein family B = ? *Enzyme E3 protein family C Locally Missing Gene (non-orthologous gene displacement) Techniques of genome context analysis (I) checking neighbors GENE CLUSTERING ON THE CHROMOSOME (OPERONS) GENOME 1 GENOME 2 GENOME 3 gene G1 gene Y2 gene Q3 gene T1 gene C2 gene X3 gene A1 gene A2 gene A3 gene X1 gene C1 gene M2 gene S3 gene X2 gene U3 gene R1 gene N2 gene Y3 Techniques of genome context analysis (II) checking connections PROTEIN FUSION EVENTS GENOME 1 gene A1 gene C1 GENOME 3 gene A3 gene C3 GENOME 4 gene A4 GENOME 5 gene C5 / A5 / X4 gene C4 / Z3 Techniques of genome context analysis (III) co-regulation SHARED REGULATORY SITES (REGULONS GENOME 1 gene T1 GENOME 2 gene C2 GENOME 5 gene A1 gene X1 gene C1 gene A2 gene R5 ) gene R1 gene W2 gene C5 / A5 gene X5 Techniques of genome context analysis (IV) co-evolution OCCURRENCE PROFILES IN-GROUP GENOME 1 gene A1 gene C1 gene X1 gene G1 gene H1 gene I1 gene W1 gene Y1 gene Z1 GENOME 2 gene A2 gene C2 gene X2 gene G2 gene H2 gene I2 gene W2 gene Y2 - GENOME 3 gene A3 gene C3 gene X3 gene G3 gene H3 gene I3 gene W3 gene Y3 gene Z3 GENOME 4 gene A4 gene C4 gene X4 - gene H4 gene I4 gene W4 - - GENOME 5 gene A5 gene C5 gene X5 gene G5 gene H5 gene I5 gene W5 - - OUT-GROUP GENOME 6 - - - - gene H6 gene I6 gene W6 gene Y6 gene Z6 GENOME 7 - - - gene G7 gene H7 gene I7 gene W7 - gene Z7 GENOME 8 - - - - gene H8 gene I8 gene W8 gene Y8 - GENOME 9 - - - - gene H9 gene I9 gene W9 gene Y9 gene Z9 GENOME 10 - - - - - gene I10 gene W10 gene Y10 gene Z10 10 10 10 6 5 4 4 3 Score: 8 Missing gene case primary suspects genome 2 genome 3 genome 4 genome 5 genome 6 genome 7 genome 8 genome 9 genome 10 OUT-GROUP genome 1 IN-GROUP yes yes yes yes yes no no no no no protein family A *Enzyme E2 A1 A2 A3 A4 A5 - - - - - ? protein family ? *Enzyme E3 ? ? ? ? ? C1 C2 C3 C4 C5 - - - - - protein family X Enzyme E4 X1 X2 X3 X4 X5 - - - - - protein family Y Y1 Y2 Y3 - - Y6 - Y8 Y9 Y10 Z1 - Z3 - - Z6 Z7 - Z9 Z10 Functional context/neighborhood Pathway A --> F *Enzyme E1 protein family C Genome context/neighborhood Clustering on chromosome Hypothetical protein Fusion events Hypothetical protein protein family Z Shared regulatory sites Enzyme E5 protein family W W1 W2 W3 W4 W5 W6 W7 W8 W9 W10 Occurrence profiles Membrane protein protein family G G1 G2 G3 - G5 - G7 - - - Example I: Chorismate Pathway Missing gene in all archaea D-Erythrose 4-P Shikimate + Phosphoenol pyruvate 1 H H H aroH aroF aroG OH OH H OH H H Shikimate Kinase (EC 1.1.1.25) 5 aroK aroL H H H COOH OP OH H H H COOH OH 6 7P-2-Dehydro-3deoxy-D-arabino -heptulosonate 2 Shikimate - 5 - P aroA O5-(1-Carboxyvinyl)3-P-Shikimate aroB 7 aroC 3-Dehydro-Quinate Chorismate 3 aroD 4 ydiB aroD 3-Dehydro-Shikimate Trp Phe Tyr syntheses Chorismate catabolism Isochorismate anabolism Chromosomal Clustering: Prediction ?? ?? Fusion Protein Fusion Protein Functional coupling in chorismate pathway Clustering Fusion Occurence 1 EC4.1.2.15 2 EC4.6.1.3 3 EC4.2.1.10 4 EC1.1.1.25 5 EC2.7.1.71 6 EC2.5.1.19 7 EC4.6.1.4 5' ? 3Dehydroquinate Dehydratase Shikimate 5Dehydrogenase Bacterial Shikimate Kinase Phosphoshikimat e 1-Carboxyvinyl Transferase Chorismate Synthase Archaeal Shikimate Kinase aroB aroD aroE, ydiB aroK, aroL aroA aroC hypothetical REC05984 REC01650 REC05912 REC01649 REC05985 REC0372 REC00874 REC05421 - 2-Dehydro-33Deoxyphosphohe Dehydroquinate p-tonate Aldolase Synthase Escherichia coli aroH, aroG, aroF REC01661 REC05569 REC00721 Helicobacter pylori RHP01088 RHP01230 RHP00443 RHP00644 RHP01111 RHP01338 RHP00097 - Thermotoga maritima RTM00236 RTM00229 RTM00228 RTM00231 RTM00229 RTM00232 RTM00230 - Bacillus subtilis RBS02969 RBS02266 RBS2304 RBS2442 RBS02559 RBS00316 RBS02256 RBS02267 - Clostridium acetobutylicum RCA01750 RCA01752 RCA01757 RCA01755 RCA01756 RCA01753 RCA01754 - Streptococcus pneumoniae RPN00965-66 RPN00386 RPN00384 RPN00385 RPN00391 RPN00390 RPN00387 - RSC05895 - Saccharomyces cerevisiae RSC01644 RSC08655 Methanococcus jannaschii - - RMJ05308 RMJ00483 - RMJ00806 RMJ07769 RMJ7785 Archaeoglobus fulgidus - - RAG18799 RAG27692 - RAG27692 RAG50410 RAG45918 Methanobacter. Thermoautotrop. - - RTH01640 RTH01082 - RTH02023 RTH00020 RTH01890 Aeropyrum pernix RAP00399 RAP00398 RAP00397 RAP00396 - RAP00394 RAP00393 RAP00395 Pyrococcus furiosus RPF01413 RPF01411-12 RPF01410 RPF01409 - RPF01402 RPF01401 RPF01407-08 Pyrococcus abysii RPO01190 RPO01191 RPO01192 RPO01193 - RPO01200 RPO01201 RPO01194 RSC06906 Pentafunctional Enzyme RSC06906 Example II: “Missing Drug Target” in S.pneumoniae accA accD accB Gene fabI of Enoyl-ACP reductase (EC 1.3.1.9) is missing in a number of Streptococci accC fab D fab F fab G fabZ fabI acp P fabH Clustering of FAB Genes : Prediction MAF g30k L32P PLSX fabH fabD fabG acpP fabF EC 4… 2.7.4.9 2.7.7.7 Escherichia coli hyp TR? 3.5.1.? fabD fabH acpP fabG fabF accB fabZ accC accD accA fabI hyp 6.3.4.15 Genome X TR? fabH acpP ? fabD fabG fabF accB fabZ accC accD accA fabZ accC accD accA 2.1.1.79 FRNS Genome Y TR? fabH acpP ? fabD fabG fabF accB 5.99.1.2 Clostridium acetobutylicum TR? fabH acpP ? fabD fabG fabF accB fabZ accC accD accA Streptococcus pyogenes A conserved hypothetical FMN-binding protein “?” is the best candidate for the missing gene fabI in Gram-positive cocci hyp Independent Experimental Verification 13 July 2000 Nature 406, 145 - 146 (2000) © Macmillan Publishers Ltd. Microbiology: A triclosan-resistant bacterial enzyme RICHARD J. HEATH AND CHARLES O. ROCK Triclosan is an antimicrobial agent that is widely used in a variety of consumer products and acts by inhibiting one of the highly conserved enzymes (enoyl-ACP reductase, or FabI) of bacterial fatty-acid biosynthesis. But several key pathogenic bacteria do not possess FabI, and here we describe a unique triclosanresistant flavoprotein, FabK, that can also catalyse this reaction in Streptococcus pneumoniae. Our finding has implications for the development of FabI-specific inhibitors as antibacterial agents. Missing genes, examples in cofactor pathways prediction and experimental verification M is s ing/Found in: Ke y e vide nce Expe rim e ntal Ve rification Functional Role E.C.# Pathw ays KYNURENINE FORMAMIDASE* 3.5.1.9 NAD/NADP B (gram+, gram-) RIBOSYLNICOTINAMIDE KINASE* 2.7.1.22 NAD/NADP B (gram-) NaMN ADENYLYLTRANSFERASE* 2.7.7.18 NAD/NADP B (gram+, gram-) DUAL SPECIFICITY NMN/NaMN ADENYLYLTRANSFERASE 2.7.7.1 / 2.7.7.18 NAD/NADP E (human, fungi) Projection from Zhou et al. 2002(3D) bacteria Zhang et al. 2003(3D) 3.5.1.- NAD/NADP B A (deep branched) Operon/Fusion Shatalin et al. in progess (3D) BIFUNCTIONAL PANTETHEINEPHOSPHATE ADENYLYLTRANSFERASE/dpCoA KINASE 2.7.7.3 / 2.7.1.24 COENZYME A E Fusion Daugherty et al, 2002 PANTETHEINE-PHOSPHATE ADENYLYLTRANSFERASE 2.7.7.3 COENZYME A E A (fungi, plants) Projection from human Daugherty et al, 2002 MONOFUNCTIONAL PHOSPHOPANTOTHENOYLCYSTEINE LIGASE 6.3.2.5 COENZYME A E (human) Projection from bacteria Daugherty et al, 2002 PANTOTHENATE KINASE 2.7.1.33 COENZYME A A Operon MONOFUNCTIONAL PYRIMIDINE DEAMINASE 3.5.4.26 FMN/FAD A Operon EXTENDED FAD SYNTHASE 2.7.7.2 FMN/FAD E (human) Projection from yeast FMN/FAD B (gram+) Regulon NAD SYNTHASE, GLN-DEAMIDASE SUBUNIT* RIBOFLAVIN TRANSPORTER (human) Operon Operon/Fusion Operon Kurnasov et al. in press Kurnasov et al. 2002 Singh et al.2002(3D) Zhang et al. 2002(3D) Mseeh et al. unpublished The Leucine Degradation Cluster: Origin of a New Perspective on Uses of Clusters Context-based enrichment of initial functional assignments example from Brucella melitensis genome analysis Leu deamination oxydation IsovalerylCoA Isovaleryl-CoA dehydrogenase (EC 1.3.99.10) Methylcrotonoyl-CoA carboxylase (EC 6.4.1.4) Methylcrotonoylbiotin-containing subunit CoA AcetylCoA MethylglutaconylCoA MethylglutaconylCoA hydratase (EC 4.2.1.18) HMGCoA carboxylase subunit E.C. No Functional role Gene ID No. in cluster 1.3.99.10 ISOVALERYL-COA DEHYDROGENASE BR0020 BR0020 1 1 6.4.1.4 METHYLCROTONYL-COA CARBOXYLASE BR0018 - Biotin-containing subunit BR0018 3 - Carboxylase subunit BR0019 4 BR0019 4.2.1.18 METHYLGLUTACONYL-COA HYDRATASE BR0016 2 BR0016 -------------------------------------------------------------------------------------------------------------------BR0017* 4.1.3.4 HYDROXYMETHYLGLUTARYL-COA LYASE BR0017* 6 6.2.1.16 ACETOACETATE-COA LIGASE BR0021 5 BR0021 * Biotin carboxylase; Carboxyl transferase familty subunit; Enoyl-CoA hydratase/isomerase family 4.1.3.4 Acetoacetate 6.2.1.16 TIGR specific non-specific* non-specific* non-specific* frameshift specific Leucine degradation in Baccili No gene assigned in any organism in KEGG, NCBI, TIGR Gene assigned in B. melitensis 2003 (IG) Gene assignment propagated over 26 organisms using gene clustering Organism Gene anchor Clustered genes 158 New assignments Gene cluster in B. subtilis NCBI similar t o similar t o long- similar t o butyryl-CoA chain acyl-CoA biotin dehydrogenase synt hetase carboxylase gene not called similar t o 3similar t o hydroxbut yrylhydroxymethylgl CoA ut aryl-CoA lyase dehydratase PIR butyryl-CoA probable aciddehydrogenase CoA ligase (EC homolog yngJ 6.2.1.-) yngI gene not called probable enoylhydroxymethylgl CoA hydratase propionyl-CoA ut aryl-CoA lyase (EC 4.2.1.17) carboxylase homolog yngG yngF homolog yngE biotin carboxylase homolog yngH similar t o propionyl-CoA carboxylase Leucine degradation in Baccili ? E.C. No 1.3.99.10 6.4.1.4 Functional role No. in cluster ISOVALERYL-COA DEHYDROGENASE 2 METHYLCROTONYL-COA CARBOXYLASE - BIOTIN CONTAINING SUBUNIT 3 - CARBOXYLASE SUBUNIT 1 BIOTIN CARBOXYL CARRIER 7 4.2.1.18 METHYLGLUTACONYL-COA HYDRATASE 4 -------------------------------------------------------------------------------------------------------------4.1.3.4 HYDROXYMETHYLGLUTARYL-COA LYASE 5 6.2.1.16 ACETOACETATE-COA LIGASE 6 6.2.1.16* ACETOACETATE-COA LIGASE* 14 EC # in clust er Functional role 1.3.99.10 6.3.4.14 3 7 I s ovalerylC oA Biotin dehydrogena c orboxyl se c arrier 6.4.1.4 4 4 Biotin c arboxylas e M ethylc rotonyl- C oA c arboxtlas e biotinc ontaining s ubunit subunit 4.2.1.18 1 4.1.3.4 2 6.2.1.16 6 6.2.1.3 5 14 M ethylglutac oH ydroxymeth nyl- C oA ylglutarylA c etoac etateL ong- c hain-fattyhydratas e C oA lyas e C oA ligas e ac id- C oA ligas e c arboxylas e s ubunit Bru. meli 1909 1911 1910 Bru. abor. 1089 1087 1088 1085 1086 1090 1471 Bac. ant h. 2343 2345 2344 2348 2347 2346 2349 4397 Bac. cere. 2373 2375 2374 2378 2377 2376 2379 4413 Bac. halo. Bac. subt. 1171 1826 1174 not c alled 1173 1824 1177 1821 1176 1822 1175 1823 1 1 7 83 1 7 9 3 1 7 8 1 1 7 2 1825 2856 Oce. ihey. Cau. cres. 1695 2243 1697 1696 1699 2241 1698 2240 2234 1913 1914 1912 3230 4343 476 1907 1908 504 3724 1620 2122 979 Listeria 1 Cell division protein mraZ 3 S-adenosyl-methyltransferase mraW (EC 2.1.1.-) 4 Cell division protein ftsI 2 UDP-N-acetylmuramoylalanine--D-glutamate ligase (EC 6.3.2.9) Shew. 2 UDP-N-acetylmuramoylalanyl-D-glutamate--2,6-diaminopimelate ligase (EC 6.3.2.13) Xylella 5 Phospho-N-acetylmuramoyl-pentapeptide-transferase (EC 2.7.8.13) Clostridia Ralstonia Brevibacter Enterococcus Brucella Geobacter 1 Phospho-N-acetylmuramoyl-pentapeptide-transferase (EC 2.7.8.13) 2 UDP-N-acetylmuramoylalanine--D-glutamate ligase (EC 6.3.2.9) 6 Cell division protein ftsW 5 UDP-N-acetylglucosamine--N-acetylmuramyl-(pentapeptide) pyrophosphoryl-undecaprenol Nacetylglucosamine transferase (EC 2.4.1.227) 2 UDP-N-acetylmuramate--alanine ligase (EC 6.3.2.8) 9 Cell division protein ftsZ 11 UDP-N-acetylenolpyruvoylglucosamine reductase (EC 1.1.1.158) 2 D-alanine--D-alanine ligase (EC 6.3.2.4) Bacteroides thetaiotaomicron Bacillus cereus Geobacter metallireducens Buchnera 5 Cell division protein ftsW 1 UDP-N-acetylglucosamine--N-acetylmuramyl-(pentapeptide) pyrophosphorylundecaprenol N-acetylglucosamine transferase (EC 2.4.1.227) 2 UDP-N-acetylmuramoylalanine--D-glutamate ligase (EC 6.3.2.9) 8 UDP-N-acetylenolpyruvoylglucosamine reductase (EC 1.1.1.158) 9 Cell division protein ftsQ 2 UDP-N-acetylmuramoylalanyl-D-glutamate--2,6-diaminopimelate ligase (EC 6.3.2.13) 3 Cell division protein ftsA 6 Cell division protein ftsZ Oceanobacillus iheyensis Enterococcus faecium DO Escherichia coli K12 Wigglesworthia brevipalpis 2 Cell division protein ftsA 1 Cell division protein ftsZ 8 Hypothetical protein 10 Hypothetical protein 12 RNA binding protein 7 UDP-3-O-[3-hydroxymyristoyl] N-acetylglucosamine deacetylase (EC 3.5.1.-) 13 Protein translocas subunit secA The Project: Annotate 1000 Genomes in Three Years • By making the task concrete, we force engineering decisions • It will be easier to annotate 1000 genomes well than to annotate 50 well (comparative analysis is the key) • Analysis by subsystem (rather than by genome) is clearly the key • The use of clusters is the key to precise annotation of subsystems Annotation by Subsystem • Requires knowledge of known variants • Evolution of clusters plays a major role • There are three components of the task: – Building tools to support analysis – Actually doing the analysis on 30-50 subsystems – Coordinating with groups doing a limited set of wet lab confirmations FIG: Building the Initial Annotation Tools • Releasing the browser/curation tool with approximately 220-230 genomes within a few months • Peer-to-peer updates/synchronization • Open source and free (initially for Macs and Linux systems)