in silico Mikhail Gelfand AlBio06, Moscow, July 2006 Research and Training Center “Bioinformatics”,
Download ReportTranscript in silico Mikhail Gelfand AlBio06, Moscow, July 2006 Research and Training Center “Bioinformatics”,
Molecular biology in silico Mikhail Gelfand Research and Training Center “Bioinformatics”, Institute for Information Transmission Problems, RAS AlBio06, Moscow, July 2006 Propaganda red: papers (experiments) blue: sequence fragments 10000000 1000000 100000 10000 1000 100 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 год 90 84 80 Complete genomes 70 60 55 50 40 30 30 10 19 18 20 14 9 2 0 1995 4 1 2 1 3 2 1996 1997 1998 4 2 10 7 4 1999 2000 15 8 2001 2002 GOLD db.(III.2006): 361 complete genomes Incomplete (in the process): 952 bacteria 58 archaea 607 eukaryotes (incl. ESTs) 46 metagenomes More propaganda Most genes will never be studied in experiment Even in E.coli: only 20-30 new genes per year (hundreds are still uncharacterized) Bioinformatics = molecular biology in silico • ~2% of all recent papers in biological journals • Essential component of biological research • Make predictions about function and regulation of genes (many quite reliable!) • Metabolic reconstruction and prediction of phenotype given genome • Identify really interesting cases, fill gaps in knowledge – “Universally missing genes” – not a single known gene even for ~10% reactions of central metabolism. No genes for >40% reactions overall – “Conserved hypothetical genes” (5-15% of any bacterial genome) – essential, but unknown function Haemophilus influenzae, 1995 Vibrio cholerae, 2000 How? Similarity to known proteins • Useful for many purposes (allows one to annotate 50-75% genes in a bacterial genome) • Necessary first step • May be automated – … to some extent … – in particular, care is needed to avoid too specific predictions – Problem: propagation of annotation errors • Boring (nothing new) Noradrenaline transporter in an archaeon? SOURCE ORGANISM FEATURES source Protein Methanococcus jannaschii. Methanococcus jannaschii Archaea; Euryarchaeota; Methanococcales; Methanococcaceae; Methanococcus. Location/Qualifiers 1..492 /organism="Methanococcus jannaschii" /db_xref="taxon:2190" 1..492 /product="sodium-dependent noradrenaline transporter" CDS 1..492 /gene="MJ1319" /note="similar to EGAD:HI0736 percent identity: 38.5; identified by sequence similarity; putative" /coded_by="U67572:71..1549" /transl_table=11 Now corrected: Hypothetical sodium-dependent transporter MJ1319. Similarity to hypothetical proteins: somebody else’s errors… The correct annotation Genes with curious functional assignments • C75604: Probable head morphogenesis protein, Deinococcus radiodurans • O05360: Automembrane protein H, Yersinia enterocolitica • Q8TID9: Benzodiazepine (valium) receptor TspO, Methanosarcina acetivorans • NP_069403: DR-beta chain MHC class II, Archaeoglobus fulgidus Errors in experimental papers SwissProt: DEFINITION Hypothetical 43.6 kDa protein. ACCESSION ... KEYWORDS SOURCE ORGANISM P48012 Hypothetical protein. Debaryomyces occidentalis Debaryomyces occidentalis Eukaryota; Fungi; Ascomycota; Saccharomycotina; Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Debaryomyces. [CAUTION] Was originally (Ref.1) thought to be 3-isopropylmalate dehydrogenase (LEU2). PIR: DEFINITION 3-isopropylmalate dehydrogenase ACCESSION KEYWORDS - yeast(Schwanniomyces occidentalis). S55845 oxidoreductase. (EC 1.1.1.85) SwissProt entry DSDX_ECOLI -!- CAUTION: An ORF called dsdC was originally (Ref.3) assigned to the wrong DNA strand and thought to be a D-serine deaminase activator, it was then resequenced by Ref.2 and still thought to be "dsdC", but this time to function as a D-serine permease. It is Ref.1 that showed that dsdC is another gene and that this sequence should be called dsdX. It should also be noted that the C-terminal part of dsdX (from 338 onward) was also sequenced (Ref.6 and Ref.7) and was thought to be a separate ORF (don't worry, we also had difficulties understanding what happened!). Positional clustering • Genes that are located in immediate proximity tend to be involved in the same metabolic pathway or functional subsystem – mainly in prokaryotes, very weak in eukaryotes – caused by operon structure, but not only • horizontal transfer of loci containing several functionally linked operons • compartmentalisation of products in the cytoplasm – very weak evidence • stronger if observed in may unrelated genomes • May be measured – e.g. the STRING database/server (P.Bork, EMBL) – and other sources STRING: trpB – positional clusters Functionally dependent genes tend to cluster on chromosomes in many different organisms Vertical axis: number of gene pairs with association score exceeding a threshold. Control: same graph, random re-labeling of vertices More genomes (stronger links) => highly significant clustering Especially in linear pathways (right) Fusions • If two (or more) proteins form a single multidomain protein in some organism, they all are likely to be tightly functionally related • Very useful for the analysis of eukaryotes • Sometimes useful for the analysis of prokaryotes STRING: trpB – fusions Phyletic patterns • Functionally linked genes tend to occur together • Enzymes with the same function (isozymes) have complementary phyletic profiles STRING: trpB – cooccurrence (phyletic profiles) Phyletic profiles in the Phe/Tyr pathway shikimate kinase Archaeal shikimate-kinase Chorismate biosynthesis pathway (E. coli) Arithmetics of phyletic patterns Shikimate dehydrogenase (EC 1.1.1.25): AroE COG0169 aompkzyqvdrlbcefghsnuj-i-5-enolpyruvylshikimate 3-phosphate synthase (EC 2.5.1.19) AroA COG0128 aompkzyqvdrlbcefghsnuj-i-Chorismate synthase (EC 2.5.1.19) AroC COG0082 aompkzyqvdrlbcefghsnuj-i-- Shikimate kinase (EC 2.7.1.71): Typical (AroK) COG0703 ------yqvdrlbcefghsnuj-i-Archaeal-type COG1685 aompkz-------------------+ aompkzyqvdrlbcefghsnuj-i-Two forms combined 3-dehydroquinate dehydratase (EC 4.2.1.10): Class I (AroD) COG0710 aompkzyq---lb-e----n---i-Class II (AroQ) COG0757 ------y-vdr-bcefghs-uj---+ aompkzyqvdrlbcefghsnuj-i-Two forms combined Distribution of association scores (monotonic for subunits, bimodal for isozymes) E.g. transporters • Transporters of end products of metabolic pathways may substitute the entire pathway • Transporters of compounds for catabolic pathways co-occur with pathways • Transporters for intermediates substitute upstream parts of pathways Example: bioY Other approaches to phyletic patterns • Gene signatures of lifestyles – e.g. thermophily: DNA gyrase is the only gene specific to all hyperthermophiles (bacterial and archaeal) – see COGs • Regulators and signals Example: bioR gene: black arrow; candidate site: red dot Comparative analysis of regulation • Phylogenetic footprinting: regulatory sites are more conserved than non-coding regions in general and are often seen as conserved islands in alignments of gene upstream regions • Consistency filtering: regulons (sets of coregulated genes) are conserved => – true sites occur upstream of orthologous genes – false sites are scattered at random Enzymes • Identification of a gap in a pathway (universal, taxon-specific, or in individual genomes) • Search for candidates assigned to the pathway by co-localization and co-regulation (in many genomes) • Prediction of general biochemical function from (distant) similarity and functional patterns • Tentative filling of the gap • Verification by analysis of phylogenetic patterns: – Absence in genomes without this pathway – Complementary distribution with known enzymes for the same function Transporters • Identification of candidates assigned to the pathway by co-localization and co-regulation (in many genomes) • Prediction of general function by analysis of transmembrane segments and similarity • Prediction of specificity by analysis of phylogenetic patterns: – End product if present in genomes lacking this pathway (substituting the biosynthetic pathway for an essential compound) – Input metabolite if absent in genomes without the pathway (catabolic, also precursors in biosynthetic pathways) – Entry point in the middle if substituting an upper or side part of the pathway in some genomes 5’ UTR regions of riboflavin genes from bacteria BS BQ BE HD Bam CA DF SA LLX PN TM DR TQ AO DU CAU FN TFU SX BU BPS REU RSO EC TY KP HI VK VC YP AB BP AC Spu PP AU PU PY PA MLO SM BME BS BQ BE CA DF EF LLX LO PN ST MN SA AMI DHA FN GLU 1 2 2’ 3 =========> ==> <== ===> TTGTATCTTCGGGG-CAGGGTGGAAATCCCGACCGGCGGT AGCATCCTTCGGGG-TCGGGTGAAATTCCCAACCGGCGGT TGCATCCTTCGGGG-CAGGGTGAAATTCCCGACCGGCGGT TTTATCCTTCGGGG-CTGGGTGGAAATCCCGACCGGCGGT TGTATCCTTCGGGG-CTGGGTGAAAATCCCGACCGGCGGT GATGTTCTTCAGGG-ATGGGTGAAATTCCCAATCGGCGGT CTTAATCTTCGGGG-TAGGGTGAAATTCCCAATCGGCGGT TAATTCTTTCGGGG-CAGGGTGAAATTCCCAACCGGCAGT ATAAATCTTCAGGG-CAGGGTGTAATTCCCTACCGGCGGT AACTATCTTCAGGG-CAGGGTGAAATTCCCTACCGGTGGT AAACGCTCTCGGGG-CAGGGTGGAATTCCCGACCGGCGGT GACCTCTTTCGGGG-CGGGGCGAAATTCCCCACCGGCGGT CACCTCCTTCGGGG-CGGGGTGGAAGTCCCCACCGGCGGT AATAATCTTCAGGG-CAGGGTGAAATTCCCGATCGGCGGT TTTAATCTTCAGGG-CAGGGTGAAATTCCCGATCGGTGGT GAAGACCTTCGGGG-CAAGGTGAAATTCCTGATCGGCGGT TAAAGTCTTCAGGG-CAGGGTGAAATTCCCGACCGGTGGT ACGCGTGCTCCGGG-GTCGGTGAAAGTCCGAACCGGCGGT -AGCGCACTCCGGG-GTCGGTGAAAGTCCGAACCGGCGGT GTGCGTCTTCAGGG-CGGGGTGAAATTCCCCACCGGCGGT GTGCGTCTTCAGGG-CGGGGCGAAATTCCCCACCGGCGGT TTACGTCTTCAGGG-CGGGGTGCAATTCCCCACCGGCGGT GTACGTCTTCAGGG-CGGGGTGGAATTCCCCACCGGCGGT GCTTATTCTCAGGG-CGGGGCGAAATTCCCCACCGGCGGT GCTTATTCTCAGGG-CGGGGCGAAATTCCCCACCGGCGGT GCTTATTCTCAGGG-CGGGGCGAAATTCCCCACCGGCGGT TCGCATTCTCAGGG-CAGGGTGAAATTCCCTACCGGTGGT GCGCATTCTCAGGG-CAGGGTGAAATTCCCTACCGGTGGT CAATATTCTCAGGG-CGGGGCGAAATTCCCCACCGGTGGT GCTTATTCTCAGGG-CGGGGTGAAAGTCCCCACCGGCGGT GCGCATTCTCAGGG-CAGGGTGAAAGTCCCTACCGGTGGT GTACGTCTTCAGGG-CGGGGTGCAATTCCCCACCGGCGGT ACATCGCTTCAGGG-CGGGGCGTAATTCCCCACCGGCGGT AACAATTCTCAGGG-CGGGGTGAAACTCCCCACCGGCGGT GTCGGTCTTCAGGG-CGGGGTGTAAGTCCCCACCGGCGGT GGTTGTTCTCAGGG-CGGGGTGCAATTCCCCACCGGCGGT AAACGTTCTCAGGG-CGGGGTGCAATTCCCCACCGGCGGT TAACGTTCTCAGGG-CGGGGTGCAACTCCCCACCGGCGGT TAACGTTCTCAGGG-CGGGGTGAAAGTCCCCACCGGCGGT TAAAGTTCTCAGGG-CGGGGTGAAAGTCCCCACCGGCGGT AAGCGTTCTCAGGG-CGGGGTGAAATTCCCCACCGGCGGT GCTTGTTCTCGGGG-CGGGGTGAAACTCCCCACCGGCGGT ATCAATCTTCGGGG-CAGGGTGAAATTCCCTACCGGCGGT GTCTATCTTCGGGG-CAGGGTGAAAATCCCGACCGGCGGT ATTCATCTTCGGGG-CAGGGTGAAATTCCCGACCGGCGGT AATGATCTTCAGGG-CAGGGTGAAATTCCCTACCGGCGGT GAAGATCTTCGGGG-CAGGGTGAAATTCCCTACCGGCGGT GTTCGTCTTCAGGGGCAGGGTGTAATTCCCGACCGGTGGT AAATATCTTCAGGG-CACCGTGTAATTCGGGACCGGCGGT GTTCATCTTCGGGG-CAGGGTGCAATTCCCGACCGGTGGT AAGAGTCTTCAGGG-CAGGGTGAAATTCCCGACCGGCGGT AAGTGTCTTCAGGG-CAGGGTGTGATTCCCGACCGGCGGT AAGTGTCTTCAGGG-CAGGGTGAGATTCCCGACCGGCGGT ATTCATCTTCGGGG-TCGGGTGTAATTCCCAACCGGCAGT TCACAGTTTCAGGG-CGGGGTGCAATTCCCCACTGGCGGT ACGAACCTTCGAGG-TAGGGTGAAATTCCCGACCGGCGGT AATAATCTTCGGGG-CAGGGTGAAATTCCCGACCGGTGGT ---TGTTCTCAGGG-CGGGGCGAAATTCCCCACCGGCGGT Add. 3’ -><<=== 21 AGCCCGTGAC-19 AGTCCGTGAC-20 AGCCCGCGA--19 AGTCCGTGAC-23 AGCCCGTGAC-2 AGCCCGCAA--2 AGCCCGCG---6 AGCCTGCGAC-2 AGCCCGCGA--2 AGCCCACGA--3 AGCCCGCGAG-15 AGCCCGCGAA-3 AGCCCGCGAA-2 AGTCCGCGA--2 AGTCCGCGA--20 AGCCCGCGA--2 AGTCCACG---3 AGTCCGCGAC-3 AGTCCGCGAC-30 AGCCCGCGAGCG 21 AGCCCGCGAGCG 31 AGCCCGCGAGCG 21 AGCCCGCGAGCG 17 AGCCCGCGAGCG 67 AGCCCGCGAGCG 20 AGCCCGCGAGCG 2 AGCCCACGAGCG 14 AGCCCACGAGCG 13 AGCCCACGAGCG 40 AGCCCGCGAGCG 25 AGCCCACGAGCG 18 AGCCCGCGAGCG 16 AGCCCGCGAGCA 34 AGCCCGCGAGCG 13 AGCCCGCGAGCG 17 AGCCCGCGAGCG 19 AGCCCGCGAGCG 19 AGCCCGCGAGCG 19 AGCCCGCGAGCG 16 AGCCCGCGAGCG 34 AGCCCGCGAGCG 17 AGCCCGCGAGCG 18 AGCCCGCGA--27 AGCCCGCGA—-20 AGCCCGCGA--2 AGCCCGCGAG-2 AGCCCGCG---3 AGTCCACGAC-21 ACTCCGCGAT-3 AGTCCACGAT-125 AGTCCGTG---14 AGTCCGCG---104 AGTCCGCG---6 AGCCTGCGAC-14 AGCCCGCGC--20 AGCCCGCAAC-2 AGTCCACG---28 AGCCCGCGAGCG Variable 4 4’ 5 5’ 1’ -> <====> <==== ==> <== <========= 8 4 8 -----TGGATTCAGTTTAA-GCTGAAGCCGACAGTGAA-AGTCTGGAT-GGGAGAAGGATGAT 8 5 8 -----TGGATCTAGTGAAACTCTAGGGCCGACAGT-AT-AGTCTGGAT-GGGAGAAGGATATG 3 4 3 -----AGGATCCGGTGCGATTCCGGAGCCGACAGT-AT-AGTCTGGAT-GGGAGAAGGATGCC 10 4 10 ----–TGGACCTGGTGAAAATCCGGGACCGACAGTGAA-AGTCTGGAT-GGGAGAAGGAAACG 8 4 8 ----–TGGATTCAGTGAAAAGCTGAAGCCGACAGTGAA-AGTCTGGAT-GGGAGAAGGATGAG 3 4 3 ------AGATCCGGTTAAACTCCGGGGCCGACAGTTAA-AGTCTGGAT-GAAAGAAGAAATAG 7 6 7 --------ATTTGGTTAAATTCCAAAGCCGACAGT-AA-AGTCTGGAT-GGAAGAAGATATTT 11 3 11 ----–CTGATCTAGTGAGATTCTAGAGCCGACAGTTAA-AGTCTGGAT-GGGAGAAAGAATGT 4 4 4 -----ATGATTCGGTGAAACTCCGAGGCCGACAGT-AT-AGTCTGGAT-GAAAGAAGATAATA 3 4 3 -----ATGATTTGGTGAAATTCCAAAGCCGACAGT-AT-AGTCTGGAT-GAAAGAAGATAAAA 5 4 5 ----–TTGACCCGGTGGAATTCCGGGGCCGACGGTGAA-AGTCCGGAT-GGGAGAGAGCGTGA 8 12 9 ----–CCGATGCCGCGCAACTCGGCAGCCGACGGTCAC-AGTCCGGAC-GAAAGAAGGAGGAG 5 4 5 -----CCGACCCGGTGGAATTCCGGGGCCGACGGTGAA-AGTCCGGAT-GGGAGAAGGAGGGC 7 7 7 -----AGGAACCGGTGAGATTCCGGTACCGACAGT-AT-AGTCTGGAT-GGAAGAAGATGAAA 13 4 12 -----AGGAACTAGTGAAATTCTAGTACCGACAGT-AT-AGTCTGGAT-GGAAGAAGAGCAGA 3 4 3 -----AGGACCCGGTGTGATTCCGGGGCCGACGGT-AT-AGTCCGGAT-GGGAGAAGGTCGGC 5 4 5 -------GATTTGGTGAAATTCCAAAACCGACAGT-AG-AGTCTGGAT-GGGAGAAGAATTAG 8 5 8 -----TGGAACCGGTGAAACTCCGGTACCGACGGTGAA-AGTCCGGAT-GGGAGGTAGTACGTG 8 5 8 -----TTGACCAGGTGAAATTCCTGGACCGACGGTTAA-AGTCCGGAT-GGGAGGCAGTGCGCG 137 GTCAGCAGATCTGGTGAGAAGCCAGAGCCGACGGTTAG-AGTCCGGAT-GGAAGAAGATGTGC 8 4 8 GTCAGCAGATCTGGTCCGATGCCAGAGCCGACGGTCAT-AGTCCGGAT-GAAAGAAGATGTGC 7 5 7 GTCAGCAGATCTGGTGAGAGGCCAGGGCCGACGGTTAA-AGTCCGGAT-GAAAGAAGATGGGC 11 3 11 GTCAGCAGATCCGGTGAGATGCCGGGGCCGACGGTCAG-AGTCCGGAT-GGAAGAAGATGTGC 8 4 8 GACAGCAGATCCGGTGTAATTCCGGGGCCGACGGTTAG-AGTCCGGAT-GGGAGAGAGTAACG 8 3 8 GTCAGCAGATCCGGTGTAATTCCGGGGCCGACGGTTAA-AGTCCGGAT-GGGAGAGGGTAACG 8 4 8 GTCAGCAGATCCGGTGTAATTCCGGGGCCGACGGTTAA-AGTCCGGAT-GGGAGAGAGTAACG 26 9 30 GTCAGCAGATTTGGTGAAATTCCAAAGCCGACAGT-AA-AGTCTGGAT-GAAAGAGAATAAAA 11 9 11 GTCAGCAGATTTGGTGAGAATCCAAAGCCGACAGT-AT-AGTCTGGAT-GAAAGAGAATAAGC 5 4 5 GTCAGCAGATCTGGTGAGAAGCCAGGGCCGACGGTTAC-AGTCCGGAT-GAGAGAGAATGACA 16 6 16 GTCAGCAGACCCGGTGTAATTCCGGGGCCGACGGTTAT-AGTCCGGAT-GGGAGAGAGTAACG 16 4 27 GTCAGCAGATTTGGTGCGAATCCAAAGCCGACAGTGAC-AGTCTGGAT-GAAAGAGAATAAAA 10 4 10 GTCAGCAGACCTGGTGAGATGCCAGGGCCGACGGTCAT-AGTCCGGAT-GAGAGAAGATGTGC 10 3 11 ---CGCAGATCTGGTGTAAATCCAGAGCCGACGGT-AT-AGTCCGGAT-GAAAGAAGACGACG 6 6 6 GTCAGCAGATCTGGTG 52 TCCAGAGCCGACGGT 31 AGTCCGGAT-GGAAGAGAATGTAA 7 3 7 GTCAGCAGATCTGGTGCAACTCCAGAGCCGACGGTCAT-AGTCCGGAT-GAAAGAAGGCGTCA 7 9 7 GTCAGCAGATCCGGTGAGAGGCCGGAGCCGACGGT-AT-AGTCCGGAT-GGAAGAGGACAAGG 19 4 18 GTCAGCAGACCCGGTGTGATTCCGGGGCCGACGGTCAC-AGTCCGGATGAAGAGAGAACGGGA 15 4 16 GTCAGCAGACCCGGTGTGATTCCGGGGCCGACGGTCAT-AGTCCGGATGAAGAGAGAGCGGGA 14 4 13 GTCAGCAGACCCGGTGCGATTCCGGGGCCGACGGTCAT-AGTCCGGATAAAGAGAGAACGGGA 8 5 8 GTCAGCAGATCCGGTGTGATTCCGGAGCCGACGGTTAG-AGTCCGGAT-GAAAGAGGACGAAA 8 3 8 GTCAGCAGATCCGGTCGAATTCCGGAGCCGACGGTTAT-AGTCCGGAT-GGAAGAGAGCAAGC 10 15 10 GTCAGCAGATCCGGTGAGATGCCGGAGCCGACGGTTAA-AGTCCGGAT-GGAAGAGAGCGAAT 5 4 5 -----AGGATTCGGTGAGATTCCGGAGCCGACAGT-AC-AGTCTGGAT-GGGAGAAGATGGAG 3 5 3 -----AGGATTTGGTGTGATTCCAAAGCCGACAGT-AT-AGTCTGGAT-GGGAGAAGATGGAG 3 4 3 -----AGGATCCGGTGCGAGTCCGGAGCCGACAGT-AT-AGTCTGGAT-GGGAGAAGATGAAG 3 4 3 ----TATGATCCGGTTTGATTCCGGAGCCGACAGT-AA-AGTCTGGAT-GAAAGAAGATATAT 6 4 6 -------GATTTGGTGAGATTCCAAAGCCGACAGT-AA-AGTCTGGAT-GAGAGAAGATATTT 5 3 5 ----ATTGAATTGGTGTAATTCCAATACCGACAGT-AT-AGTCTGGAT—-AAAGAAGATAGGG 4 4 4 ----–TTGAAGCAGTGAGAATCTGCTAGCGACAGT-AA-AGTCTGGAT-GGAAGAAGATGAAC 3 10 3 ----TTGACTCTGGTGTAATTCCAGGACCGACAGT-AT-AGTCTGGAT-GGGAGAAGATGTTG 3 4 3 -------GATGTGGTGAGATTCCACAACCGACAGT-AT-AGTCTGGAT-GGGAGAAGACGAAA 3 4 3 -------GATGTGGTGTAACTCCACAACCGACAGT-AT-AGTCTGGAT-GAGAGAAGACCGGG 3 4 3 -------GATGTGGTGAAATTCCACAACCGACAGT-AA-AGTCTGGAT-GGGAGAAGACTGAG 11 3 11 ----–CTGATCTAGTGAGATTCTAGAGCCGACAGT-AT-AGTCTGGAT-GGGAGAAGATGGAG 5 5 5 ------TGATCTGGTGCAAATCCAGAGCCAACGGT-AT-AGTCCGGAT-GGAAGAAACGGAGC 11 4 11 --CGACTGACTTGGTGAGACTCCAAGGCCGACGGT-AT-AGTCCGGAT-GGGAGAAGGTACAA 4 6 4 -------GATTTGGTGAAATTCCAAAACCGACAGT-AG-AGTCTGGAT-GAGAGAAGAAAAGA 10 4 10 GTCAGCAGATCCGGTTAAATTCCGGAGCCGACGGTCAT-AGTCCGGAT-GCAAGAGAACC--- Conserved secondary structure of the RFN-element additional stemloop variable stem-loop Ag Y u C N rU G CRY G N GY G 3 G C c A A N UC C c N a * GGgN N c G Y 2 x G G g rC U Y Y 1 y N N N N 5’ * * * * G A R R r N N N N KN R A RG K x Y yB RYC V Rr C 4 C G A U xN CRG N AG Y C UG A x R R 5 g x u Capitals: invariant (absolutely conserved) positions. GA Lower case letters: strongly conserved positions. Dashes and stars: obligatory and facultative base pairs 3’ Degenerate positions: R = A or G; Y = C or U; K = G or U; B= not A; V = not U. N: any nucleotide. X: any nucleotide or deletion RFN: the mechanism of regulation • Transcription attenuation • Translation attenuation Early observation: an uncharacterized gene (ypaA) with an upstream RFN element Phylogenetic tree of RFN-elements (regulation of riboflavin biosynthesis) no riboflavin biosynthesis duplications no riboflavin biosynthesis YpaA: riboflavin (vitamin B2) transporter in Gram-positive bacteria • 5 predicted transmembrane segments => a transporter • Upstream RFN element (likely co-regulation with riboflavin genes) => transport of riboflaving or a precursor • S. pyogenes, E. faecalis, Listeria sp.: ypaA, no riboflavin pathway => transport of riboflavin Prediction: YpaA is riboflavin transporter (Gelfand et al., 1999) Validation: • YpaA transports flavines (riboflavin, FMN, FAD) (by genetic analysis, Kreneva et al., 2000) • ypaA is regulated by riboflavin (by microarray expression study, Lee et al., 2001) • … via attenuation of transcription (and to some extent inhibition of translaition) (Winkler et al., 2003) A new family of nickel/cobalt transporters • • • • • No experimental data No structural data Specificity predicted by comparative genomics … and then validated in experiment Mutational analysis under way Conserved signal upstream of nrd genes Identification of the candidate regulator by the analysis of phyletic patterns • COG1327: the only COG with exactly the same phylogenetic pattern as the signal – “large scale” on the level of major taxa – “small scale” within major taxa: • absent in small parasites among alpha- and gammaproteobacteria • absent in Desulfovibrio spp. among delta-proteobacteria • absent in Nostoc sp. among cyanobacteria • absent in Oenococcus and Leuconostoc among Firmicutes • present only in Treponema denticola among four spirochetes COG1327 “Predicted transcriptional regulator, consists of a Zn-ribbon and ATP-cone domains”: regulator of the riboflavin pathway? Additional evidence • sometimes clustered with nrd genes or with replication genes dnaB, dnaI, polA • candidate signals upstream of other replication-related genes • dNTP salvage • topoisomerase I, replication initiator dnaA, chromosome partitioning, DNA helicase II • experimental confirmation in Streptomyces (Borovok et al., 2004) Multiple sites (nrd genes): FNR, DnaA, NrdR Mode of regulation • Repressor (overlaps with promoters) • Co-operative binding: – most sites occur in tandem (> 90% cases) – the distance between the copies (centers of palindromes) equals an integer number of DNA turns: • mainly (94%) 30-33 bp, in 84% 31-32 bp – 3 turns • 21 bp (2 turns) in Vibrio spp. • 41-42 bp (4 turns) in some Firmicutes Combined regulatory network for iron homeostasis genes in a-proteobacteria. [- Fe] [+Fe] [ - Fe] [+Fe] RirA RirA Irr Irr FeS heme degraded Siderophore uptake 2+ 3+ Fe / Fe uptake Iron uptakesystems Fur [- Fe] Iron storage ferritins FeS synthesis Heme synthesis Iron-requiring enzymes [ironcofactor] Fur IscR Fe FeS Transcription factors FeS status of cell [+Fe] The connecting line denote regulatory interactions, which the thickness reflecting the frequency of the interaction in the analyzed genomes. The suggested negative or positive mode of operation is shown by dead-end and arrow-end of the line. Fe and Mn regulons Rhizobiaceae Organism Abb. Irr MUR / FUR MntR RirA IscR Sinorhizobium meliloti SM + + - + - + + + - + - Rhizobium leguminosarum RL Rhizobium etli RHE + + - + - Agrobacterium tumefaciens AGR + + - + - Mesorhizobium loti ML + - + + - MBNC + + + - + - + - + - + - Mesorhizobium sp. BNC1 Brucella melitensis Rhizobiales Rhodobacteraceae BQ + + Bradyrhizobium japonicum BJ Rhodopseudomonas palustris RPA + + + + + + - - - Nitrobacter hamburgensis Nham + + - - - Nitrobacter winogradskyi Nwi + + - - - Rhodobacter capsulatus RC - Rhodobacter sphaeroides Rsph + + + + - + + + + Silicibacter STM + + - + + Silicibacter pomeroyi S PO + + - + + Jannaschia Jann + + - #? + + + + quintana and spp. sp. TM1040 sp.CC51 HTCC2654 Rhodobacterales bacterium Roseobacter sp. MED193 Roseovarius nubinhibens - proteobacteria Rhodobacterales Roseovarius ISM sp.217 Loktanella vestfoldensis Sulfitobacter sp. SKA53 EE-36 RB2654 + + - MED193 + + - ISM + + - + #? ROS217 + + - + + SKA53 + + - #? + EE36 + + - #? #? + OB2597 + + OA2633 - + - - + CC - + - - + PB2503 - + - - + Erythrobacter litoralis ELI - - Novosphingobium aromaticivorans Saro - + + - - + + Sphinopyxis g alaskensis HTCC2597 Oceanicola batsensis HTCC2633 Oceanicaulis alexandrii Caulobacterales Caulobacter crescentu s Parvularculales Parvularcula bermudensis Rhodospirillales SAR11 cluster Rickettsiales HTCC2503 Sala - + - - + ZM - + - - + Gluconobacter oxydans GOX - + - + Rhodospirillum rubrum Rrub - + + - - + + Magnetospirillum magneticum Amb - + + - - + PU1002 + + - - + - - - - + Pelagibacter ubique Rickettsia HTCC1002 and Ehrlichia species B. C. + Zymomonas mobilis RB2256 A. Distribution of Irr, Fur/Mur, MntR, RirA, and IscR regulons in α-proteobacteria + - Hyphomonadaceae Sphingomonadales + - Bartonella Bradyrhizobiaceae BME Group D. #?' in RirA column denotes the absence of the rirA gene in an unfinished genomic sequence and the presence of candidate RirA-binding sites upstream of the iron uptake genes. Phylogenetic tree of the Fur family of transcription factors in a-proteobacteria - I Fur sp| Escherichia coli: P0A9A9 ECOLI Pseudomonas aeruginosa PSEAE NEIMA Neisseria meningitidis : sp|Q03456 : sp|P0A0S7 Fur in g- and b- proteobacteria HELPY Helicobacter pylori : sp|O25671 Bacillus subtilis : P54574 sp| BACSU SM mur Sinorhizobium meliloti Mesorhizobium sp. BNC1 (I) MBNC03003179 BQ fur2 Bartonella quintana BMEI0375 Brucella melitensis EE36 12413 Sulfitobacter sp. EE-36 MBNC03003593Mesorhizobium sp. BNC1 (II) HTCC2654 Rhodobacterales bacterium RB2654 19538 Agrobacterium tumefaciens AGR C 620 RHE_CH00378 Rhizobium etli Rhizobium leguminosarum RL mur Nham 0990 Nitrobacter hamburgensis X14 Nwi 0013 Nitrobacter winogradskyi Rhodopseudomonas palustris RPA0450 Bradyrhizobium japonicum BJ fur Roseovarius sp.217 ROS217 18337 Jannaschia sp. CC51 Jann 1799 Silicibacter pomeroyi SPO2477 STM1w01000993Silicibacter sp. TM1040 MED193 22541 Roseobacter sp. MED193 OB2597 02997 Oceanicola batsensis HTCC2597 Loktanella vestfoldensisSKA53 SKA53 03101 Rhodobacter sphaeroides Rsph03000505 Roseovarius nubinhibens ISM ISM 15430 PU1002 04436Pelagibacter ubiqueHTCC1002 GOX0771 Gluconobacter oxydans Zmomonas y mobilis ZM01411 Novosphingobium aromaticivorans Saro02001148 Sphinopyxis alaskensis RB2256 Sala 1452 ELI1325 Erythrobacter litoralis Oceanicaulis alexandrii HTCC2633 OA2633 10204 PB2503 04877 Parvularcula bermudensis HTCC2503 CC0057 Caulobacter crescentus Rhodospirillum rubrum Rrub02001143 (I) Magnetospirillum magneticum Amb1009 Magnetospirillum magneticum (II) Amb4460 Fur in e- proteobacteria Fur in Firmicutes Mur in a-proteobacteria Regulator of manganese uptake genes (sit, mntH) Fur in a-proteobacteria Regulator of iron uptake and metabolism genes Irr a-proteobacteria Erythrobacter litoralis Caulobacter crescentus Zymomonas mobilis Novosphingobium aromaticivorans Oceanicaulis alexandrii Sphinopyxis alaskensis Gluconobacter oxydans Rhodospirillum rubrum Parvularcula bermudensis - Magnetospirillum magneticum Identified Mur-binding sites The A, B, and C groups of a - proteobacteria - Sequence logos for the identified Fur-binding sites in the D group of a-proteobacteria Bacillus subtilis Mur Escherichia coli Sequence logos for the known Fur-binding sites in Escherichia coli and Bacillus subtilis Phylogenetic tree of the Fur family of transcription factors in a-proteobacteria - II Fur Escherichia coli : P0A9A9 sp| ECOLI Pseudomonas aeruginosa : sp|Q03456 PSEAE NEIMA Fur in g- and b- proteobacteria Neisseria meningitidis : sp|P0A0S7 HELPY Helicobacter pylori : sp|O25671 sp| BACSU Bacillus subtilis : P54574 Fur in e- proteobacteria Fur in Firmicutes a-proteobacteria Mur / Fur Agrobacterium tumefaciens AGR C 249 Sinorhizobium meliloti SM irr Rhizobium etli RHE CH00106 Rhizobium leguminosarum (I) RL irr1 RL irr2 Rhizobium leguminosarum (II) Mesorhizobium loti MLr5570 MBNC03003186 Mesorhizobium sp. BNC1 BQ fur1 Bartonella quintana Brucella melitensis (I) BMEI1955 Brucella melitensis (II) BMEI1563 BJ blr1216 Bradyrhizobium japonicum (II) RB2654 182 Rhodobacterales bacterium HTCC2654 Loktanella vestfoldensis SKA53 SKA53 01126 Roseovarius sp.217 ROS217 15500 Roseovarius nubinhibens ISM ISM 00785 OB2597 14726 Oceanicola batsensis HTCC2597 Jann 1652 Jannaschia sp. CC51 Rsph03001693Rhodobacter sphaeroides Sulfitobacter sp. EE-36 EE36 03493 STM1w01001534 Silicibacter sp. TM1040 Roseobacter sp. MED193 MED193 17849 SPOA0445 Silicibacter pomeroyi Rhodobacter capsulatus RC irr RPA2339 Rhodopseudomonas palustris (I) RPA0424* Rhodopseudomonas palustris (II) Bradyrhizobium japonicum (I) BJ irr* Nwi 0035* Nitrobacter winogradskyi Nham 1013* Nitrobacter hamburgensis X14 PU1002 04361 Pelagibacter ubique HTCC1002 Irr in a-proteobacteria regulator of iron homeostasis Sequence logos for the identified Irr binding sites in a-proteobacteria. The A group (8 species) - Irr The B group (4 species) - Irr The C group (12 species) - Irr Phylogenetic tree of the Rrf2 family of transcription factors in a-proteobacteria Nitrite/NO-sensing regulator NsrR (Nitrosomonas europeae, Escherichia coli) ROS217_15206 Rsph03001477 RC NsrR GOX0860 Amb1318 Nwi_0743 Iron repressor RirA (Rhizobium leguminosarum) SPOA0186 Ricket. Sala_1049 Saro02000305 NE NsrR OB2597_05195 ROS217_02155 ROS217_14291 SMc00785 RHE CH00735 AGR_C_344 Cysteine metabolism repressor CymR (Bacillus subtilis) AGR_L_1131 SPO3722 RHE_CH02777 RL_3336 SPO1393 MBNC02000669 MLl1642 SMc02238 AGR_C_872 RHE_CH00547 OA2633_11510 RL RirA BMEII0707 MLr1147 MBNC02002196 BQ04990 RC 0780 RB2654_19993 Rsph023178 SPO0432 MED193_09800 STM_634 Positional clustering of rrf2-like genes with: iron uptake and storage genes; Fe-S cluster synthesis operons; genes involved in nitrosative stress protection; sulfate uptake/assimilation genes; CC0132 thioredoxin reductase; SMc01160 BJ blr7974 carboxymuconolactone RL_5159 AGR_L_2343 decarboxylase-family genes; AGR_C_402 hmc cytochrome operon NsrR RirA RL_619 ZMO0116 ROS217_16231 GOX0099 BS CymR IscR-II Rrub02000219 ZMO0422 Sala_1236 IscR ELI0458 Saro3534 DV Rrf2 OA2633_03246 CC1866 EC IscR Jann_2366 STM_3629 EE36_14302 SPO2025 Rsph023725 RC_0477 Rrub_1115 Amb0200 GOX1196 RPA0663 Ricket. Cytochrome complex regulator Rrf2 (Desulfovibrio vulgaris) Iron-Sulfur cluster synthesis repressor IscR (Escherichia coli) PB2503_ 09884 proteins with the conserved C-X(6-9)-C(4-6)-C motif within effector-responsive domain proteins without a cysteine triad motif Sequence logos for the identified RirA-binding sites in a-proteobacteria The A group - RirA (8 species) The C group - RirA (12 species) Distribution of the conserved members of the Fe- and Mn-responsive regulons and the predicted RirA, Fur/Mur, Irr, and DtxR binding sites in a-proteobacteria Genes Functions: Iron uptake Iron storage FeS synthesis Iron usage Heme biosynthesis Regulatory genes Manganese uptake An attempt to reconstruct the history Acknowledgements • Dmitry Rodionov (comparative genomics) • Andrei Mironov (software) • Alexei Vitreschak (riboswitches) • Slides: – Michael Galperin (NCBI, Bethesda) – Andrei Osterman (Burnham Institute, San-Diego) • Collaboration: – Thomas Eitinger (Humboldt University, Berlin) – Co/Ni transporters – Andy Johnston (University of East Anglia) – Fe in alphas • Funding: – – – – Howard Hughes Medical Institute Russian Fund of Basic Research RAS, program “Molecular and Cellular Biology” INTAS