Transcript Document
AUTOMATED ASSEMBLY OF GENE FAMILIES AND DETECTION OF HORIZONTALLY TRANSFERRED GENES Maria Poptsova University of Connecticut Dept. of Molecular and Cell Biology Superfamily of ATP synthases for 317 taxa of bacteria and archaea August 18, 2006, Stanford University, CA Outline: Tree of Life 16s RNA tree Rooting the Tree of Life How Tree-like is an Organismal Evolution? What is Horizontal Gene Transfer? What is Organismal Lineage in light of HGT? Automated Methods of Assembling Orthologous Gene Families Reciprocal Blast Hit Method – problems with paralogs Branchclust: Phylogenetic Algorithm for Assembling Gene Families Methods of HGT Detection Overview of methods for HGT detection: AU Test, Symmetrical Difference of Robinson and Foulds, Bipartition Analysis HGT In-silico experiments The Tree of Life according to SSU ribosomal RNA (+) • strictly bifurcating • no reticulation • only extant lineages • based on a single molecular phylogeny • branch length is not proportional to time Cenancestor To Root as placed by ancient duplicated genes (ATPases, Signal recognition particles, EF) Haloferax ARCHAEA Methanospirillum Methanobacterium Thermoproteus Thermofilum BACTERIA Epulopiscium Bacillus chloroplast Synechococcus Treponema Thermus Deinococcus Thermotoga Aquifex EM 17 Methanococcus pSL 50 pSL 4 pSL 22 pSL 12 mitochondria Agrobacterium Chlorobium Cytophaga Methanosarcina Sulfolobus Marine group 1 Riftia E.coli Chromatium Thermococcus Methanopyrus pJP 27 pJP 78 0.1 changes per nt EUCARYA Tritrichomonas Hexamita Zea Homo Coprinus Paramecium Giardia Porphyra Vairimorpha Dictyostelium Physarum Encephalitozoon Trypanosoma Naegleria Entamoeba Euglena CPS V/A-ATPase Prolyl RS Lysyl RS Mitochondria Plastids Fig. modified from Norman Pace What is HGT? Genes can be passed vertically – from ancestor to a child Genes also can be passed horizontally – exchange of genes between different species HGT stands for Horizontal Gene Transfer How Tree-like is Organismal Evolution? Horizontal Gene Transfer Mosaic Genomes Science,280 p.672ff (1998) How many common genes? Escherichia coli, strain CFT073, uropathogenic Escherichia coli, strain EDL933, enterohemorrhagic Escherichia coli K12, strain MG1655, laboratory strain, “… only 39.2% of their combined (nonredundant) set of proteins actually are common to all three strains.” Welch RA, et al. Proc Natl Acad Sci U S A. 2002; 99:17020-4 Organismal Lineage What is an “organismal lineage” in light of horizontal gene transfer? Over very short time intervals an organismal lineage can be defined as the majority consensus of genes. Rope as a metaphor to describe an organismal lineage (Gary Olsen) Individual fibers = genes that travel for some time in a lineage. While no individual fiber (gene) present at the beginning might be present at the end, the rope (or the organismal lineage) nevertheless has continuity. However, the genome as a whole will acquire the character of the incoming genes (the rope turns solidly red over time). From: Bill Martin (1999) BioEssays 21, 99-104 Selection of Orthologous Gene Families All automated methods for assembling sets of orthologous genes are based on sequence similarities. BLAST hits Triangular circular BLAST significant hits (COG, or Cluster of Orthologous Groups) Sequence identity of 30% and greater (SCOP database) Similarity complemented by HMM-profile analysis Pfam database Reciprocal BLAST hit method Reciprocal BLAST Hit Method 2’ 1 2 1 2 3 4 3 4 1 gene family 0 gene family often fails in the presence of paralogs Families of ATP-synthases Case of 2 bacteria and 2 archaea species ATP-A (catalytic subunit) Escherichia coli Escherichia coli ATP-A ATP-B ATP-B (non-catalytic subunit) ATP-A ATP-B Methanosarcina mazei ATP-F Methanosarcina mazei ATP-A ATP-A ATP-A ATP-A ATP-B ATP-B ATP-B ATP-B Sulfolobus solfataricus Sulfolobus solfataricus ATP-A ATP-B Bacillus subtilis ATP-A ATP-B ATP-F Bacillus subtilis Neither ATP-A nor ATB-B is selected by RBH method Families of ATP-synthases Phylogenetic Tree Family of ATP-A Sulfolobus solfataricus ATP-A Methanosarcina mazei ATP-A Bacillus subtilis ATP-A ATP-A Escherichia coli Bacillus subtilis ATP-F ATP-B Escherichia coli Escherichia coli ATP-F ATP-B ATP-B Family of ATP-F Sulfolobus solfataricus Bacillus subtilis ATP-B Methanosarcina mazei Family of ATP-B BranchClust Algorithm genome 1 genome i genome 2 BLAST hits genome 3 genome N dataset of N genomes www.bioinformatics.org/branchclust superfamily tree BranchClust Algorithm www.bioinformatics.org/branchclust BranchClust Algorithm Root positions Superfamily of penicillin-binding protein 13 gamma proteo bacteria www.bioinformatics.org/branchclust Superfamily of DNA-binding protein 13 gamma proteo bacteria BranchClust Algorithm Comparison of the best BLAST hit method and BranchClust algorithm Number of taxa A: Archaea B: Bacteria Number of selected families: Reciprocal best BLAST hit BranchClust 2A 2B 80 414 (all complete) 13B 236 409 (263 complete, 409 with n8 ) 16B 14A 12 126 (60 complete, 126 with n24). www.bioinformatics.org/branchclust BranchClust Algorithm ATP-synthases: Examples of Clustering 13 gamma proteobacteria 30 taxa: 16 bacteria and 14 archaea www.bioinformatics.org/branchclust 317 bacteria and archaea BranchClust Algorithm Typical Superfamily for 30 taxa (16 bacteria and 14 archaea) 59:30 33:19 53:26 55:21 37:19 36:21 www.bioinformatics.org/branchclust BranchClust Algorithm Gene Annotation Superfamily of penicillin-binding protein for 13 gamma proteobacteria ------------ CLUSTER 1 ---------------------- FAMILY ----------->gi|27904705| peptidoglycan synthetase FtsI [Buchnera aphidicola str. Bp (Baizongia pistaciae)] >gi|26246017| Peptidoglycan synthetase ftsI precursor [Escherichia coli CFT073] >gi|16273058| penicillin-binding protein 3 [Haemophilus influenzae Rd KW20] >gi|15602001| FtsI [Pasteurella multocida subsp. multocida str. Pm70] >gi|15599614| penicillin-binding protein 3 [Pseudomonas aeruginosa PAO1] >gi|16763512| division specific transpeptidase [Salmonella typhimurium LT2] >gi|15642404| penicillin-binding protein 3 [Vibrio cholerae O1 biovar eltor str. N16961] >gi|32490961| hypothetical protein WGLp212 [Wigglesworthia glossinidia endosymbiont of Glossina brevipalpis] >gi|21230194| penicillin-binding protein 3 [Xanthomonas campestris pv. campestris str. ATCC 33913] >gi|21241544| penicillin-binding protein 3 [Xanthomonas axonopodis pv. citri str. 306] >gi|15837394| penicillin binding protein 3 [Xylella fastidiosa 9a5c] >gi|16120877| penicillin-binding protein 3 [Yersinia pestis CO92] >gi|22127506| peptidoglycan synthetase [Yersinia pestis KIM] COMPLETE: 13 >>>>> IN-PARALOGS ---------->gi|16765177| putative penicillin-binding protein 3 [Salmonella typhimurium LT2] >gi|15597468| penicillin-binding protein 3A [Pseudomonas aeruginosa PAO1] ------------ CLUSTER 2 ---------------------- FAMILY ----------->gi|26246616| Penicillin-binding protein 2 [Escherichia coli CFT073] >gi|16272007| penicillin-binding protein 2 [Haemophilus influenzae Rd KW20] >gi|15603789| Pbp2 [Pasteurella multocida subsp. multocida str. Pm70] >gi|15599198| penicillin-binding protein 2 [Pseudomonas aeruginosa PAO1] >gi|16764017| cell elongation-specific transpeptidase [Salmonella typhimurium LT2] >gi|15640966| penicillin-binding protein 2 [Vibrio cholerae O1 biovar eltor str. N16961] >gi|32490921| hypothetical protein WGLp172 [Wigglesworthia glossinidia endosymbiont of Glossina brevipalpis] >gi|21232896| penicillin-binding protein 2 [Xanthomonas campestris pv. campestris str. ATCC 33913] >gi|21241430| penicillin-binding protein 2 [Xanthomonas axonopodis pv. citri str. 306] >gi|15837913| penicillin binding protein 2 [Xylella fastidiosa 9a5c] >gi|16122817| penicillin-binding protein 2 [Yersinia pestis CO92] >gi|22125081| peptidoglycan synthetase, penicillin-binding protein 2 [Yersinia pestis KIM] INCOMPLETE: 12 >>>>> IN-PARALOGS ---------->gi|16765252| putative penicillin-binding protein [Salmonella typhimurium LT2] www.bioinformatics.org/branchclust BranchClust Algorithm Implementation and Usage The BranchClust algorithm is implemented in Perl with the use of the BioPerl module for parsing trees and is freely available at http://bioinformatics.org/branchclust Required: 1.Bioperl module for parsing trees Bio::TreeIO 2. Taxa recognition file gi_numbers.out must be present in the current directory. How to create this file, read the Taxa recognition file section on the web-site. Usage: At the command line type: # perl branch_clust.pl <tree-file> <MANY> Output: families.list, clusters.out, clusters.log How to do batch processing: Example of a wrapper you can find on the web-site. www.bioinformatics.org/branchclust BranchClust Algorithm Data Flow Download n complete genomes (ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria) In fasta format (*.faa) Put all n genomes in one database Take one starting genome and do BLAST of this genome against the database, consisting of n genomes Align with ClustalW Reconstruct superfamily tree ClustalW –quick distance method Phyml – Maximum Likelihood Parse with BranchClust Gene families Parse BLAST-output with the requirement that all n-taxa should be present Superfamilies www.bioinformatics.org/branchclust Why do we need gene families? How many genes are common between different species? Do all the common genes share the common history? How do we reconstruct the tree of life? How can we detect genes that were horizontally transferred? Methods of HGT Detection • Parametric methods • GC-content analysis • analysis of single nucleotide composition (SNC) and dinucleotide composition (DNC) • codon usage bias • other measures based on sequence composition • Phylogenetic methods • AU-test • SPR metric (NP-hard problem) • Symmeytric difference of Robison and Foulds • Biparition spectrum analysis • Quartet spectrum analysis AU test Hidetoshi Shimodaira (2002) P1 P2 AU test, or approximately unbiased test of phylogenetic tree selection was proposed for assessing the confidence of tree selection. The AU test method produces for each tree a number ranging from zero to one – (P1 and P2). This number Is the probability value that the tree is the true tree. The greater the P-value, the greater the probability that the tree is the true tree. If P1 = genome tree , and P2 = gene tree, then the AU test provides the probability that P1 is the true tree for the gene family. One would expect that a tree with different topology would have a small P-value. unclear Accepted requirement for HGT detection: P-Value < 1E-2 – 1E-4 Some Metrics to Compare Tree Topologies • SPR – distance • Symmetric Difference of Robinson and Foulds (bipartition distance) • Quartet distance SPR metric – Subtree Pruning and Regrafting is very hard computationally There are no tools available to calculate the difference In tree topology by number of SPR-operations required to transform one tree to another. There is Robert Beiko’s program.. Also the SPR distance alone would not consider support levels. That is why bipartitions come into the scene Spectral Analyses of Phylogenetic Data Phylogenetic information present in genomes Break information into small quanta of information (bipartitions or embedded quartets) Analyze spectra to detect transferred genes and plurality consensus. BIPARTITION OF A PHYLOGENETIC TREE Bipartition (or split) – a division of a phylogenetic tree into two parts that are connected by a single branch. It divides a dataset into two groups, but it does not consider the relationships within each of the two groups. Number of non-trivial bipartitions for N genomes is equal to 2(N-1)-N-1. A **… ***.. B C D E A E C D B *...* *.*.* Bipartitions can be divided in conflicting and non-conflicting non-conflicting (can coexist in one tree) conflicting (can not coexist in one tree) **… ***.. **… *…* The Tree Drawing Process Try to infer phylogeny unclear Choose/ make consensus “Likely” Trees “Best” Tree Obtaining Bootstrap Support for Branches Now bipartitions have weights: .**….. 65 ***….. 75 …**… 75 …..*** 100 ……** 100 Resampling e.g.. bootstrapping 75 65 75 100 100 A B C D E F G H *.*….. Resampling simulates examining extra sequence from the original data 75 Data Flow select gene families For every gene family, align sequences (ClustalW) For every gene family, reconstruct Maximum Likelihood (ML) Tree and generate 100 bootstrap samples (phyml) For every gene family, extract bipartition information from each bootstrapped tree, and compose a bipartition matrix Bipartitions matrix is generated **….. *.*…. *…*…. *….*…. etc. fam1 71 34 0 99 fam2 56 2 0 76 fam3 1 99 1 99 … famN 34 99 0 99 Do “Lento” Plot analysis Results: majority consensus bipartitions detected conflicts = HGT events 13 gamma proteobacteria 1. Buchnera aphidicola str. Bp (Baizongia pistaciae) 2. Escherichia coli CFT073 3. Haemophilus influenzae Rd KW20 4. Pasteurella multocida subsp. multocida str. Pm70 5. Pseudomonas aeruginosa PAO1 6. Salmonella typhimurium LT2 7. Vibrio cholerae O1 biovar eltor str. N16961 8. Wigglesworthia glossinidia endosymbiont of Glossina brevipalpis 9. Xanthomonas campestris pv. campestris str. ATCC 33913 10. Xanthomonas axonopodis pv. citri str. 306 11. Xylella fastidiosa 9a5c 12. Yersinia pestis KIM 13. Yersinia pestis CO92 1.37E+10 possible unrooted tree topologies “Lento”-plot of 34 supported bipartitions (out of 4082 possible) 13 gammaproteobacterial genomes (258 putative orthologs): •E.coli •Buchnera •Haemophilus •Pasteurella •Salmonella •Yersinia pestis (2 strains) •Vibrio •Xanthomonas (2 sp.) •Pseudomonas •Wigglesworthia There are 13,749,310,575 possible unrooted tree topologies for 13 genomes Consensus Tree and Horizontally Transferred Gene Consensus clusters of eight significantly supported bipartitions only 258 genes analyzed Phylogeny of putatively transferred gene (virulence factor homologs (mviN)) What are the Actual HGT Rates? Are the detected transfers mainly false positives, or are they the tip of an iceberg of many transfer events most of which go undetected by current methods? Here we explore how well these methods perform using in silico transfers between the leaves of a gamma proteobacterial phylogeny. HGT in silico: Testing Methods of Detection •AU test •Symmetric Difference of Robinson – Foulds •Biparition Analysis HGT in silico: AU test 13 gamma proteobacteria 236 families AU Test: Number of Detected Conflicts A Number of Conflicts 250 200 150 100 50 0 -4 -3,5 -3 -2,5 -2 -1,5 -1 -0,5 Logarithm of Significance Level Only two families out of 236 showed a conflict at the significance level of 5 *10-4, 5 conflicts were found at the significance level of 0,01 and 26 conflicts at the significance level of 0,05. 0 HGT in silico: AU test 13 gamma proteobacteria 236 families Pseudomonas aeroginosa Vibrio cholera 120 100 80 60 40 20 Au-value = 10-4 Log(Au-value)=-4 Only 10% of au-values is less than 10-4 0 -4 0. 0 -3 0 7. 5 -3 0 5. 0 -3 0 2. 5 -3 0 0. 0 -2 0 7. 5 -2 0 5. 0 -2 0 2. 5 -2 0 0. 0 -1 0 7. 5 -1 0 5. 0 -1 0 2. 5 -1 0 0. 00 -7 .5 0 -5 .0 0 -2 .5 0 0. 00 -4 0. 0 -3 0 7. 5 -3 0 5. 0 -3 0 2. 5 -3 0 0. 0 -2 0 7. 5 -2 0 5. 0 -2 0 2. 5 -2 0 0. 0 -1 0 7. 5 -1 0 5. 0 -1 0 2. 5 -1 0 0. 00 -7 .5 0 -5 .0 0 -2 .5 0 0. 00 50 45 40 35 30 25 20 15 10 5 0 Escherichia coli Xylella fastidosa 86% of au-values is less than 10-4 HGT in silico: AU test Power of Detection Significance level < 1e-4 Significance level < 1e-2 HGT in silico: Robinson Foulds Metric Power of Detection Significance level < 1e-2 Distribution of number of different bipartitions in the original dataset 100 90 80 70 60 50 40 30 20 10 0 1 2 3 4 5 6 7 8 9 Nu of different bipartitions 10 11 12 13 HGT in silico: Bipartition Analysis Power of Detection Bootstrap support >70% Bootstrap support >90% Acknowledgements Prof. Peter Gogarten, University of Connecticut Gogarten Lab: Pascal Lapierre Gregory Fournier Alireza Ghodsi Senejani Holly E. Gardner Tim Harlow Kristen Swithers Kaiyuan Shi NSF Microbial Genetics NASA Exobiology & AISR Programs