Transcript Document

AUTOMATED ASSEMBLY OF GENE FAMILIES
AND DETECTION OF HORIZONTALLY
TRANSFERRED GENES
Maria Poptsova
University of Connecticut
Dept. of Molecular and Cell Biology
Superfamily of ATP synthases for 317 taxa
of bacteria and archaea
August 18, 2006, Stanford University, CA
Outline:
Tree of Life
16s RNA tree
Rooting the Tree of Life
How Tree-like is an Organismal Evolution?
What is Horizontal Gene Transfer?
What is Organismal Lineage in light of HGT?
Automated Methods of Assembling Orthologous Gene Families
Reciprocal Blast Hit Method – problems with paralogs
Branchclust: Phylogenetic Algorithm for Assembling Gene Families
Methods of HGT Detection
Overview of methods for HGT detection:
AU Test, Symmetrical Difference of Robinson and Foulds, Bipartition Analysis
HGT In-silico experiments
The Tree of Life according to SSU ribosomal RNA (+)
• strictly bifurcating
• no reticulation
• only extant lineages
• based on a single
molecular phylogeny
• branch length is not
proportional to time
Cenancestor
To Root
as placed by ancient duplicated
genes (ATPases, Signal
recognition particles, EF)
Haloferax
ARCHAEA
Methanospirillum
Methanobacterium
Thermoproteus
Thermofilum
BACTERIA
Epulopiscium
Bacillus
chloroplast
Synechococcus
Treponema
Thermus
Deinococcus
Thermotoga
Aquifex
EM 17
Methanococcus
pSL 50
pSL 4
pSL 22
pSL 12
mitochondria
Agrobacterium
Chlorobium
Cytophaga
Methanosarcina
Sulfolobus
Marine
group 1
Riftia E.coli
Chromatium
Thermococcus
Methanopyrus
pJP 27
pJP 78
0.1 changes per nt
EUCARYA
Tritrichomonas
Hexamita
Zea
Homo
Coprinus
Paramecium
Giardia
Porphyra
Vairimorpha
Dictyostelium
Physarum
Encephalitozoon
Trypanosoma
Naegleria
Entamoeba
Euglena
CPS
V/A-ATPase
Prolyl RS
Lysyl RS
Mitochondria
Plastids
Fig. modified from
Norman Pace
What is HGT?
Genes can be passed vertically – from
ancestor to a child
Genes also can be passed horizontally –
exchange of genes between different
species
HGT stands for Horizontal Gene Transfer
How Tree-like is Organismal
Evolution?
 Horizontal Gene Transfer
 Mosaic Genomes
Science,280 p.672ff (1998)
How many common genes?
Escherichia coli, strain CFT073, uropathogenic
Escherichia coli, strain EDL933, enterohemorrhagic
Escherichia coli K12, strain MG1655, laboratory strain,
“… only 39.2% of their
combined (nonredundant)
set of proteins actually are
common to all three
strains.”
Welch RA, et al.
Proc Natl Acad Sci U S A. 2002; 99:17020-4
Organismal Lineage
What is an “organismal lineage” in light of
horizontal gene transfer?
Over very short time intervals an organismal
lineage can be defined as the majority
consensus of genes.
Rope as a metaphor to describe an organismal lineage
(Gary Olsen)
Individual fibers = genes that travel for some time in a lineage.
While no individual fiber (gene) present at the beginning might be present at
the end, the rope (or the organismal lineage) nevertheless has continuity.
However, the genome as a whole will acquire the character
of the incoming genes (the rope turns solidly red over time).
From:
Bill Martin (1999)
BioEssays 21, 99-104
Selection of Orthologous Gene
Families
All automated methods for assembling sets of orthologous
genes are based on sequence similarities.
BLAST hits
Triangular circular BLAST significant hits
(COG, or Cluster of Orthologous Groups)
Sequence identity of 30% and greater
(SCOP database)
Similarity complemented by HMM-profile analysis
Pfam database
Reciprocal BLAST hit method
Reciprocal BLAST Hit Method
2’
1
2
1
2
3
4
3
4
1 gene family
0 gene family
often fails in the presence of paralogs
Families of ATP-synthases
Case of 2 bacteria and 2 archaea species
ATP-A (catalytic subunit)
Escherichia coli
Escherichia coli
ATP-A
ATP-B
ATP-B (non-catalytic subunit)
ATP-A
ATP-B
Methanosarcina
mazei
ATP-F
Methanosarcina
mazei
ATP-A
ATP-A
ATP-A
ATP-A
ATP-B
ATP-B
ATP-B
ATP-B
Sulfolobus
solfataricus
Sulfolobus
solfataricus
ATP-A
ATP-B
Bacillus subtilis
ATP-A
ATP-B
ATP-F
Bacillus subtilis
Neither ATP-A nor ATB-B is selected by RBH method
Families of ATP-synthases
Phylogenetic Tree
Family of ATP-A
Sulfolobus solfataricus
ATP-A
Methanosarcina
mazei
ATP-A
Bacillus subtilis
ATP-A
ATP-A
Escherichia coli
Bacillus subtilis ATP-F
ATP-B
Escherichia coli
Escherichia coli
ATP-F
ATP-B
ATP-B
Family of ATP-F
Sulfolobus
solfataricus
Bacillus subtilis
ATP-B
Methanosarcina
mazei
Family of ATP-B
BranchClust Algorithm
genome 1
genome i
genome 2
BLAST
hits
genome 3
genome N
dataset of N genomes
www.bioinformatics.org/branchclust
superfamily
tree
BranchClust Algorithm
www.bioinformatics.org/branchclust
BranchClust Algorithm
Root positions
Superfamily of penicillin-binding protein
13 gamma proteo bacteria
www.bioinformatics.org/branchclust
Superfamily of DNA-binding protein
13 gamma proteo bacteria
BranchClust Algorithm
Comparison of the best BLAST hit method and BranchClust algorithm
Number of taxa A: Archaea
B: Bacteria
Number of selected families:
Reciprocal best
BLAST hit
BranchClust
2A 2B
80
414 (all complete)
13B
236
409 (263 complete, 409 with n8 )
16B 14A
12
126 (60 complete, 126 with n24).
www.bioinformatics.org/branchclust
BranchClust Algorithm
ATP-synthases: Examples of Clustering
13 gamma proteobacteria
30 taxa: 16 bacteria and 14 archaea
www.bioinformatics.org/branchclust
317 bacteria and archaea
BranchClust Algorithm
Typical Superfamily for 30 taxa (16 bacteria and 14 archaea)
59:30
33:19
53:26
55:21
37:19
36:21
www.bioinformatics.org/branchclust
BranchClust Algorithm
Gene Annotation
Superfamily of penicillin-binding protein for 13 gamma proteobacteria
------------ CLUSTER 1 ---------------------- FAMILY ----------->gi|27904705| peptidoglycan synthetase FtsI [Buchnera aphidicola str. Bp (Baizongia pistaciae)]
>gi|26246017| Peptidoglycan synthetase ftsI precursor [Escherichia coli CFT073]
>gi|16273058| penicillin-binding protein 3 [Haemophilus influenzae Rd KW20]
>gi|15602001| FtsI [Pasteurella multocida subsp. multocida str. Pm70]
>gi|15599614| penicillin-binding protein 3 [Pseudomonas aeruginosa PAO1]
>gi|16763512| division specific transpeptidase [Salmonella typhimurium LT2]
>gi|15642404| penicillin-binding protein 3 [Vibrio cholerae O1 biovar eltor str. N16961]
>gi|32490961| hypothetical protein WGLp212 [Wigglesworthia glossinidia endosymbiont of Glossina brevipalpis]
>gi|21230194| penicillin-binding protein 3 [Xanthomonas campestris pv. campestris str. ATCC 33913]
>gi|21241544| penicillin-binding protein 3 [Xanthomonas axonopodis pv. citri str. 306]
>gi|15837394| penicillin binding protein 3 [Xylella fastidiosa 9a5c]
>gi|16120877| penicillin-binding protein 3 [Yersinia pestis CO92]
>gi|22127506| peptidoglycan synthetase [Yersinia pestis KIM]
COMPLETE: 13
>>>>> IN-PARALOGS ---------->gi|16765177| putative penicillin-binding protein 3 [Salmonella typhimurium LT2]
>gi|15597468| penicillin-binding protein 3A [Pseudomonas aeruginosa PAO1]
------------ CLUSTER 2 ---------------------- FAMILY ----------->gi|26246616| Penicillin-binding protein 2 [Escherichia coli CFT073]
>gi|16272007| penicillin-binding protein 2 [Haemophilus influenzae Rd KW20]
>gi|15603789| Pbp2 [Pasteurella multocida subsp. multocida str. Pm70]
>gi|15599198| penicillin-binding protein 2 [Pseudomonas aeruginosa PAO1]
>gi|16764017| cell elongation-specific transpeptidase [Salmonella typhimurium LT2]
>gi|15640966| penicillin-binding protein 2 [Vibrio cholerae O1 biovar eltor str. N16961]
>gi|32490921| hypothetical protein WGLp172 [Wigglesworthia glossinidia endosymbiont of Glossina brevipalpis]
>gi|21232896| penicillin-binding protein 2 [Xanthomonas campestris pv. campestris str. ATCC 33913]
>gi|21241430| penicillin-binding protein 2 [Xanthomonas axonopodis pv. citri str. 306]
>gi|15837913| penicillin binding protein 2 [Xylella fastidiosa 9a5c]
>gi|16122817| penicillin-binding protein 2 [Yersinia pestis CO92]
>gi|22125081| peptidoglycan synthetase, penicillin-binding protein 2 [Yersinia pestis KIM]
INCOMPLETE: 12
>>>>> IN-PARALOGS ---------->gi|16765252| putative penicillin-binding protein [Salmonella typhimurium LT2]
www.bioinformatics.org/branchclust
BranchClust Algorithm
Implementation and Usage
The BranchClust algorithm is implemented in Perl with the use of the BioPerl module for
parsing trees and is freely available at http://bioinformatics.org/branchclust
Required:
1.Bioperl module for parsing trees Bio::TreeIO
2. Taxa recognition file gi_numbers.out must be present in the current directory.
How to create this file, read the Taxa recognition file section on the web-site.
Usage:
At the command line type:
# perl branch_clust.pl <tree-file> <MANY>
Output: families.list, clusters.out, clusters.log
How to do batch processing:
Example of a wrapper you can find on the web-site.
www.bioinformatics.org/branchclust
BranchClust Algorithm
Data Flow
Download n complete genomes
(ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria)
In fasta format (*.faa)
Put all n genomes in one database
Take one starting genome and do
BLAST of this genome against the
database, consisting of n genomes
Align with ClustalW
Reconstruct superfamily tree
ClustalW –quick distance method
Phyml – Maximum Likelihood
Parse with BranchClust
Gene families
Parse BLAST-output with the
requirement that all n-taxa should be
present
Superfamilies
www.bioinformatics.org/branchclust
Why do we need gene families?
How many genes are common between
different species?
Do all the common genes share the
common history?
How do we reconstruct the tree of life?
How can we detect genes that were
horizontally transferred?
Methods of HGT Detection
• Parametric methods
• GC-content analysis
• analysis of single nucleotide composition (SNC) and
dinucleotide composition (DNC)
• codon usage bias
• other measures based on sequence composition
• Phylogenetic methods
• AU-test
• SPR metric (NP-hard problem)
• Symmeytric difference of Robison and Foulds
• Biparition spectrum analysis
• Quartet spectrum analysis
AU test
Hidetoshi Shimodaira (2002)
P1
P2
AU test, or approximately unbiased test of phylogenetic tree
selection was proposed for assessing the confidence of tree selection.
The AU test method produces for each tree a number ranging from zero to one –
(P1 and P2). This number Is the probability value that the tree is the true tree.
The greater the P-value, the greater the probability that the tree is the true tree.
If P1 = genome tree , and P2 = gene tree, then the AU test provides the
probability that P1 is the true tree for the gene family.
One would expect that a tree with different topology would have a small P-value.
unclear
Accepted requirement for HGT detection:
P-Value < 1E-2 – 1E-4
Some Metrics to Compare
Tree Topologies
• SPR – distance
• Symmetric Difference of Robinson and
Foulds (bipartition distance)
• Quartet distance
SPR metric –
Subtree Pruning and Regrafting
is very hard computationally
There are no tools available to calculate the difference
In tree topology by number of SPR-operations
required to transform
one tree to another.
There is Robert Beiko’s program..
Also the SPR distance alone would not consider support levels.
That is why bipartitions come into the scene
Spectral Analyses of
Phylogenetic Data
Phylogenetic
information
present in
genomes
Break information
into small quanta
of information
(bipartitions or
embedded quartets)
Analyze spectra
to detect
transferred genes
and plurality
consensus.
BIPARTITION OF A PHYLOGENETIC TREE
Bipartition (or split) – a division of a phylogenetic tree into two parts that are
connected by a single branch. It divides a dataset into two groups, but it does not
consider the relationships within each of the two groups.
Number of non-trivial bipartitions for N genomes is equal to 2(N-1)-N-1.
A
**…
***..
B
C D E
A
E
C D B
*...*
*.*.*
Bipartitions can be divided in conflicting and non-conflicting
non-conflicting (can coexist in one tree) conflicting (can not coexist in one tree)
**…
***..
**…
*…*
The Tree Drawing Process
Try to infer
phylogeny
unclear
Choose/
make
consensus
“Likely” Trees “Best” Tree
Obtaining Bootstrap Support for Branches
Now bipartitions
have weights:
.**…..
65
***…..
75
…**…
75
…..***
100
……**
100
Resampling e.g..
bootstrapping
75
65
75
100
100
A
B
C
D
E
F
G
H
*.*…..
Resampling simulates examining extra
sequence from the original data
75
Data Flow
select gene families
For every gene family,
align sequences (ClustalW)
For every gene family,
reconstruct Maximum
Likelihood (ML) Tree
and generate 100 bootstrap
samples (phyml)
For every gene family,
extract bipartition information
from each bootstrapped tree,
and compose a bipartition
matrix
Bipartitions matrix is
generated
**….. *.*…. *…*…. *….*…. etc.
fam1 71
34
0
99
fam2 56
2
0
76
fam3 1
99
1
99
…
famN 34
99
0
99
Do “Lento” Plot analysis
Results:
majority
consensus
bipartitions
detected
conflicts =
HGT events
13 gamma proteobacteria
1. Buchnera aphidicola str. Bp (Baizongia pistaciae)
2. Escherichia coli CFT073
3. Haemophilus influenzae Rd KW20
4. Pasteurella multocida subsp. multocida str. Pm70
5. Pseudomonas aeruginosa PAO1
6. Salmonella typhimurium LT2
7. Vibrio cholerae O1 biovar eltor str. N16961
8. Wigglesworthia glossinidia endosymbiont of Glossina brevipalpis
9. Xanthomonas campestris pv. campestris str. ATCC 33913
10. Xanthomonas axonopodis pv. citri str. 306
11. Xylella fastidiosa 9a5c
12. Yersinia pestis KIM
13. Yersinia pestis CO92
1.37E+10 possible unrooted tree topologies
“Lento”-plot of 34 supported bipartitions (out of 4082 possible)
13
gammaproteobacterial
genomes
(258 putative
orthologs):
•E.coli
•Buchnera
•Haemophilus
•Pasteurella
•Salmonella
•Yersinia pestis
(2 strains)
•Vibrio
•Xanthomonas
(2 sp.)
•Pseudomonas
•Wigglesworthia
There are
13,749,310,575
possible
unrooted tree
topologies for
13 genomes
Consensus Tree and Horizontally Transferred Gene
Consensus clusters of eight significantly
supported bipartitions
only 258 genes analyzed
Phylogeny of putatively transferred gene
(virulence factor homologs (mviN))
What are the Actual HGT Rates?
Are the detected transfers mainly false positives, or are they
the tip of an iceberg of many transfer events most of which
go undetected by current methods?
Here we explore how well these methods perform using
in silico transfers between the leaves of a gamma
proteobacterial phylogeny.
HGT in silico:
Testing Methods of Detection
•AU test
•Symmetric Difference of Robinson – Foulds
•Biparition Analysis
HGT in silico: AU test
13 gamma proteobacteria
236 families
AU Test: Number of Detected Conflicts
A
Number of Conflicts
250
200
150
100
50
0
-4
-3,5
-3
-2,5
-2
-1,5
-1
-0,5
Logarithm of Significance Level
Only two families out of 236 showed a conflict at the significance
level of 5 *10-4, 5 conflicts were found at the significance level of
0,01 and 26 conflicts at the significance level of 0,05.
0
HGT in silico: AU test
13 gamma proteobacteria
236 families
Pseudomonas aeroginosa  Vibrio cholera
120
100
80
60
40
20
Au-value = 10-4
Log(Au-value)=-4
Only 10% of au-values is less than 10-4
0
-4
0.
0
-3 0
7.
5
-3 0
5.
0
-3 0
2.
5
-3 0
0.
0
-2 0
7.
5
-2 0
5.
0
-2 0
2.
5
-2 0
0.
0
-1 0
7.
5
-1 0
5.
0
-1 0
2.
5
-1 0
0.
00
-7
.5
0
-5
.0
0
-2
.5
0
0.
00
-4
0.
0
-3 0
7.
5
-3 0
5.
0
-3 0
2.
5
-3 0
0.
0
-2 0
7.
5
-2 0
5.
0
-2 0
2.
5
-2 0
0.
0
-1 0
7.
5
-1 0
5.
0
-1 0
2.
5
-1 0
0.
00
-7
.5
0
-5
.0
0
-2
.5
0
0.
00
50
45
40
35
30
25
20
15
10
5
0
Escherichia coli  Xylella fastidosa
86% of au-values is less than 10-4
HGT in silico: AU test
Power of Detection
Significance level < 1e-4
Significance level < 1e-2
HGT in silico:
Robinson Foulds Metric
Power of Detection
Significance level < 1e-2
Distribution of number of different
bipartitions in the original dataset
100
90
80
70
60
50
40
30
20
10
0
1
2
3
4
5
6
7
8
9
Nu of different bipartitions
10
11
12
13
HGT in silico: Bipartition Analysis
Power of Detection
Bootstrap support >70%
Bootstrap support >90%
Acknowledgements
Prof. Peter Gogarten, University of Connecticut
Gogarten Lab:
Pascal Lapierre
Gregory Fournier
Alireza Ghodsi Senejani
Holly E. Gardner
Tim Harlow
Kristen Swithers
Kaiyuan Shi
NSF Microbial Genetics
NASA Exobiology & AISR Programs