The multispecies coalescent: implications for inferring

Download Report

Transcript The multispecies coalescent: implications for inferring

The multispecies coalescent:
implications for inferring species
trees
James Degnan
21 February 2008
Outline
1. Background
--gene trees vs. species trees
--coalescence and incomplete lineage sorting
2. Inferring species trees
--Concatenation
--Consensus Trees
3. Conclusions
Population Genetics and Phylogenetics
Population genetics: traditionally used to
analyze single populations.
Phylogenetics: What is the best way to
infer relationships between
populations/species?
Graphic by Mark A. Klinger, Carnegie Museum of Natural History, Pittsburgh
Desirable properties of species tree estimators
1. Statistical consistency (sample size = # of genes)
2. Efficiency
3. Robustness to violations in assumptions
Bridging the popgen/phylo divide
“Incorporation of explicit models of lineage sorting will be needed
for continued development of phylogenetic inference near the
species level.” –Maddison and Knowles (2006).
“Closer integration of population-genetic factors in phylogenetics,
including further insights into gene-tree/species tree, and
horizontal gene transfer.” --from Mike Steel’s website, My pick
for five directions in phylogenetics that will grow in the next five
years (2006).
The coalescent process
Past
Present
One population
Multiple populations/species
Past
Present
Gene tree in a species tree
Model species tree with gene tree
A
B
C
D
The gene tree is a random variable. The gene tree distribution is
parameterized by the species tree topology and internal branch lengths.
How can we compute probabilities of
gene trees given species trees?
-Under a coalescent model, probabilities for gene trees with
three species were derived by Nei (1987): 1-(2/3)e-T
-Probabilities for the gene tree to match the species tree
topology for 4 and 5 species given by Pamilo and Nei (1988).
-All 30 species tree/gene tree combinations for 4 species given by
Rosenberg (2002).
-General case solved by Degnan and Salter (2005) and implemented by
program COAL. Also allows ni  0 individuals sampled in species i.
Definition: a coalescent history is a list of the populations in which
each coalescent event occurs.
A
B
C
D
This coalescent history: (1,3,3)
Other coalescent histories: (2,3,3), (3,3,3)
Gene tree probabilities
Pr[G | S ] 
 Pr[G, histories| S ]
histories
Gene tree probabilities
Pr[G  g | S ] 
 Pr[G  g, histories| S ]
histories

combinatorial enumeration,
complexity only known in special
cases
 w p
b
histories
u ( b ),v ( b )
(Tb )
b
internal
branches
of S
u coalesce
into v
branch length
probability coalescences
are consistent with g
Data from
Ebersberger et al.
2007. Mol. Biol.
Evol. 24:2266-2276.
Theoretical
distribution based
on parameters
from Rannala and
Yang, 2003.
Genetics
164:1645-1656.
t/N = 4.2
1.2
x
y
Definition: a gene tree which is more probable than the gene tree
matching the species tree is called an anomalous gene tree (Degnan and
Rosenberg, 2006).
Theorem 1. For the asymmetric species tree topology with four
species and for any species tree topology with more than four
species, there exist branch lengths such that at least one gene tree
is anomalous (Degnan and Rosenberg, 2006).
Is species tree inference consistent in this setting?
1. Concatenation?
2. Consensus?
Species Tree inference—concatenation
Species Trees are often estimated by concatenating
several gene sequences and analyzing as one (data
from Chen and Li, 2001).
Gene 1
Gene 2
Human
CTTGAATAATTTTTAC TAGAGTTTCCTTGTGGTG
Chimp
CTTCAATAATTTTTAC TAGAGTTTCCTTGTGGTA
Gorilla TTTGAATAATTTTTAC TAGAGTTTCCTTGTGGTA
Orang
CTTGAATAATTTTTAT CAGAGTTTCCTTGTGGTC
Gene 3
CGGTTT
TGGTTT
TGGTTT
CRGTTT
Concatenation and gene tree
discordance
How does concatenation perform when sequences are generated
from different topologies?
Species tree:
CGGTTT
TGGTTA
TGGTTA
TAGTTA
y = 1.0, x = 0.05
y
x
CGATTA
TGATTA
TAATTT
TGAATT
CGGTTT
TGGTTA
TGGTTA
TAGTTA
CGATTA
TGATTA
TAATTT
TGAATT
TGCTAT
TGCTAT
TGCTAT
CCCTAT
concatenated
TGCTAT
TGCTAT
TGCTAT
CCCTAT
Simulated gene trees
sequence
Trees inferred from concatenated sequences (Kubatko
and Degnan, 2007)
y = 1.0, x = 0.05
Number of genes
Is species tree inference consistent in this setting?
1. Concatenation? No.
2. Consensus?
Consensus (majority-rule)
Types of consensus trees
Majority rule—consensus tree has all clades that were observed in > 50% of trees.
Greedy—sort clades by their proportions. Accept the most frequently
observed clades one at a time that are compatible with already accepted
clades. Do this until you have a fully resolved tree.
R*—for each set of 3 taxa, find the most commonly occurring triple e.g., (AB)C,
(AC)B or (BC)A. Build the tree from the most commonly occurring triple.
(AB)D, (CD)B are
two rooted triples
Asymptotic consensus trees
Consensus trees are usually statistics, functions of data like x-bar.
Definition: an asymptotic consensus tree is the tree that is obtained
by computing the consensus tree using topology probabilities from the
multispecies coalescent model.
Motivation: if there are a large number of independent loci,
observed gene tree, clade, and rooted triple proportions should
approximate their theoretical probabilities.
Greedy
consensus
tree
Simulated
gene trees
Greedy consensus tree
Greedy consensus tree
R* consensus
treetree
Greedy
consensus
Simulated
gene trees
Majority-rule: unresolved zone
Too-greedy zone
Is species tree inference consistent in this setting?
1. Concatenation?
No.
2. Consensus? Yes (R*), no for greedy and majority-rule.
Are consensus trees inconsistent estimators of
species trees?
Theorem 2. (i) Majority-rule asymptotic consensus trees (MACTs) do not have any
clades not on the species tree. (ii) Majority-rule unresolved zones exist for any
species tree topology with n ≥ 3 species.
Theorem 3. Greedy asymptotic consensus trees (GACTs) can be
misleading estimators of species trees for the 4-species asymmetric
tree and for any species tree with n > 4 species.
Theorem 4. R* asymptotic consensus trees (RACTs) always match the
species tree.
What about finite samples?
If you sample 10 loci, you could have:
All 10 match the species tree
9 match the species tree, 1 disagrees
8 match the species tree, 2 disagree, etc.
You can consider gene trees as categories and use multinomial probabilities
for the probability of your sample
n!
Pr[c (n1 ,, nk )  T ]  
p1n1  pknk I ( c (n1,, nk )  T )
samples n1!nk !
R* consensus, y = 0.4, x = 0.6
Conclusion
Coalescent gene tree probabilities can be used to prove or
disprove the statistical consistency of species tree estimators.
Probability
R* consensus, y = x = 0.1
Number of genes