A Coalescent-based Method for Population Tree Inference with Haplotypes

Download Report

Transcript A Coalescent-based Method for Population Tree Inference with Haplotypes

A Coalescent-based Method for Population
Tree Inference with Haplotypes
Yufeng Wu
Dept. of Computer Science & Engineering
University of Connecticut, USA
Cold Spring Harbor Asia Meeting
SuZhou, China, 2014
1
Population Tree: Population split history
Coalescence
(including order and time); not known
Locus (gene): genomic region
Mutation
H: haplotypes at SNPs
a (A):
AAGCCAATTCCGAACAAGA
b (B): ACGCCAATTCCGGACAAGA
c (C): ACGCCTATTCCGGACAAGA
d (D): AAGCCAATTCCGAACCAGA
Time
a
A
b
B
c
C
1234
AAAA
CAGA
CTGA
AAAC
Coalescent genealogical tree:
underlying genetic model
d
D
Population tree inference: given haplotypes H from multiple loci,
infer the population tree MLE of T: find T maximizing P(H|T)
P(H|T): probability of H given T under coalescent models
Challenge: P(H|T) is difficult to compute even for single population
Common simplification: treating haplotypes as
unlinked variants (SNPs). P(H|T) ≈
P(S1|T)P(S2|T)P(S3|T)…, Si: ith SNP of H. See, e.g. SNAPP
SNP vs.
Haplotype
(Bryant, et al., MBE, 2012), TreeMix (Pickrell and Pritchard, PLoS
Genet, 2012)
Single SNPs: potential loss of information in haplotypes.
This talk: likelihood based population tree inference from haplotypes.
Assumptions: (1) No intra-locus recombination and
(2) infinite sites model of mutations
AAAA
Fact 1: haplotypes H
1
implies a unique (non1234
3
a:
AAAA
bifurcating) genealogical
b:
CAGA
tree called the perfect
c:
CTGA
2
phylogeny TH
d: AAAC
a
c
b
Fact 2: under infinite sites model, P(H|T)=P(TH|T)
Unfortunately, computing P(TH|T) is still non-trivial
4
d
G’: genealogical topology implied by haplotypes H Simplification
Ignore mutations on genealogy G.
of Likelihood
Key Assumption: P(G|T)  P(G’|T)
Inference of population
tree T: maximizing
Ignore mutations
G’ P(G’1|T)P(G’2|T)P(G’3|T)…
G
1
G’i : gene genealogical
3
4
topologies of ith locus
Use G to refer to genealogical
2
topology
a
c
b
d
a
c
b
d
Genealogical topology G
and population tree T:
a
b
c
d A
a
B
b
C
c
d
D
Gene lineages b and c coalesce
first  Populations B and C are
likely to be more closely related
But not always…
Incomplete lineage sorting: gene tree topology is stochastic
STELLSH: infer population trees from haplotypes
For population tree T and a gene tree topology G:
Gene tree probability P(G|T): probability of observing a gene
tree topology G for population tree T under coalescent theory.
Gene tree probability P(G|T): (relatively) efficiently computed
by the STELLS algorithm (Wu, Evolution, 2012) algorithm for
when G is bifurcating and can be used in inference.
Issue: perfect phylogeny from haplotypes usually non-bifurcating
Gene tree probability for non-bifurcating topology: sum over all
compatible bifurcating topologies. Can be more efficiently
computed: Wu, manuscript, 2014.
STELLSH: maximizing probability of all gene topologies, by
optimizing topology and branch lengths of population tree
5
(e.g. nearest neighbor interchange)
Population tree: same tree topologies. Haplotypes: use Hudson’s ms
Simulation (support island model)
• Multiple alleles per population per gene
• Various population tree heights (0.1, 0.5 and 1.0 coalescent units)
• Number of loci: 10,50,100,200,500
Inference
STELLSH: infer population tree from haplotypes
Evaluation Topological error of inferred population trees
Inference
error
Assume: no migration; no intra-locus
recombination.
Accuracy: higher with more loci
Number of loci
Moderate migration or
recombination: accurate inference
Strong migration or high
recombination: less accurate
Compare with TreeMix
Simulation data
STELLSH (Solid lines): up to 4 alleles per
population
TreeMix (dashed lines): up to 100 alleles
per population
STELLSH: more accurate than TreeMix,
even TreeMix uses 25 times more data.
Also analyzed part of 1000 Genomes
Project to infer population trees from
10 populations: CHB,JPT,CHS,CEU,
TSI,FIN,GBR,IBS, YRI, and LWK.
Conclusion:
• Haplotypes: can be more informative than individual SNPs
• Simplifying likelihood function may lead to faster algorithms to use in inference.
Paper: “A Coalescent-based Method for Population Tree
Inference with Haplotypes”, Yufeng Wu, submitted for
Research supported by National publication, 2014.
Science Foundation under grants IIS-Paper: “Coalescent-based Species Tree Inference from
Gene Tree Topologies Under Incomplete Lineage Sorting
0803440 and CCF-1116175
by Maximum Likelihood”, Yufeng Wu, Evolution, v. 66 (3),
p. 763-775, 2012.”
8