Inferring Evolutionary History with Network Models in Population Genomics: Challenges and Progress

Download Report

Transcript Inferring Evolutionary History with Network Models in Population Genomics: Challenges and Progress

Inferring Evolutionary History with
Network Models in Population
Genomics: Challenges and Progress
Yufeng Wu
Dept. of Computer Science and Engineering
University of Connecticut, USA
Dagstuhl Seminar, 2010
Recombination
• One of the principle genetic forces shaping
sequence variations within species
• Two equal length sequences generate a third
new equal length sequence in genealogy
• Spatial order is important: different parts of genome inherit
from different ancestors.
110001111111001
1100 00000001111
Prefix
000110000001111
Suffix
Breakpoint
2
Ancestral Recombination Graph (ARG)
00
Recombination
Mutations
10
1 0
0 1
1 1
10
01
S1 = 00
S2 = 01
S3 = 10
S4 = 10
00
10
11
S1 = 00
S2 = 01
S3 = 10
S4 = 11
01
00
Assumption: At most one mutation per site
Network model: beyond tree
model
Reconstruction of Network-based
Evolutionary History
Different formulation
Input: DNA sequences (haplotypes) or phylogenetic trees
Biology: meiotic recombination in populations, or reticulate
evolutionary processes: horizontal gene transfer or hybrid
speciation
Same objective
Reconstruct the network-based evolutionary history (and
related problems)
• Efficiency
• Accuracy
4
Reconstructing ARGs by Parsimony
Kreitman’s data for
adh locus of D.
Malonagaster (1983)
• Input: a set of binary sequences M
• Goal: reconstruct ARGs deriving M
• Parsimony formulation
– minARG: Minimize the number of
recombination events
– NP complete (Wang, et al)
5
The minARG Problem
Structural constrained ARGs, e.g. galled trees (Wang, et al,
Gusfield, et al).
• Simplified ARG topology
Heuristic methods, e.g. program MARGARITA (Durbin, et al.),
Song, et al., Parida, et al.
Exact minARG by branch and bound (Lyngso, Song and Hein)
Uniform sampling of minARGs by treating each minARG as
equally likely (Wu)
Estimating the range of minARGs: lower and upper bounds
minARG for Kreitman’s data
Rmin: minimum number of
recombination for M.
L(M): lower bound on Rmin
U(M): upper bound on Rmin
Several lower bounds give
L(M)=7.
Challenge: accurate inference
of ARGs
U(M)=7 for Kreitman’s
data (Song, Wu and Gusfield).
Thus, Rmin(M)=7
ARG Induces Local Trees
Local trees: evolutionary history at a
genomic position.
Data
0000
0000
0101
Trace backwards in time. At
recombination node, pick the branch
passing alleles to the recombinant
at this location.
0000
0110
0100
1110
1010
0010
0110
0101
0110
1110
Local tree near site 3
1010
1010
0000
Mutations
Recombination
8
Local Trees Change Across the Genome
Local trees change when
moving across recombination
breakpoints.
Data
0000
0000
0101
Spatial property:
0000
Nearby local tree tends to
be more similar.
0110
0100
1110
1010
0010
0110
0101
0110
1110
Local tree near site 2
How good is the
inferred ARGs?
1010
1010
0000
Compare the
inferred local tree
topologies with the
simulated trees
Inferring Local Trees
Problem: given binary sequences, infer local tree topologies
(one tree for each site, ignore branch length)
Key: local trees have different topology due to recombination
Trees or Network? Do not reconstruct full network; local
trees are very informative
Parsimony-based approaches
• Hein (1990,1993), Song and Hein (2005)
• Wu (2010): shared topological features in nearby trees.
Accuracy: Robinson-Foulds distances between inferred trees
and the simulated tree
Challenge: How to improve the accuracy?
RENT: REfining Neighboring Trees
• Maintain for each SNP site a (possibly nonbinary) tree topology
– Initialize to a tree containing the split induced by
the SNP
• Gradually refining trees by adding new splits
to the trees
– Splits found by a set of rules (later)
– Splits added early may be more reliable
• Stop when binary trees or enough information
is recovered
11
A Little Background: Compatibility
A B C
M
a
b
c
d
e
000
100
001
101
011
Sites A and B are
compatible, but A and C
are incompatible.
• Two sites (columns) p, q are incompatible if
columns p,q contains all four ordered pairs
(gametes): 00, 01, 10, 11. Otherwise, p and q
are compatible.
• Easily extended to splits.
12
Fully-Compatible Region: Simple Case
• A region of consecutive SNP sites where these
SNPs are pairwise compatible.
– May indicate no topology-altering recombination
occurred within the region
• Rule: for site s, add any such split to tree at s.
– Compatibility: very strong property and unlikely arise
due to chance.
A
B
C
13
Split Propagation: More General Rule
• Three consecutive sites A,B and C. Sites A and B are
incompatible. Does site C matter for tree at site A?
– Trees at site A and B are different.
– Suppose site C is compatible with sites A and B. Then?
– Site C may indicate a shared subtree in both trees at sites A and B.
• Rule: a split propagates to both directions until reaching a
incompatible tree.
A
B
C
14
Reticulate Networks
Gene trees: phylogenetic
trees from gene sequences
- Assume: Binary and rooted
- Different topologies at different
genes
1:
2:
3:
4:
Gene A
000
001
110
100
1:
2:
3:
4:
Gene B
000
101
010
001
ρ
ρ
T’
T
Reticulate evolution:
one explanation
- Hybrid speciation,
horizontal gene transfer
1
2
3
4
1
3
2
Reticulate network:
A directed acyclic graph
displaying each of the
gene trees
Hybridization
event: nodes
with in-degree
two or more
Keep two
red edges
1
Keep two
black edges
2
3
4
4
The Minimum Reticulation Problem
Given: a set of K gene trees G.
NP complete: even for K=2
Problem: reconstruct reticulate
networks with Rmin(G), the
minimum number, reticulation
events displaying each gene tree.
Current approaches:
T1
1
T3
T2
2
3
4
1
2
3
4
1
2
4
3
Challenge: efficient and accurate reconstruction
of reticulate network for multiple trees.
N
1
• exact methods for K=2 case (see
Semple, et al)
• impose topological constraints (e.g.
galled networks, see Huson, et al.)
2
3
4
Close lower and upper bounds for arbitrary
number of trees (Wu, 2010)
Performance of PIRN: Optimal Solution
Horizontal axis: number of taxa
Vertical axis: % of data LB=UB
K: number of trees
r: level of reticulation
• Lower and upper bounds often match for
many data
17
Performance of PIRN: Gap of Bounds
Horizontal axis: number of taxa
K: number of trees
Vertical axis: gap between lower and upper bounds r: level of reticulation
• Gap between the lower and upper bounds
is often small for many data
18
Reticulate Network for Five Poaceae Trees
ndhF
phyB
Lower bound: 11
Upper bound: 13
rbcL
rpoC2
ITS
19
Reticulate Network for Five Poaceae Trees
Upper bound: 13
used in this network
20
Acknowledgement
• More information available at:
http://www.engr.uconn.edu/~ywu
• Research supported by National
Science Foundation and UConn
Research Foundation
21
Coalescent with Recombination
Coalescent theory: define probabilistic distribution of genealogy
Likelihood computation for coalescent with recombination
Likelihood: summation of probability of
all the ARGs
Challenging: too many ARGs (Lyngso, Song
and Hein)
Probability of ARGs under
certain parameters
Importance Sampling approach: draw samples (ARGs)
wrt some probablistic distribution
Work well with no recombination
Not working well with recombination
Coalescent-based ARG Sampling
minARG
Uniform sampling of minARGs (Wu, 2007)
• Treat each minARG as equally likely.
• Algorithm for generating an minARG
uniformly at random (exponential time for setting
up, but polynomial-time in sampling)
Probability of ARGs under
certain parameters
Challenge: develop a more general ARG
sampling method that can efficiently
sample ARGs approximately according
to coalescent probabilities.
A related problem: compute coalescent likelihood with
recombination efficiently.
Recent work: exact computation of coalescent likelihood under
infinite sites model with no recombination (Wu, 2009)
The Mosaic Model
M: input sequences
M, K=2
0000
0101
Total 5
breakpoint
0111
1111
1110
Assumption: input sequences are
descendent of K founder sequences
(unknown)
Extant sequences: concatenation of
exact copies of founder segment
(no shift of position)
breakpoint
• Coloring: assign which position of a
sequence is from which founder (color); need
consistency
The Minimum Mosaic Problem
Inferred founders
Data from Rastas
and Ukkonen
20 sequences
40 sites
55 breakpoints:
minimum number of
breakpoints
• Problem: given a set of binary sequences and the number of
founder K, find a K-coloring of these sequences to minimize
the number of color change (recombination breakpoints)
• And find the K founder sequences (not part of input)
The Minimum Mosaic Problem
• Introduced by Ukkonen (2002)
• Simple and easier to visualize
• Main known results
– An exponential-time algorithm which runs in polynomialtime algorithm for K=2 (Ukkonen 2002)
– An exact method that works for relatively small K and
modest-sized data (Wu and Gusfield, 2007)
– Haplovisual program and other extensions by Rastas
and Ukkonen (2007).
– Heuristic algorithm by Roli and Blum (2009)
– Lower bounds for the minimum number of breakpoints
needed (Wu, 2010)
• Challenges
– Polynomial-time algorithm for K  3?
– Concrete applications in biology?
26