Phylogenetic Networks with Constrained Recombination

Download Report

Transcript Phylogenetic Networks with Constrained Recombination

Optimal Phylogenetic Networks
with Constrained and
Unconstrained Recombination
Dan Gusfield
UC Davis
Different parts of this work are joint with Satish Eddhu, Charles
Langley, Dean Hickerson, Yufeng Wu.
Reconstructing the Evolution of
Binary Bio-Sequences
• Perfect Phylogeny (tree) model
• Phylogenetic Networks (DAG) with
recombination
• Phylogenetic Networks with disjoint cycles:
Galled-Trees
• Phylogenetic Networks with unconstrained cycles:
Blobbed-Trees
• Combinatorial Structure and Efficient Algorithms
• Efficiently Computed Lower Bounds on the
number of recombinations needed
Geneological or Phylogenetic
Networks
• The major biological motivation comes from
genetics and attempts to reconstruct the history of
recombination in populations.
• Also relates to phylogenetic-based haplotyping.
• Some of the algorithmic and mathematical results
also have phylogenetic applications, for example
in hybrid speciation, lateral gene transfer.
The Perfect Phylogeny Model for
binary sequences
sites 12345
Ancestral sequence 00000
1
4
Site mutations on edges
3
The tree derives the set M:
2
10100
10100
5
10000
01011
01010
00010
10000
00010
01010
01011
Extant sequences at the leaves
When can a set of sequences be
derived on a perfect phylogeny
with the all-0 root?
Classic NASC: Arrange the sequences in a
matrix. Then (with no duplicate columns),
the sequences can be generated on a unique
perfect phylogeny if and only if no two
columns (sites) contain all four pairs:
0,0 and 0,1 and 1,0 and 1,1
This is the 4-Gamete Test
A richer model
10100
10000
01011
01010
00010
10101 added
pair 4, 5 fails the three
gamete-test. The sites 4, 5
``conflict”.
12345
00000
1
4
3
10100
2
00010
5
10000
0101101010
Real sequence histories often involve recombination.
Sequence Recombination
01011
10100
S
P
5
Single crossover recombination
10101
A recombination of P and S at recombination point 5.
The first 4 sites come from P (Prefix) and the sites
from 5 onward come from S (Suffix).
Network with Recombination
10100
10000
01011
01010
00010
10101 new
12345
00000
1
4
3
2
10100
The previous tree with one
recombination event now derives
all the sequences.
P
00010
5
10000
5
10101
S
0101101010
Multiple Crossover
Recombination
4-crossovers
2-crossovers = ``gene conversion”
Elements of a Phylogenetic
Network (single crossover
recombination)
• Directed acyclic graph.
• Integers from 1 to m written on the edges. Each integer
written only once. These represent mutations.
• A choice of ancestral sequence at the root.
• Every non-root node is labeled by a sequence obtained
from its parent(s) and any edge label on the edge into it.
• A node with two edges into it is a ``recombination node”,
with a recombination point r. One parent is P and one is S.
• The network derives the sequences that label the leaves.
A Phylogenetic Network
00010
a:00010
10010
00000
4
3
1
00100
2
b:10010
P
c:00100
S
3
01100
p
d:10100
5
00101
S
4
01101
f:01101
e:01100
g:00101
Which Phylogenetic Networks
are meaningful?
Given M we want a phylogenetic network that
derives M, but which one?
A: A perfect phylogeny (tree) if possible. As little deviation
from a tree, if a tree is not possible. Use as little recombination
or gene-conversion as possible.
Minimizing recombinations
• Any set M of sequences can be generated by a
phylogenetic network with enough recombinations, and
one mutation per site. This is not interesting or useful.
• However, the number of (observable) recombinations is
small in realistic sets of sequences. ``Observable” depends
on n and m relative to the number of recombinations.
• Two algorithmic problems: given a set of sequences M,
find a phylogenetic network generating M, minimizing the
number of recombinations (Hein’s problem). Find a
network generating M that has some biologicallymotivated structural properties.
Minimization is NP-hard
The problem of finding a phylogenetic network that creates a given set of
sequences M, and minimizes the number of recombinations, is NPhard. (Wang et al 2000) (Semple 2004)
Wang et al. explored the problem of finding a phylogenetic network where
the recombination cycles are required to be node disjoint, if possible.
They gave a sufficient but not a necessary condition to recognize cases
when this is possible. O(nm + n^4) time.
Recombination Cycles
• In a Phylogenetic Network, with a
recombination node x, if we trace two paths
backwards from x, then the paths will
eventually meet.
• The cycle specified by those two paths is
called a ``recombination cycle”.
Galled-Trees
A recombination cycle in a phylogenetic
network is called a “gall” if it shares no
node with any other recombination cycle.
A phylogenetic network is called a “galledtree” if every recombination cycle is a gall.
A galled-tree generating
the sequences generated
by the prior network.
4
3
1
p s
a: 00010
3
b: 10010
c: 00100
d: 10100
2
p
e: 01100
5
4
s
f: 01101
g: 00101
Sales pitch for Galled-Trees
Galled-trees represent a small deviation from true trees.
There are sufficient applications where it is plausible that a galled tree
exists that generates the sequences.
Observable recombinations tend to be recent; block structure of human
DNA; recombination is sparse, so the true history of observable
recombinations may be a galled-tree.
The number of recombinations is never more than m/2. Moreover,
when M can be derived on a galled-tree, the number of recombinations used
is the minimum number over any phylogenetic network, even if multiple
cross-overs at a recombination event are counted as a single recombination.
A galled-tree for M is ``almost unique” - implications for reconstructing the
correct history.
Old (Aug. 2003) Results
• O(nm + n^3)-time algorithm to determine whether
or not M can be derived on a galled-tree with all-0
ancestral sequence.
• Proof that the galled-tree produced by the
algorithm is a “nearly-unique” solution.
• Proof that the galled-tree (if one exists) produced
by the algorithm minimizes the number of
recombinations used, over all phylogeneticnetworks with all-0 ancestral sequence.
New work
We derive the galled-tree results in a more
general setting that addresses unconstrained
recombination cycles and multiple
crossover recombination. This also solves the
problem of finding the ``most tree-like”
network when a perfect phylogeny is not
possible. In this algorithm, no ancestral
sequence is known in advance.
Blobbed-trees: generalizing
galled-trees
• In a phylogenetic network a maximal set of
intersecting cycles is called a blob.
• Contracting each blob results in a directed, rooted
tree, otherwise one of the “blobs” was not
maximal.
• So every phylogenetic network can be viewed as a
directed tree of blobs - a blobbed-tree.
The blobs are the non-tree-like parts of the network.
Every network is a
tree of blobs.
How do the tree parts
and the blobs relate?
How can we exploit
this relationship?
Ugly tangled
network inside
the blob.
Incompatible Sites
A pair of sites (columns) of M that fail the
4-gametes test are said to be incompatible.
A site that is not in such a pair is compatible.
M
12345
a 00010
b 10010
c 00100
d 10100
e 01100
f 01101
g 00101
Incompatibility Graph
4
1
3
2
5
Two nodes are connected iff the pair
of sites are incompatible, i.e, fail the
4-gamete test.
THE MAIN TOOL: We represent the pairwise
incompatibilities in a incompatibility graph.
Simple Fact
If sites two sites i and j are incompatible,
then the sites must be together on some
recombination cycle whose recombination
point is between the two sites i and j.
(This is a general fact for all phylogenetic
networks.)
Ex: In the prior example, sites 1, 3 are incompatible, as are 1, 4;
as are 2, 5.
A Phylogenetic Network
00010
a:00010
10010
00000
4
3
1
00100
2
b:10010
P
c:00100
S
3
01100
p
d:10100
5
00101
S
4
01101
f:01101
e:01100
g:00101
Simple Consequence of the
simple fact
All sites on the same (non-trivial) connected
component of the incompatibility graph
must be on the same blob in any blobbed-tree.
Follows by transitivity.
So we can’t subdivide a blob into a tree-like
structure if it only contains sites from a single
connected component of the incompatibility
graph.
Key Result about Galls: For
galls, the converse of the simple
consequence is also true.
Two sites that are in different (non-trivial) connected
components cannot be placed on the same gall in
any phylogenetic network for M.
Hence, in a galled-tree T for M each gall contains all
and only the sites of one (non-trivial) connected
component of the incompatibility graph. All
compatible sites can be put on edges outside of the
galls.
This was the key to the galled-tree solution.
A galled-tree generating
the sequences generated
by the prior network.
4
Incompatibility Graph
4
3
1
1
3
2
5
p s
a: 00010
2
b: 10010
c: 00100
d: 10100
2
p
e: 01100
5
4
s
f: 01101
g: 00101
Reduced Galled-Trees
• A galled-tree is called reduced if every gall only
contains conflicted sites.
• Theorem: If M can be derived on a galled-tree, it
can be derived on a reduced galled-tree.
• The number of recombination nodes in a reduced
galled-tree equals the number of connected
components of the conflict graph, which is the
minimum number of recombinations possible in
any galled-tree.
The main new result: The
Partition Theorem
For any set of sequences M, there is a blobbed-tree
T(M) that derives M, where each blob contains all
and only the sites in one non-trivial connected
component of the incompatibility graph. The
compatible sites can always be put on edges
outside of any blob. Moreover, the tree part of
T(M) is unique. And it is easy to find the tree part.
This is weaker than the result for galled-trees: it
replaces “must” with “can”.
My proof is direct and self-contained, but the result
may also be derivable from well-established results
about splits, for example, from facts known about
Bunemann graphs. Thanks to Mike Steel for pointing
this out. Can also be derived from earlier work
on splits by Bandelt and Dress.
Algorithmically
• Finding the tree part of the blobbed-tree is easy.
• Determining the sequences labeling the exterior
nodes on any blob is easy.
• Determining a “good” structure inside a blob B is
the problem of generating the sequences of the
exterior nodes of B.
• It is easy to test whether the exterior sequences on
B can be generated with only a single (possibly
multiple-crossover) recombination. The original
galled-tree problem is now just the problem of
testing whether one single-crossover
recombination is sufficient for each blob.
The main open question
• Is there always a blobbed-tree where each blob has
all and only the sites of a single connected
component of the conflict graph, which minimizes
the number of recombinations over all possible
phylogenetic networks for M?
• Proof attempt by splitting blob B into B’ and B”.
The catch is the possibility of a recombination
node in B that is needed in both B’ and B”.
If Yes,
Then any method that computes a lower bound on the number of
needed recombinations should be applied separately on sequences
defined by the sites on each connected component, and the results
added together.
This may be true even if the desired result is not. It is worth
trying to prove this.
How to arrange the sites on a
blob
Given a single connected component of the conflict
graph with k sites, how do we arrange those k sites
to generate the required sequences, using only one
(multiple-crossover) recombination,
or using a multiple-crossover recombination
with the fewest cross-overs?
Arranging the sites
We will describe an O(n^3) time method to arrange
all of the blobs. O(n^2) time is possible with a
more complex method when only single-crossover
recombinations are allowed.
A needed fact in words
Let Q be a gall for the sites on connected-component
C of the conflict graph.
Let M[C] be the matrix M restricted to the sites on
C.
Let LQ[C] be the sequences labeling the nodes of Q,
restricted to the sites on C.
Claim: The two sets of sequences are identical, i.e.,
M[C] = LQ[C].
The idea for arranging the sites of
C on B: via a short movie
4
Q
3
001
010
1
a
101
b
p s
2
110
c, e, f, g
d
4
Q
3
001
010
1
a
101
b
110
d
c, e, f, g
B
4
001
1
a
3
010
101
b: 101
110
c, e, f, g: 010
d
Blob B minus the recombination node is a perfect phylogeny for M[C]
minus the recombinant sequence; all sites are on one or two paths
from the root; and the two end sequences of those paths can
recombine at point r to recreate the recombinant sequence.
The point
If we remove the recombinant node from B,
we have a phylogenetic tree (no cycles) for
the remaining sequences and hence
a perfect phylogenetic tree for the sequences in
M[C] minus the recombinant sequence of B.
The sites in this tree are on one or two paths.
Moreover, the two end sequences on that perfect
phylogeny can recombine to create the removed
recombinant sequence.
The algorithm for arranging a
blob B for C
1.Form the matrix M[C].
2. For each row of M[C], remove the row, see if there
is a perfect phylogeny for the remaining rows.
If yes, see if the sites are in one or two paths, and
the end sequences can generate the removed row
by a recombination.
Fact: Every row that works gives a permitted
Optimality of Galled-Trees
Theorem: (G,H,B,B) The minimum number of recombination nodes
in any phylogenetic network for M is at least the number
of non-trivial connected components of the incompatibility graph.
Hence, when the sequences on each blob on T(M) can be generated
with a single recombination node, the blobbed-tree minimizes
the number of recombination nodes over all phylogenetic
networks and all choices of ancestral sequence. This solves
the root-unknown galled-tree problem in polynomial time.
Code is on the web.
The number of arrangements on a
gall (all-0 ancestral sequence)
Following the algorithmic approach above, one can
prove that the number of arrangements of any gall is
at most three, and this happens only if the gall has two
sites.
If the gall has more than two sites, then the number of
arrangements is at most two.
If the gall has four or more sites, with at least two sites
on each side of the recombination point (not the side of
the gall) then the arrangement is forced and unique.
Practical and Accurate Lower
Bound Computation
• Unconstrained recombination.
• Want efficient computation of lower bounds on the
minimum number of recombinations needed.
Many methods: HK, Haplotype, (history),
Connected Comps, Composite Interval Composite,
Gall-Interval Composite Method, Subsets,
Recmin, Composite ILP. Some are specific for
recombination, and some apply to any reticulation
event.
The grandfather of all lower
bounds - HK 1985
• Arrange the nodes of the incompatibility graph on
the line in order that the sites appear in the
sequence.
• The HK bound is the minimum number of vertical
lines needed to cut every edge in the
incompatibility graph. Weak bound, but widely
used - not only to bound the number of
recombinations, but also to suggest their locations.
HK Lower Bound
1
2
3
4
5
HK Lower Bound = 1
1
2
3
4
5
More general view of HK
Given a set of intervals on the line, and for each interval I, a
number N(I), find the minimum
number of vertical lines so that every interval I intersects at
least N(I) of the vertical lines.
In HK, each incompatibility defines an interval I where N(I) = 1.
Other methods find different intervals and larger N(I) values.
The general problem is easy to solve by a left-to-right myopic
placement of vertical lines: sort the intervals by right end-point;
Process the intervals left to right in that order; when the right
endpoint of an interval I is reached, place there (if needed)
additional vertical so that N(I) lines intersect I.
The general approach is called the Composite Interval Method
(Myers).
New bound 2003
The number of non-trivial connected
components in the incompatibility graph
(Gusfield, Hickerson; Bafna, Bansal) - weak
bound when used alone, but effective when
used with the Composite Interval Method.
CC Bound = 2 in the prior example.
Haplotype Bound (Simon Myers)
• Rh = Number of distinct sequences (rows) Number of distinct sites (columns) -1 (folklore)
• Before computing Rh, remove any site that is
compatible with all other sites. A valid lower
bound results - generally increases the bound.
• Generally really bad bound, often negative, when
used alone, but Very Good when used in the
Composite Interval Method, and Amazing when
used in the Composite Subset Method, and More
than Amazing when used in the Composite ILP
Method.
Composite Interval Method
(Simon Myers)
Compute Rh separately for each of the C(m,2)
intervals of the m sites; Let Rh(I) be the bound for
interval I. This is called a ``local bound”. Compute
the Minimum number of vertical lines so that each
interval I is cut at least Rh(I) times (a trivial
algorithm).
The HK bound is a specialization of this where
Rh(I) is 0 or 1. Also can use the connected
component bound for the local bounds, or take the
best of both in each interval.
Composite Subset Method
• Let S be subset of sites, and Rh(S) be the
haplotype bound for subset S. If the
leftmost site in S is L and the rightmost site
in S is R, then use Rh(S) as a local bound
for interval [S,L].
• Compute Rh(S) on many subsets, and then
use the Composite method to find an overall
bound.
RecMin (Myers)
• World Champion Lower Bound Program
(available) Great Program (Myers). Often
RecMin gives a bound three times as large as HK.
• Uses Composite Subsets with Rh, but limits the
size and the span of the subsets. Default
parameters are s = 6, w = 15 (s = size, w = span).
• Generally, impractical to set s = w = m, so
generally one doesn’t know if increasing the
parameters would increase the bound.
Optimal RecMin Bound (ORB)
• The Optimal RecMin Bound is the lower
bound that RecMin would produce if both
parameters were set to their maximum
values (s = w= m).
• In general, RecMin cannot compute (in
practical time) the ORB.
• Practical computation of the ORB is our
first contribution.
Computing the ORB
• Gross Idea: For each interval I, use ILP to
find a subset of sites that maximizes Rh
over all subsets in interval I. Set N(I) to the
found Rh value, and use these values in the
Composite Method.
• The result is guaranteed to be the ORB. i.e,
the bound one would get by using all 2^m
subsets in RecMin.
Now instead of doing an exponential number of simple
computations (computing Rh for each subset), we have
a quadratic number of (possibly expensive) ILPs to
solve.
Is this a good trade-off in practice? Our experience - yes!
Two critical tricks
1) Formulate the ILP as a ``test set” problem: Find the minimum
number of sites so that each pair of haplotypes differs at one
site at least. Compared to other ILP formulations, this allows large
eliminations of redundant inequalities, and greatly speeds up the
ILP solution.
2) Reduce the number of intervals examined:
I
If the optimal subset in the interval I
L
only spans interval L, then there is no
Need to examine any interval containing
L. Each ILP calls at most 4 other ILPs.
ILP Compsite method efficiency
• Only C(m,2) intervals, but the time for each
is more than for the Hap bound.
• With tricks we need to solve the ILP for
only 0.5% of all the C(m,2) intervals
(empirical result). Shockingly fast in
practice.
Typical Result
With n = 40, m = 100, r = 10 (ms
recombination parameter):
HK gives 4
Composite Intervals with Hap Bound or CC bound gives 6
Recmin Composite Subsets with Hap Bound and subsets of size
up to 20 in windows of size 20: 9 (7 secs)
Composite Subsets with Hap Bound and subsets of size up to
30 in windows of size 30: 10 (takes 10 mins on Powerbook G4)
ILP Composite gets 10 and takes 1 second on Linux Box.
Papers and Software
wwwcsif.cs.ucdavis.edu/~gusfield/