Phylogenetic Networks with Constrained Recombination

Download Report

Transcript Phylogenetic Networks with Constrained Recombination

Reconstructing Ancestral
Recombination Graphs - or
Phylogenetic Networks with
Recombination
Dan Gusfield
UC Davis
Different parts of this work are joint with Satish Eddhu, Charles
Langley, Dean Hickerson, Yun Song, Yufeng Wu.
Reconstructing the Evolution of
Binary Bio-Sequences
• Perfect Phylogeny (tree) model
• Phylogenetic Networks (DAG) with
recombination
• Phylogenetic Networks with disjoint cycles:
Galled-Trees
• Phylogenetic Networks with unconstrained cycles:
Blobbed-Trees
• Combinatorial Structure and Efficient Algorithms
• Efficiently Computed Lower and Upper bounds
on the number of recombinations needed
Geneological or Phylogenetic
Networks
• The major biological motivation comes from
genetics and attempts to reconstruct the history of
recombination in populations.
• Also relates to phylogenetic-based haplotyping.
• Some of the algorithmic and mathematical results
have phylogenetic applications, for example in
hybrid speciation, lateral gene transfer.
Why binary?
Single nucleotide polymorphisms are the key
data that we use. SNPs imply that the sequences
are binary, and also that the order of the sites is
fixed (on a chromosome). This is in contrast to
a set of taxonomic characters, which may be
binary, but where the given order
is arbitrary.
The Perfect Phylogeny Model for
binary sequences
sites 12345
Ancestral sequence 00000
1
4
Site mutations on edges
3
The tree derives the set M:
2
10100
10100
5
10000
01011
01010
00010
10000
00010
01010
01011
Extant sequences at the leaves
The converse problem
Given a set of sequences M we want to find, if possible,
a perfect phylogeny that derives M. Remember that
each site can change state from 0 to 1 only once.
n will denote the number of sequences in M, and m will
denote the length of each sequence in M.
m
M
n
01101001
11100101
10101011
When can a set of sequences be
derived on a perfect phylogeny?
Classic NASC: Arrange the sequences in a
matrix. Then (with no duplicate columns),
the sequences can be generated on a unique
perfect phylogeny if and only if no two
columns (sites) contain all four pairs:
0,0 and 0,1 and 1,0 and 1,1
This is the 4-Gamete Test
A richer model
10100
10000
01011
01010
00010
10101 added
pair 4, 5 fails the three
gamete-test. The sites 4, 5
``conflict”.
12345
00000
1
4
3
10100
2
00010
5
10000
0101101010
Real sequence histories often involve recombination.
Sequence Recombination
01011
10100
S
P
5
Single crossover recombination
10101
A recombination of P and S at recombination point 5.
The first 4 sites come from P (Prefix) and the sites
from 5 onward come from S (Suffix).
Network with Recombination
10100
10000
01011
01010
00010
10101 new
12345
00000
1
4
3
2
10100
The previous tree with one
recombination event now derives
all the sequences.
P
00010
5
10000
5
10101
S
0101101010
Multiple Crossover
Recombination
4-crossovers
2-crossovers = ``gene conversion”
Elements of a Phylogenetic
Network (single crossover
recombination)
• Directed acyclic graph.
• Integers from 1 to m written on the edges. Each integer
written only once. These represent mutations.
• A choice of ancestral sequence at the root.
• Every non-root node is labeled by a sequence obtained
from its parent(s) and any edge label on the edge into it.
• A node with two edges into it is a ``recombination node”,
with a recombination point r. One parent is P and one is S.
• The network derives the sequences that label the leaves.
A Phylogenetic Network
00010
a:00010
10010
00000
4
3
1
00100
2
b:10010
P
c:00100
S
3
01100
p
d:10100
5
00101
S
4
01101
f:01101
e:01100
g:00101
Which Phylogenetic Networks
are meaningful?
Given M we want a phylogenetic network that
derives M, but which one?
A: A perfect phylogeny (tree) if possible. As little deviation
from a tree, if a tree is not possible. Use as little recombination
or gene-conversion as possible.
Minimizing recombinations
• Any set M of sequences can be generated by a
phylogenetic network with enough recombinations, and
one mutation per site. This is not interesting or useful.
• However, the number of (observable) recombinations is
small in realistic sets of sequences. ``Observable” depends
on n and m relative to the number of recombinations.
• Two algorithmic problems: given a set of sequences M,
find a phylogenetic network generating M, minimizing the
number of recombinations (Hein’s problem). Find a
network generating M that has some biologicallymotivated structural properties.
Minimization is NP-hard
The problem of finding a phylogenetic network that creates a given set of
sequences M, and minimizes the number of recombinations, is NPhard. (Wang et al 2000) (Semple 2004)
Wang et al. explored the problem of finding a phylogenetic network where
the recombination cycles are required to be node disjoint, if possible.
They gave a sufficient but not a necessary condition to recognize cases
when this is possible. O(nm + n^4) time.
Recombination Cycles
• In a Phylogenetic Network, with a
recombination node x, if we trace two paths
backwards from x, then the paths will
eventually meet.
• The cycle specified by those two paths is
called a ``recombination cycle”.
Galled-Trees
A recombination cycle in a phylogenetic
network is called a “gall” if it shares no
node with any other recombination cycle.
A phylogenetic network is called a “galledtree” if every recombination cycle is a gall.
A galled-tree generating
the sequences generated
by the prior network.
4
3
1
p s
a: 00010
3
b: 10010
c: 00100
d: 10100
2
p
e: 01100
5
4
s
f: 01101
g: 00101
Sales pitch for Galled-Trees
Galled-trees represent a small deviation from true trees.
There are sufficient applications where it is plausible that a galled tree
exists that generates the sequences.
Observable recombinations tend to be recent; block structure of human
DNA; recombination is sparse, so the true history of observable
recombinations may be a galled-tree.
The number of recombinations is never more than m/2. Moreover,
when M can be derived on a galled-tree, the number of recombinations used
is the minimum number over any phylogenetic network, even if multiple
cross-overs at a recombination event are counted as a single recombination.
A galled-tree for M is ``almost unique” - implications for reconstructing the
correct history.
Old (Aug. 2003) Results
• O(nm + n^3)-time algorithm to determine whether
or not M can be derived on a galled-tree with all-0
ancestral sequence.
• Proof that the galled-tree produced by the
algorithm is a “nearly-unique” solution.
• Proof that the galled-tree (if one exists) produced
by the algorithm minimizes the number of
recombinations used, over all phylogeneticnetworks with all-0 ancestral sequence.
New work
We derive the galled-tree results in a more
general setting that addresses unconstrained
recombination cycles and multiple
crossover recombination. This also solves the
problem of finding the ``most tree-like”
network when a perfect phylogeny is not
possible. In this algorithm, no ancestral
sequence is known in advance.
Blobbed-trees: generalizing
galled-trees
• In a phylogenetic network a maximal set of
intersecting cycles is called a blob.
• Contracting each blob results in a directed, rooted
tree, otherwise one of the “blobs” was not
maximal.
• So every phylogenetic network can be viewed as a
directed tree of blobs - a blobbed-tree.
The blobs are the non-tree-like parts of the network.
Every network is a
tree of blobs.
How do the tree parts
and the blobs relate?
How can we exploit
this relationship?
Ugly tangled
network inside
the blob.
Incompatible Sites
A pair of sites (columns) of M that fail the
4-gametes test are said to be incompatible.
A site that is not in such a pair is compatible.
M
12345
a 00010
b 10010
c 00100
d 10100
e 01100
f 01101
g 00101
Incompatibility Graph
4
1
3
2
5
Two nodes are connected iff the pair
of sites are incompatible, i.e, fail the
4-gamete test.
THE MAIN TOOL: We represent the pairwise
incompatibilities in a incompatibility graph.
The connected components of
G(M) are very informative
• The number of non-trivial connected components
is a lower-bound on the number of recombinations
needed in any network (Bafna, Bansal; Gusfield,
Hickerson).
• When each blob is a single-cycle (galled-tree case)
all the incompatible sites in a blob must come
from a single connected component C, and that
blob must contain all the sites from C.
Compatible sites need not be inside any blob.
(Gusfield et al 2003-5)
Simple Fact
If sites two sites i and j are incompatible,
then the sites must be together on some
recombination cycle whose recombination
point is between the two sites i and j.
(This is a general fact for all phylogenetic
networks.)
Ex: In the prior example, sites 1, 3 are incompatible, as are 1, 4;
as are 2, 5.
A Phylogenetic Network
00010
a:00010
10010
00000
4
3
1
00100
2
b:10010
P
c:00100
S
3
01100
p
d:10100
5
00101
S
4
01101
f:01101
e:01100
g:00101
Simple Consequence of the
simple fact
All sites on the same (non-trivial) connected
component of the incompatibility graph
must be on the same blob in any blobbed-tree.
Follows by transitivity.
So we can’t subdivide a blob into a tree-like
structure if it only contains sites from a single
connected component of the incompatibility
graph.
Key Result about Galls: For
galls, the converse of the simple
consequence is also true.
Two sites that are in different (non-trivial) connected
components cannot be placed on the same gall in
any phylogenetic network for M.
Hence, in a galled-tree T for M each gall contains all
and only the sites of one (non-trivial) connected
component of the incompatibility graph. All
compatible sites can be put on edges outside of the
galls.
This is the key to the galled-tree solution.
A galled-tree generating
the sequences generated
by the prior network.
4
Incompatibility Graph
4
3
1
1
3
2
5
p s
a: 00010
2
b: 10010
c: 00100
d: 10100
2
p
e: 01100
5
4
s
f: 01101
g: 00101
Motivated by the one-one correspondence between galls
and non-trivial connected components, we ask:
To what extent does this one-one correspondence hold
in general blobbed-trees, i.e. with no constraints on
how recombination cycles interweave?
The Decomposition Theorem
(Recomb 2005)
For any set of sequences M, there is a
blobbed-tree T(M) that derives M, where
each blob contains all and only the sites in
one non-trivial connected component of
G(M). The compatible sites can always be
put on edges outside of any blob.
A blobbed-tree with this structure is called fully-decomposed.
General Structure
So, for any set of sequences M, there is a phylogenetic
network T(M) that is fully decomposed.
Moreover, the tree part of T(M) is unique. And it is easy to
find the tree part.
A fully-decomposed
network for the sequences 4
generated by the prior
network.
Incompatibility Graph
4
3
1
1
3
2
5
p s
a: 00010
2
b: 10010
c: 00100
d: 10100
2
p
e: 01100
5
4
s
f: 01101
g: 00101
Moreover
Since all sites from a single connected
component must be together on some blob
in any phylogenetic network, no network is
more decomposed than the fully
decomposed network.
Proof Ideas
Let C be a connected component of G(M).
Define M[C] as the sequences in M
restricted to the sites in C.
M
12345
a 00010
b 10010
c 00100
d 10100
e 01100
f 01101
g 00101
4
C2
C1
1
3
B1
2
5
B2
134
a001
b101
c010
d110
e010
f 010
g010
25
00
00
00
00
10
11
01
a
b
c
d
e
f
g
M[C1]
M[C2]
Faux Proof
Pick one site from each connected component C in G(M) to
``represent” C. No pair of those sites are incompatible, so
by the NASC for a perfect phylogeny, there will be a
perfect phylogeny T for the sites. Expand each node to a
network generating the sequences in M[C].
Incorrect, because the structure of T can be wrong. We need
to use information about all the sites in each C.
Now for each connected component C in G(M), call each distinct
sequence in M[C] a supercharacter, and let W be the indicator
matrix for the supercharacters. So W indicates which rows of M
contain which particular supercharacters.
1
2
3
4
3
3
3
134
a001
b101
c010
d110
e010
f010
g 01 0
M[C1]
5
5
5
5
6
7
8
a
b
c
d
e
f
g
25
00
00
00
00
10
11
01
M[C2]
a
b
c
d
e
f
g
12345678
10001000
01001000
00101000
00011000
00100100
00100010
00100001
W
Proof Ideas
Lemma: No pair of supercharacters are
incompatible.
So by the NASC for a Perfect Phylogeny,
there is a unique perfect phylogeny T for W.
Proof Ideas
For each connected component C of G(M), all
supercharacters that originate from C label edges in T that
are incident with one single node v[C] in T. So, if we
expand each node v[C] to be a network that generates the
supercharacters from C (the sequences in M[C]), and
connect each network correctly to the edges in T, the
resulting network is a fully-decomposed blobbed-tree that
generates M.
Algorithmically, T is easy to find and is the
tree resulting from contracting each blob in
the fully-decomposed blobbed-tree T(M) for
M. T can be constructed from M in
O(nm^2) time.
Broader Biological Applications
Our major interest is in recombination, but
the proof of the decomposition theorem
does not explicitly use recombination. So it
holds for whatever biological phenomena
caused the incompatibility of sites. For
example, back or recurrent mutation, geneconversion, lateral gene transfer etc.
The main open question
The Decomposition Theorem says there is
always a fully-decomposed blobbed-tree for
any M, but
Is there always a fully-decomposed
blobbed-tree that minimizes the number of
recombinations over all possible
phylogenetic networks for M?
We conjecture the answer is yes.
If true, then we can decompose the problem of minimizing the total
number of recombinations into separate problems on each connected
component, and also find lower bounds on the needed number of
recombinations, in each component separately, adding those bounds to
get a valid overall lower bound for M. This computation of lower
bounds is known to be correct for certain lower bounds (Bafna, Bansal
2004).
Progress on Proving the
Conjecture
Definition: If N is a phylogenetic network for M, and a node v in N is
labeled with a sequence in M, then v is said to be visible in N.
Theorem: If every node in N is visible, then there is a fully-decomposed
network for M where the number of recombinations is at most the
number in N.
Corollary: The conjecture is true for any M where the Haplotype or
History lower bounds (S. Myers) on the number of recombinations
needed to generate M, is tight.
Optimality of Galled-Trees
Theorem: (G,H,B,B) The minimum number of recombination nodes
in any phylogenetic network for M is at least the number
of non-trivial connected components of the incompatibility graph.
Hence, when the sequences on each blob on T(M) can be generated
with a single recombination node, the blobbed-tree minimizes
the number of recombination nodes over all phylogenetic
networks and all choices of ancestral sequence. This solves
the root-unknown galled-tree problem in polynomial time.
Code is on the web.
The number of arrangements on a
gall (all-0 ancestral sequence)
By analysing the algorithm to layout the
sites on a gall (not discussed here), one can
prove that the number of arrangements of any gall is
at most three, and this happens only if the gall has two
sites.
If the gall has more than two sites, then the number of
arrangements is at most two.
If the gall has four or more sites, with at least two sites
on each side of the recombination point (not the side of
the gall) then the arrangement is forced and unique.
Computing close bounds on the
minimum number of
recombinations
Dan Gusfield UCD
Y. Song, Y. F. Wu, D. Gusfield (ISMB2005)
D. Gusfield, D. Hickerson (Dis. Appl. Math 2005)
D. Gusfield, V. Bansal (Recomb 2005)
The grandfather of all lower
bounds - HK 1985
• Arrange the nodes of the incompatibility graph on the line
in order that the sites appear in the sequence. This bound
requires a linear order.
• The HK bound is the minimum number of vertical lines
needed to cut every edge in the incompatibility graph.
Weak bound, but widely used - not only to bound the
number of recombinations, but also to suggest their
locations.
Justification for HK
If two sites are incompatible, there must have
been some recombination where the
crossover point is between the two sites.
HK Lower Bound
1
2
3
4
5
HK Lower Bound = 1
1
2
3
4
5
More general view of HK
Given a set of intervals on the line, and for each interval I, a
number N(I), define the composite problem: Find the minimum
number of vertical lines so that every interval I intersects at
least N(I) of the vertical lines.
In HK, each incompatibility defines an interval I where N(I) = 1.
The composite problem is easy to solve by a left-to-right myopic
placement of vertical lines: Sort the intervals by right end-point;
Process the intervals left to right in that order; when the right
endpoint of an interval I is reached, place there (if needed)
additional vertical so that N(I) lines intersect I.
If each N(I) is a ``local” lower bound on the number of
recombinations needed in interval I, then the solution to
the composite problem is a valid lower bound for the
full sequences. The resulting bound is called the composite
bound given the local bounds.
This general approach is called the Composite Method
(Simon Myers 2002).
Haplotype Bound (Simon Myers)
• Rh = Number of distinct sequences (rows) - Number of
distinct sites (columns) -1 (folklore)
• Before computing Rh, remove any site that is compatible
with all other sites. A valid lower bound results - generally
increases the bound.
• Generally really bad bound, often negative, when used on
large intervals, but Very Good when used as local bounds
in the Composite Interval Method, and other methods.
Composite Interval Method using
RH bounds
Compute Rh separately for each of the C(m,2) intervals of
the m sites; let N(I) = Rh(I) be the local lower bound for
interval I. Then compute the composite bound using these
local bounds.
Polynomial time and gives bounds that often double HK in
our simulations.
Composite Subset Method
(Myers)
• Let S be subset of sites, and Rh(S) be the
haplotype bound for subset S. If the
leftmost site in S is L and the rightmost site
in S is R, then use Rh(S) as a local bound
N(I) for interval I = [S,L].
• Compute Rh(S) on many subsets, and then
solve the composite problem to find a
composite bound.
RecMin (Myers)
• World Champion Lower Bound Program (until
now). Often RecMin gives a bound three times as
large as HK.
• Computes Rh on subsets of sites, but limits the
size and the span of the subsets. Default
parameters are s = 6, w = 15 (s = size, w = span).
• Generally, impractical to set s = w = m, so
generally one doesn’t know if increasing the
parameters would increase the bound. (example:
Myers bound of 70 on the LPL data).
Optimal RecMin Bound (ORB)
• The Optimal RecMin Bound is the lower
bound that RecMin would produce if both
parameters were set to their maximum
values (s = w= m).
• In general, RecMin cannot compute (in
practical time) the ORB.
• Practical computation of the ORB is our
first contribution.
Computing the ORB
• Gross Idea: For each interval I, use ILP to find a subset of
sites S that maximizes Rh(S) over all subsets in interval I.
Call the result Opt(I).
• Set N(I) = Opt(I), and compute the composite bound using
those local bounds.
• The composite bound is the ORB. -- the result one would
get by using all 2^m subsets in RecMin, with s = w = m.
We have moved from doing an exponential number of simple
computations (computing Rh for each subset), to solving
a quadratic number of (possibly expensive) ILPs.
Is this a good trade-off in practice? Our experience - very much
so!
How to compute Opt(I) by ILP
Create one variable Xi for each row i; one variable Yj
for each column j in interval I. All variables are 0/1 variables.
Define S(i,i’) as the set of columns where rows i and i’ have different
values. Each column in S(i,i’) is a witness that rows i and i’ differ.
For each pair of rows (i,i’), create the constraint:
Xi + Xi’ <= 1 + ∑ [Yj: j in S(i,i’)]
Objective Function: Maximize ∑ Xi - ∑ Yj -1
Alternate way to compute Opt(I)
by ILP
First remove any duplicate rows. Let N be the number of rows
remaining.
Create one variable Yj for each column j
in interval I. All variables are 0/1 variables.
S(i,i’) as before.
For each pair of rows (i,i’) create the constraint:
1 <= ∑ [Yj: j in S(i,i’)]
Objective Function: Maximize N - ∑(Yj) -1
Finds the smallest number of columns to distinguish the rows.
Two critical tricks
1) Use the second ILP formulation to compute Opt(I). It solves
much faster than the first (why?)
2) Reduce the number of intervals where Opt(I) is computed:
I
L
If the solution to Opt(I) uses columns that
only span interval L, then there is no
need to find Opt(I’) in any interval
I’ containing L. Each ILP computation
directly spawns at most 4 other ILPs.
Apply this idea recursively.
With the second trick we need to find Opt(I)
for only 0.5% - 35% of all the C(m,2)
intervals (empirical result). Surprisingly
fast in practice (with either the GNU solver
or CPLEX).
Bounds Higher Than the Optimal
RecMin Bound
• Often the ORB underestimates the true minimum, e.g.
Kreitman’s ADH data: 6 v. 7
• How to derive sharper bound? Idea: In the composite
method, check if each local bound N(I) = Opt(I) is tight,
and if not, increase N(I) by one.
• Small increases in local bounds can increase the composite
bound.
Bounds Sharper Than Optimal
RecMin Bound
• A set of sequences M is called self-derivable if there is a
network generating M where the sequence at every node
(leaf and intermediate) is in M.
• Observation: The haplotype bound for a set of sequences is
tight if and only if the sequences are self-derivable.
• So for each interval I where Opt(I) is computed, we check
self-derivability of the sequences induced by columns S*,
where S* are the columns specified by Opt(I). Increase
N(I) by 1 if the sequences are not self-derivable.
Algorithm for Self-Derivability
• Solution is easy when there are no mutations --only
recombinations are allowed and one initial pair of
sequences is chosen as ``reached”.
• Two reached sequences S1 and S2 can reach a third
sequence S3, if S3 can be created by recombining S1 and
S2.
• Do BFS to see if all sequences can be reached by
successive application of the ``reach operation”.
• Clearly polynomial time and can be optimized with suffixtrees etc. (Kececiouglu, Gusfield)
Self-derivability Test with
mutations allowed
• For each site i, construct set MUT(i) of sequence pairs (S1,
S2) in M where S1 and S2 differ at only site.
• Try each sequence in M as root (which is the only reached
sequence initially).
• For each root, enumerate all ways of choosing exactly one
ordered pair of sequences (Sp, Sq) from each MUT(i). Sp
is allowed to ``reach” Sq.
• Run the prior self-derivability algorithm with these new
permitted reach relations, to test if all sequences in M can
be reached. If so, then M is self-derivable, otherwise it is
not.
Checking if N(I) should be
increased by two
If the set of sequences is not self-derivable,
we can test if adding one new sequence makes
it self-derivable.
the number of candidate sequences is
polynomial and for each one added to M we
check self-derivability. N(I) should be
increased by two if none of these sets of
sequences is self-derivable.
Program HapBound
• HapBound –S. Checks each Opt(I) subset for selfderivability. Increase N(I) by 1 or 2 if possible. This often
beats ORB and is still practical for most datasets.
• HapBound –M. Explicitly examine each interval directly
for self-derivability. Increase local bound if possible.
Derives lower bound of 7 for Kreitman’s ADH data in this
mode.
HapBound vs. RecMin on LPL
from Clark et al.
Program
Lower Bound
Time
RecMin (default)
59
3s
RecMin –s 25 –w 25
75
7944s
RecMin –s 48 –w 48
No result
5 days
HapBound
75
31s
HapBound -S
78
1643s
2 Ghz PC
Example where RecMin has
difficulity in Finding Optimal Bound
on a 25 by 376 Data Matrix
Program
RecMin default
RecMin –s 30 –w 30
RecMin –s 35 –w 35
RecMin –s 40 –w 40
RecMin –s 45 –w 45
HapBound
HapBound -S
Bound
36
42
43
43
43
44
48
Time
1s
3m 25s
24m 2s
2h 9m 4s
10h 20m 59s
2m 59s
39m 30s
Frequency of HapBound –S
Bound Sharper Than Optimal
RecMin Bound
ms param. Rho=1
Rho=5
Rho=10
Rho=20
Theta=1
0.0%
0.4%
0.5%
1.5%
Theta=5
0.7%
4.0%
10.4%
27.0%
Theta=10 1.4%
9.2%
17.8%
40.4%
Theta=20 1.4%
10.5%
27.8%
45.4%
For every ms parameters, 1000 data sets are used.
Computing Upper Bounds
• The method is an adaptation of the
``history” lower bound (Myers).
• A non-informative column is one with
fewer than two 0’s or fewer than two 1’s.
Single History Computation
1)
2)
3)
Set W = 0
Collapse identical rows together, and remove noninformative columns. Repeat until neither is possible.
Let A be the data at this point. If A is empty, stop, else
remove some row r from A, and set W = W + 1. Go to
step 2).
Note that the choice of r is arbitrary in Step 3), so the
resulting W can vary.
History Lower Bound
Theorem (Myers) Let W* be the minimum W
obtained from all possible single history
computations.
Then W* is a valid lower bound on the
number of recombinations needed.
Naïve time: theta(n!) (RecMin), but can be
reduced to theta(2^n) (Bafna, Bansal).
Converting the History Lower
Bound to an Upper Bound
• Given a set of rows A and a single row r,
define w(r | A - r) as the minimum number
of recombinations needed to create r from
A-r (well defined in our application).
• w(r | A-r) can be computed in linear time by
a greedy-type algorithm.
Upper Bound Computation
1)
2)
3)
Set W = 0
Collapse identical rows together, and remove noninformative columns. Repeat until neither is possible.
Let A be the data at this point. If A is empty, stop, else
remove some row r from A, and set W = W + W(r | A-r).
Go to step 2).
Note that the choice of r is arbitrary in Step 3), so the
resulting W can vary.
This is the Single History Computation, with a change in
step 3).
Note, even a single execution of the upper bound computation
gives a valid upper bound, and a way to construct a
network. This is in contrast to the History Bound which
requires finding the minimum W over all histories.
However, we also would like to find the lowest upper bound
possible with this approach. This can be done in O(2^n)
time by DP. In practice, we can use branch and bound to
find the lowest upper bound, but we have also found that
branching on the best local choice, or randomizing gives
good results.
Branch and Bound
(Branching) In Step 3) choose r to minimize
w(r | A-r) + L(A-r), where L(A-r) is some
fast lower bound on the number of
recombinations needed for the set A-r.
Even HK is good for this purpose.
(Bounding) Let C be the min for an full
solution found so far; If W + L(A) >= C,
then backtrack.
Kreitman’s 1983 ADH Data
•
11 sequences, 43 segregating sites
•
Both HapBound and SHRUB took only a fraction of a
second to analyze this data.
•
Both produced 7 for the number of detected
recombination events
Therefore, independently of all other methods, our lower
and upper bound methods together imply that 7 is the
minimum number of recombination events.
A minimal ARG for Kreitman’s data
SHRUB produces
code that can be
input to an open
source program to
display the
constructed ARG
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
The Human LPL Data (Nickerson et al. 1998)
(88 Sequences, 88 sites)
Our new lower
and upper
bounds
Optimal RecMin
Bounds
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
(We ignored insertion/deletion, unphased sites, and sites with missing data.)
Study on simulated data:
Exact-Match frequency for varying parameters
  = Scaled mutation rate
  = Scaled recombination rate
• n = Number of sequences
n = 25
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Used Hudson’s MS to
generate1000 simulated datasets
for each pair of and 
n = 15
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
For < 5, our lower and upper bounds match more than
95% of the time.
Exact-Match frequency
for varying number of sequences
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Match frequency does not depend on n as much as it does on
 or 
A closer look at the deviation

Average ratio of lower bound to upper bound when
they do not match
For n = 25:
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
The numerical difference between lower and upper bounds
grows as  or  increases, but their ratio is more stable.
Papers and Software
wwwcsif.cs.ucdavis.edu/~gusfield/