Combinatorial Optimization in Computational Biology

Download Report

Transcript Combinatorial Optimization in Computational Biology

Combinatorial Optimization in
Computational Biology
Dan Gusfield
Computer Science, UC Davis
Some Applications of
optimization methods
•
•
•
•
Dynamic Programming in sequence alignment
TSP and Euler paths in DNA sequencing
Integer Programming in Haplotype inference
Graphic Matroid Recognition in Haplotype
inference
• Integer Programming for Virus Identification
• Weighted matching in genetic partition analysis
Apology
This talk focuses (mostly) on my own work, for simplicity of
preparation. But, many other wonderful examples exist
in the literature.
At present, there is no coherent story about the ways that
combinatorial optimization arises in Computational
Biology. Must be alert for meaningful applications.
Topic 1: Sequence Alignment
Intended to show the similarity of two sequences (strings).
Definition: An alignment of two strings S and S’ is obtained
by inserting spaces into, or at the ends of, the two strings, so
that the resulting strings (including the spaces) have the same
lengths.
A
--
-C
T
T
C
C
-A
A
A
Matches, Mismatches and Indels
Two aligned, identical characters in an alignment are a match.
Two aligned, unequal characters are a mismatch.
A character aligned with a space, represents an indel
(insertion/deletion).
A A C T A C T -- C C T A A C A C T -- --- -- C T C C T A C C T -- -- T A C T T T
the bars show mismatches
10 matches, 2 mismatches, 7 indels
Basic Algorithmic Problem
Find an alignment of the two strings that
Maximizes the # matches - # mismatches - # spaces
in the alignment of strings S and S’.
That maximum number defines the similarity of the two
strings.
Also called Optimal Global Alignment
Biologically, under the point mutation model, a mismatch
represents a mutation, while an indel represents a historical
insertion or deletion of a single character.
Solution By Dynamic
Programming
Def: Let V(I, J) be the Maximimum Value obtainable in
any alignment of the first I characters of S and the first
J characters of S’.
If S has N characters and S’ has M characters, then
we want to compute V(N, M), and the associated
alignment.
Let M(I, J) = -1 if S(I) mismatches S’(J) and
M(I, J) = 1 if S(I) matches S’(J)
DP Solution
The DP recurrences are:
V (I,0)  I,
V (0,J)  J,
Base Cases
General Cases
I and J aligned
V (I 1,J 1)  M(I,J)


I opposite a space
V (I,J)  MaxV (I 1,J) 1



V (I,J 1) 1
 J opposite a space
They can be evaluated in O(NM) operations.
Many Variations on the Theme
The most important alignment variant is Optimal Local Alignment:
Given S and S’ find a substring A of S, and a substring B of S’
such that the value of the Global Alignment of A and B is
maximized over all pairs of substrings of S and S’.
The point is to find a pair of substrings of S and S’ which are
highly similar to each other, even if S and S’ are not. Many
biological reasons for this: DNA problems, Protein Domains etc.
Naively, we would need to do O((NM)**2) global alignments,
for a time bound of O((NM)**3) operations overall.
DP solution to Local Alignment
V (I,0)  V (0,J)  0
V (I 1,J 1)  M(I,J)


V
(I
1,J)
1

V (I,J)  MAX
V (I,J 1) 1



0

But the optimal value is no longer V(N, M) as in global alignment.
 Instead the optimal Local Alignment Value is the largest V
value computed over all the NM values. The time is
again O(NM). This is the Smith-Waterman Local Alignment
Algorithm.
Adding Operation Weights
Find an alignment of the two strings that
Maximizes
Vm x (# matches) - Cm x (# mismatches) - Cs x (# spaces)
in the alignment of strings S and S’.
The modification to the DP is immediate, but now the
modeling question is what Vm, Cm and Cs should be.
Those values greatly determine the biological utility of
the alignment.
This leads to Parametric Alignment: Find the optimal alignment
as a function of unknown values of Vm, Cm, and Cs
Parametric Sequence Alignment
As a function of two selected weight parameters Vm, Cm, Cs, Cg,
decompose the two dimensional parameter space into convex
polygons, so that in each polygon, an alignment inside the polygon
is optimal for any point in the polygon, and nowhere else.
The time for this decomposition is O(NM) per polygon.
In the case of global alignment, the number of polygons is bounded
by 0.88(N^(2/3)), where N < M.
package XPARAL to find the polygons is available on the web.
Gusfield, Naor, Stelling in several papers (see the web)
Adding Gaps to the Model
A Gap in an alignment is a contiguous run of spaces in one sequence.
Example: A T T -- -- -- C C --- -- T C T G C C C
has six spaces in three gaps
Now the objective function for alignment is Maximize
Vm x (#matches) - Cm x (#mis) - Cs x (#spaces) - Cg x (#gaps)
This is the affine gap model, the dominant model for studying
protein sequence similarity. The choice of Cs and Cg are critical for the
biological utility. The DP in this case can also be solved in O(NM) time.
Topic 2: Sequencing By
Hybridization
A proposed approach to sequencing DNA
The technology allows you to determine all the K-tuples,
for a fixed K, that are contained in a DNA string S.
The goal is to try to determine the string S from the given
K-tuples.
One can cast this problem as a Hamilton Path problem on
a graph, but it is more productive to view it as an Euler
Path Problem.
This approach was introduced by Pavel Pavzner.
I will focus on the results in
Graph Traversals, Genes and Matroids: An Efficient Special Case
of the Traveling Salesman Problem
D. Gusfield, R. Karp, L. Wang, P. Stelling
Discrete Applied Math 88 special issue on
Computational Molecular Biology, 1998, p. 167-180.
Sequencing by Hybridization
Example: S =
ACTTAAACGCGCACAA
K=3
So we learn that ACT, CTT etc. are in S, but we learn these
in no predictable order.
The Hamilton Path View
ACG
CGC
add a C on the
left
Each node represents a K-tuple, and a directed edge from one
node to another if the second K-tuple can be obtained from the
first by deleting the left character, and adding a right character:
i.e, a shift and add.
Then a Hamilton Path in this graph generates a sequence S’
that has the same K-tuples as S
The Euler Path View
The Euler Path view is more productive, leading to more
efficient algorithms both for the base problem, and for
variants of the problem.
Due to P. Pevzner
Graph G: One node for every k-1 tuple, and one
edge for every observed k-tuple
A
C
AA
AC
C
A
A
CA
A
G
T
TA
A
GC
C
A
TT
T
CT
CG
Observed triples from S: AAA, AAC, ACA, CAA, CAC, ACG, CGC,
GCA, GCC, ACT, CTT, TTA, TAA
Any Euler path in G produces a string which has the same
K-tuples as the original string S. So if there is only one
Euler path in G, then you have reconstructed S.
The problem is that typically there is more than one Euler
path. So how to choose one Euler path as a most promising
one? That is, one that is most likely to correspond to S?
Frequently there is additional information about the K-tuples,
for example the rough distances between K-tuples in S. So,
we can give a value to each pair of successive edges in G,
which roughly reflects the probability that the two edges
would appear successively in the Euler Path for S. Another
way to state this is that we evaluate the goodness of S’
according to the (k+1)-tuples that it contains.
Back to the Hamilton View
Now we are back to the Hamilton Path view, where each
node is a K-tuple of S, but now each
edge has a distance reflecting the goodness of any (K+1)-tuple
that would be generated by a Hamilton path, and we want the
Maximum distance Hamilton Path in the graph.
This sounds hard, but there is structure in this problem: the
graphs are not arbitrary.
The structure of the Hamilton
Graph
The Hamilton Graph for the data, where nodes represent
K-tuples in S, and edges represent (K+1)-tuples in a
solution string S’, is the line di-graph of the Euler Graph
for the data, where nodes represent (K-1) tuples, and
edges represent K-tuples.
How Euler Relates to Hamilton
The Hamilton graph is the
line di-graph L(G) of the Euler Graph G.
Each edge of G becomes a node in L(G), and each edge in L(G)
represents a successive pair of edges in G.
Each red edge has the value of the successive pair of black edges.
So the problem now is to find a Maximum Value TSPath.
In the case that each node has only two in and two out edges
(which happens when using a simplified DNA alphabet),
this TSP problem has a polynomial time solution.
In fact, if L(G) is any line digraph where every node has
in and out degrees at most 2, then even for arbitrary edge
distances, the TSP problem can be solved in O(n log n) time.
Moreover, if every edge value is distinct, then the TSP solution
is unique.
The Hamilton paths for such 2-in, 2-out-degree graphs have
a matroid structure that reveals many features of the TSP solutions.
When degrees are greater than 2, we branch-and-bound down to
cases of degree 2, which we then can solve optimally.
Approximations
When the in and out-degrees are bounded by X, a greedy
algorithm is guaranteed to find a Hamilton path whose
value is within 1/X that of the optimal. So 1/3 or 1/4
for the case of DNA.
Use of this approximation can speed up the branch-andbound until the in and out-degree bound of 2 is reached.
Topic 3: Combinatorial
Algorithms for Haplotype
Inference
True Parsimony and Perfect
Phylogeny
Dan Gusfield
Genotypes and Haplotypes
Each individual has two “copies” of each
chromosome.
At each site, each chromosome has one of two
alleles (states) denoted by 0 and 1 (motivated by
SNPs)
0 1 1 1 0 0 1 1 0
1 1 0 1 0 0 1 0 0
Two haplotypes per individual
Merge the haplotypes
2 1 2 1 0 0 1 2 0
Genotype for the individual
Haplotyping Problem
• Biological Problem: For disease association studies,
haplotype data is more valuable than genotype data, but
haplotype data is hard to collect. Genotype data is easy to
collect.
• Computational Problem: Given a set of n genotypes,
determine the original set of n haplotype pairs that
generated the n genotypes. This is hopeless without a
genetic model.
We consider two models in this
talk
• Pure (True) Parsimony Model, Gusfield
unpublished (Note added 3/18/03 technical
report is available on the web, and will
appear in the 2003 CPM conference)
• Perfect Phylogeny Model, Gusfield
RECOMB Proceedings April 2002
Topic 3a: The Pure Parsimony
Objective
For a set of genotypes, find a Smallest set H of
haplotypes, such that each genotype can be
explained by a pair of haplotypes in H.
Example of Parsimony
02120
00100
01110
22110
01110
10110
20120
00100
10110
3 distinct haplotypes
set S has size 3
Pure Parsimony is NP-hard
Earl Hubbel (Affymetrix) showed that Pure Parsimony
is NP-hard.
However, for a range of parameters of current interest
(number of sites and genotypes) a Pure Parsimony
solution can be computed efficiently, using Integer
Linear Programming, and one speed-up trick.
For larger parameters (100 sites and 50 genotypes)
A near-parsimony solution can be found efficiently.
The Conceptual Integer
Programming Formulation
For each genotype (individual) j, create one integer
programming variable Yij for each pair of haplotypes
whose merge creates genotype j. If j has k 2’s, then
This creates 2^(k-1) Y variables.
Create one integer programming variable Xq for
Each distinct haplotype q that appears in one of the
pairs for a Y variable.
Conceptual IP
For each genotype, create an equality that says that
exactly one of its Y variables must be set to 1.
For each variable Yij, whose two haplotypes are
given variables Xq and Xq’, include an inequality
that says that if variable Yij is set to 1, then both
variables Xq and Xq’ must be set to 1.
Then the objective function is to Minimize the
sum of the X variables.
Example
02120 Creates a Y variable Y1 for pair 00100 X1
01110 X2
and a Y variable Y2 for pair
01100 X3
00110 X4
Include the following (in)equalities into the IP
Y1 + Y2 = 1
Y1 - X1 <= 0
Y1 - X2 <= 0
Y2 - X3 <= 0
Y2 - X4 <= 0
The objective function will
include the subexpression
X1 + X2 + X3 + X4
But any X variable is included
exactly once no matter how many
Y variables it is associated with.
Efficiency Trick
Ignore any Y variable and its two X variables if those X
variables are associated with no other Y variable. The
Resulting IP is much smaller, and can be used to find
the optimal to the conceptual IP.
Also, we need not first enumerate all X pairs for a given
genotype, but can efficiently recognize the pairs we
need. Another simple trick.
How Fast? How Good?
Depends on the level of recombination in the underlying
data. Pure Parsimony can be computed in seconds to
minutes for most cases with 50 genotypes and up to 60
sites, faster as the level of recombination increases.
As the level of recombination increases, the accuracy
of the Pure Parsimony Solution falls, but remains within
10% of the quality of PHASE (for comparison).
Topic 3b: Haplotyping via
Perfect Phylogeny
Conceptual Framework and Efficient
(almost linear-time) Solutions
Dan Gusfield
U.C. Davis
The Perfect Phylogeny Model
We assume that the evolution of extant haplotypes can be
displayed on a rooted, directed tree, with the all-0
haplotype at the root, where each site
changes from 0 to 1 on exactly one edge, and each extant
haplotype is created by accumulating the changes on a path
from the root to a leaf, where that haplotype is displayed.
In other words, the extant haplotypes evolved along a
perfect phylogeny with all-0 root.
The Perfect Phylogeny Model
sites 12345
Ancestral haplotype 00000
1
4
Site mutations on edges
3
00010
2
10100
5
10000
01010
01011
Extant haplotypes at the leaves
Justification for Perfect
Phylogeny Model
• In the absence of recombination each haplotype of
any individual has a single parent, so tracing back
the history of the haplotypes in a population gives
a tree.
• Recent strong evidence for long regions of DNA
with no recombination. Key to the NIH haplotype
mapping project. (See NYT October 30, 2002)
• Mutations are rare at selected sites, so are assumed
non-recurrent.
• Connection with coalescent models.
The Haplotype Phylogeny Problem
Given a set of genotypes S, find an explaining set
of haplotypes that fits a perfect phylogeny.
sites
A haplotype pair explains a
genotype if the merge of the
a
S
b
haplotypes creates the
c
genotype. Example: The
Genotype matrix merge of 0 1 and 1 0 explains
2 2.
1
2
0
1
2
2
2
0
The Haplotype Phylogeny Problem
Given a set of genotypes, find an explaining set
of haplotypes that fits a perfect phylogeny
a
b
c
1
2
0
1
2
2
2
0
a
a
b
b
c
c
1
1
0
0
0
1
1
2
0
1
0
1
0
0
The Haplotype Phylogeny Problem
Given a set of genotypes, find an explaining set
of haplotypes that fits a perfect phylogeny
00
a
b
c
1
2
0
1
2
2
2
0
a
a
b
b
c
c
1
1
0
0
0
1
1
2
0
1
0
1
0
0
1
2
b
00
cc
10 10
a
10
a
01
b
01
The Alternative Explanation
a
b
c
1
2
0
1
2
2
2
0
a
a
b
b
c
c
1
1
0
0
0
1
1
2
1
0
0
1
0
0
No tree
possible
for this
explanation
When does a set of haplotypes to
fit a perfect phylogeny?
Classic NASC: Arrange the haplotypes in a
matrix, two haplotypes for each individual.
Then (with no duplicate columns), the
haplotypes fit a unique perfect phylogeny if
and only if no two columns contain all three
pairs:
0,1 and 1,0 and 1,1
This is the 3-Gamete Test
The Alternative Explanation
a
b
c
1
2
0
1
2
2
2
0
a
a
b
b
c
c
1
1
0
0
0
1
1
2
1
0
0
1
0
0
No tree
possible
for this
explanation
The Tree Explanation Again
a
b
c
1
2
0
1
2
2
2
0
a
a
b
b
c
c
1
1
0
0
0
1
1
2
0
1
0
1
0
0
00
1
2
b
00
cc
a
a
b
01 01
Solving the Haplotype Phylogeny
Problem (PPH)
• Simple Tools based on classical Perfect
Phylogeny Problem.
• Complex Tools based on Graph Realization
Problem (graphic matroid realization).
Simple Tools – treeing the 1’s
1) For any row i in S, the set of 1 entries in row
i specify the exact set of mutations on the
path from the root to the least common
ancestor of the two leaves labeled i, in every
perfect phylogeny for S.
2) The order of those 1 entries on the path is
also the same in every perfect phylogeny for
S, and is easy to determine by “leaf
counting”.
Leaf Counting
In any column c, count two for each 1, and
count one for each 2. The total is the number
of leaves below mutation c, in every perfect
phylogeny for S. So if we know the set of
mutations on a path from the root, we know
their order as well.
1 2 3 4 5 6 7
S
a
b
c
d
1
0
2
2
0
1
2
2
1
0
0
0
0
1
0
0
0
0
2
0
0
0
0
2
0
0
2
0
Count 4 4 2 2 1 1 1
The Subtree of 1’s is Forced
The columns of S with at least one 1 entry
specify a unique upper subtree that must be
in every perfect phylogeny for S.
S
a
b
c
d
1
1
0
2
2
2
0
1
2
2
3
1
0
0
0
4
0
1
0
0
5
0
0
2
0
6
0
0
0
2
7
0
0
2
0
4 4 2
2
1
1
1
1
3
a a
2
4
b b
Corollary: A Useful Sufficient
Condition
If every column of S has at least one 1, then
there is a unique perfect phylogeny for S, if
there is any.
Build the forced tree based on the 1-entries,
then every mutation (site) is in the tree, and
any 2-entries dictate where the remaining
leaves must be attached to the tree.
What to do when the Corollary
does not apply?
• Build the forced tree of 1-entries.
• 2-entries in columns with 1-entries are
done.
• Remaining problem: 2-entries in columns
without any 1-entries
Uniqueness without 1’s
A
0
2
2
B
2
2
0
C
2
0
2
There are 8 haplotype solutions
that can explain the data, but
only one that fits a perfect
phylogeny, i.e. only one solution
to the PPH problem.
More Simple Tools
3) For any row i in S, and any column c, if
S(i,c) is 2, then in every perfect phylogeny
for S, the path between the two leaves
labeled i, must contain the edge with
mutation c.
Further, every mutation c on the path
between the two i leaves must be from
such a column c.
From Row Data to Tree
Constraints
sites
1234567
i:0 1 0 2 2 1 2
i
Subtree for row i data
Root
2
6
The order is
known for the red
mutations
4,5,7
order
i
unknown
The Graph Theoretic Problem
Given a genotype matrix S with n sites, and a
red-blue fork for each row i,
create a directed tree T where each
integer from 1 to n labels exactly one
edge, so that each fork is
i
i contained in T.
Complex Tool: Graph Realization
• Let Rn be the integers 1 to n, and let P be an unordered
subset of Rn. P is called a path set.
• A tree T with n edges, where each is labeled with a unique
integer of Rn, realizes P if there is a contiguous path in T
labeled with the integers of P and no others.
• Given a family P1, P2, P3…Pk of path sets, tree T realizes
the family if it realizes each Pi.
• The graph realization problem generalizes the consecutive
ones problem, where T is a path.
Graph Realization Example
P1: 1, 5, 8
P2: 2, 4
P3: 1, 2, 5, 6
P4: 3, 6, 8
P5: 1, 5, 6, 7
5
1
2
8
4
6
7
3
Realizing Tree T
Graph Realization
Polynomial time (almost linear-time) algorithms
exist for the graph realization problem – Whitney,
Tutte, Cunningham, Edmonds, Bixby, Wagner,
Gavril, Tamari, Fushishige.
The algorithms are not simple; no known
implementations of the near linear-time methods.
But an implementation from UCD CS of a slightly
slower method is available on the web.
Recognizing graphic Matroids
• The graph realization problem is the same
problem as determining if a binary matroid is
graphic, and the algorithms come from that
literature.
• The fastest algorithm is due to Bixby and Wagner.
For n rows and m columns the time is O(nm x
alpha(nm).
• Representation methods due to Cunningham et al.
Reducing PPH to graph
realization
We solve any instance of the PPH problem
by creating appropriate path sets, so that a
solution to the resulting graph realization
problem leads to a solution to the PPH
problem instance.
The key issue: How to encode a fork by path
sets.
From Row Data to Tree
Constraints
sites
1234567
i:0 1 0 2 2 1 2
i
Subtree for row i data
Root
2
6
The order is
known for the red
mutations
4,5,7
order
i
unknown
Encoding a Red path, named g
2
5
7
P1: U, 2
U
P2: U, 2, 5
2
P3: 2, 5
5
forced
P4: 2, 5, 7 In T
7
P5: 5, 7
P6: 5, 7, g
g
P7: 7, g
U is a universal glue edge, used for all red paths.
g is used for all red paths identical to this one – i.e.
different forks with the identical red path.
Encoding and gluing a blue path
For a specific blue path, simply specify a single
path set containing
the integers in the blue path. However, we need to
force that path to be glued to the end of a specific
red path g, or to the root of T.
In the first case, just add g to the path set. In the
second case, add U to the path set.
Encoding and gluing a blue path
to the end of red path g
g
2
5
7
P: g, 4, 9, 18, 20
4, 9, 18, 20
After the Reduction
After all the paths are given to the graph realization
algorithm, and a realizing tree T is obtained
(assuming one is), contract all the glue edges to
obtain a perfect phylogeny for the original
haplotyping problem (PPH) instance.
Conversely, any solution can be obtained in
this way. The reduction takes time O(nm) for n
rows and m columns.
Uniqueness
In 1932, Whitney established the NASC for the uniqueness
of a solution to a graph realization problem:
Let T be the tree realizing the family of required path sets.
For each path set Pi in the family, add an edge in T
between the ends of the path realizing Pi. Call the result
graph G(T).
T is the unique realizing tree, if and only if
G(T) is 3-vertex connected. e.g., G(T) remains connected
after the removal of any two vertices.
Uniqueness of PPH Solution
• Apply Whitney’s theorem, using all the
paths implied by the reduction of the PPH
problem to graph realization.
• (Minor point) Do not allow the removal of
the endpoints of the universal glue edge U.
Multiple Solutions
• In 1933, Whitney showed how multiple solutions to the
graph realization problem are related.
• Partition the edges of G into two, connected graphs, each
with at least two edges, such that the two graphs have
exactly two nodes in common. Then twist one graph
around those nodes.
• Any solution can be transformed to another via a series of
such twists.
Twisting Example
2
1
x
7
2
3
y
4
8
5
6
1
x
3
y
8
4
5
7
6
All the cycle sets are preserved by twisting
Representing all the solutions
All the solutions to the PPH problem can be
implicitly represented in a data structure which
can be built in linear time, once one solution is
known. Then each solution can be generated in
linear time per solution. Method is a small
modification of ones developed by Cunningham
and Edwards, and by Hopcroft and Tarjan.
See Gusfield’s web site for errata.
The biological meaning
Each subgraph in the representation identifies a set of
mutations (columns) whose 0/1 settings are invariant
over all PPH solutions. However, by switching all
0/1 values that had been 2, in all of those mutations,
the relative 0/1
settings between columns in different subgraphs can
be set arbitrarilly, and all PPH solutions can be
generated in this way.
a
a
b
b
c
c
d
d
e
e
1
2
3
4
5
6
7
1
1
2
2
1
1
1
1
2
2
2
2
0
0
2
2
2
2
2
2
2
2
2
2
2
2
2
2
0
0
2
2
0
0
2
2
0
0
0
0
0
0
0
0
0
0
2
2
0
0
0
0
0
0
2
2
0
0
2
2
0
0
2
2
0
0
0
0
0
0
An example.
Each row starts
duplicated for
simplicity.
a
a
b
b
c
c
d
d
e
e
1
2
3
4
6
5
7
1
1
1
0
1
1
1
1
1
0
1
0
0
0
1
0
1
0
1
0
0
1
1
0
0
1
0
1
0
0
0
1
0
0
0
1
0
0
0
0
0
0
0
0
1
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
Starting from a PPH
Solution, if all
shaded cells in a
block switch value,
then the result is
also a PPH solution,
and any PPH
solution can be
obtained in this
way, i.e. by
choosing in each
block whether to
switch or not.
Secondary information and
optimization
• The partition shows explicitly what added information is
useful and what is redundant. Information about the
relationship of a pair of columns is redundant if and only if
they are in the same block of the column partition. Apply
this successively as additional information is obtained.
• Problem: Minimize the number of haplotype pairs
(individuals) that need be laboratory determined in order to
find the correct tree.
• Minimize the number of (individual, site1, site2) triples
whose phase relationship needs to be determined, in order
to find the correct tree.
The implicit representation of all solutions provides
a framework for solving these secondary problems,
as well as other problems involving the use of
additional information, and specific tree-selection
criteria.
A Phase-Transition
Problem, as the ratio of sites to genotypes changes,
how does the probability that the PPH solution is
unique change?
For greatest utility, we want genotype data where the
PPH solution is unique.
Intuitively, as the ratio of genotypes to sites increases,
the probability of uniqueness increases.
Frequency of a unique solution with 50
and 100 sites, 5% rule and 2500 datasets
per entry
# geno. Frequency of
unique solution
10
0.0018
20
0.0032
22
0.7646
40
0.7488
42
0.9611
70
0.994
130
0.999
140
1
10
20
22
40
42
60
100
110
0
0
0.78
0.725
0.971
0.983
0.999
1
An alternative solution
Recently, we developed an algorithm that is not based
on graph realization, and which is much easier to understand.
However, it runs in O(nm^2) time. See Gusfield’s website.
Implementations
See wwwcsif.cs.ucdavis.edu/~gusfield/
for papers and software.