Effective Heuristics for NP-Hard Problems Arising in Molecular Biology Richard M. Karp Bangalore, January 5, 2011

Download Report

Transcript Effective Heuristics for NP-Hard Problems Arising in Molecular Biology Richard M. Karp Bangalore, January 5, 2011

Effective Heuristics for NP-Hard
Problems Arising in Molecular
Biology
Richard M. Karp
Bangalore, January 5, 2011
NP-Hard Problems
• The P vs. NP problem: Is finding a solution to
a combinatorial search problem as easy as
checking a solution. The answer is expected to
be “No.”
• NP-Hard Problems: Solvable in polynomial
time only if P=NP.
• General belief: Solving an NP-hard problem
requires worst case exponential time.
Understanding NP-Hard Problems
Through Worst-Case Analysis
• Exact solution methods: exponential running
time in worst case.
• Polynomial-time approximation algorithms
for optimization problems, yielding a worstcase upper bound of the ratio between the
cost of an approximate solution and the cost
of optimal solution. Unfortunately, these
guaranteed approximation ratios are
unrealistically high.
Probabilistic Analysis and Heuristics
• In probabilistic analysis problem instances are
drawn from simple probability distributions.
Often one can prove excellent performance on
the average. However, the probability
distributions may not correspond to real-life
instances.
• Heuristics are typically evaluated empirically on
examples drawn from, or representative of. reallife instances. Heuristics are often “unreasonably
effective,” for reasons not well understood.
Famous Unreasonably Effective
Heuristics
• Large traveling-salesman problems can be solved
by quick tour construction methods, local
improvement methods or cutting plane methods.
• Local improvement methods find near-optimal
solutions to graph bisection problems.
• Huge satisfiability problems are routinely solved
rapidly by branch-and-bound methods.
• The greedy set cover algorithm typically gives
solutions within a few percent of optimal.
NP-Hard Problems Arising in Molecular
Biology and Genetics
• Genome Sequencing
• Global alignment of multiple genomes
• Identifying siblings, cousins, second cousins
etc. through comparison of genomes
• Finding protein modules containing specified
types of proteins
• Computational discovery of dysregulated
pathways in human diseases
Patterns of Inheritance
• In each region of the genome, each individual
has two haplotypes, one inherited from each
parent. A haplotype is a sequence of alleles.
• The haplotype inherited from a parent is a mosaic
of segments inherited from the parent’s two
haplotypes. Recombination occurs at the
boundaries between segments.
• In a pedigree graph the vertices are individuals
and the edges represent parent-child relations.
Reconstructing Pedigrees
• Given the haplotypes of individuals in the
current generation, we wish to reconstruct the
pedigree that gave rise to that generation and
chart the flow of alleles.
Assumptions of a Generative Model
• Monogamy
• Layered structure: each individual and its mate lie
in generation g, have parents in generation g-1,
and children in generation g+1. Generation 1 is
the founding generation.
• The number of children of each couple is drawn
from a Poisson distribution with mean 2.
• In each haplotype, sites of recombination occur
according to a Poisson process with known rate.
Working Backwards
• We construct the pedigree generation by generation,
working backwards from the current generation.
• It suffices to determine, in each generation, which
individuals are siblings.
• Two alleles are identical by descent (IBD) if they are
inherited from the same allele in the founding
generation.
• To test whether two individuals in generation g are
likely to be siblings, we observe the amount of IBD
between their descendants in the current generation
Inferring Siblinghood
• Problem: determine which individuals in
generation g are siblings.
• Using IBD, we construct a compatibility graph
with a vertex for each individual in generation
g, and edges indicating pairs of individuals
that are likely to be siblings on the basis of the
IBD of their descendants.
• Problem: Infer the siblinghood graph from the
compatibility graph.
Inferring Siblinghood
• Because of the monogamy assumption, the siblinghood
graph must be a union of cliques.
• Problem: Given a compatibility graph C determine the
“closest” siblinghood graph S.
• The algorithm maintains a partition of the vertices of
C. The parts of the partition are called quasi-cliques.
The score of a partition is A times the number of edges
of C whose end points lie in the same quasi-clique,
minus the number of non-edges of C whose end points
lie in the same in the same quasi-clique. We seek a
partition of maximum score.
Justifying the Scoring Function
• Assumptions: The compatibility graph C is obtained by
randomly perturbing the siblinghood graph S. S is a
random union of disjoint cliques with sizes uniformly
distributed between 1 and a parameter t.
• If u is adjacent to v in S then u is adjacent to v in C with
probability p; if u is not adjacent to v in S then u is
adjacent to v in C with probability q, where q <p.
• Under these assumptions maximizing the score
produces a siblinghood graph of maximum conditional
probability given C.
Heuristic Algorithm
• The heuristic algorithm creates an initial partition
by greedily constructing disjoint quasi-cliques. It
then performs the following local operations to
improve the score:
Move a vertex; Extract a vertex; Split a quasiclique; Merge two quasi-cliques; Restructure two
quasi-cliques adjacent to a vertex v;
Dynamic Programming: given a chain of quasicliques, make an optimal simultaneous move of a
small set of vertices from each quasi-clique in the
chain to its successor quasi-clique.
Performance of the Algorithm
Typically the algorithm produces a partition with a
slightly higher score than the “true” partition
from which the compatibility graph was
generated by perturbation. However, the fraction
of vertices placed in the “correct” partition lies
between 93% and 98%, depending on the
fraction of edges deleted from cliques and the
fraction of edges added between cliques in
creating the compatibility graph C from the
siblinghood graph S.
The Colorful Subgraph Problem
• Input: A graph G and an assignment of a color
to each vertex.
• Find, if one exists, a connected subgraph H
containing exactly one vertex of each color.
• Optimization version:
Minimize a x (number of extra vertices)
+ b x (number of omitted colors)
• The problem is NP-hard, even on planar
graphs.
Interpretation
• In the protein-protein interaction (PPI) graph of a
species the vertices represent proteins and the
edges represent pairs of physically interacting
proteins.
• Given a connected set X of proteins performing a
regulatory function in species A, we seek a similar
connected set of proteins in species B. The color
of each protein in species B indicates its
similarity to a particular protein in X.
Dynamic Programming
• For each vertex v and set of colors S,
determine whether there is a tree containing
exactly one vertex of each color in S, no
vertices of any other color, and containing
vertex v. The computation is recursive,
running through sets S in order of increasing
cardinality. The running time is of order n3k
where n is the number of vertices and k is the
number of colors.
Integer Programming plus Constraint
Generation
• We may assume that the desired connected subgraph is a tree
T
• Variables: x(i)= 1 iff vertex i is included in T
y(e) = 1 iff edge e is included in T
• Constraints:
Exactly one vertex of each color is included;
Exactly n-1 edges are included in T;
If an edge is included then its endpoints are included;
For each set of colors X, the number of edges of T connecting
two vertices in X is at most |X| -1.
Performance
• An implementation of integer programming
plus constraint generation solves typical
instances with 100 vertices in less than a
minute.
• Using a heuristic not yet implemented, one
can solve typical instances with 100 vertices
by hand in 20-30 minutes.
Heuristic Algorithmic Strategy
• Repeat:
(1) Delete vertices with frequent colors;
(2) In the remaining graph, select a minimal set
of connected components covering all
infrequent colors;
(3) Insert minimal set of vertices with frequent colors
to restore connectedness and cover all colors.
Example on Grid
WELCOME T
OTHEWEB S
IT EO FTH E
ANNUALS Y
MP O S I U M O
N C O MB I N A
TO R IALPA
T T E R NMA T
After One Iteration
WeBS
F t H
NNUaL SY
P S IU
NComBIN
R I LP
R N
After Two Iterations
W
F
n
A
P
I
COMB
E B
T H
l s Y
u