Ming Li Talk about Bioinformatics

Download Report

Transcript Ming Li Talk about Bioinformatics

Lecture 24 Coping with NPC and
Unsolvable problems.
 When a problem is unsolvable, that's generally very bad
news: it means there is no general algorithm that is
guaranteed to solve the problem in all cases.
 However, it's rare that we actually want the capability of
solving a problem in all possible cases. We can:


Specialize for particular applications.
Try heuristics.
 For example, we know compressing a string is not
solvable (we proved it), not even approximate
compression. But everybody is doing it anyways and
some people are making money out of it.
 Theory and practice are usually very far apart!
When a problem is NP-hard …
 Similarly, when we prove a problem is NP-complete,
that means that no one currently has a polynomialtime algorithm for the problem. But that's absolutely
not a reason to give up.
 Theoretical proofs are often deceiving. For
optimization problems, we are often willing to settle
for solutions that are not best possible, but come
pretty close to being best possible.
 For example, for the travelling salesman problem,
finding the tour of least cost is nice, but in real life we
would often be content with finding a tour that is close
to optimal. This leads to the idea of approximation
algorithms.
Exhaustive search
 Although exhaustive search is too slow for
large instances of NP-complete problems, as
the solution space can grow exponentially,
there are tricks that can speed up the
computation in many cases.
 For example, although the travelling
salesman problem is NP-complete, we can
find optimal travelling salesman tours for realworld instances with hundreds or even
thousands of cities, by using some search
techniques.
Backtracking
 Backtracking and exhaustive search is
something we have “avoided” at all cost in
this course.
 But is it really that bad?.
Example. This often works.
For input
(x1 OR ~x2) AND (~x2 OR x4) AND (x1 OR x2 OR x3).
By setting x1 = 0 and x1 = 1, we reduce to simpler
formulas:
(x1 OR ~x2) AND (~x2 OR x4) AND (x1 OR x2 OR x3)
/
\
x1 = 0 /
\ x1 = 1
/
\
(~x2) AND (~x2 OR x4) AND (x2 OR x3) (~x2 OR x4)
Branch and Bound
 Branch-and-bound is a natural idea applied to
optimization problems.
 The idea is that we keep track of the cost of
the best solution or partial solution found so
far, and reject partial solutions if they exceed
some quantity, as sometimes we can
estimate the cost from the partial solutions.
Example. travelling salesman
 Suppose we have a partial solution given by a simple
path from a to b, passing through vertices S and
denote it by [a, S, b].
 Extend this to a full tour by finding a path [b, V-(S ∪
{a,b}), a]. We do this extension edge by edge, so if
there is an edge (b, c) in the graph then [a, S, b] gets
replaced by [a, S ∪ {b}, c].
 How to estimate the cost of a partial solution? Given
a partial tour [a, S, b], the remainder of the tour is a
path through V-S-{a,b}, plus edges from a and b to
this part. Therefore the cost is at least the sum of the
least-weight edge from a to V-S-{a,b}, the leastweight edge from b to V-S-{a,b}, and the minimum
spanning tree of V-S-{a,b}, which can be estimated.
Approximation Algorithms.
 If we cannot solve it exactly, we can find
approximate solutions.
 For example, for the travelling salesman
problem, we might settle for a tour that is
within some constant factor of the best.
 For minimization algorithms, the
approximation ratio of an optimization
algorithm A is defined to be
A’s Solution / Optimal Solution
Vertex Cover: approx. ratio 2
 A matching in a graph is a subset of the edges such that no
vertex appears in two or more edges.
 A matching is maximal if one cannot add any new edges to it
and still preserve the matching property.
 Maximal matching Alg.: Examine edges consecutively and add
them to our matching if they are disjoint
from edges already chosen,
all in polynomial time.
Ratio-2 Vertex Cover continues …
Clearly
 (1) The number of vertices in any vertex cover of
G is at least as large as the number of edges in
any maximal matching.
 (2) The set of all endpoints of a maximal matching
is a vertex cover.
 So, letting M be the set of edges in a maximal
matching, and C be the number of vertices in the
smallest vertex cover, we have |C| ≥ |M| by (1) and
2|M| ≥ |C|. It follows that our algorithm for
constructing a vertex cover has an approximation
ratio bounded by 2.
 Dinur and Safra proved you can’t do better than
1.3606 unless P=NP.
Shortest Common Superstring
 Approximation algorithms are usually simple, but the
proof of approximation guarantees are usually hard.
Here is one example.
 Given n strings s1, s2, … , sn, find the shortest
common superstring s. (I,e, each si is a substring of
s, and s is the shortest such string.)
 The problem is NP-complete.
 Greedy Algorithm: keep on merging max overlapped
strings, until one left.
Theorem: This Greedy algorithm is 4 x optimal.
Widely used in DNA sequencing
 It is widely used in DNA shotgun Sequencing
(especially with the new generation of sequencers which
promises to sequence a fragment of 40k BP long):
 Make many copies (single strand)
 Cut them into fragments of lengths ~500.
 Sequence each of the fragments.
 Then assemble all fragments into the shortest
common superstring by GREEDY: repeatedly merge
the pair with max overlap until finish.
 Dec. 2001 release of mouse genome: 33 million
reads, covering 2.5G bases (x10 coverage)
Many have worked on this:
 Many people (well known scientists, including
one author of our textbook) have worked on
this and improved the constant from 4 to
3
 2.89
 2.81
 2.79
 2.75
 2.66
 2.5
Theorem. GREEDY achieves 4n, n=opt.
Proof by Picture:
Given S={s1, … ,sm}, construct G:


Nodes are s1, … ,sm
Edges: if s pref
sj then add edge:
i
si
pref
sj
where pref is the pref length. I.e. |si|=pref+overlap length with sj


|SCS(S)| = length shortest Hamiltonian cycle in G
Greedy Modified: find all cycles with minimum weights
in G, then open cycle, concatenate to obtain the final
superstring. (Note: regular greedy has no cycles.)
This minimum cycle exists
 Assuming initial Hamiltonian
cycle has w(C) = n
 Then merging si with sj is
equivalent to breaking into two
cycles. We have:
w(C1)+ w(C2) ≤ n
 Proof: We merged (si, sj)
because they have max
overlap. Picture shows:
sj
S”
S’
si
C
si
sj
s’
s’’
…
C1
Reasoning: s’ and s” at least overlap that much so that the
sum of red is no more than sum of green:
d(si,sj)+d(s’’,s’)≤d(si,s’)+d(s’’,sj)
 Continue this process,end with
self-cycles: C1, C2, C3, C4, …
Sw(Ci) ≤ n.
si
s’
C2
sj
s’’





Then we open cycles & concatenate
Let wi=w(Ci)
s1
Li =| longest string in Ci |
|open Ci| ≤ wi + Li
We know n ≥ S wi
Lemma. S1 and S2 overlap ≤ w1+w2
 S(Li-2wi) ≤ n, by Lemma, since Li’s
must be in the final SCS.
s2
 |Greedy’(S)|<S(Li+wi)
=S(Li-2wi)+S3wi
≤ n + 3n
=4n.
QED
w1
w2
w1
w1
w2
w1
w2
Open Question
 Show Greedy achieves approximation ratio 2.