Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department

Download Report

Transcript Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department

Combinatorial Optimization Problems
in Computational Biology
Ion Mandoiu
CSE Department
What Is Computational Biology?
• [G. Lancia] “Study of mathematical and computational problems of
modeling biological processes in the cell, removing experimental
errors from genomic data, interpreting the data and providing theories
about their biological relations”
• Multidisciplinary field at the intersection of computer science, biology,
discrete mathematics, statistics, optimization, chemistry, physics, …
5 Steps to Solving CB Problems
1.
2.
Understand biological problem
Represent biological data as mathematical objects
(strings, sets, graphs, permutations,…), map biological
relations into mathematical relations, and formulate the
biological question as optimization or feasibility problem
Study computational complexity: Polynomial? NP-hard?
Develop efficient algorithms
3.
4.
–
–
5.
If in P, find fast and memory efficient exact algorithms
If NP-hard, find practical exact algorithms and/or algorithms
with provable approximation guarantees
Validate algorithms on biological data
Outline
• Shortest Superstring
• Sequencing by Hybridization
• PCR Primer Selection
Shotgun Sequencing
Shortest Superstring
• Given: set of strings s1, s2, …, sn
• Find: shortest string s containing each si as a substring
• Example:
Set of strings: 000, 001, 010, 011, 100, 101, 110, 111
Superstring: 0001110100
• NP-Hard [Maier&Storer77]
Greedy Merging Algorithm
-
S = {s1,s2,…,sn}
While |S| > 1 do
-
Find s,t in S with longest overlap
S = ( S \ {s,t} ) U { s overlapped with t to maximum extent}
-
Output final string
•
Approximation factor no better than 2:
– s1 = abk, s2 =bkc, s3 = bk+1
– Greedy output: abkcbk+1
length = 2k+3
– Optimum: abk+1c
length = k+3
•
Open problem: prove that greedy superstring is always at most
twice longer than optimum
Overlap & Prefix of 2 strings
•
•
Overlap of s and t: longest suffix of s that is a prefix of t
Prefix of s and t: s after removing overlap(s,t)
s = a1 a2 a3 … a|s|-k+1…a|s|
t = b1 … bk … b|t|
prefix(s,t)
overlap(s,t)
Lower Bound on OPT
OPT = prefix(s1,s2) … prefix(sn-1,sn) prefix(sn,s1) overlap(sn,s1)
cost of tour 12…n in the prefix graph
The Cycle Cover Algorithm
• Computing TSP in prefix graph is NP-hard
• Key idea: lowerbound OPT using min-weight cycle cover
• For every cycle c = (i1i2…ili1), (c) := prefix(si1,si2) …
prefix(sil,si1) si1 is a superstring of si1, …, sil
• Cycle cover algorithm:
The Cycle Cover Algorithm
Theorem [Blum,Jiang,Li,Tromp,Yannakakis94]: Cycle cover
algorithm gives factor 4 approximation.
• Length of output is
where ri is a “representative” string from cycle ci
• wt(C)  OPT
- If ri no longer than wt(ci)  output within factor 2 of optimum!
- ri can be much longer than wt(ci) (periodic strings!)
- it can be shown that  | ri |  OPT + 2 wt(C)  factor 4
Improved Algorithm
Theorem [Blum,Jiang,Li,Tromp,Yannakakis 94]: The improved
algorithm gives factor 3 approximation.
Proof using that the greedy algorithm gives at least ½ of the optimum
compression.
Current best approximation factor is 2.596 [Breslauer,Jiang,Jiang97]
Sequencing by Hybridization
• Exploits parallel hybridization in DNA arrays
• All 4k probes of a certain length k (k=8 to 10) are
synthesized on the array
• Target DNA hybridizes at locations containing
probes complementary to its k-substrings
• Sequencing by Hybridization (SBH) Problem:
Reconstruct target DNA given its k-length
substrings (spectrum)
Mathematical Formulation of SBH
• SBH is a special case of the shortest superstring: solution
corresponds to a Hamiltonian path (NP-hard to find) in the “prefix
length = 1” graph
• [Pevzner 89] SBH is equivalent to finding an Eulerian path (easy
to find in linear time) in the following graph:
– Vertices are all (k-1)-tuples
– Directed edge between two (k-1)-tuples u and v iff there is a k-length string
in the spectrum whose first k symbols match u and last k symbols match v
• Choose the right mathematical abstraction!
Polymerase Chain Reaction
…
Primer Selection Problem
ri
5'
 L+x
3'
Reverse primer
 L+x
Forward primer
3'
5'
fi
i-th amplification locus
• Given:
• Pairs of forward/reverse sequences for the n amplification loci
• Primer length k and amplification upperbound L
• Find:
• Minimum set of primers S of length k such that, for each
amplification locus, there are two primers in S hybridizing to the
forward and reverse sequences within a distance of L of each other
Previous Work
• [Pearson et al. 96] Logarithmic approximation factor using
greedy set cover algorithm for a formulation that does not
distinguish between forward and reverse primers
• Similar formulations used by [Linhart&Shamir’02, Souvenir et al.’03]
• To enforce bound of L on amplification length must truncate forward
and reverse sequences to length L/2
• [Fernandes&Skiena’02] model primer selection as a minimum
multicolored subgraph problem:
• Vertices are candidate primers
• Add edge colored by color i between primers u and v if they hybridize
to i-th forward and reverse sequences within a distance of L
• Find minimum size set of vertices inducing edges of all colors
• No non-trivial approximation factor proposed
Improved Approximations
• [Konwar,M,Russell,Shvartsman 04]
• Logarithmic approximation factor using “potential function” greedy for
the bounded amplification length primer selection problem
• O(Lln n) approximation factor based on randomized rounding for the
minimum multicolored subgraph problem of [Fernandes&Skiena’02]
Improved Approximations
• [Konwar,M,Russell,Shvartsman 04]
• Logarithmic approximation factor using “potential function” greedy for
the bounded amplification length primer selection problem
• O(Lln n) approximation factor based on randomized rounding for the
minimum multicolored subgraph problem of [Fernandes&Skiena’02]