Transcript Talk

Improved Algorithms for Inferring
the Minimum Mosaic of a Set of
Recombinants
Yufeng Wu and Dan Gusfield
UC Davis
CPM 2007
Recombination
• Recombination: one of the principle genetic forces
shaping sequence variations within species.
• Two equal length sequences generate a new equal
length sequence.
110001111111001
11000 0000001111
Prefix
000110000001111
Suffix
Breakpoint
2
Founders and Mosaic
• Current sequences are descendents of a small
number of founders.
– A current sequence is composed of blocks from the
founders, due to recombination.
– No mutations since formation of founders.
Sampled sequences
in current population
000000
111111
Founders
000000
111111
001111
Breakpoint
000000
000000
111111
001111
001111
111100
111100
011100
Mosaic
The Minimum Mosaic Problem
• Given a set of aligned binary sequences in the current
population and assume the number of founders is
known to be Kf, find a set of founders and the mosaic
with the minimum number of breakpoints.
Assume
Kf =3
Four breakpoints:
minimum for all possible
three founders
1101101
1101101
1101111
1010001
1010001
1010001
0111111
0111111
0110100
0110100
0110100
1100011
1100011
Three
Founders
4
Status of the Minimum Mosaic Problem
• First studied by E. Ukkonen (WABI 2002).
– Dynamic programming method. Not practical when
the number of rows is more than 20 and Kf >2.
• No polynomial-time algorithm was known
even when Kf is small. No NP-completeness
result is known.
• Our results:
– A simple polynomial-time algorithm for Kf = 2 case.
– Exact and practical method for data of medium
range for Kf  3 .
5
The Two-Founder Case
1111101
1111101
1010001
1010001
010111111
0111111
0111111
010101100
0110100
0110100
110000111
1100011
1100011
110111101
100100101
Remove
uniform
columns
Study pairs of
neighboring columns
Founders
0?
01
0?
00
1?
10
1?
11
 2 breakpoints
between c1 and c2
Key: at columns 1 and 2, the founders are either
 2 breakpoints
between c2 and c3
or
.
There are two rows with 00/11, and three rows with 01/10. So, at least
two breakpoints between columns 1 and 2 with founders as
.
The Two-Founder Case (Cont.)
# breakpoints
between two
columns
2
2
2
1
2
2
Local founders
Founders
c1
c2
c3
c4
c5
0
1
1
1
1
1
0
0
0
0
c6
c7
At least 2 + 2 + 2 +1 +2 +2 = 11 breakpoints are needed.
On the other hand, we can construct two founders that use the same
local optimal founders, and thus 11 breakpoints is global optimum.
No matter which founder states are chosen for previous column, we can
always choose the needed founders for current column.
Three or More Founders: Assuming
Known Founders
Input
Sequences
1101101
1010001
0111111
0110100
Three
Founders
1101111
1010001
0110100
1100011
Founder Mapping
1101101
1101101
With known founders, can
minimize breakpoints for each
sequence, and thus also minimize
the total number of breakpoints.
For each input sequence, starting
from the left, insert a breakpoint at
the end of longest segments
matching one founder.
Founder mapping: at each
position c in any input sequence s,
which founder s[c] takes its value
from.
Breakpoint!
8
Founder 1
Founder 2
Enumerating Founders for FounderUnknown Case
In reality, founders are not known. A straightforward way is to simply
enumerate all possible sets of founders, and then run the previous
method to find the minimum mosaic.
1
0
0
0
1
0
0
0
1
0
1
1
1
0
1
1
1
0
At each column, there are 2kf–2
founder settings.
Let m be the number of columns, fully enumerate all possible sets of
founders takes (2m*kf) time. Infeasible when m or Kf is large.
Need more ideas to develop a practical method. First, we do the
enumeration in the form of search paths in a search tree.
Search Paths and Search Tree
Assume three founders
Founder
setting at
column
one
0 0
0
1
0 0
1
1
c1
c2of tot.
Num
0 0 2
breakpoints
0 0
up1to 1current
column
0
0
1
0
1
0
1
0
0
1
1
0
0
0
0
0
1
0
1
1
1
0
0
1
1
0
1
0
0
0
1
1
1
0
2
c3
Founder
settings up to
column 3
0 0 0 5
0 1 1
1 0 0
On-line computation:
Compute partial
solution up to the
current column for
speedup.
The founder-known method can be
run with partially-known founders!
It works but there is exponential blowup of the
search paths!
Obvious idea to reduce search space: branch and
bound (compute a lower bound and …).
But we found a different idea is more useful.
Dropping Search Paths that are Beaten
by Another Search Path
Founder Config.
P1
<= 5 bkpts
0 0 0
1 0
1 1
P2
>= 0 bkpts
1 0 6
0 1
1 1
An optimal search path
following P2
Assume three founders
<=39
P1 and P2 are two search
paths up to column 2.
40
Can we say P1 is better than
P2? Not really, because
maybe P2 can lead to fewer
breakpoints later on.
But, suppose the number of
input sequences is 5. We can
then say P1 beats P2 (and
so drop P2). Why?
A More Powerful Beaten Rule
Still five input rows. Now can not say P1 beats P2. But remember we
have founder matching…
P1
0 0 0
1 0
1 1
Rows
Match
These two rows have the same
founder mappings.
1
2
If no bkpt at P2, no bkpt at p1 too.
3
P1 Row2
4
5
P2 Row2
Rows
P2
0 0 4
0 1
1 1
1
2
3
4
5
Match
P1: No extra breakpoints at rows 2, 4
So P1 beats P2 since at most 3
rows need extra breakpoints to get
onto a path from P2, and P2 uses 4
more breakpoints than P1.
How Practical Is Our Method?
Source of data and
image: UNC Chapel Hill
Five founders
20 rows, 36 columns
UNC’s heuristic solution:
54 breakpoints
Enumerating 2180
founder states is
impossible!
Our method takes 5 minutes to find the optimal solutions:
53 breakpoints. It is also practical for 50x50 matrix with
four founders.
Open Problems and Software
• Is the minimum mosaic problem NPcomplete?
• Is there a polynomial-time algorithm for the
minimum mosaic problem for small (say three
to ten) number of founders?
• Software available at:
http://wwwcsif.cs.ucdavis.edu/~wuyu
• Thank you.
14