Transcript Talk
Improved Algorithms for Inferring the Minimum Mosaic of a Set of Recombinants Yufeng Wu and Dan Gusfield UC Davis CPM 2007 Recombination • Recombination: one of the principle genetic forces shaping sequence variations within species. • Two equal length sequences generate a new equal length sequence. 110001111111001 11000 0000001111 Prefix 000110000001111 Suffix Breakpoint 2 Founders and Mosaic • Current sequences are descendents of a small number of founders. – A current sequence is composed of blocks from the founders, due to recombination. – No mutations since formation of founders. Sampled sequences in current population 000000 111111 Founders 000000 111111 001111 Breakpoint 000000 000000 111111 001111 001111 111100 111100 011100 Mosaic The Minimum Mosaic Problem • Given a set of aligned binary sequences in the current population and assume the number of founders is known to be Kf, find a set of founders and the mosaic with the minimum number of breakpoints. Assume Kf =3 Four breakpoints: minimum for all possible three founders 1101101 1101101 1101111 1010001 1010001 1010001 0111111 0111111 0110100 0110100 0110100 1100011 1100011 Three Founders 4 Status of the Minimum Mosaic Problem • First studied by E. Ukkonen (WABI 2002). – Dynamic programming method. Not practical when the number of rows is more than 20 and Kf >2. • No polynomial-time algorithm was known even when Kf is small. No NP-completeness result is known. • Our results: – A simple polynomial-time algorithm for Kf = 2 case. – Exact and practical method for data of medium range for Kf 3 . 5 The Two-Founder Case 1111101 1111101 1010001 1010001 010111111 0111111 0111111 010101100 0110100 0110100 110000111 1100011 1100011 110111101 100100101 Remove uniform columns Study pairs of neighboring columns Founders 0? 01 0? 00 1? 10 1? 11 2 breakpoints between c1 and c2 Key: at columns 1 and 2, the founders are either 2 breakpoints between c2 and c3 or . There are two rows with 00/11, and three rows with 01/10. So, at least two breakpoints between columns 1 and 2 with founders as . The Two-Founder Case (Cont.) # breakpoints between two columns 2 2 2 1 2 2 Local founders Founders c1 c2 c3 c4 c5 0 1 1 1 1 1 0 0 0 0 c6 c7 At least 2 + 2 + 2 +1 +2 +2 = 11 breakpoints are needed. On the other hand, we can construct two founders that use the same local optimal founders, and thus 11 breakpoints is global optimum. No matter which founder states are chosen for previous column, we can always choose the needed founders for current column. Three or More Founders: Assuming Known Founders Input Sequences 1101101 1010001 0111111 0110100 Three Founders 1101111 1010001 0110100 1100011 Founder Mapping 1101101 1101101 With known founders, can minimize breakpoints for each sequence, and thus also minimize the total number of breakpoints. For each input sequence, starting from the left, insert a breakpoint at the end of longest segments matching one founder. Founder mapping: at each position c in any input sequence s, which founder s[c] takes its value from. Breakpoint! 8 Founder 1 Founder 2 Enumerating Founders for FounderUnknown Case In reality, founders are not known. A straightforward way is to simply enumerate all possible sets of founders, and then run the previous method to find the minimum mosaic. 1 0 0 0 1 0 0 0 1 0 1 1 1 0 1 1 1 0 At each column, there are 2kf–2 founder settings. Let m be the number of columns, fully enumerate all possible sets of founders takes (2m*kf) time. Infeasible when m or Kf is large. Need more ideas to develop a practical method. First, we do the enumeration in the form of search paths in a search tree. Search Paths and Search Tree Assume three founders Founder setting at column one 0 0 0 1 0 0 1 1 c1 c2of tot. Num 0 0 2 breakpoints 0 0 up1to 1current column 0 0 1 0 1 0 1 0 0 1 1 0 0 0 0 0 1 0 1 1 1 0 0 1 1 0 1 0 0 0 1 1 1 0 2 c3 Founder settings up to column 3 0 0 0 5 0 1 1 1 0 0 On-line computation: Compute partial solution up to the current column for speedup. The founder-known method can be run with partially-known founders! It works but there is exponential blowup of the search paths! Obvious idea to reduce search space: branch and bound (compute a lower bound and …). But we found a different idea is more useful. Dropping Search Paths that are Beaten by Another Search Path Founder Config. P1 <= 5 bkpts 0 0 0 1 0 1 1 P2 >= 0 bkpts 1 0 6 0 1 1 1 An optimal search path following P2 Assume three founders <=39 P1 and P2 are two search paths up to column 2. 40 Can we say P1 is better than P2? Not really, because maybe P2 can lead to fewer breakpoints later on. But, suppose the number of input sequences is 5. We can then say P1 beats P2 (and so drop P2). Why? A More Powerful Beaten Rule Still five input rows. Now can not say P1 beats P2. But remember we have founder matching… P1 0 0 0 1 0 1 1 Rows Match These two rows have the same founder mappings. 1 2 If no bkpt at P2, no bkpt at p1 too. 3 P1 Row2 4 5 P2 Row2 Rows P2 0 0 4 0 1 1 1 1 2 3 4 5 Match P1: No extra breakpoints at rows 2, 4 So P1 beats P2 since at most 3 rows need extra breakpoints to get onto a path from P2, and P2 uses 4 more breakpoints than P1. How Practical Is Our Method? Source of data and image: UNC Chapel Hill Five founders 20 rows, 36 columns UNC’s heuristic solution: 54 breakpoints Enumerating 2180 founder states is impossible! Our method takes 5 minutes to find the optimal solutions: 53 breakpoints. It is also practical for 50x50 matrix with four founders. Open Problems and Software • Is the minimum mosaic problem NPcomplete? • Is there a polynomial-time algorithm for the minimum mosaic problem for small (say three to ten) number of founders? • Software available at: http://wwwcsif.cs.ucdavis.edu/~wuyu • Thank you. 14