Transcript Slides
CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms Overview • Sequence alignment: two sub-problems: – How to score an alignment with errors – How to find an alignment with the best score • Today: exact string matching – Does not allow any errors – Efficiency becomes the sole consideration • Time and space Why exact string matching? • The most fundamental string comparison problem • Often the core of more complex string comparison algorithms – E.g., BLAST • Often repeatedly called by other methods – Usually the most time consuming part – Small improvement could improve overall efficiency considerably Definitions • Text: a longer string T (length m) • Pattern: a shorter string P (length n) • Exact matching: find all occurrences of P in T T b b a b a P a b a x length n a b a b a y a b a length m The naïve algorithm b b a b a a b a a b a a b a a b x a b a b a a a b a a b a a b a a b a y a b a Time complexity • Worst case: O(mn) • Best case: O(m) e.g. aaaaaaaaaaaaaa vs baaaaaaa • Average case? – Alphabet A, C, G, T – Assume both P and T are random – Equal probability – In average how many chars do you need to compare before giving up? Average case time complexity P(mismatch at 1st position): ¾ P(mismatch at 2nd position): ¼ * ¾ P(mismatch at 3nd position): (¼)2 * ¾ P(mismatch at kth position): (¼)k-1 * ¾ Expected number of comparison per position: p = 1/4 k (1-p) p(k-1) k = (1-p) / p * k pk k = 1/(1-p) = 4/3 Average complexity: 4m/3 Not as bad as you thought it might be Biological sequences are not random T: aaaaaaaaaaaaaaaaaaaaaaaaa P: aaaab Plus: 4m/3 average case is still bad for long genomic sequences! Especially if this has to be done again and again Smarter algorithms: O(m + n) in worst case sub-linear in practice How to speedup? • Pre-processing T or P • Why pre-processing can save us time? – Uncovers the structure of T or P – Determines when we can skip ahead without missing anything – Determines when we can infer the result of character comparisons without doing them. ACGTAXACXTAXACGXAX ACGTACA Cost for exact string matching Overhead Total cost = cost (preprocessing) + cost(comparison) + cost(output) Constant Hope: gain > overhead Minimize String matching scenarios • One T and one P – Search a word in a document • One T and many P all at once – Search a set of words in a document – Spell checking (fixed P) • One fixed T, many P – Search a completed genome for short sequences • Two (or many) T’s for common patterns • Q: Which one to pre-process? • A: Always pre-process the shorter seq, or the one that is repeatedly used Pre-processing algs • Pattern preprocessing – Knuth-Morris-Pratt algorithm (KMP) – Aho-Corasick algorithm • Multiple patterns – Boyer – Moore algorithm (discuss only if have time) • The choice of most cases • Typically sub-linear time • Text preprocessing – Suffix tree • Very useful for many purposes Algorithm KMP: Intuitive example 1 abcxabc T mismatch P abcxabcde Naïve approach: T abcxabc ? abcxabcde abcxabcde abcxabcde abcxabcde • Observation: by reasoning on the pattern alone, we can determine that if a mismatch happened when comparing P[8] with T[i], we can shift P by four chars, and compare P[4] with T[i], without missing any possible matches. • Number of comparisons saved: 6 Intuitive example 2 Should not be a c abcxabc T mismatch P abcxabcde Naïve approach: T abcxabc ? abcxabcde abcxabcde abcxabcde abcxabcde abcxabcde abcxabcde • Observation: by reasoning on the pattern alone, we can determine that if a mismatch happened between P[7] and T[j], we can shift P by six chars and compare T[j] with P[1] without missing any possible matches • Number of comparisons saved: 7 KMP algorithm: pre-processing • Key: the reasoning is done without even knowing what string T is. • Only the location of mismatch in P must be known. T P t’ z t x t y j P i t’ z j t i y Pre-processing: for any position i in P, find P[1..i]’s longest proper suffix, t = P[j..i], such that t matches to a prefix of P, t’, and the next char of t is different from the next char of t’ (i.e., y ≠ z) For each i, let sp(i) = length(t) KMP algorithm: shift rule T P t’ z j P t x t y i t y t’ z 1 sp(i) j i Shift rule: when a mismatch occurred between P[i+1] and T[k], shift P to the right by i – sp(i) chars and compare x with z. This shift rule can be implicitly represented by creating a failure link between y and z. Meaning: when a mismatch occurred between x on T and P[i+1], resume comparison between x and P[sp(i)+1]. Failure Link Example P: aataac a sp(i) 0 If a char in T fails to match at pos 6, re-compare it with the char at pos 3 (= 2 + 1) a t a a c 1 0 0 2 0 aa at aat aac Another example P: abababc If a char in T fails to match at pos 7, re-compare it with the char at pos 5 (= 4 + 1) a Sp(i) 0 b a b a b c 0 0 0 0 4 0 ab ab abab abab ababa ababc KMP Example using Failure Link a a t a a c T: aacaataaaaataaccttacta aataac Time complexity analysis: ^^* • Each char in T may be compared up to n aataac times. A lousy analysis gives O(mn) time. .* • More careful analysis: number of aataac Implicit comparisons can be broken to two phases: comparison ^^^^^* • Comparison phase: the first time a char in T aataac is compared to P. Total is exactly m. • Shift phase. First comparisons made after a ..* shift. Total is at most m. aataac .^^^^^ • Time complexity: O(2m) KMP algorithm using DFA (Deterministic Finite Automata) If a char in T fails to match at pos 6, re-compare it with the char at pos 3 P: aataac Failure link a a t a DFA 0 1 a 2 c If the next char in T is t after matching 5 chars, go to state 3 a a a t t 3 a 4 a a All other inputs goes to state 0. 5 c 6 a DFA Example a DFA 0 a 1 a 2 t t 3 a 4 a 5 a c 6 a T: aacaataataataaccttacta 1201234534534560001001 Each char in T will be examined exactly once. Therefore, exactly m comparisons are made. But it takes longer to do pre-processing, and needs more space to store the FSA. Difference between Failure Link and DFA • Failure link – Preprocessing time and space are O(n), regardless of alphabet size – Comparison time is at most 2m (at least m) • DFA – Preprocessing time and space are O(n ||) • May be a problem for very large alphabet size • For example, each “char” is a big integer • Chinese characters – Comparison time is always m. Boyer – Moore algorithm • Often the choice of algorithm for many cases – One T and one P – We will talk about it later if have time – In practice sub-linear The set matching problem • Find all occurrences of a set of patterns in T • First idea: run KMP or BM for each P – O(km + n) • k: number of patterns • m: length of text • n: total length of patterns • Better idea: combine all patterns together and search in one run A simpler problem: spell-checking • A dictionary contains five words: – – – – – potato poetry pottery science school • Given a document, check if any word is (not) in the dictionary – Words in document are separated by special chars. – Relatively easy. Keyword tree for spell checking This version of the potato gun was inspired by the Weird Science team out of Illinois p s o t o o o l i e t t r e n y c r y 1 2 • • • • h e t a c 3 e 4 O(n) time to construct. n: total length of patterns. Search time: O(m). m: length of text Common prefix only need to be compared once. What if there is no space between words? 5 Aho-Corasick algorithm • Basis of the fgrep algorithm • Generalizing KMP – Using failure links • Example: given the following 4 patterns: – potato – tattoo – theater – other Keyword tree 0 p o t t t h e a h r e t a t a 4 t t e o o 1 r o 2 3 Keyword tree 0 p o t t t h e a h r e t a t a 4 t t e o o 1 r o 2 3 potherotathxythopotattooattoo Keyword tree 0 p o t t t h e a h r e t a t a 4 t t e o o 1 r o 2 3 potherotathxythopotattooattoo O(mn) m: length of text. n: length of longest pattern Keyword Tree with a failure link 0 p o t t t h e a h r e t a t a 4 t t e o o 1 r o 2 3 potherotathxythopotattooattoo Keyword Tree with a failure link 0 p o t t t h e a h r e t a t a 4 t t e o o 1 r o 2 3 potherotathxythopotattooattoo Keyword Tree with all failure links 0 p o t t t h e a h r e t a t a 4 t t e o o 1 o 2 r 3 Example 0 p o t t t h e a h r e t a t a 4 t t e o o 1 o 2 r 3 potherotathxythopotattooattoo Example 0 p o t t t h e a h r e t a t a 4 t t e o o 1 o 2 r 3 potherotathxythopotattooattoo Example 0 p o t t t h e a h r e t a t a 4 t t e o o 1 o 2 r 3 potherotathxythopotattooattoo Example 0 p o t t t h e a h r e t a t a 4 t t e o o 1 o 2 r 3 potherotathxythopotattooattoo Example 0 p o t t t h e a h r e t a t a 4 t t e o o 1 o 2 r 3 potherotathxythopotattooattoo Aho-Corasick algorithm • O(n) preprocessing, and O(m+k) searching. – n: total length of patterns. – m: length of text – k is # of occurrence. • Can create a DFA similar as in KMP. – Requires more space, – Preprocessing time depends on alphabet size – Search time is constant • A: Where can this algorithm be used in previous topics? • Q: BLAST – Given a query sequence, we generate many seed sequences (kmers) – Search for exact matches to these seed sequences – Extend exact matches into longer inexact matches Suffix Tree • All algorithms we talked about so far preprocess pattern(s) – Boyer-Moore: fastest in practice. O(m) worst case. – KMP: O(m) – Aho-Corasick: O(m) • In some cases we may prefer to pre-process T – Fixed T, varying P • Suffix tree: basically a keyword tree of all suffixes Suffix tree • • T: xabxac Suffixes: 1. 2. 3. 4. 5. 6. xabxac abxac bxac xac ac c x a c a b x c a c 2 5 b x a c 6 c b x a c 1 4 3 Naïve construction: O(m2) using Aho-Corasick. Smarter: O(m). Very technical. big constant factor Difference from a keyword tree: create an internal node only when there is a branch Suffix tree implementation • Explicitly labeling sequence end • T: xabxa$ x a a b x a b x a b x a a 1 b x $ 2 3 x a 2 a $ 5 b x a $ 3 • One-to-one correspondence of leaves and suffixes • |T| leaves, hence < |T| internal nodes b x a$ $ 4 1 Suffix tree implementation • Implicitly labeling edges • T: xabxa$ x a a b x $ 2 a $ 5 b x a 1:2 b x a$ 1 $ $ 3:$ 4 $ 3:$ 5 $ 3 3:$ 2:2 2 3 • |Tree(T)| = O(|T| + size(edge labels)) 4 1 Suffix links • Similar to failure link in a keyword tree • Only link internal nodes having branches x a b d e g h j i f c a b P: xabcf c f d e f g h i j ST Application 1: pattern matching • Find all occurrence of P=xa in T – Find node v in the ST that matches to P – Traverse the subtree rooted at v to get the locations x a c a b x c a c 5 2 b x a c 3 6 c b x a c 4 T: xabxac • O(m) to construct ST (large constant factor) • O(n) to find v – linear to length of P instead of T! • O(k) to get all leaves, k is the number of occurrence. • Asymptotic time is the same as KMP. ST wins if T is fixed. KMP wins otherwise. 1 ST Application 2: set matching • Find all occurrences of a set of patterns in T x a c a b – Build a ST from T – Match each P to ST x c a c 5 2 b x a c 6 3 c b x a c 4 T: xabxac P: xab • O(m) to construct ST (large constant factor) • O(n) to find v – linear to total length of P’s • O(k) to get all leaves, k is the number of occurrence. • Asymptotic time is the same as Aho-Corasick. ST wins if T fixed. AC wins if P’s are fixed. Otherwise depending on relative size. 1 ST application 3: repeats finding • Genome contains many repeated DNA sequences • Repeat sequence length: Varies from 1 nucleotide to millions – Genes may have multiple copies (50 to 10,000) – Highly repetitive DNA in some non-coding regions • 6 to 10bp x 100,000 to 1,000,000 times • Problem: find all repeats that are at least kresidues long and appear at least p times in the genome Repeats finding • at least k-residues long and appear at least p times in the seq – Phase 1: top-down, count label lengths (L) from root to each node – Phase 2: bottom-up: count # of leaves descended from each internal node For each node with L >= k, and N >= p, print all leaves O(m) to traverse tree (L, N) Maximal repeats finding 1. Right-maximal repeat – S[i+1..i+k] = S[j+1..j+k], – but S[i+k+1] != S[j+k+1] 2. Left-maximal repeat acatgacatt 1. cat 2. aca 3. acat – S[i+1..i+k] = S[j+1..j+k] – But S[i] != S[j] 3. Maximal repeat – S[i+1..i+k] = S[j+1..j+k] – But S[i] != S[j], and S[i+k+1] != S[j+k+1] Maximal repeats finding 1234567890 acatgacatt a a c t c a t t t 5:e t t 6 1 8 5:e 5 $ 10 t 5:e 5:e 3 t 9 4 5:e 7 2 • Find repeats with at least 3 bases and 2 occurrence – right-maximal: cat – Maximal: acat – left-maximal: aca Maximal repeats finding 1234567890 acatgacatt a a c t c a t t t 5:e t t 6 5:e 5 $ 10 t 5:e 8 5:e 3 9 t 5:e 7 1 Left char = [] 4 2 g c c a a • How to find maximal repeat? – A right-maximal repeats with different left chars ST application 4: word enumeration • Find all k-mers that occur at least p times – Compute (L, N) for each node • L: total label length from root to node • N: # leaves – Find nodes v with L>=k, and L(parent)<k, and N>=p – Traverse sub-tree rooted at v to get the locations L<k L=k L=K L>=k, N>=p This can be used in many applications. For example, to find words that appeared frequently in a genome or a document Joint Suffix Tree (JST) • • • • • Build a ST for more than two strings Two strings S1 and S2 S* = S1 & S2 Build a suffix tree for S* in time O(|S1| + |S2|) The separator will only appear in the edge ending in a leaf (why?) Joint suffix tree example • S1 = abcd • S2 = abca • S* = abcd&abca$ a a c b a 1,1 Seq ID Suffix ID & d a c b $ b c 2,4 a d & 2,1 a 2,2 b c a &abcd (2, 0) d & c useless a bc d d a & 1,4 a 2,3 b c d 1,2 1,3 To Simplify a c b a 1,1 c b a & d a $ b c 2,4 a d & 2,1 a 2,2 b c a &abcd useless d & c a b c d d a & 1,4 a 2,3 b c d a c b d a 1,1 $ b c 2,4 a d 2,1 2,2 d c a d 1,4 2,3 1,3 1,2 1,3 1,2 • We don’t really need to do anything, since all edge labels were implicit. • The right hand side is more convenient to look at Application 1 of JST • Longest common substring between two sequences • Using smith-waterman a – Gap = mismatch = -infinity. – Quadratic time c b • Using JST – Linear time – For each internal node v, keep a bit vector B – B[1] = 1 if a child of v is a suffix of S1 – Bottom-up: find all internal nodes with B[1] = B[2] = 1 (green nodes) – Report a green node with the longest label – Can be extended to k sequences. Just use a bit vector of size k. d a 1,1 $ b c 2,4 a d 2,1 2,2 d c a d 1,4 2,3 1,3 1,2 Application 2 of JST • Given K strings, find all k-mers that appear in at least (or at most) d strings • Exact motif finding problem L< k cardinal(B) >= 3 L >= k B = 1010 1,x 3,x 3,x B = BitOR(1010, 0011) = 1011 cardinal(B) = 3 B = 0011 4,x Application 3 of JST • Substring problem for sequence databases – Given: A fixed database of sequences (e.g., individual genomes) – Given: A short pattern (e.g., DNA signature) – Q: Does this DNA signature belong to any individual in the database? • i.e. the pattern is a substring of some sequences in the database • Aho-Corasick doesn’t work – This can also be used to design signatures for individuals • Build a JST for the database seqs • Match P to the JST • Find seq IDs from descendents 1,1 Seqs: abcd, abca P1: cd P2: bc a c b d a $ b c 2,4 a d 2,1 2,2 c a d d 1,4 2,3 1,3 1,2 Application 4 of JST • Detect DNA contamination – For some reason when we try to clone and sequence a genome, some DNAs from other sources may contaminate our sample, which should be detected and removed – Given: A fixed database of sequences (e.g., possible cantamination sources) – Given: A DNA just sequenced (e.g., DNA signature) – Q: Does this DNA contain longer enough substring from the seqs in the database? a d • Build a JST for the database seqs c b b c • Scan T using the JST $ c d 1,4 d a a 2,4 a d 1,1 2,3 Contamination sources: abcd, abca 2,1 1,3 2,2 1,2 Sequence: dbcgaabctacgtctagt Suffix Tree Memory Footprint • The space requirements of suffix trees can become prohibitive – |Tree(T)| is about 20|T| in practice • Suffix arrays provide one solution. Suffix Arrays • Very space efficient (m integers) • Pattern lookup is nearly O(n) in practice – O(n + log2 m) worst case with 2m additional integers – Independent of alphabet size! • Easiest to describe (and construct) using suffix trees – Other (slower) methods exist x a a 3 4 1 xabxa$ 3 2 xa$ 2 $ 4 5 bxa$ 5 b x a $ 1 abxa$ $ $ a$ 1. xabxa 2. abxa 3. bxa 4. xa 5. a b x a b x a$ Suffix array construction • Build suffix tree for T$ • Perform “lexical” depth-first search of suffix tree – output the suffix label of each leaf encountered • Therefore suffix array can be constructed in O(m) time. Suffix array pattern search • If P is in T, then all the locations of P are consecutive suffixes in Pos. • Do binary search in Pos to find P! – – – – Compare P with suffix Pos(m/2) If lexicographically less, P is in first half of T If lexicographically more, P is in second half of T Iterate! Suffix array pattern search 2 $ 3 4 1 5 2 3 4 1 xabxa$ 5 b x a $ $ xa$ $ b x a b x a$ bxa$ a M abxa$ x a L R M a$ • T: xabxa$ • P: abx R Suffix array binary search • How long to compare P with suffix of T? – O(n) worst case! • Binary search on Pos takes O(n log m) time • Worst case will be rare – occur if many long prefixes of P appear in T • In random or large alphabet strings – expect to do less than log m comparisons • O(n + log m) running time when combined with LCP table – suffix tree = suffix array + LCP table Summary • One T, one P – Boyer-Moore is the choice – KMP works but not the best Alphabet independent • One T, many P – Aho-Corasick – Suffix Tree (array) • One fixed T, many varying P – Suffix tree (array) • Two or more T’s – Suffix tree, joint suffix tree Alphabet dependent Boyer – Moore algorithm • Three ideas: – Right-to-left comparison – Bad character rule – Good suffix rule Boyer – Moore algorithm • Right to left comparison Resume comparison here x y Skip some chars without missing any occurrence. y Bad character rule 0 1 12345678901234567 T:xpbctbxabpqqaabpq P: tpabxab *^^^^ What would you do now? Bad character rule 0 1 12345678901234567 T:xpbctbxabpqqaabpq P: tpabxab *^^^^ P: tpabxab Bad character rule 0 1 123456789012345678 T:xpbctbxabpqqaabpqz P: tpabxab *^^^^ P: tpabxab * P: tpabxab Basic bad character rule tpabxab char Right-most-position in P a 6 b 7 p 2 t 1 x 5 Pre-processing: O(n) Basic bad character rule k T: xpbctbxabpqqaabpqz P: tpabxab When rightmost T(k) in *^^^^ P is left to i, shift pattern P to align T(k) with the rightmost T(k) in P i=3 Shift 3 – 1 = 2 P: tpabxab char Right-most-position in P a 6 b 7 p 2 t 1 x 5 Basic bad character rule k T: xpbctbxabpqqaabpqz P: tpabxab * When T(k) is not in i=7 P, shift left end of P to align with T(k+1) Shift 7 – 0 = 7 P: tpabxab char Right-most-position in P a 6 b 7 p 2 t 1 x 5 Basic bad character rule k T: xpbctbxabpqqaabpqz P: tpabxab When rightmost T(k) *^^ in P is right to i, shift pattern P by 1 i=5 P: tpabxab char Right-most-position in P a 6 b 7 p 2 t 1 x 5 5 – 6 < 0. so shift 1 Extended bad character rule k T: xpbctbxabpqqaabpqz P: tpabxab *^^ Find T(k) in P that is immediately left to i, shift P to align T(k) with that position 5 – 3 = 2. so shift 2 i=5 P: tpabxab char Position in P a 6, 3 b 7, 4 p 2 t 1 x 5 Preprocessing still O(n) Extended bad character rule • Best possible: m / n comparisons • Works better for large alphabet size • In some cases the extended bad character rule is sufficiently good • Worst-case: O(mn) – Expected time is sublinear 0 1 123456789012345678 T:prstabstubabvqxrst P: qcabdabdab *^^ According to extended bad character rule P: qcabdabdab (weak) good suffix rule 0 1 123456789012345678 T:prstabstubabvqxrst P: qcabdabdab *^^ P: qcabdabdab (Weak) good suffix rule x T P t’ P y t Preprocessing: For any suffix t of P, find the rightmost copy of t, denoted by t’. How to find t’ efficiently? t t’ y t (Strong) good suffix rule 0 1 123456789012345678 T:prstabstubabvqxrst P: qcabdabdab *^^ (Strong) good suffix rule 0 1 123456789012345678 T:prstabstubabvqxrst P: qcabdabdab *^^ P: qcabdabdab (Strong) good suffix rule x T P z t’ y z≠y P t In preprocessing: For any suffix t of P, find the rightmost copy of t, t’, such that the char left to t ≠ the char left to t’ t z t’ y t Example preprocessing qcabdabdab Bad char rule char Positions in P a 9, 6, 3 b 10, 7, 4 c 2 d 8, 5 q 1 Where to shift depends on T Good suffix rule 1 2 3 4 5 6 7 8 9 10 q c a b d a b d a b 0 0 0 0 2 0 0 2 0 0 dabdab cabdab Does not depend on T Largest shift given by either the (extended) bad char rule or the (strong) good suffix rule is used. dab cab Time complexity of BM algorithm • Pre-processing can be done in linear time • With strong good suffix rule, worst-case is O(m) if P is not in T – If P is in T, worst-case could be O(mn) – E.g. T = m100, P = m10 – unless a modification was used (Galil’s rule) • Proofs are technical. Skip. How to actually do pre-processing? • Similar pre-processing for KMP and B-M – Find matches between a suffix and a prefix KMP B-M P P t’ x j x t’ j y t i y t For each i, find a j. similar to DP. Start from i = 2 i – Both can be done in linear time – P is usually short, even a more expensive preprocessing may result in a gain overall Fundamental pre-processing P t’ 1 x zi t i y i+zi-1 • Zi: length of longest substring starting at i that matches a prefix of P – i.e. t = t’, x ≠ y, Zi = |t| – With the Z-values computed, we can get the preprocessing for both KMP and B-M in linear time. aabcaabxaaz Z = 01003100210 • How to compute Z-values in linear time? Computing Z in Linear time P t’ P t’ l t k y r l t k y r x x 1 k-l+1 We already computed all Zvalues up to k-1. need to compute Zk. We also know the starting and ending points of the previous match, l and r. We know that t = t’, therefore the Z-value at k-l+1 may be helpful to us. Computing Z in Linear time Case 1: P Case 2: P The previous r is smaller than k. i.e., no previous match extends beyond k. do explicit comparison. k x l 1 k y r Zk-l+1 <= r-k+1. Zk = Zk-l+1 No comparison is needed. k-l+1 Case 3: P l 1 k r Zk-l+1 > r-k+1. Zk = Zk-l+1 Comparison start from r k-l+1 • • No char inside the box is compared twice. At most one mismatch per iteration. Therefore, O(n). Z-preprocessing for B-M and KMP Z KMP t’ 1 t’ i j x t y i j B-M y t x zi x t’ j j = i+zi-1 For each j sp’(j+zj-1) = z(j) y t i Use Z backwards • Both KMP and B-M preprocessing can be done in O(n)