Transcript Slides
CS 6293 Advanced Topics: Translational Bioinformatics Lectures 3-4: Pair-wise Sequence Alignment Outline • • • • • Biological background Global sequence alignment Local sequence alignment Optional: linear-space alignment algorithm Heuristic alignment: BLAST Evolution at the DNA level C …ACGGTGCAGTCACCA… …ACGTTGC-GTCCACCA… DNA evolutionary events (sequence edits): Mutation, deletion, insertion Sequence conservation implies function next generation OK OK OK X X Still OK? Why sequence alignment? • Conserved regions are more likely to be functional – Can be used for finding genes, regulatory elements, etc. • Similar sequences often have similar origin and function – Can be used to predict functions for new genes / proteins • Sequence alignment is one of the most widely used computational tools in biology Global Sequence Alignment S T S’ T’ AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition An alignment of two strings S, T is a pair of strings S’, T’ (with spaces) s.t. (1) |S’| = |T’|, and (|S| = “length of S”) (2) removing all spaces in S’, T’ leaves S, T What is a good alignment? Alignment: The “best” way to match the letters of one sequence with those of the other How do we define “best”? S’: -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--T’: TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC • The score of aligning (characters or spaces) x & y is σ (x,y). • Score of an alignment: • An optimal alignment: one with max score Scoring Function • Sequence edits: – Mutations – Insertions – Deletions Scoring Function: Match: +m Mismatch: -s Gap (indel): -d AGGCCTC AGGACTC AGGGCCTC AGG-CTC ~~~AAC~~~ ~~~A-A~~~ • Match = 2, mismatch = -1, gap = -1 • Score = 3 x 2 – 2 x 1 – 1 x 1 = 3 More complex scoring function • Substitution matrix – Similarity score of matching two letters a, b should reflect the probability of a, b derived from the same ancestor – It is usually defined by log likelihood ratio – Active research area. Especially for proteins. – Commonly used: PAM, BLOSUM An example substitution matrix A C G T A C G T 3 -2 -1 -2 3 -2 -1 3 -2 3 How to find an optimal alignment? • A naïve algorithm: for all subseqs A of S, B of T s.t. |A| = |B| do align A[i] with B[i], 1 ≤i ≤|A| align all other chars to spaces compute its value S = abcd A = cd T = wxyz B = xz retain the max -abc-d a-bc-d end w--xyz -w-xyz output the retained alignment Analysis • Assume |S| = |T| = n • Cost of evaluating one alignment: ≥n • How many alignments are there: – pick n chars of S,T together – say k of them are in S – match these k to the k unpicked chars of T • Total time: • E.g., for n = 20, time is > 240 >1012 operations Dynamic Programming for sequence alignment Suppose we wish to align x1……xM y1……yN Let F(i,j) = optimal score of aligning x1……xi y1……yj Scoring Function: Match: +m Mismatch: -s Gap (indel): -d Optimal substructure 1 2 i M ... x: 1 y: 2 j N ... • If x[i] is aligned to y[j] in the optimal alignment between x[1..M] and y[1..N], then • The alignment between x[1..i] and y[1..j] is also optimal • Easy to prove by contradiction Recursive solution Notice three possible cases: 1. ~~~~~~~ xM ~~~~~~~ yN 2. F(M,N) = F(M-1, N-1) + -s, if not xM aligns to a gap ~~~~~~~ xM ~~~~~~~ 3. m, if xM = yN xM aligns to yN max F(M,N) = F(M-1, N) - d yN aligns to a gap ~~~~~~~ ~~~~~~~ yN F(M,N) = F(M, N-1) - d Recursive solution • Generalize: F(i,j) = max F(i-1, j-1) + (Xi,Yj) F(i-1, j) – d F(i, j-1) – d (Xi,Yj) = m if Xi = Yj, and –s otherwise • Boundary conditions: – F(0, 0) = 0. -jd: y[1..j] aligned to gaps. – F(0, j) = ? -id: x[1..i] aligned to gaps. – F(i, 0) = ? What order to fill? F(0,0) F(i-1, j-1)1 F(i-1, j) 1 i F(i, j-1) 3 2 F(i, j) F(M,N) j What order to fill? F(0,0) F(M,N) Example x = AGTA y = ATA F(i,j) m= 1 s =1 d =1 i = j=0 1 A 2 T 3 A 0 1 2 A G 3 T 4 A Example x = AGTA y = ATA F(i,j) m= 1 s =1 d =1 i = 0 0 j=0 1 A -1 2 T -2 3 A -3 1 2 3 4 A G T A -1 -2 -3 -4 Example x = AGTA y = ATA F(i,j) m= 1 s =1 d =1 i = j=0 0 1 2 A G T A 0 -1 -2 -3 -4 1 0 -1 -2 1 A -1 2 T -2 3 A -3 3 4 Example x = AGTA y = ATA F(i,j) m= 1 s =1 d =1 i = j=0 0 1 2 3 4 A G T A 0 -1 -2 -3 -4 1 A -1 1 0 -1 -2 2 T -2 0 0 1 0 3 A -3 Example x = AGTA y = ATA F(i,j) m= 1 s =1 d =1 i = j=0 0 1 2 3 4 A G T A 0 -1 -2 -3 -4 1 A -1 1 0 -1 -2 2 T -2 0 0 1 0 3 A -3 -1 -1 0 2 Optimal Alignment: F(4,3) = 2 Example x = AGTA y = ATA F(i,j) m= 1 s =1 d =1 i = j=0 0 1 2 3 4 A G T A 0 -1 -2 -3 -4 1 A -1 1 0 -1 -2 2 T -2 0 0 1 0 3 A -3 -1 -1 0 2 Optimal Alignment: F(4,3) = 2 This only tells us the best score Trace-back x = AGTA y = ATA F(i,j) F(i,j) = max i = j=0 0 F(i-1, j-1) + (Xi,Yj) F(i-1, j) – d F(i, j-1) – d 1 2 3 A G T A 0 -1 -2 -3 -4 m= 1 s =1 d =1 4 1 A -1 1 0 -1 -2 A 2 T -2 0 0 1 0 A 3 A -3 -1 -1 0 2 Trace-back x = AGTA y = ATA F(i,j) F(i,j) = max i = j=0 0 F(i-1, j-1) + (Xi,Yj) F(i-1, j) – d F(i, j-1) – d 1 2 3 A G T A 0 -1 -2 -3 -4 m= 1 s =1 d =1 4 1 A -1 1 0 -1 -2 T A 2 T -2 0 0 1 0 T A 3 A -3 -1 -1 0 2 Trace-back x = AGTA y = ATA F(i,j) F(i,j) = max i = j=0 0 F(i-1, j-1) + (Xi,Yj) F(i-1, j) – d F(i, j-1) – d 1 2 3 A G T A 0 -1 -2 -3 -4 m= 1 s =1 d =1 4 1 A -1 1 0 -1 -2 G T A 2 T -2 0 0 1 0 - 3 A -3 -1 -1 0 2 T A Trace-back x = AGTA y = ATA F(i,j) F(i,j) = max i = j=0 0 F(i-1, j-1) + (Xi,Yj) F(i-1, j) – d F(i, j-1) – d 1 2 3 A G T A 0 -1 -2 -3 -4 m= 1 s =1 d =1 4 1 A -1 1 0 -1 -2 A G T A 2 T -2 0 0 1 0 A - 3 A -3 -1 -1 0 2 T A Trace-back x = AGTA y = ATA F(i,j) F(i,j) = max i = j=0 1 A 0 F(i-1, j-1) + (Xi,Yj) F(i-1, j) – d F(i, j-1) – d 1 2 3 A G T A 0 -1 -2 -3 -4 -1 1 0 -1 -2 m= 1 s =1 d =1 4 2 T -2 0 0 1 0 3 A -3 -1 -1 0 2 Optimal Alignment: F(4,3) = 2 AGTA ATA Using trace-back pointers x = AGTA y = ATA F(i,j) m= 1 s =1 d =1 i = 0 0 j=0 1 A -1 2 T -2 3 A -3 1 2 3 4 A G T A -1 -2 -3 -4 Using trace-back pointers x = AGTA y = ATA F(i,j) m= 1 s =1 d =1 i = j=0 0 1 2 A G T A 0 -1 -2 -3 -4 1 0 -1 -2 1 A -1 2 T -2 3 A -3 3 4 Using trace-back pointers x = AGTA y = ATA F(i,j) m= 1 s =1 d =1 i = j=0 0 1 2 3 4 A G T A 0 -1 -2 -3 -4 1 A -1 1 0 -1 -2 2 T -2 0 0 1 0 3 A -3 Using trace-back pointers x = AGTA y = ATA F(i,j) m= 1 s =1 d =1 i = j=0 0 1 2 3 4 A G T A 0 -1 -2 -3 -4 1 A -1 1 0 -1 -2 2 T -2 0 0 1 0 3 A -3 -1 -1 0 2 Using trace-back pointers x = AGTA y = ATA F(i,j) m= 1 s =1 d =1 i = j=0 0 1 2 3 4 A G T A 0 -1 -2 -3 -4 1 A -1 1 0 -1 -2 2 T -2 0 0 1 0 3 A -3 -1 -1 0 2 Using trace-back pointers x = AGTA y = ATA F(i,j) m= 1 s =1 d =1 i = j=0 0 1 2 3 4 A G T A 0 -1 -2 -3 -4 1 A -1 1 0 -1 -2 2 T -2 0 0 1 0 3 A -3 -1 -1 0 2 Using trace-back pointers x = AGTA y = ATA F(i,j) m= 1 s =1 d =1 i = j=0 0 1 2 3 4 A G T A 0 -1 -2 -3 -4 1 A -1 1 0 -1 -2 2 T -2 0 0 1 0 3 A -3 -1 -1 0 2 Using trace-back pointers x = AGTA y = ATA F(i,j) m= 1 s =1 d =1 i = j=0 1 A 0 1 2 3 4 A G T A 0 -1 -2 -3 -4 -1 1 0 -1 -2 2 T -2 0 0 1 0 3 A -3 -1 -1 0 2 Optimal Alignment: F(4,3) = 2 AGTA ATA The Needleman-Wunsch Algorithm 1. Initialization. a. b. c. 2. F(0, 0) F(0, j) F(i, 0) = 0 =-jd =-id Main Iteration. Filling in scores a. For each i = 1……M For each j = 1……N F(i, j) Ptr(i,j) 3. = max = F(i-1,j) – d F(i, j-1) – d F(i-1, j-1) + σ(xi, yj) UP, LEFT DIAG [case 1] [case 2] [case 3] if [case 1] if [case 2] if [case 3] Termination. F(M, N) is the optimal score, and from Ptr(M, N) can trace back optimal alignment Complexity • Time: O(NM) • Space: O(NM) • Linear-space algorithms do exist (with the same time complexity) Equivalent graph problem S1 = A G T A (0,0) S2 = A 1 1 : a gap in the 2nd sequence : a gap in the 1st sequence 1 T 1 A : match / mismatch 1 Value on vertical/horizontal line: -d Value on diagonal: m or -s (3,4) • Number of steps: length of the alignment • Path length: alignment score • Optimal alignment: find the longest path from (0, 0) to (3, 4) • General longest path problem cannot be found with DP. Longest path on this graph can be found by DP since no cycle is possible. A variant of the basic algorithm Scoring scheme: m = s = d: 1 Seq1: CAGCA-CTTGGATTCTCGG || |:||| Seq2: ---CAGCGTGG-------- Seq1: CAGCACTTGGATTCTCGG |||| | | || Seq2: CAGC-----G-T----GG Score = -7 Score = -2 The first alignment may be biologically more realistic in some cases (e.g. if we know s2 is a subsequence of s1) A variant of the basic algorithm • Maybe it is OK to have an unlimited # of gaps in the beginning and end: ----------CTATCACCTGACCTCCAGGCCGATGCCCCTTCCGGC GCGAGTTCATCTATCAC--GACCGC--GGTCG-------------- • Then, we don’t want to penalize gaps in the ends The Overlap Detection variant yN ……………………………… y1 x1 ……………………………… xM Changes: 1. Initialization For all i, j, F(i, 0) = 0 F(0, j) = 0 2. Termination maxi F(i, N) FOPT = max maxj F(M, j) Different types of overlaps x x y y The local alignment problem Given two strings X = x1……xM, Y = y1……yN Find substrings x’, y’ whose similarity (optimal global alignment value) is maximum e.g. X = abcxdex Y = xxxcde x y X’ = cxde Y’ = c-de Why local alignment • Conserved regions may be a small part of the whole – Global alignment might miss them if flanking “junk” outweighs similar regions • Genes are shuffled between genomes A B C B D D A C Naïve algorithm for all substrings X’ of X and Y’ of Y Align X’ & Y’ via dynamic programming Retain pair with max value end ; Output the retained pair • Time: O(n2) choices for A, O(m2) for B, O(nm) for DP, so O(n3m3 ) total. Reminder • The overlap detection algorithm – We do not give penalty to gaps at either end Free gap Free gap The local alignment idea • Do not penalize the unaligned regions (gaps or mismatches) • The alignment can start anywhere and ends anywhere • Strategy: whenever we get to some low similarity region (negative score), we restart a new alignment – By resetting alignment score to zero The Smith-Waterman algorithm Initialization: F(0, j) = F(i, 0) = 0 0 F(i – 1, j) – d Iteration: F(i, j) = max F(i, j – 1) – d F(i – 1, j – 1) + (xi, yj) The Smith-Waterman algorithm Termination: 1. If we want the best local alignment… FOPT = maxi,j F(i, j) 2. If we want all local alignments scoring > t For all i, j find F(i, j) > t, and trace back Match: 2 Mismatch: -1 Gap: -1 a b c x d e x 0 0 0 0 0 0 0 0 x 0 x 0 x 0 c 0 d 0 e 0 Match: 2 Mismatch: -1 Gap: -1 a b c x d e x 0 0 0 0 0 0 0 0 x 0 0 0 x 0 0 0 x 0 0 0 c 0 0 0 d 0 0 0 e 0 0 0 Match: 2 Mismatch: -1 Gap: -1 a b c x d e x 0 0 0 0 0 0 0 0 x 0 0 0 0 x 0 0 0 0 x 0 0 0 0 c 0 0 0 2 d 0 0 0 1 e 0 0 0 0 Match: 2 Mismatch: -1 Gap: -1 a b c x d e x 0 0 0 0 0 0 0 0 x 0 0 0 0 2 x 0 0 0 0 2 x 0 0 0 0 2 c 0 0 0 2 1 d 0 0 0 1 0 e 0 0 0 0 0 Match: 2 Mismatch: -1 Gap: -1 a b c x d e x 0 0 0 0 0 0 0 0 x 0 0 0 0 2 1 x 0 0 0 0 2 1 x 0 0 0 0 2 1 c 0 0 0 2 1 1 d 0 0 0 1 0 3 e 0 0 0 0 0 2 Match: 2 Mismatch: -1 Gap: -1 a b c x d e x 0 0 0 0 0 0 0 0 x 0 0 0 0 2 1 0 x 0 0 0 0 2 1 0 x 0 0 0 0 2 1 0 c 0 0 0 2 1 1 0 d 0 0 0 1 0 3 2 e 0 0 0 0 0 2 5 Match: 2 Mismatch: -1 Gap: -1 a b c x d e x 0 0 0 0 0 0 0 0 x 0 0 0 0 2 1 0 2 x 0 0 0 0 2 1 0 2 x 0 0 0 0 2 1 0 2 c 0 0 0 2 1 1 0 1 d 0 0 0 1 1 3 2 1 e 0 0 0 0 0 2 5 4 Trace back Match: 2 Mismatch: -1 Gap: -1 a b c x d e x 0 0 0 0 0 0 0 0 x 0 0 0 0 2 1 0 2 x 0 0 0 0 2 1 0 2 x 0 0 0 0 2 1 0 2 c 0 0 0 2 1 1 0 1 d 0 0 0 1 1 3 2 1 e 0 0 0 0 0 2 5 4 Trace back Match: 2 Mismatch: -1 Gap: -1 cxde | || c-de x-de | || xcde a b c x d e x 0 0 0 0 0 0 0 0 x 0 0 0 0 2 1 0 2 x 0 0 0 0 2 1 0 2 x 0 0 0 0 2 1 0 2 c 0 0 0 2 1 1 0 1 d 0 0 0 1 1 3 2 1 e 0 0 0 0 0 2 5 4 • No negative values in local alignment DP array • Optimal local alignment will never have a gap on either end • Local alignment: “Smith-Waterman” • Global alignment: “Needleman-Wunsch” Analysis • Time: – O(MN) for finding the best alignment – Time to report all alignments depends on the number of sub-opt alignments • Memory: – O(MN) – O(M+N) possible Optional: more efficient alignment algorithms • Given two sequences of length M, N • Time: O(MN) – Ok (?) • Space: O(MN) – bad – 1Mb seq x 1Mb seq = 1000G memory • Can we do better? Bounded alignment Good alignment should appear near the diagonal Bounded Dynamic Programming If we know that x and y are very similar Assumption: Then, xi | yj # gaps(x, y) < k implies |i–j|<k yN ………………………… y1 Bounded Dynamic Programming x1 ………………………… xM Initialization: F(i,0), F(0,j) undefined for i, j > k Iteration: For i = 1…M For j = max(1, i – k)…min(N, i+k) F(i – 1, j – 1)+ (xi, yj) F(i, j) = max F(i, j – 1) – d, if j > i – k F(i – 1, j) – d, if j < i + k k Termination: same Analysis • Time: O(kM) << O(MN) • Space: O(kM) with some tricks => M 2k M 2k • Given two sequences of length M, N • Time: O(MN) – ok • Space: O(MN) – bad – 1mb seq x 1mb seq = 1000G memory • Can we do better? Linear space algorithm • If all we need is the alignment score but not the alignment, easy! We only need to keep two rows (You only need one row, with a little trick) But how do we get the alignment? Linear space algorithm • When we finish, we know how we have aligned the ends of the sequences XM YN Naïve idea: Repeat on the smaller subproblem F(M-1, N-1) Time complexity: O((M+N)(MN)) (0, 0) M/2 (M, N) Key observation: optimal alignment (longest path) must use an intermediate point on the M/2-th row. Call it (M/2, k), where k is unknown. (0,0) (3,0) (3,2) (3,4) (3,6) (6,6) • Longest path from (0, 0) to (6, 6) is max_k (LP(0,0,3,k) + LP(6,6,3,k)) Hirschberg’s idea • Divide and conquer! Y X Forward algorithm Align x1x2…xM/2 with Y M/2 F(M/2, k) represents the best alignment between x1x2…xM/2 and y1y2…yk Backward Algorithm Y X M/2 Backward algorithm Align reverse(xM/2+1…xM) with reverse(Y) B(M/2, k) represents the best alignment between reverse(xM/2+1…xM) and reverse(ykyk+1…yN ) Linear-space alignment Using 2 (4) rows of space, we can compute for k = 1…N, F(M/2, k), B(M/2, k) M/2 Linear-space alignment Now, we can find k* maximizing F(M/2, k) + B(M/2, k) Also, we can trace the path exiting column M/2 from k* Conclusion: In O(NM) time, O(N) space, we found optimal alignment path at row M/2 Linear-space alignment k* M/2 M/2 • Iterate this procedure to the two sub-problems! N-k* Analysis • Memory: O(N) for computation, O(N+M) to store the optimal alignment • Time: – MN for first iteration – k M/2 + (N-k) M/2 = MN/2 for second –… k M/2 M/2 N-k MN MN/2 MN/4 MN + MN/2 + MN/4 + MN/8 + … = MN (1 + ½ + ¼ + 1/8 + 1/16 + …) = 2MN = O(MN) MN/8 Heuristic Local Sequence Alignment: BLAST State of biological databases Sequenced Genomes: Human Mouse Neurospora Tetraodon Drosophila Rice sea squirts 3109 2.7109 4107 3108 1.2108 1.0109 1.6108 Yeast Rat Fugu fish 3.3108 Mosquito 2.8108 Worm Arabidopsis Current rate of sequencing (before new-generation sequencing): 4 big labs 3 109 bp /year/lab 10s small labs Private sectors With new-generation sequencing: Easily generating billions of reads daily 1.2107 2.6109 1.0108 1.2108 Some useful applications of alignments Given a newly discovered gene, - Does it occur in other species? Assume we try Smith-Waterman: Our new gene 104 The entire genomic database 1010 - 1011 May take several weeks! Some useful applications of alignments Given a newly sequenced organism, - Which subregions align with other organisms? - Potential genes - Other functional units Assume we try Smith-Waterman: Our newly sequenced mammal 3109 The entire genomic database 1010 - 1011 > 1000 years ??? BLAST • Basic Local Alignment Search Tool – Altschul, Gish, Miller, Myers, Lipman, J Mol Biol 1990 – The most widely used bioinformatics tool • Which is better: long mediocre match or a few nearby, short, strong matches with the same total score? – Score-wise, exactly equivalent – Biologically, later may be more interesting, & is common – At least, if must miss some, rather miss the former • BLAST is a heuristic algorithm emphasizing the later – speed/sensitivity tradeoff: BLAST may miss former, but gains greatly in speed BLAST • Available at NCBI (National Center for Biotechnology Information) for download and online use. http://blast.ncbi.nlm.nih.gov/ • Along with many sequence databases query Main idea: 1.Construct a dictionary of all the words in the query 2.Initiate a local alignment for each word match between query and DB Running Time: O(MN) However, orders of magnitude faster than Smith-Waterman DB BLAST Original Version …… Dictionary: All words of length k (~11 for DNA, 3 for proteins) Alignment initiated between words of alignment score T (typically T = k) query …… Alignment: Ungapped extensions until score below statistical threshold scan DB Output: All local alignments with score > statistical threshold query BLAST Original Version A C G A A G T A A G G T C C A G T k = 4, T = 4 The matching word GGTC initiates an alignment Extension to the left and right with no gaps until alignment falls < 50% Output: GTAAGGTCC GTTAGGTCC C C C T T C C T G G A T T G C G A Example: Gapped BLAST Added features: • • Pairs of words can initiate alignment Extensions with gaps in a band around anchor Output: GTAAGGTCCAGT GTTAGGTC-AGT C T G A T C C T G G A T T G C G A A C G A A G T A A G G T C C A G T Example Query: gattacaccccgattacaccccgattaca (29 letters) [2 mins] Database: All GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or phase 0, 1 or 2 HTGS sequences) 1,726,556 sequences; 8,074,398,388 total letters >gi|28570323|gb|AC108906.9| Oryza sativa chromosome 3 BAC OSJNBa0087C10 genomic sequence, complete sequence Length = 144487 Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plus Query: Sbjct: 4 tacaccccgattacaccccga 24 ||||||| ||||||||||||| 125138 tacacccagattacaccccga 125158 Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plus Query: Sbjct: 4 tacaccccgattacaccccga 24 ||||||| ||||||||||||| 125104 tacacccagattacaccccga 125124 >gi|28173089|gb|AC104321.7| Oryza sativa chromosome 3 BAC OSJNBa0052F07 genomic sequence, complete sequence Length = 139823 Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plus Query: Sbjct: 4 tacaccccgattacaccccga 24 ||||||| ||||||||||||| 3891 tacacccagattacaccccga 3911 Example Query: Human atoh enhancer, 179 letters [1.5 min] Result: 57 blast hits 1. 2. 3. 4. 5. 6. 7. 8. gi|7677270|gb|AF218259.1|AF218259 Homo sapiens ATOH1 enhanc... gi|22779500|gb|AC091158.11| Mus musculus Strain C57BL6/J ch... gi|7677269|gb|AF218258.1|AF218258 Mus musculus Atoh1 enhanc... gi|28875397|gb|AF467292.1| Gallus gallus CATH1 (CATH1) gene... gi|27550980|emb|AL807792.6| Zebrafish DNA sequence from clo... gi|22002129|gb|AC092389.4| Oryza sativa chromosome 10 BAC O... gi|22094122|ref|NM_013676.1| Mus musculus suppressor of Ty ... gi|13938031|gb|BC007132.1| Mus musculus, Similar to suppres... 355 1e-95 264 4e-68 256 9e-66 78 5e-12 54 7e-05 44 0.068 42 0.27 42 0.27 gi|7677269|gb|AF218258.1|AF218258 Mus musculus Atoh1 enhancer sequence Length = 1517 Score = 256 bits (129), Expect = 9e-66 Identities = 167/177 (94%), Gaps = 2/177 (1%) Strand = Plus / Plus Query: 3 tgacaatagagggtctggcagaggctcctggccgcggtgcggagcgtctggagcggagca 62 ||||||||||||| ||||||||||||||||||| |||||||||||||||||||||||||| Sbjct: 1144 tgacaatagaggggctggcagaggctcctggccccggtgcggagcgtctggagcggagca 1203 Query: 63 cgcgctgtcagctggtgagcgcactctcctttcaggcagctccccggggagctgtgcggc 122 |||||||||||||||||||||||||| ||||||||| |||||||||||||||| ||||| Sbjct: 1204 cgcgctgtcagctggtgagcgcactc-gctttcaggccgctccccggggagctgagcggc 1262 Query: 123 cacatttaacaccatcatcacccctccccggcctcctcaacctcggcctcctcctcg 179 ||||||||||||| || ||| |||||||||||||||||||| ||||||||||||||| Sbjct: 1263 cacatttaacaccgtcgtca-ccctccccggcctcctcaacatcggcctcctcctcg 1318 Different types of BLAST • blastn: search nucleic acid databases • blastp: search protein databases • blastx: you give a nucleic acid sequence, search protein databases • tblastn: you give a protein sequence, search nucleic acid databases • tblastx: you give a nucleic sequence, search nucleic acid database, implicitly translate both into protein sequences BLAST cons and pros • Advantages – Fast!!!! – A few minutes to search a database of 1011 bases • Disadvantages – Sensitivity may be low – Often misses weak homologies • New improvement – Make it even faster • Mainly for aligning very similar sequences or really long sequences – E.g. whole genome vs whole genome – Make it more sensitive • PSI-BLAST: iteratively add more homologous sequences • PatternHunter: discontinuous seeds Variants of BLAST NCBI-BLAST: most widely used version WU-BLAST: (Washington University BLAST): another popular version Optimized, added features MEGABLAST: Optimized to align very similar sequences. Linear gap penalty BLAT: Blast-Like Alignment Tool BlastZ: Optimized for aligning two genomes PSI-BLAST: BLAST produces many hits Those are aligned, and a pattern is extracted Pattern is used for next search; above steps iterated Sensitive for weak homologies Slower Things we’ve covered so far • Global alignment – Needleman-Wunsch and variants • Local Alignment – Smith-Waterman • Improvement on space and time • Heuristic algorithms – BLAST families • Things we did not cover: – Statistics for sequence alignment – To handle gaps more accurately: affine gap penalty – Multiple sequence alignment