Transcript Slide 1
Lesson 3 Aligning sequences and searching databases 1 Some Terminology Matrix = Table Probability = סיכוי Likelihood = סבירות Global and Local pairwise alignments 5 Global vs. Local • Global alignment – finds the best alignment across the entire two sequences. ADLGAVFALCDRYFQ |||| |||| | ADLGRTQN-CDRYYQ • Local alignment – finds regions of similarity in parts of the sequences. ADLG |||| ADLG CDRYFQ |||| | CDRYYQ 6 The sequence similarity is restricted to a single domain Domain A Protein tyrosine kinase Domain B PTK2 domain Domain X Protein tyrosine kinase domain Leukocyte TK 7 Which alignment is the correct one? AAGTGAATTCGAA AGGCTCATTTCTGA AAG-TGAATT-C-GAA AGGCT-CATTTCTGA- A-AG-TGAATTC--GAA AG-GCTCA-TTTCTGA- 8 Scoring system (naïve) Perfect match: +1 Mismatch: -2 Indel (gap): -1 AAG-TGAATT-C-GAA AGGCT-CATTTCTGAScore: = (+1)x9 + (-2)x2 + (-1)x5 = 0 A-AG-TGAATTC--GAA AG-GCTCA-TTTCTGAScore: = (+1)x8 + (-2)x2 + (-1)x6 = -1 Higher score Better alignment 9 DNA scoring matrices • Uniform substitutions between all nucleotides: From To A G C T A 2 -2 -6 -6 G -2 2 -6 -6 C -6 -6 2 -2 T -6 -6 -2 2 Mismatch Match 10 Scoring gaps (I) Gap extension penalty < Gap opening penalty 11 Protein matrices – actual substitutions The idea: Given an alignment of a large number of closely related sequences we can score the relation between amino acids based on how frequently they substitute each other M M M M M M M M G G G G G G G G Y Y Y Y Y Y Y Y D D E D Q D E E E E E E E E E E In the fourth column E and D are found in 7 / 8 12 PAM Matrices • Family of matrices PAM 80, PAM 120, PAM 250 • The number on the PAM matrix represents evolutionary distance • Larger numbers are for larger distances 13 Example: PAM 250 Similar amino acids have greater score 14 PAM - limitations • Based only on a single, and limited dataset • Examines proteins with few differences (85% identity) • Based mainly on small globular proteins so the matrix is biased 15 BLOSUM • Henikoff and Henikoff (1992) derived a set of matrices based on a much larger dataset • BLOSUM observes significantly more replacements than PAM, even for infrequent pairs 16 BLOSUM: Blocks Substitution Matrix • Based on BLOCKS database – ~2000 blocks from 500 families of related proteins – Families of proteins with identical function • Blocks are short conserved patterns of 3-60 amino acids without gaps AABCDA----BBCDA DABCDA----BBCBB BBBCDA-AA-BCCAA AAACDA-A--CBCDB CCBADA---DBBDCC AAACAA----BBCCC 17 Example : Blosum62 Derived from blocks where the sequences share at least 62% identity 18 PAM vs. BLOSUM PAM100 = BLOSUM90 PAM120 = BLOSUM80 PAM160 = BLOSUM60 PAM200 = BLOSUM52 PAM250 = BLOSUM45 More distant sequences 19 Frequency Amino Acid Blosum62 PAM 250 in vertebrates W Tryptphan 11 17 1.30% M Methionine 5 6 1.80% H Histidine 8 6 2.90% C Cysteine 9 12 3.30% Y Tyrosine 7 10 3.30% Q Glutamine 5 4 3.70% I Isoleucine 4 5 3.80% F Phenylalanine 6 9 4% R Arginine 5 6 4.20% N Asparagine 6 2 4.40% P Proline 7 6 5% E Glutamic acid 5 4 5.80% D Aspartic acid 6 4 5.90% T Threonine 5 3 6.20% V Valine 4 4 6.80% K Lysine 5 5 7.20% A Alanine 4 2 7.40% G Glycine 6 5 7.40% L Leucine 4 6 7.60% S Serine 4 2 8.10% Intermediate summary 1. Scoring system = substitution matrix + gap penalty. 2. Used for both global and local alignment 3. For amino acids, there are two types of substitution matrices: PAM and Blosum 21 Computational Aspects 22 Many possible alignments AAGCTGAATTCGAA AGGCTCATTTCTGA AAGCTGAATT-C-GAA AGGCT-CATTTCTGAAAG-CTGAATT-C-GAA AGGCT-CATTT-CTGA- Which alignment has the best score? • Two sequences of length 10 have >> 1,000,000 possible alignments • Two sequences of length 20 have >> 1,000,000,000,000 possible alignments • Two sequences of length 30 have >> 1,000,000,000,000,000,000 possible alignments AAGCT-GAATT-C-GAA A-GGCT-CATTTCTGA23 Optimal alignment algorithms •Needleman-Wunsch (global) [1970] •Smith-Waterman (local) [1981] •Two sequences of length 10: 100 computer operations (instead of 1,000,000). • Two sequences of length 20: 400 computer operations (instead of 1,000,000,000,000). • Two sequences of length 30: 900 computer operations (instead of 1,000,000,000,000,000,000). 24 Matrix Representation Match = 1 Mismatch = -1 Indel = -2 S T 0 0 0 A 1 -2 A G C 1 2 3 -2 -4 -6 1 A 2 -4 -1 -1 -3 0 -2 A 3 -6 -3 -2 -1 C 4 -8 -5 -4 -1 AAAC score(AAAC,AGC) = -1 A-GC 25 Matrix Representation Match = 1 Mismatch = -1 Indel = -2 S T 0 0 0 A 1 -2 A G C 1 2 3 -2 -4 -6 1 A 2 -4 -1 -1 -3 0 -2 A 3 -6 -3 -2 -1 C 4 -8 -5 -4 -1 score(AAA,AG) = -2 AAA A-G 26 Matrix Representation Match = 1 Mismatch = -1 Indel = -2 S T 0 0 0 A 1 -2 A G C 1 2 3 -2 -4 -6 1 A 2 -4 -1 -1 -3 0 -2 A 3 -6 -3 -2 -1 C 4 -8 -5 -4 -1 score(,AG) = -2 -AG 27 Matrix Representation Match = 1 Mismatch = -1 Indel = -2 S T 0 0 0 A 1 -2 A G C 1 2 3 -2 -4 -6 1 A 2 -4 -1 -1 -3 0 -2 A 3 -6 -3 -2 -1 C 4 -8 -5 -4 -1 How do we fill in the alignment scores in the matrix? That’s where the algorithm comes into play 28 A Useful Link • http://alggen.lsi.upc.es/docencia/ember/fram e-ember.html – Gives a step by step illustration of the algorithm for any given pair of sequences. 29 Homology versus chance similarity 30 A suggestion A. Take the two sequences Compute score. B. Take one sequence randomly shuffle it -> find score with the second sequence. Repeat 100,000 times. If the score in A is at the top 5% of the scores in B the similarity is significant. 31 Searching databases 32 Craig Venter’s Cruise Craig Venter’s cruise A sequence found in Craig Venter’s cruise: …AGGTAGACTAGAGCAGTTAGAACGTTAGTTTA… Which organism is it coming from?? Database Q u e r y A G G T A G A C GTGAGCAGAGAATAGTTTAAC… GAGCTATGTGAGCAGAGAATA… CTACGTGAGCAGAGAATAGTT… CATAGCTACTATGTGAGCAGA… GAGACCAGAGACTACGATAGC… CTAAACTGTGAGCAGACTCGT… GGGGACAGAGAATAGTTTAAC… TAGCTGAGCTATGTGAGCAGA… … … Searching a sequence database The idea: Use your sequence as a query to find homologous sequences in a sequence database Database A sequence taken from Venter’s trip 37 Searching a sequence database Database query 38 Searching a sequence database Database query hit 39 Terminology • Query sequence - the sequence with which we are searching • Hit – a sequence found in the database, suspected as homologous 40 Protein or DNA search 41 Query sequence: DNA or protein? • For coding sequences, we can use the DNA sequence or the protein sequence to search for similar sequences. • Which is preferable if we want to learn about homology? 42 Amino acids are better! • Selection (and hence conservation) works (mostly) at the protein level: CTTTCA = TTGAGT = Leu-Ser Leu-Ser 43 Query type • Nucleotides: a four letter alphabet • Amino acids: a twenty letter alphabet • Two random DNA sequences will, on average, have 25% identity • Two random protein sequences will, on average, have 5% identity 44 Computation time 45 Searching a sequence database Assuming 10 comparisons in every second, a full comparison of the query to the database requires 11.5 days. Database query 107 sequences 46 How do we search a database? • 11.5 days is ok if we are doing it once. • 150,000 searches (at least!!) are performed per day. >82,000,000 sequence records in GenBank. 47 Heuristic • Definition: a heuristic is a design to solve a problem that does not provide an exact solution (but is not too bad) but reduces the time complexity of the exact solution 48 BLAST • BLAST - Basic Local Alignment and Search Tool • A heuristic for searching a database for similar sequences 49 BLAST - underlying hypothesis • The underlying hypothesis: when two sequences are similar there are short ungapped regions of high similarity between them • The heuristic: 1. Discard irrelevant sequences 2. Perform exact local alignment only with the remaining sequences 50 How do we discard irrelevant sequences quickly? • Divide the database into words of length w (default: w = 3 for protein and w = 11 for DNA) • Save the words in a look-up table that can be searched quickly AGCTTAGACTAAAGC… AGCTTAGACTA GCTTAGACTAA CTTAGACTAAA TTAGACTAAAG TAGACTAAAGC … 51 BLAST: discarding sequences • When the user enters a query sequence, it is also divided into words • Search the database for consecutive neighboring words 52 Search for consecutive words Neighbor word Database record This is the filtering stage – many unrelated hits are filtered, saving lots of time! Query 53 Try to extend the alignment • Stop extending when the score of the alignment drops X beneath the maximal score obtained so far • Discard segments with score < S AAGACCTAGGCATTAAGCATTTAAGAGA GGAAGACAGGCATTAAGCGTCAAAGAGG Score=11 X=4 Score=9 Score=9 Score=7 54 The result – local alignment • The result of BLAST will be a series of local alignments between the query and the different hits found 55