Mathematics and computation behind BLAST and FASTA Xuhua Xia [email protected] http://dambe.bio.uottawa.ca Why string matching? • Biological significance – "We have discovered a new gene that has.
Download ReportTranscript Mathematics and computation behind BLAST and FASTA Xuhua Xia [email protected] http://dambe.bio.uottawa.ca Why string matching? • Biological significance – "We have discovered a new gene that has.
Mathematics and computation behind BLAST and FASTA Xuhua Xia [email protected] http://dambe.bio.uottawa.ca Why string matching? • Biological significance – "We have discovered a new gene that has major effect on liver cancer…" – "We have discovered a new cancer-arresting function of a previously reported reported gene …" • Early applications: Sequence similarity between an oncogene (genes in viruses that cause a cancer-like transformation of the infected cells), v-sis, and the platelet-derived growth factor (PDGF) • M. D. Waterfield et al. 1983. Nature 304:35-39 • R. F. Doolittle et al. 1983. Science 221:275-227 • Fast computational methods in string matching – FASTA – BLAST – Local pair-wise alignment by dynamic programming Slide 3 FASTA • A commonly used family of alignment and search tools • Generally considered to be more sensitive than BLAST. • Illustration with two fictitious sequences used in the Contig Assembly lecture: Seq1: ACCGCGATGACGAATA Seq2: GAATACGACTGACGATGGA Seq1: ACCGCGATGACGAATA Seq2: GAATACGACTGACGATGGA Slide 4 String Match in FASTA Query Target 1 A G 2 C A 3 C A 4 G T A C G T 1 2 4 8 7 3 6 15 10 5 9 13 11 12 14 16 1 2 3 4 G A A T -3 1 2 -4 -5 -5 -4 -11 -8 -8 -7 -11 -11 -10 -12 -11 -14 -13 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Left Right C G A T G A C G A A T A Move N Move N A C G A C T G A C G A T G G A -1 3 1 6 -2 5 2 7 -3 1 3 3 -4 3 4 3 -5 7 5 6 -6 1 6 3 -7 1 7 3 -8 4 8 5 -9 1 9 2 -10 1 10 2 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 -11 5 11 3 A C G A C T G A C G A T G G A -12 1 12 2 4 4 3 7 7 2 7 11 11 10 14 8 13 14 18 -13 1 13 1 -2 3 1 1 6 -5 5 5 10 8 8 1 11 12 12 -14 1 14 2 -5 1 -2 -2 4 2 2 8 5 5 8 9 9 -15 0 15 0 -8 -5 -5 -5 -2 -1 -1 2 2 2 5 6 6 16 0 -9 -6 -2 1 5 17 0 -11 -8 -4 -1 3 18 1 Left and Right: -n means moving the query left by n sites and n means moving the query right by n sites. Slide 5 Alternative Matched Strings Forw. Back Move N Move N -1 3 1 6 -2 5 2 7 -3 1 3 3 -4 3 4 3 -5 7 5 6 -6 1 6 3 -7 1 7 3 -8 4 8 5 -9 1 9 2 -10 1 10 2 -11 5 11 3 -12 1 12 2 -13 1 13 1 -14 1 14 2 -15 0 15 0 16 0 17 0 18 1 Query: ACCGCGATGACGAATA Target:GAATACGACTGACGATGGA From lecture on contig assembly: One of the three 3rd best Query: ACCGCGATGACGAATA Target: GAATACGACTGACGATGGA From FASTA algorithm: Query: ACCGCGATGACGAATA Target: GAATACGACTGACGATGGA Query: ACCGCGATGACGAATA Target: GAATACGACTGACGATGGA Best 2nd best Which one is best based on YOUR judgment? Slide 6 Word length of 2 Query Target 1 A G 2 C A 3 C A 4 G T 5 C A 6 G C 7 A G 8 T A 9 G C 10 A T 11 12 C G G A 13 A C 14 15 16 A T A G A T AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT 13 1 7 2 3 6 4 15 8 10 14 5 9 11 12 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 GA AA AT TA AC CG GA AC CT TG GA AC CG GA AT TG -5 -11 -4 -11 4 3 1 7 2 5 11 10 8 8 8 -8 -11 -5 1 -2 -2 2 2 8 5 1 -11 -5 -5 -1 2 2 Query: ACCGCGATGACGAATA Target: GAATACGACTGACGATGGA Query: ACCGCGATGACGAATA Target: GAATACGACTGACGATGGA 17 18 Left Move G G A -1 -2 -3 -4 -5 -6 -7 -8 -9 -10 17 18 -11 GG GA -12 12 -13 9 -14 6 N 1 2 0 1 4 0 0 1 0 0 4 0 0 0 Right Move N 1 3 2 5 3 1 4 1 5 2 6 1 7 1 8 4 9 1 10 1 11 1 12 1 13 0 14 0 15 0 16 0 17 0 One of the three 2nd best Best Slide 7 BLAST • Adapted from Crane & Raymer 2003 • Motivation: matching short sequences are faster than matching longer ones • Input sequence: AILVPTVIGCTVPT • Algorithm: – Break the query sequence into words: AILV, ILVP, LVPT, VPTV, PTVI, TVIG, VIGC, IGCT, GCTV, CTVP, TVPT – Discard common words (i.e., words made entirely of common amino acids) – Search for matches against database sequences, assess significance and decide whether to discard to continue with extension using dynamic programming: AILVPTVIGCTVPT MVQGWALYDFLKCRAILVPTVIACTCVAMLALYDFLKC • Critical decision: Discard or continue? • The E-value as an answer. Slide 8 Basic stats in string matching • Given PA, PC, PG, PT in a target (database) sequence, the probability of a query sequence, say, ATTGCC, having a perfect match of the target sequence is: prob = PAPTPT PGPCPC = PA (PC)2 PG (PT)2 • Let M be the target sequence length and N be the query sequence length, the “matching operation” can be performed (M – N +1) times, e.g., Query: ATG Target CGATTGCCCG • The probability distribution of the number of matches follows (approximately) a binomial distribution with p = prob and n = (M – N +1) Slide 9 Basic stats in string matching • Probability of having a sequence match: p • Probability of having no match: q = 1-p • Binomial distribution: ( p q) n p n n! n! p n 1q ... p n x q x ... q n (n 1)!1! (n x)! x ! • When np > 50, the binomial distribution can be approximated by the normal distribution with the mean = np and variance = npq ( x ) 2 P( x) 1 e 2 2 2 • When np < 1 and n is very large, binomial distribution can be approximated by the Poisson distribution with mean and variance equal to np (i.e., = 2 = np). e x P( x) x! Slide 10 From Binomial to Poisson ( p q) n p n n! n! n! p n 1q ... p n x q x ... p x q n x ... q n (n 1)!1! (n x)! x ! (n x)! x ! P ( n) p n P(n 1) np n 1q n! P(n x) p n x q x (n x)! x ! n! P( x) p x q n x (n x)! x ! P(0) q n n! p x q n x (n x)! x ! n! p xq xqn (n x)! x ! P( x) x n(n 1)(n 2)...(n x 1) p n (1 p ) x! q x n x x np n x p x np (np ) x np pe e e e x! x! x! x! Slide 11 Matching two sequences without gap • Assuming equal nucleotide frequencies, the probability of a nucleotide site in the query sequence matching a site in the target sequence is p = 0.25. • The probability of finding an exact match of L letters is a = pL = 0.25L = 2-2L = 2-S, where S is called the bit score in BLAST. • M: query length; N: target length, e.g., M = 8, N = 5, L = 3 AACGGTTC CGGTT • A sequence of length L can move at (M – L +1) distinct sites along the query and (N – L +1) distinct sites along the target. • m = (M-L+1) and n = (N-L+1) are called effective lengths of the two sequences. • The expected number of matches with length L is mn2-S, which is called E-value in ungapped BLAST. • S is calculated differently in the gapped BLAST Slide 12 Blast Output (Nuc. Seq.) BLASTN 2.2.4 [Aug-26-2002] ... Query= Seq1 38 Database: MgCDS 480 sequences; 526,317 total letters Sequences producing significant alignments: MG001 1095 bases Score = 34.2 bits (17), Expect = 7e-004 Identities = 35/40 (87%), Gaps = 2/40 (5%) Query: 1 Sbjct: 1 Constant gap penalty vs affine function penalty Score E (bits) Value 34 7e-004 atgaataacg--attatttccaacgacaaaacaaaaccac 38 |||||||||| ||||||||||| |||||| |||||||| atgaataacgttattatttccaataacaaaataaaaccac 40 Lambda K H 1.37 0.711 1.31 Matrix: blastn matrix:1 -3 Gap Penalties: Existence: 5, Extension: 2 … effective length of query: 26 effective length of database: 520,557 e E ( E ) x p ( x) x! Typically one would count only 1 GE here. Matches: 35*1 = 35 Mismatches: 3*(-3) = -9 Gap Open: 1*5 = 5 Gap extension: 2*2 =4 R = 35 - 9 - 5 - 4 = 17 S = [λR – ln(K)]/ln(2) =[1.37*17-ln(0.711)]/ln(2) = 34 E = mn2-S = 26 * 520557 * 2-34 = 7.878E-04 x p(x) 0 0.999265217 1 0.000734513 2 0.000000270 Slide 13 3 0.000000000 E-Value in BLAST • The e-value is the expected number of random matches that is equally good or better than the reported match. It can be a number near zero or much larger than 1. • It is NOT the probability of finding the reported match. • Only when the e-value is extremely small can it be interpreted as the probability of finding 1 match that is as good as the reported one (see next slide). Slide 14 E-value and P(1) e E ( E ) x p ( x) x! p(1) e E E E (when E 0) 1 P(1) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.00 0.20 0.40 0.60 0.80 1.00 E-value Slide 15 BLAST Programs Program Database Query Typical Uses BLASTN/ME GABLAST Nucleotide Nucleotide MEGABLAST has longer word size than BLASTN BLASTP Protein Protein Query a protein/peptide against a protein database. BLASTX Protein Nucleotide Translate a nuc sequence into a “protein” in six frames and search against a protein database TBLASTN Nucleotide Protein Unannotated nuc sequences (e.g., ESTs) are translated in six frames against which the query protein is searched TBLASTX Nucleotide Nucleotide 6-frame translation of both query and database PHI-BLAST Protein Protein Pattern-hit iterated BLAST PSI-BLAST Protein Protein Position-specific iterated BLAST RPS-BLAST Protein Protein Reverse PSI-BLAST Slide 16 Comparison: BLAST and FASTA • BLAST starts with exact string matching, while FASTA starts with inexact string matching (or exact string matching with a shorter words). BLAST is faster than FASTA. • For the examples given, both BLAST and FASTA will find the same best match, i.e., shifting the query sequence by 2 sites to the right. • Both perform dynamic programming for extending the match after the initial match. Slide 17 Optional: BLAST Parameters • Lambda and Karlin-Altschul (K) parameters are important because they directly affect the computation of E value. • Both and K depend on – nucleotide (or aminon acid) frequencies – match-mismatch matrix • All BLAST implementations generally assume that nucleotide (or amino acid) sequences have roughly equal frequencies. • For nucleotide (or amino acid) sequences with strongly biased frequencies, BLAST E value obtained with the assumption can be quite misleading, i.e., one should use appropriate and K. Lambda () and K BLAST output includes lambda () and K. Mathematically, is defined as follows: 4 4 i 1 j 1 pi p j e sij 1 where pi, pj are nucleotide frequencies (i,j = A, C, G, or T), and sij is the match (when i = j) or mismatch (when i j) score. In nucleotide BLAST by default, we have sii = 1 and sij = -3. In the simplest case with equal nucleotide frequencies, i.e., when pi = 0.25, the equation above is reduced to 4 4 pi p j e sij 4 0.252 e 12 0.252 e3 0.25e 0.75e3 1 i 1 j 1 Now insert different values to the equation above to find which balances the equation (not the trivial solution of = 0) 20 20 i 1 j 1 pi p j e sij 1 (for amino acid sequences) See the updated Chapter 1 and BLASTParameter.xlsm on how to compute K. Finding I: equal , (1, -3) A G C T 0.25 0.25 0.25 0.25 A 0.25 0.0625 0.0625 0.0625 0.0625 G 0.25 0.0625 0.0625 0.0625 0.0625 C 0.25 0.0625 0.0625 0.0625 0.0625 T 0.25 0.0625 0.0625 0.0625 0.0625 1 -3 -3 -3 G -3 1 -3 -3 C -3 -3 1 -3 T -3 -3 -3 1 Match-Mismatch A Lambda 1 0.169893 0.003112 0.003112 0.003112 0.003112 0.169893 0.003112 0.003112 0.003112 0.003112 0.169893 0.003112 0.003112 0.003112 0.003112 0.169893 0.716911 Double-click it, copy to EXCEL and find by using solver. Slide 20 Finding II: Different , (1, -3) A G C T 0.1 0.4 0.4 0.1 A 0.1 0.01 0.04 0.04 0.01 G 0.4 0.04 0.16 0.16 0.04 C 0.4 0.04 0.16 0.16 0.04 T 0.1 0.01 0.04 0.04 0.01 1 -3 -3 -3 G -3 1 -3 -3 C -3 -3 1 -3 T -3 -3 -3 1 A Match-Mismatch Lambda 1 0.027183 0.001991 0.001991 0.000498 0.001991 0.434925 0.007966 0.001991 0.001991 0.007966 0.434925 0.001991 0.000498 0.001991 0.001991 0.027183 0.957075 Finding III: Different , s/v A G C T 0.1 0.4 0.4 0.1 Match-Mismatch A G C T Lambda A 0.1 0.01 0.04 0.04 0.01 G 0.4 0.04 0.16 0.16 0.04 C 0.4 0.04 0.16 0.16 0.04 T 0.1 0.01 0.04 0.04 0.01 1 -1 -3 -3 -1 1 -3 -3 -3 -3 1 -1 -3 -3 -1 1 0.02691 0.014865 0.002053 0.000513 0.014865 0.430554 0.008211 0.002053 1 0.9899 0.002053 0.000513 0.008211 0.002053 0.430554 0.014865 0.014865 0.02691 1.000046 Finding K: equal , (1, -3) A G C T 0.25 0.25 0.25 0.25 A 0.25 0.0625 0.0625 0.0625 0.0625 G 0.25 0.0625 0.0625 0.0625 0.0625 C 0.25 0.0625 0.0625 0.0625 0.0625 T 0.25 0.0625 0.0625 0.0625 0.0625 1 -3 -3 -3 G -3 1 -3 -3 C -3 -3 1 -3 T -3 -3 -3 1 Match-Mismatch A Lambda 1 0.169893 0.003112 0.003112 0.003112 0.003112 0.169893 0.003112 0.003112 0.003112 0.003112 0.169893 0.003112 0.003112 0.003112 0.003112 0.169893 0.716911 Double-click it, copy to EXCEL and find by using solver. Slide 23 Bioinformatics research workflow Accumulation of nucleotide and amino acid sequences: UUCUCAACCAACCAUAAAGAUAU UUCUCUACAAACCACAAAGACAU Storage and annotation of the sequences UUCUCAACCAACCAUAAAGAUAU 1. UUCUCAACCAACCACAAAGACAU UUCUCCACGAACCACAAAGAUAU UUCUCUACAAACCACAAAGAUAU 2. UUCUCAACCAACCACAAAGACAU Functional genomics Systems biology Digital cells Structural annotation with homology search and de novo gene prediction Species-specific gene dictionaries, e.g., yeastgenome.org Functional annotation with gene ontologies UUCUCUACUAACCACAAAGACAU 1. Mutation Selection 2. Adaptation 3. Comparative genomics (the origin of new genes, new features and new species) Phylogenetics (cladogenic process, dating of speciation and gene duplication events) Phylogeny-based inference. 1. 2. 3. 4. Gene/Protein families (e.g., Pfam) Cluster of orthologous genes (e.g., COG) Supermatrix of gene presence/absence Genome-based pair-wise distance distributions Slide 24