Mathematics and computation behind BLAST and FASTA Xuhua Xia xxia@uottawa.ca http://dambe.bio.uottawa.ca Why string matching? • Biological significance – "We have discovered a new gene that has.

Mathematics and computation behind BLAST and FASTA Xuhua Xia [email protected] http://dambe.bio.uottawa.ca Why string matching? • Biological significance – "We have discovered a new gene that has.

Transcript Mathematics and computation behind BLAST and FASTA Xuhua Xia [email protected] http://dambe.bio.uottawa.ca Why string matching? • Biological significance – "We have discovered a new gene that has.

Mathematics and computation behind
BLAST and FASTA
Xuhua Xia
[email protected]
http://dambe.bio.uottawa.ca
Why string matching?
• Biological significance
– "We have discovered a new gene that has major effect on liver
cancer…"
– "We have discovered a new cancer-arresting function of a previously
reported reported gene …"
• Early applications: Sequence similarity between an oncogene (genes in
viruses that cause a cancer-like transformation of the infected cells), v-sis,
and the platelet-derived growth factor (PDGF)
• M. D. Waterfield et al. 1983. Nature 304:35-39
• R. F. Doolittle et al. 1983. Science 221:275-227
• Fast computational methods in string matching
– FASTA
– BLAST
– Local pair-wise alignment by dynamic programming
Slide 3
FASTA
• A commonly used family of alignment and search
tools
• Generally considered to be more sensitive than
BLAST.
• Illustration with two fictitious sequences used in the
Contig Assembly lecture:
Seq1: ACCGCGATGACGAATA
Seq2: GAATACGACTGACGATGGA
Seq1: ACCGCGATGACGAATA
Seq2:
GAATACGACTGACGATGGA
Slide 4
String Match in FASTA
Query
Target
1
A
G
2
C
A
3
C
A
4
G
T
A C G
T
1
2
4
8
7
3
6 15
10
5
9
13 11 12
14
16
1
2
3
4
G A A
T
-3
1
2 -4
-5 -5 -4 -11
-8 -8 -7
-11 -11 -10
-12 -11
-14 -13
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Left
Right
C G A T G A C G A A T A
Move N Move N
A C G A C T G A C G A T G G A
-1 3
1 6
-2 5
2 7
-3 1
3 3
-4 3
4 3
-5 7
5 6
-6 1
6 3
-7 1
7 3
-8 4
8 5
-9 1
9 2
-10 1
10 2
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
-11 5
11 3
A C G A C T G A C G A T G G A
-12 1
12 2
4 4 3 7 7 2 7 11 11 10 14 8 13 14 18
-13 1
13 1
-2 3 1 1 6 -5 5 5 10 8 8 1 11 12 12
-14 1
14 2
-5 1 -2 -2 4
2 2 8 5 5
8 9 9
-15 0
15 0
-8 -5 -5 -5 -2
-1 -1 2 2 2
5 6 6
16 0
-9
-6
-2
1
5
17 0
-11
-8
-4
-1
3
18 1
Left and Right: -n means moving the query left by n
sites and n means moving the query right by n sites.
Slide 5
Alternative Matched Strings
Forw.
Back
Move N Move N
-1 3
1 6
-2 5
2 7
-3 1
3 3
-4 3
4 3
-5 7
5 6
-6 1
6 3
-7 1
7 3
-8 4
8 5
-9 1
9 2
-10 1
10 2
-11 5
11 3
-12 1
12 2
-13 1
13 1
-14 1
14 2
-15 0
15 0
16 0
17 0
18 1
Query: ACCGCGATGACGAATA
Target:GAATACGACTGACGATGGA
From lecture on contig assembly:
One of the
three 3rd best
Query: ACCGCGATGACGAATA
Target:
GAATACGACTGACGATGGA
From FASTA algorithm:
Query:
ACCGCGATGACGAATA
Target: GAATACGACTGACGATGGA
Query: ACCGCGATGACGAATA
Target:
GAATACGACTGACGATGGA
Best
2nd best
Which one is best based on YOUR judgment?
Slide 6
Word length of 2
Query
Target
1
A
G
2
C
A
3
C
A
4
G
T
5
C
A
6
G
C
7
A
G
8
T
A
9
G
C
10
A
T
11 12
C G
G A
13
A
C
14 15 16
A T A
G A T
AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT
13
1
7
2
3
6
4
15
8
10
14
5
9
11
12
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16
GA AA AT TA AC CG GA AC CT TG GA AC CG GA AT TG
-5 -11 -4 -11
4
3
1
7
2
5 11 10
8 8 8
-8
-11
-5
1 -2 -2
2
2
8
5 1
-11
-5 -5
-1
2
2
Query: ACCGCGATGACGAATA
Target:
GAATACGACTGACGATGGA
Query:
ACCGCGATGACGAATA
Target: GAATACGACTGACGATGGA
17
18
Left
Move
G G A
-1
-2
-3
-4
-5
-6
-7
-8
-9
-10
17 18
-11
GG GA
-12
12
-13
9
-14
6
N
1
2
0
1
4
0
0
1
0
0
4
0
0
0
Right
Move N
1 3
2 5
3 1
4 1
5 2
6 1
7 1
8 4
9 1
10 1
11 1
12 1
13 0
14 0
15 0
16 0
17 0
One of the
three 2nd best
Best
Slide 7
BLAST
• Adapted from Crane & Raymer 2003
• Motivation: matching short sequences are faster than matching
longer ones
• Input sequence: AILVPTVIGCTVPT
• Algorithm:
– Break the query sequence into words:
AILV, ILVP, LVPT, VPTV, PTVI, TVIG, VIGC, IGCT,
GCTV, CTVP, TVPT
– Discard common words (i.e., words made entirely of common amino
acids)
– Search for matches against database sequences, assess significance and
decide whether to discard to continue with extension using dynamic
programming:
AILVPTVIGCTVPT
MVQGWALYDFLKCRAILVPTVIACTCVAMLALYDFLKC
• Critical decision: Discard or continue?
• The E-value as an answer.
Slide 8
Basic stats in string matching
• Given PA, PC, PG, PT in a target (database) sequence, the
probability of a query sequence, say, ATTGCC, having a
perfect match of the target sequence is:
prob = PAPTPT PGPCPC = PA (PC)2 PG (PT)2
• Let M be the target sequence length and N be the query
sequence length, the “matching operation” can be performed
(M – N +1) times, e.g.,
Query: ATG
Target CGATTGCCCG
• The probability distribution of the number of matches follows
(approximately) a binomial distribution with p = prob and n =
(M – N +1)
Slide 9
Basic stats in string matching
• Probability of having a sequence match: p
• Probability of having no match: q = 1-p
• Binomial distribution:
( p  q) n  p n 
n!
n!
p n 1q  ... 
p n  x q x  ...  q n
(n  1)!1!
(n  x)! x !
• When np > 50, the binomial distribution can be approximated
by the normal distribution with the mean = np and variance =
npq
( x )

2
P( x) 
1
e
2
2 2
• When np < 1 and n is very large, binomial distribution can be
approximated by the Poisson distribution with mean and
variance equal to np (i.e.,  = 2 = np).
e   x
P( x) 
x!
Slide 10
From Binomial to Poisson
( p  q) n  p n 
n!
n!
n!
p n 1q  ... 
p n  x q x ... 
p x q n  x  ...  q n
(n  1)!1!
(n  x)! x !
(n  x)! x !
P ( n)  p n
P(n  1)  np n 1q
n!
P(n  x) 
p n x q x
(n  x)! x !
n!
P( x) 
p x q n x
(n  x)! x !
P(0)  q n
n!
p x q n x
(n  x)! x !
n!

p xq xqn
(n  x)! x !
P( x) 
x
n(n  1)(n  2)...(n  x  1)  p 
n

(1

p
)
 
x!
q
x
n x x  np n x p x  np (np ) x  np
 

pe 
e 
e e
x!
x!
x!
x!
Slide 11
Matching two sequences without gap
• Assuming equal nucleotide frequencies, the probability of a
nucleotide site in the query sequence matching a site in the
target sequence is p = 0.25.
• The probability of finding an exact match of L letters is a = pL
= 0.25L = 2-2L = 2-S, where S is called the bit score in BLAST.
• M: query length; N: target length, e.g., M = 8, N = 5, L = 3
AACGGTTC
CGGTT
• A sequence of length L can move at (M – L +1) distinct sites
along the query and (N – L +1) distinct sites along the target.
• m = (M-L+1) and n = (N-L+1) are called effective lengths of
the two sequences.
• The expected number of matches with length L is mn2-S,
which is called E-value in ungapped BLAST.
• S is calculated differently in the gapped BLAST
Slide 12
Blast Output (Nuc. Seq.)
BLASTN 2.2.4 [Aug-26-2002]
...
Query= Seq1 38
Database: MgCDS
480 sequences; 526,317 total letters
Sequences producing significant alignments:
MG001 1095 bases
Score = 34.2 bits (17), Expect = 7e-004
Identities = 35/40 (87%), Gaps = 2/40 (5%)
Query: 1
Sbjct: 1
Constant gap penalty vs
affine function penalty
Score
E
(bits) Value
34
7e-004
atgaataacg--attatttccaacgacaaaacaaaaccac 38
|||||||||| ||||||||||| |||||| ||||||||
atgaataacgttattatttccaataacaaaataaaaccac 40
Lambda
K
H
1.37
0.711
1.31
Matrix: blastn matrix:1 -3
Gap Penalties: Existence: 5, Extension: 2
…
effective length of query: 26
effective length of database: 520,557
e E ( E ) x
p ( x) 
x!
Typically one would
count only 1 GE here.
Matches: 35*1 = 35
Mismatches: 3*(-3) = -9
Gap Open: 1*5 = 5
Gap extension: 2*2 =4
R = 35 - 9 - 5 - 4 = 17
S = [λR – ln(K)]/ln(2) =[1.37*17-ln(0.711)]/ln(2) = 34
E = mn2-S = 26 * 520557 * 2-34 = 7.878E-04
x
p(x)
0
0.999265217
1
0.000734513
2
0.000000270
Slide 13
3
0.000000000
E-Value in BLAST
• The e-value is the expected number of random
matches that is equally good or better than the
reported match. It can be a number near zero or much
larger than 1.
• It is NOT the probability of finding the reported
match.
• Only when the e-value is extremely small can it be
interpreted as the probability of finding 1 match that
is as good as the reported one (see next slide).
Slide 14
E-value and P(1)
e E ( E ) x
p ( x) 
x!
p(1)  e E E  E (when E  0)
1
P(1)
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0.00
0.20
0.40
0.60
0.80
1.00
E-value
Slide 15
BLAST Programs
Program
Database
Query
Typical Uses
BLASTN/ME
GABLAST
Nucleotide
Nucleotide
MEGABLAST has longer word size than
BLASTN
BLASTP
Protein
Protein
Query a protein/peptide against a protein
database.
BLASTX
Protein
Nucleotide
Translate a nuc sequence into a “protein”
in six frames and search against a protein
database
TBLASTN
Nucleotide
Protein
Unannotated nuc sequences (e.g., ESTs)
are translated in six frames against which
the query protein is searched
TBLASTX
Nucleotide
Nucleotide
6-frame translation of both query and
database
PHI-BLAST
Protein
Protein
Pattern-hit iterated BLAST
PSI-BLAST
Protein
Protein
Position-specific iterated BLAST
RPS-BLAST
Protein
Protein
Reverse PSI-BLAST
Slide 16
Comparison: BLAST and FASTA
• BLAST starts with exact string matching, while
FASTA starts with inexact string matching (or exact
string matching with a shorter words). BLAST is
faster than FASTA.
• For the examples given, both BLAST and FASTA
will find the same best match, i.e., shifting the query
sequence by 2 sites to the right.
• Both perform dynamic programming for extending
the match after the initial match.
Slide 17
Optional: BLAST Parameters
• Lambda  and Karlin-Altschul (K) parameters are important
because they directly affect the computation of E value.
• Both  and K depend on
– nucleotide (or aminon acid) frequencies
– match-mismatch matrix
• All BLAST implementations generally assume that nucleotide
(or amino acid) sequences have roughly equal frequencies.
• For nucleotide (or amino acid) sequences with strongly biased
frequencies, BLAST E value obtained with the assumption
can be quite misleading, i.e., one should use appropriate  and
K.
Lambda () and K
BLAST output includes lambda () and K. Mathematically,  is defined as follows:
4
4

i 1 j 1
pi p j e
sij 
1
where pi, pj are nucleotide frequencies (i,j = A, C, G, or T), and sij is the match (when i = j) or
mismatch (when i  j) score. In nucleotide BLAST by default, we have sii = 1 and sij = -3. In the
simplest case with equal nucleotide frequencies, i.e., when pi = 0.25, the equation above is reduced to
4 4
 pi p j e
sij 
 4  0.252 e  12  0.252 e3  0.25e  0.75e3  1
i 1 j 1
Now insert different  values to the equation above to find which  balances the equation
(not the trivial solution of  = 0)
20 20

i 1 j 1
pi p j e
sij 
1
(for amino acid sequences)
See the updated Chapter 1 and BLASTParameter.xlsm on how to compute K.
Finding  I: equal , (1, -3)
A
G
C
T
0.25
0.25
0.25
0.25
A
0.25
0.0625
0.0625
0.0625
0.0625
G
0.25
0.0625
0.0625
0.0625
0.0625
C
0.25
0.0625
0.0625
0.0625
0.0625
T
0.25
0.0625
0.0625
0.0625
0.0625
1
-3
-3
-3
G
-3
1
-3
-3
C
-3
-3
1
-3
T
-3
-3
-3
1
Match-Mismatch
A
Lambda
1
0.169893 0.003112 0.003112 0.003112
0.003112 0.169893 0.003112 0.003112
0.003112 0.003112 0.169893 0.003112
0.003112 0.003112 0.003112 0.169893 0.716911
Double-click it, copy to EXCEL and find  by using solver.
Slide 20
Finding  II: Different , (1, -3)
A
G
C
T
0.1
0.4
0.4
0.1
A
0.1
0.01
0.04
0.04
0.01
G
0.4
0.04
0.16
0.16
0.04
C
0.4
0.04
0.16
0.16
0.04
T
0.1
0.01
0.04
0.04
0.01
1
-3
-3
-3
G
-3
1
-3
-3
C
-3
-3
1
-3
T
-3
-3
-3
1
A
Match-Mismatch
Lambda
1
0.027183 0.001991 0.001991 0.000498
0.001991 0.434925 0.007966 0.001991
0.001991 0.007966 0.434925 0.001991
0.000498 0.001991 0.001991 0.027183 0.957075
Finding  III: Different , s/v
A
G
C
T
0.1
0.4
0.4
0.1
Match-Mismatch
A
G
C
T
Lambda
A
0.1
0.01
0.04
0.04
0.01
G
0.4
0.04
0.16
0.16
0.04
C
0.4
0.04
0.16
0.16
0.04
T
0.1
0.01
0.04
0.04
0.01
1
-1
-3
-3
-1
1
-3
-3
-3
-3
1
-1
-3
-3
-1
1
0.02691
0.014865
0.002053
0.000513
0.014865
0.430554
0.008211
0.002053
1
0.9899
0.002053 0.000513
0.008211 0.002053
0.430554 0.014865
0.014865 0.02691 1.000046
Finding K: equal , (1, -3)
A
G
C
T
0.25
0.25
0.25
0.25
A
0.25
0.0625
0.0625
0.0625
0.0625
G
0.25
0.0625
0.0625
0.0625
0.0625
C
0.25
0.0625
0.0625
0.0625
0.0625
T
0.25
0.0625
0.0625
0.0625
0.0625
1
-3
-3
-3
G
-3
1
-3
-3
C
-3
-3
1
-3
T
-3
-3
-3
1
Match-Mismatch
A
Lambda
1
0.169893 0.003112 0.003112 0.003112
0.003112 0.169893 0.003112 0.003112
0.003112 0.003112 0.169893 0.003112
0.003112 0.003112 0.003112 0.169893 0.716911
Double-click it, copy to EXCEL and find  by using solver.
Slide 23
Bioinformatics research workflow
Accumulation of nucleotide
and amino acid sequences:
UUCUCAACCAACCAUAAAGAUAU
UUCUCUACAAACCACAAAGACAU
Storage and annotation of
the sequences
UUCUCAACCAACCAUAAAGAUAU
1.
UUCUCAACCAACCACAAAGACAU
UUCUCCACGAACCACAAAGAUAU
UUCUCUACAAACCACAAAGAUAU
2.
UUCUCAACCAACCACAAAGACAU
Functional genomics
Systems biology
Digital cells
Structural annotation
with homology search
and de novo gene
prediction
Species-specific gene
dictionaries, e.g.,
yeastgenome.org
Functional annotation
with gene ontologies
UUCUCUACUAACCACAAAGACAU
1.
Mutation
Selection
2.
Adaptation
3.
Comparative genomics (the
origin of new genes, new
features and new species)
Phylogenetics (cladogenic
process, dating of speciation
and gene duplication events)
Phylogeny-based inference.
1.
2.
3.
4.
Gene/Protein families (e.g.,
Pfam)
Cluster of orthologous genes
(e.g., COG)
Supermatrix of gene
presence/absence
Genome-based pair-wise
distance distributions
Slide 24

Mathematics and computation behind BLAST and FASTA Xuhua Xia [email protected] http://dambe.bio.uottawa.ca Why string matching? • Biological significance – "We have discovered a new gene that has.

Transcript Mathematics and computation behind BLAST and FASTA Xuhua Xia [email protected] http://dambe.bio.uottawa.ca Why string matching? • Biological significance – "We have discovered a new gene that has.

Directory