Molecular Evolution of Pasteurella multocida During

Download Report

Transcript Molecular Evolution of Pasteurella multocida During

Mathematics and computation behind
BLAST and FASTA
Xuhua Xia
[email protected]
http://dambe.bio.uottawa.ca
Bioinformatics-enabled research
Sequence variation:
Difference in
Difference in
UUCUCUACAAACCACAAAGACAU
1. coding sequences
1. protein abundance
UUCUCAACCAACCAUAAAGAUAU
2. Regulatory sequences
2. protein structure
UUCUCAACCAACCACAAAGACAU
3. transcription
3. cellular localization
4. splicing
4. protein interaction
partners
UUCUCAACCAACCAUAAAGAUAU
UUCUCCACGAACCACAAAGAUAU
UUCUCUACAAACCACAAAGAUAU
UUCUCAACCAACCACAAAGACAU
UUCUCUACUAACCACAAAGACAU
5. translation
6. translated sequences
Difference in
Difference in biochemical
function
1. susceptibility to
diseases
Personalized medicine
Conservation strategies
2. response to medicine
Difference in phenotype
3. Fitness (survival and
reproductive success)
1. morphological
2. physiological
Evolutionary mechanisms
...
Nurturing
environment
3. behavioural
Slide 2
Why string matching?
• Efficient search against large sequence databases
• Practical significance from early applications
– Sequence similarity between an oncogene (genes in viruses that cause
a cancer-like transformation of the infected cells), v-sis, and the
platelet-derived growth factor (PDGF)
• M. D. Waterfield et al. 1983. Nature 304:35-39
• R. F. Doolittle et al. 1983. Science 221:275-227
– Contig assembly
– Functional annotation by homology search
• Fast computational methods in string matching
– FASTA
– BLAST
– Local pair-wise alignment by dynamic programming
Slide 3
Basic stats in string matching
• Given PA, PC, PG, PT in a target (database) sequence, the
probability of a query sequence, say, ATTGCC, having a
perfect match of the target sequence is:
prob = PAPTPT PGPCPC = PA (PC)2 PG (PT)2
• Let M be the target sequence length and N be the query
sequence length, the “matching operation” can be performed
(M – N +1) times, e.g.,
Query: ATG
Target CGATTGCCCG
• The probability distribution of the number of matches follows
(approximately) a binomial distribution with p = prob and n =
(M – N +1)
Slide 4
Basic stats in string matching
• Probability of having a sequence match: p
• Probability of having no match: q = 1-p
• Binomial distribution:
( p  q) n  p n 
n!
n!
p n 1q  ... 
p n  x q x  ...  q n
(n  1)!1!
(n  x)! x !
• When np > 50, the binomial distribution can be approximated
by the normal distribution with the mean = np and variance =
npq
( x )

2
P( x) 
1
e
2
2 2
• When np < 1 and n is very large, binomial distribution can be
approximated by the Poisson distribution with mean and
variance equal to np (i.e.,  = 2 = np).
e   x
P( x) 
x!
Slide 5
From Binomial to Poisson
( p  q) n  p n 
n!
n!
n!
p n 1q  ... 
p n  x q x ... 
p x q n  x  ...  q n
(n  1)!1!
(n  x)! x !
(n  x)! x !
P ( n)  p n
P(n  x) 
n!
p x q n x
(n  x)! x !
n!

p xq xqn
(n  x)! x !
P(0)  q n
n(n  1)(n  2)...(n  x  1)  p 
n

(1

p
)
 
x!
q
x
n x x  np n x p x  np (np ) x  np
 

pe 
e 
e e
x!
x!
x!
x!
P(n  1)  np n 1q
n!
p n x q x
(n  x)! x !
n!
P( x) 
p x q n x
(n  x)! x !
P( x) 
x
Slide 6
Matching two sequences without gap
• Assuming equal nucleotide frequencies, the probability of a
nucleotide site in the query sequence matching a site in the
target sequence is p = 0.25.
• The probability of finding an exact match of L letters is a = pL
= 0.25L = 2-2L = 2-S, where S is called the bit score in BLAST.
• M: query length; N: target length, e.g., M = 8, N = 5, L = 3
AACGGTTC
CGGTT
• A sequence of length L can move at (M – L +1) distinct sites
along the query and (N – L +1) distinct sites along the target.
• m = (M-L+1) and n = (N-L+1) are called effective lengths of
the two sequences.
• The expected number of matches with length L is mn2-S,
which is called E-value in ungapped BLAST.
• S is calculated differently in the gapped BLAST
Slide 7
Blast Output (Nuc. Seq.)
BLASTN 2.2.4 [Aug-26-2002]
...
Query= Seq1 38
Database: MgCDS
480 sequences; 526,317 total letters
Sequences producing significant alignments:
MG001 1095 bases
Score = 34.2 bits (17), Expect = 7e-004
Identities = 35/40 (87%), Gaps = 2/40 (5%)
Query: 1
Sbjct: 1
Constant gap penalty vs
affine function penalty
Score
E
(bits) Value
34
7e-004
atgaataacg--attatttccaacgacaaaacaaaaccac 38
|||||||||| ||||||||||| |||||| ||||||||
atgaataacgttattatttccaataacaaaataaaaccac 40
Lambda
K
H
1.37
0.711
1.31
Matrix: blastn matrix:1 -3
Gap Penalties: Existence: 5, Extension: 2
…
effective length of query: 26
effective length of database: 520,557
e E ( E ) x
p( x) 
x!
Typically one would
count only 1 GE here.
Matches: 35*1 = 35
Mismatches: 3*(-3) = -9
Gap Open: 1*5 = 5
Gap extension: 2*2 =4
R = 35 - 9 - 5 - 4 = 17
S = [λR – ln(K)]/ln(2) =[1.37*17-ln(0.711)]/ln(2) = 34
E = mn2-S = 26 * 520557 * 2-34 = 7.878E-04
x
p(x)
0
0.999265217
1
0.000734513
2
0.000000270
Slide 8
3
0.000000000
Lambda () and K
BLAST output includes lambda () and K. Mathematically,  is defined as follows:
4
4
 pi p j e
sij 
1
i 1 j 1
where pi, pj are nucleotide frequencies (i,j = A, C, G, or T), and sij is the match (when i = j) or
mismatch (when i  j) score. In nucleotide BLAST by default, we have sii = 1 and sij = -3. In the
simplest case with equal nucleotide frequencies, i.e., when pi = 0.25, the equation above is reduced to
4
4
 pi p j e
sij 
20 20
sij 
 4  0.252 e  12  0.252 e3  0.25e  0.75e3  1
i 1 j 1
 pi p j e
1
(for amino acid sequences)
i 1 j 1
See the updated Chapter 1 and BLASTParameter.xlsm on how to compute K.
E-Value in BLAST
• The e-value is the expected number of random
matches that is equally good or better than the
reported match. It can be a number near zero or much
larger than 1.
• It is NOT the probability of finding the reported
match.
• Only when the e-value is extremely small can it be
interpreted as the probability of finding 1 match that
is as good as the reported one (see next slide).
Slide 10
E-value and P(1)
e E ( E ) x
p( x) 
x!
p(1)  e E E  E (when E  0)
1
P(1)
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0.00
0.20
0.40
0.60
0.80
1.00
E-value
Slide 11
Gapped BLAST
• Adapted from Crane & Raymer 2003
• Input sequence: AILVPTVIGCTVPT
• Algorithm:
– Break the query sequence into words:
AILV, ILVP, LVPT, VPTV, PTVI, TVIG, VIGC, IGCT,
GCTV, CTVP, TVPT
– Discard common words (i.e., words made entirely of common amino
acids)
– Search for matches against database sequences, assess significance and
decide whether to discard to continue with extension using dynamic
programming:
AILVPTVIGCTVPT
MVQGWALYDFLKCRAILVPTVIACTCVAMLALYDFLKC
Slide 12
BLAST Programs
Program
Database
Query
Typical Uses
BLASTN/ME
GABLAST
Nucleotide
Nucleotide
MEGABLAST has longer word size than
BLASTN
BLASTP
Protein
Protein
Query a protein/peptide against a protein
database.
BLASTX
Protein
Nucleotide
Translate a nuc sequence into a “protein”
in six frames and search against a protein
database
TBLASTN
Nucleotide
Protein
Unannotated nuc sequences (e.g., ESTs)
are translated in six frames against which
the query protein is searched
TBLASTX
Nucleotide
Nucleotide
6-frame translation of both query and
database
PHI-BLAST
Protein
Protein
Pattern-hit iterated BLAST
PSI-BLAST
Protein
Protein
Position-specific iterated BLAST
RPS-BLAST
Protein
Protein
Reverse PSI-BLAST
Slide 13
FASTA
• Another commonly used family of alignment and
search tools
• Generally considered to be more sensitive than
BLAST.
• Illustration with two fictitious sequences used in the
Contig Assembly lecture:
Seq1: ACCGCGATGACGAATA
Seq2: GAATACGACTGACGATGGA
Seq1: ACCGCGATGACGAATA
Seq2:
GAATACGACTGACGATGGA
Slide 14
String Match in FASTA
Query
Target
1
A
G
2
C
A
3
C
A
4
G
T
A C G
T
1
2
4
8
7
3
6 15
10
5
9
13 11 12
14
16
1
2
3
4
G A A
T
-3
1
2 -4
-5 -5 -4 -11
-8 -8 -7
-11 -11 -10
-12 -11
-14 -13
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Left
Right
C G A T G A C G A A T A
Move N Move N
A C G A C T G A C G A T G G A
-1 3
1 6
-2 5
2 7
-3 1
3 3
-4 3
4 3
-5 7
5 6
-6 1
6 3
-7 1
7 3
-8 4
8 5
-9 1
9 2
-10 1
10 2
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
-11 5
11 3
A C G A C T G A C G A T G G A
-12 1
12 2
4 4 3 7 7 2 7 11 11 10 14 8 13 14 18
-13 1
13 1
-2 3 1 1 6 -5 5 5 10 8 8 1 11 12 12
-14 1
14 2
-5 1 -2 -2 4
2 2 8 5 5
8 9 9
-15 0
15 0
-8 -5 -5 -5 -2
-1 -1 2 2 2
5 6 6
16 0
-9
-6
-2
1
5
17 0
-11
-8
-4
-1
3
18 1
Left and Right: -n means moving the query left by n
sites and n means moving the query right by n sites.
Slide 15
Alternative Matched Strings
Forw.
Back
Move N Move N
-1 3
1 6
-2 5
2 7
-3 1
3 3
-4 3
4 3
-5 7
5 6
-6 1
6 3
-7 1
7 3
-8 4
8 5
-9 1
9 2
-10 1
10 2
-11 5
11 3
-12 1
12 2
-13 1
13 1
-14 1
14 2
-15 0
15 0
16 0
17 0
18 1
Query: ACCGCGATGACGAATA
Target:GAATACGACTGACGATGGA
From lecture on contig assembly:
One of the
three 3rd best
Query: ACCGCGATGACGAATA
Target:
GAATACGACTGACGATGGA
From FASTA algorithm:
Query:
ACCGCGATGACGAATA
Target: GAATACGACTGACGATGGA
Query: ACCGCGATGACGAATA
Target:
GAATACGACTGACGATGGA
Best
2nd best
Which one is best based on YOUR judgment?
Slide 16
Word length of 2
Query
Target
1
A
G
2
C
A
3
C
A
4
G
T
5
C
A
6
G
C
7
A
G
8
T
A
9
G
C
10
A
T
11 12
C G
G A
13
A
C
14 15 16
A T A
G A T
AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT
13
1
7
2
3
6
4
15
8
10
14
5
9
11
12
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16
GA AA AT TA AC CG GA AC CT TG GA AC CG GA AT TG
-5 -11 -4 -11
4
3
1
7
2
5 11 10
8 8 8
-8
-11
-5
1 -2 -2
2
2
8
5 1
-11
-5 -5
-1
2
2
Query: ACCGCGATGACGAATA
Target:
GAATACGACTGACGATGGA
Query:
ACCGCGATGACGAATA
Target: GAATACGACTGACGATGGA
17
18
Left
Move
G G A
-1
-2
-3
-4
-5
-6
-7
-8
-9
-10
17 18
-11
GG GA
-12
12
-13
9
-14
6
N
1
2
0
1
4
0
0
1
0
0
4
0
0
0
Right
Move N
1 3
2 5
3 1
4 1
5 2
6 1
7 1
8 4
9 1
10 1
11 1
12 1
13 0
14 0
15 0
16 0
17 0
One of the
three 2nd best
Best
Slide 17
Comparison: BLAST and FASTA
• BLAST starts with exact string matching, while
FASTA starts with inexact string matching (or exact
string matching with a shorter words). BLAST is
faster than FASTA.
• For the examples given, both BLAST and FASTA
will find the same best match, i.e., shifting the query
sequence by 2 sites to the right.
• Both perform dynamic programming for extending
the match after the initial match.
Slide 18
Optional: BLAST Parameters
• Lambda  and Karlin-Altschul (K) parameters are important
because they directly affect the computation of E value.
• Both  and K depend on
– nucleotide (or aminon acid) frequencies
– match-mismatch matrix
• All BLAST implementations generally assume that nucleotide
(or amino acid) sequences have roughly equal frequencies.
• For nucleotide (or amino acid) sequences with strongly biased
frequencies, BLAST E value obtained with the assumption
can be quite misleading, i.e., one should use appropriate  and
K.
Case 1: equal , (-3,1)
A
G
C
T
0.25
0.25
0.25
0.25
Match-Mismatch
A
G
C
T
Lambda
A
0.25
0.0625
0.0625
0.0625
0.0625
G
0.25
0.0625
0.0625
0.0625
0.0625
C
0.25
0.0625
0.0625
0.0625
0.0625
T
0.25
0.0625
0.0625
0.0625
0.0625
1
-3
-3
-3
-3
1
-3
-3
-3
-3
1
-3
-3
-3
-3
1
0.246963
0.001013
0.001013
0.001013
0.001013
0.246963
0.001013
0.001013
0.001013
0.001013
0.246963
0.001013
1.37407
0.001013
0.001013
0.001013
0.246963 1.000007
Case 2: Different , (-3, 1)
A
G
C
T
0.1
0.4
0.4
0.1
Match-Mismatch
A
G
C
T
Lambda
A
0.1
0.01
0.04
0.04
0.01
G
0.4
0.04
0.16
0.16
0.04
C
0.4
0.04
0.16
0.16
0.04
T
0.1
0.01
0.04
0.04
0.01
1
-3
-3
-3
-3
1
-3
-3
-3
-3
1
-3
-3
-3
-3
1
1
1.0501
0.028579 0.001714 0.001714 0.000428
0.001714 0.45727 0.006854 0.001714
0.001714 0.006854 0.45727 0.001714
0.000428 0.001714 0.001714 0.028579 0.999972
Case 3: Different , s/v
A
G
C
T
0.1
0.4
0.4
0.1
Match-Mismatch
A
G
C
T
Lambda
A
0.1
0.01
0.04
0.04
0.01
G
0.4
0.04
0.16
0.16
0.04
C
0.4
0.04
0.16
0.16
0.04
T
0.1
0.01
0.04
0.04
0.01
1
-1
-3
-3
-1
1
-3
-3
-3
-3
1
-1
-3
-3
-1
1
0.02691
0.014865
0.002053
0.000513
0.014865
0.430554
0.008211
0.002053
1
0.9899
0.002053 0.000513
0.008211 0.002053
0.430554 0.014865
0.014865 0.02691 1.000046
K: case 1
0.25
0.25
0.25
0.25
A
0.25
0.0625
0.0625
0.0625
0.0625
G
0.25
0.0625
0.0625
0.0625
0.0625
C
0.25
0.0625
0.0625
0.0625
0.0625
T
0.25
0.0625
0.0625
0.0625
0.0625
-2
0
-1
0
0
1
Match
Mismatch
1
-3
-3
0.75
0
0.25
Type '=karlin(-3,1,true,true,true)' to compute the BLAST parameters. The
Lambda = 1.3741 H = 1.3072 K = 0.7106
K: Case 2
0.1
0.4
0.4
0.1
A
0.1
0.01
0.04
0.04
0.01
G
0.4
0.04
0.16
0.16
0.04
C
0.4
0.04
0.16
0.16
0.04
T
0.1
0.01
0.04
0.04
0.01
-2
0
-1
0
0
1
Match
1
Transition
-1
Transversion
-3
-3
0.5
0.16
Lambda = 0.9898 H = 0.7705 K = 0.4891
0.34
Bioinformatics research workflow
Accumulation of nucleotide
and amino acid sequences:
UUCUCAACCAACCAUAAAGAUAU
UUCUCUACAAACCACAAAGACAU
Storage and annotation of
the sequences
UUCUCAACCAACCAUAAAGAUAU
1.
UUCUCAACCAACCACAAAGACAU
UUCUCCACGAACCACAAAGAUAU
UUCUCUACAAACCACAAAGAUAU
2.
UUCUCAACCAACCACAAAGACAU
Functional genomics
Systems biology
Digital cells
Structural annotation
with homology search
and de novo gene
prediction
Species-specific gene
dictionaries, e.g.,
yeastgenome.org
Functional annotation
with gene ontologies
UUCUCUACUAACCACAAAGACAU
1.
Mutation
Selection
2.
Adaptation
3.
Comparative genomics (the
origin of new genes, new
features and new species)
Phylogenetics (cladogenic
process, dating of speciation
and gene duplication events)
Phylogeny-based inference.
1.
2.
3.
4.
Gene/Protein families (e.g.,
Pfam)
Cluster of orthologous genes
(e.g., COG)
Supermatrix of gene
presence/absence
Genome-based pair-wise
distance distributions
Slide 25