Transcript Slides

CS 6293 Advanced Topics:
Current Bioinformatics
Lectures 3-4: Pair-wise Sequence
Alignment
Outline
•
•
•
•
•
Biological background
Global sequence alignment
Local sequence alignment
Optional: linear-space alignment algorithm
Heuristic alignment: BLAST
Evolution at the DNA level
C
…ACGGTGCAGTCACCA…
…ACGTTGC-GTCCACCA…
DNA evolutionary events (sequence edits):
Mutation, deletion, insertion
Sequence conservation implies function
next generation
OK
OK
OK
X
X
Still OK?
Why sequence alignment?
• Conserved regions are more likely to be
functional
– Can be used for finding genes, regulatory elements,
etc.
• Similar sequences often have similar origin and
function
– Can be used to predict functions for new genes /
proteins
• Sequence alignment is one of the most widely
used computational tools in biology
Global Sequence Alignment
S
T
S’
T’
AGGCTATCACCTGACCTCCAGGCCGATGCCC
TAGCTATCACGACCGCGGTCGATTTGCCCGAC
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
Definition
An alignment of two strings S, T is a pair of strings S’,
T’ (with spaces) s.t.
(1) |S’| = |T’|, and (|S| = “length of S”)
(2) removing all spaces in S’, T’ leaves S, T
What is a good alignment?
Alignment:
The “best” way to match the letters of one
sequence with those of the other
How do we define “best”?
S’: -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--T’: TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
• The score of aligning (characters or
spaces) x & y is σ (x,y).
• Score of an alignment:
• An optimal alignment: one with max score
Scoring Function
• Sequence edits:
– Mutations
– Insertions
– Deletions
Scoring Function:
Match:
+m
Mismatch:
-s
Gap (indel):
-d
AGGCCTC
AGGACTC
AGGGCCTC
AGG-CTC
~~~AAC~~~
~~~A-A~~~
• Match = 2, mismatch = -1, gap = -1
• Score = 3 x 2 – 2 x 1 – 1 x 1 = 3
More complex scoring function
• Substitution matrix
– Similarity score of matching two letters a, b
should reflect the probability of a, b derived
from the same ancestor
– It is usually defined by log likelihood ratio
– Active research area. Especially for proteins.
– Commonly used: PAM, BLOSUM
An example substitution matrix
A
C
G
T
A
C
G
T
3
-2
-1
-2
3
-2
-1
3
-2
3
How to find an optimal alignment?
• A naïve algorithm:
for all subseqs A of S, B of T s.t. |A| = |B| do
align A[i] with B[i], 1 ≤i ≤|A|
align all other chars to spaces
compute its value
S = abcd A = cd
T = wxyz B = xz
retain the max
-abc-d a-bc-d
end
w--xyz -w-xyz
output the retained alignment
Analysis
• Assume |S| = |T| = n
• Cost of evaluating one alignment: ≥n
• How many alignments are there:
– pick n chars of S,T together
– say k of them are in S
– match these k to the k unpicked chars of T
• Total time:
• E.g., for n = 20, time is > 240 >1012 operations
Dynamic Programming for
sequence alignment
Suppose we wish to align
x1……xM
y1……yN
Let F(i,j) = optimal score of aligning
x1……xi
y1……yj
Scoring Function:
Match:
+m
Mismatch:
-s
Gap (indel):
-d
Elements of dynamic programming
• Optimal sub-structures
– Optimal solutions to the original problem
contains optimal solutions to sub-problems
• Overlapping sub-problems
– Some sub-problems appear in many solutions
• Memorization and reuse
– Carefully choose the order that sub-problems
are solved
Optimal substructure
1
2
i
M
...
x:
1
y:
2
j
N
...
• If x[i] is aligned to y[j] in the optimal alignment
between x[1..M] and y[1..N], then
• The alignment between x[1..i] and y[1..j] is also
optimal
• Easy to prove by contradiction
Recursive solution
Notice three possible cases:
1.
~~~~~~~ xM
~~~~~~~ yN
2.
F(M,N) = F(M-1, N-1) +
-s, if not
xM aligns to a gap
~~~~~~~ xM
~~~~~~~ 
3.
m, if xM = yN
xM aligns to yN
max
F(M,N) = F(M-1, N) - d
yN aligns to a gap
~~~~~~~ 
~~~~~~~ yN
F(M,N) = F(M, N-1) - d
Recursive solution
• Generalize:
F(i,j) = max
F(i-1, j-1) + (Xi,Yj)
F(i-1, j) – d
F(i, j-1) – d
(Xi,Yj) = m if Xi = Yj, and –s otherwise
• Boundary conditions:
– F(0, 0) = 0.
-jd: y[1..j] aligned to gaps.
– F(0, j) = ?
-id: x[1..i] aligned to gaps.
– F(i, 0) = ?
What order to fill?
F(0,0)
F(i-1, j-1)1
F(i-1, j)
1
i
F(i, j-1)
3
2
F(i, j)
F(M,N)
j
What order to fill?
F(0,0)
F(M,N)
Example
x = AGTA
y = ATA
F(i,j)
m= 1
s =1
d =1
i =
j=0
1
A
2
T
3
A
0
1
2
A
G
3
T
4
A
Example
x = AGTA
y = ATA
F(i,j)
m= 1
s =1
d =1
i =
0
0
j=0
1
A
-1
2
T
-2
3
A
-3
1
2
3
4
A
G
T
A
-1
-2
-3
-4
Example
x = AGTA
y = ATA
F(i,j)
m= 1
s =1
d =1
i =
j=0
0
1
2
A
G
T
A
0
-1
-2
-3
-4
1
0
-1
-2
1
A
-1
2
T
-2
3
A
-3
3
4
Example
x = AGTA
y = ATA
F(i,j)
m= 1
s =1
d =1
i =
j=0
0
1
2
3
4
A
G
T
A
0
-1
-2
-3
-4
1
A
-1
1
0
-1
-2
2
T
-2
0
0
1
0
3
A
-3
Example
x = AGTA
y = ATA
F(i,j)
m= 1
s =1
d =1
i =
j=0
0
1
2
3
4
A
G
T
A
0
-1
-2
-3
-4
1
A
-1
1
0
-1
-2
2
T
-2
0
0
1
0
3
A
-3
-1
-1
0
2
Optimal Alignment:
F(4,3) = 2
Example
x = AGTA
y = ATA
F(i,j)
m= 1
s =1
d =1
i =
j=0
0
1
2
3
4
A
G
T
A
0
-1
-2
-3
-4
1
A
-1
1
0
-1
-2
2
T
-2
0
0
1
0
3
A
-3
-1
-1
0
2
Optimal
Alignment:
F(4,3) = 2
This only tells us
the best score
Trace-back
x = AGTA
y = ATA
F(i,j)
F(i,j) = max
i =
j=0
0
F(i-1, j-1) + (Xi,Yj)
F(i-1, j) – d
F(i, j-1) – d
1
2
3
A
G
T
A
0
-1
-2
-3
-4
m= 1
s =1
d =1
4
1
A
-1
1
0
-1
-2
A
2
T
-2
0
0
1
0
A
3
A
-3
-1
-1
0
2
Trace-back
x = AGTA
y = ATA
F(i,j)
F(i,j) = max
i =
j=0
0
F(i-1, j-1) + (Xi,Yj)
F(i-1, j) – d
F(i, j-1) – d
1
2
3
A
G
T
A
0
-1
-2
-3
-4
m= 1
s =1
d =1
4
1
A
-1
1
0
-1
-2
T A
2
T
-2
0
0
1
0
T A
3
A
-3
-1
-1
0
2
Trace-back
x = AGTA
y = ATA
F(i,j)
F(i,j) = max
i =
j=0
0
F(i-1, j-1) + (Xi,Yj)
F(i-1, j) – d
F(i, j-1) – d
1
2
3
A
G
T
A
0
-1
-2
-3
-4
m= 1
s =1
d =1
4
1
A
-1
1
0
-1
-2
G T A
2
T
-2
0
0
1
0
-
3
A
-3
-1
-1
0
2
T A
Trace-back
x = AGTA
y = ATA
F(i,j)
F(i,j) = max
i =
j=0
0
F(i-1, j-1) + (Xi,Yj)
F(i-1, j) – d
F(i, j-1) – d
1
2
3
A
G
T
A
0
-1
-2
-3
-4
m= 1
s =1
d =1
4
1
A
-1
1
0
-1
-2
A G T A
2
T
-2
0
0
1
0
A -
3
A
-3
-1
-1
0
2
T A
Trace-back
x = AGTA
y = ATA
F(i,j)
F(i,j) = max
i =
j=0
1
A
0
F(i-1, j-1) + (Xi,Yj)
F(i-1, j) – d
F(i, j-1) – d
1
2
3
A
G
T
A
0
-1
-2
-3
-4
-1
1
0
-1
-2
m= 1
s =1
d =1
4
2
T
-2
0
0
1
0
3
A
-3
-1
-1
0
2
Optimal Alignment:
F(4,3) = 2
AGTA
ATA
Using trace-back pointers
x = AGTA
y = ATA
F(i,j)
m= 1
s =1
d =1
i =
0
0
j=0
1
A
-1
2
T
-2
3
A
-3
1
2
3
4
A
G
T
A
-1
-2
-3
-4
Using trace-back pointers
x = AGTA
y = ATA
F(i,j)
m= 1
s =1
d =1
i =
j=0
0
1
2
A
G
T
A
0
-1
-2
-3
-4
1
0
-1
-2
1
A
-1
2
T
-2
3
A
-3
3
4
Using trace-back pointers
x = AGTA
y = ATA
F(i,j)
m= 1
s =1
d =1
i =
j=0
0
1
2
3
4
A
G
T
A
0
-1
-2
-3
-4
1
A
-1
1
0
-1
-2
2
T
-2
0
0
1
0
3
A
-3
Using trace-back pointers
x = AGTA
y = ATA
F(i,j)
m= 1
s =1
d =1
i =
j=0
0
1
2
3
4
A
G
T
A
0
-1
-2
-3
-4
1
A
-1
1
0
-1
-2
2
T
-2
0
0
1
0
3
A
-3
-1
-1
0
2
Using trace-back pointers
x = AGTA
y = ATA
F(i,j)
m= 1
s =1
d =1
i =
j=0
0
1
2
3
4
A
G
T
A
0
-1
-2
-3
-4
1
A
-1
1
0
-1
-2
2
T
-2
0
0
1
0
3
A
-3
-1
-1
0
2
Using trace-back pointers
x = AGTA
y = ATA
F(i,j)
m= 1
s =1
d =1
i =
j=0
0
1
2
3
4
A
G
T
A
0
-1
-2
-3
-4
1
A
-1
1
0
-1
-2
2
T
-2
0
0
1
0
3
A
-3
-1
-1
0
2
Using trace-back pointers
x = AGTA
y = ATA
F(i,j)
m= 1
s =1
d =1
i =
j=0
0
1
2
3
4
A
G
T
A
0
-1
-2
-3
-4
1
A
-1
1
0
-1
-2
2
T
-2
0
0
1
0
3
A
-3
-1
-1
0
2
Using trace-back pointers
x = AGTA
y = ATA
F(i,j)
m= 1
s =1
d =1
i =
j=0
1
A
0
1
2
3
4
A
G
T
A
0
-1
-2
-3
-4
-1
1
0
-1
-2
2
T
-2
0
0
1
0
3
A
-3
-1
-1
0
2
Optimal Alignment:
F(4,3) = 2
AGTA
ATA
The Needleman-Wunsch
Algorithm
1.
Initialization.
a.
b.
c.
2.
F(0, 0)
F(0, j)
F(i, 0)
= 0
=-jd
=-id
Main Iteration. Filling in scores
a.
For each i = 1……M
For each j = 1……N
F(i, j)
Ptr(i,j)
3.
= max
=
F(i-1,j) – d
F(i, j-1) – d
F(i-1, j-1) + σ(xi, yj)
UP,
LEFT
DIAG
[case 1]
[case 2]
[case 3]
if [case 1]
if [case 2]
if [case 3]
Termination. F(M, N) is the optimal score, and from Ptr(M, N) can
trace back optimal alignment
Complexity
• Time:
O(NM)
• Space:
O(NM)
• Linear-space algorithms do exist (with the
same time complexity)
Equivalent graph problem
S1 =
A
G
T
A
(0,0)
S2 =
A
1
1
: a gap in the 2nd sequence
: a gap in the 1st sequence
1
T
1
A
: match / mismatch
1
Value on vertical/horizontal line: -d
Value on diagonal: m or -s
(3,4)
• Number of steps: length of the alignment
• Path length: alignment score
• Optimal alignment: find the longest path from (0, 0) to (3, 4)
• General longest path problem cannot be found with DP. Longest path on this
graph can be found by DP since no cycle is possible.
Question
• If we change the scoring scheme, will the
optimal alignment be changed?
– Old: Match = 1, mismatch = gap = -1
– New: match = 2, mismatch = gap = 0
– New: Match = 2, mismatch = gap = -2?
Question
• What kind of alignment is represented by
these paths?
A
A
A
A
A
B
B
B
B
B
C
C
C
C
C
A-
A--
--A
-A-
-A
BC
-BC
BC-
B-C
BC
Alternating gaps are impossible if –s > -2d
A variant of the basic algorithm
Scoring scheme: m = s = d: 1
Seq1: CAGCA-CTTGGATTCTCGG
|| |:|||
Seq2: ---CAGCGTGG--------
Seq1: CAGCACTTGGATTCTCGG
||||
| |
||
Seq2: CAGC-----G-T----GG
Score = -7
Score = -2
The first alignment may be biologically more realistic in
some cases (e.g. if we know s2 is a subsequence of s1)
A variant of the basic algorithm
• Maybe it is OK to have an unlimited # of
gaps in the beginning and end:
----------CTATCACCTGACCTCCAGGCCGATGCCCCTTCCGGC
GCGAGTTCATCTATCAC--GACCGC--GGTCG--------------
• Then, we don’t want to penalize gaps in
the ends
The Overlap Detection variant
yN ……………………………… y1
x1 ……………………………… xM
Changes:
1.
Initialization
For all i, j,
F(i, 0) = 0
F(0, j) = 0
2.
Termination
maxi F(i, N)
FOPT = max
maxj F(M, j)
Different types of overlaps
x
x
y
y
The local alignment problem
Given two strings
X = x1……xM,
Y = y1……yN
Find substrings x’, y’ whose similarity (optimal
global alignment value) is maximum
e.g. X = abcxdex
Y = xxxcde
x
y
X’ = cxde
Y’ = c-de
Why local alignment
• Conserved regions may be a small part of the
whole
– Global alignment might miss them if flanking “junk”
outweighs similar regions
• Genes are shuffled between genomes
A
B
C
B
D
D
A
C
Naïve algorithm
for all substrings X’ of X and Y’ of Y
Align X’ & Y’ via dynamic programming
Retain pair with max value
end ;
Output the retained pair
• Time: O(n2) choices for A, O(m2) for B,
O(nm) for DP, so O(n3m3 ) total.
Reminder
• The overlap detection algorithm
– We do not give penalty to gaps at either end
Free gap
Free gap
The local alignment idea
• Do not penalize the unaligned regions (gaps or
mismatches)
• The alignment can start anywhere and ends anywhere
• Strategy: whenever we get to some low similarity region
(negative score), we restart a new alignment
– By resetting alignment score to zero
The Smith-Waterman algorithm
Initialization: F(0, j) = F(i, 0) = 0
0
F(i – 1, j) – d
Iteration: F(i, j) = max
F(i, j – 1) – d
F(i – 1, j – 1) + (xi, yj)
The Smith-Waterman algorithm
Termination:
1. If we want the best local alignment…
FOPT = maxi,j F(i, j)
2. If we want all local alignments scoring > t
For all i, j find F(i, j) > t, and trace back
Match: 2
Mismatch: -1
Gap: -1
a
b
c
x
d
e
x
0
0
0
0
0
0
0
0
x
0
x
0
x
0
c
0
d
0
e
0
Match: 2
Mismatch: -1
Gap: -1
a
b
c
x
d
e
x
0
0
0
0
0
0
0
0
x
0
0
0
x
0
0
0
x
0
0
0
c
0
0
0
d
0
0
0
e
0
0
0
Match: 2
Mismatch: -1
Gap: -1
a
b
c
x
d
e
x
0
0
0
0
0
0
0
0
x
0
0
0
0
x
0
0
0
0
x
0
0
0
0
c
0
0
0
2
d
0
0
0
1
e
0
0
0
0
Match: 2
Mismatch: -1
Gap: -1
a
b
c
x
d
e
x
0
0
0
0
0
0
0
0
x
0
0
0
0
2
x
0
0
0
0
2
x
0
0
0
0
2
c
0
0
0
2
1
d
0
0
0
1
0
e
0
0
0
0
0
Match: 2
Mismatch: -1
Gap: -1
a
b
c
x
d
e
x
0
0
0
0
0
0
0
0
x
0
0
0
0
2
1
x
0
0
0
0
2
1
x
0
0
0
0
2
1
c
0
0
0
2
1
1
d
0
0
0
1
0
3
e
0
0
0
0
0
2
Match: 2
Mismatch: -1
Gap: -1
a
b
c
x
d
e
x
0
0
0
0
0
0
0
0
x
0
0
0
0
2
1
0
x
0
0
0
0
2
1
0
x
0
0
0
0
2
1
0
c
0
0
0
2
1
1
0
d
0
0
0
1
0
3
2
e
0
0
0
0
0
2
5
Match: 2
Mismatch: -1
Gap: -1
a
b
c
x
d
e
x
0
0
0
0
0
0
0
0
x
0
0
0
0
2
1
0
2
x
0
0
0
0
2
1
0
2
x
0
0
0
0
2
1
0
2
c
0
0
0
2
1
1
0
1
d
0
0
0
1
1
3
2
1
e
0
0
0
0
0
2
5
4
Trace back
Match: 2
Mismatch: -1
Gap: -1
a
b
c
x
d
e
x
0
0
0
0
0
0
0
0
x
0
0
0
0
2
1
0
2
x
0
0
0
0
2
1
0
2
x
0
0
0
0
2
1
0
2
c
0
0
0
2
1
1
0
1
d
0
0
0
1
1
3
2
1
e
0
0
0
0
0
2
5
4
Trace back
Match: 2
Mismatch: -1
Gap: -1
cxde
| ||
c-de
x-de
| ||
xcde
a
b
c
x
d
e
x
0
0
0
0
0
0
0
0
x
0
0
0
0
2
1
0
2
x
0
0
0
0
2
1
0
2
x
0
0
0
0
2
1
0
2
c
0
0
0
2
1
1
0
1
d
0
0
0
1
1
3
2
1
e
0
0
0
0
0
2
5
4
• No negative values in local alignment DP
array
• Optimal local alignment will never have a
gap on either end
• Local alignment: “Smith-Waterman”
• Global alignment: “Needleman-Wunsch”
Analysis
• Time:
– O(MN) for finding the best alignment
– Time to report all alignments depends on the
number of sub-opt alignments
• Memory:
– O(MN)
– O(M+N) possible
Optional: more efficient alignment
algorithms
• Given two sequences of length M, N
• Time: O(MN)
– ok
• Space: O(MN)
– bad
– 1Mb seq x 1Mb seq = 1000G memory
• Can we do better?
Bounded alignment
Good alignment should
appear near the
diagonal
Bounded Dynamic
Programming
If we know that x and y are very similar
Assumption:
Then,
xi
|
yj
# gaps(x, y) < k
implies
|i–j|<k
yN ………………………… y1
Bounded Dynamic
Programming
x1 ………………………… xM Initialization:
F(i,0), F(0,j) undefined for i, j > k
Iteration:
For i = 1…M
For j = max(1, i – k)…min(N, i+k)
F(i – 1, j – 1)+ (xi, yj)
F(i, j) = max F(i, j – 1) – d, if j > i – k
F(i – 1, j) – d, if j < i + k
k
Termination:
same
Analysis
• Time: O(kM) << O(MN)
• Space: O(kM) with some tricks
=>
M
2k
M
2k
• Given two sequences of length M, N
• Time: O(MN)
– ok
• Space: O(MN)
– bad
– 1mb seq x 1mb seq = 1000G memory
• Can we do better?
Linear space algorithm
• If all we need is the alignment score but
not the alignment, easy!
We only need to
keep two rows
(You only need one row,
with a little trick)
But how do we get
the alignment?
Linear space algorithm
• When we finish, we know how we have
aligned the ends of the sequences
XM
YN
Naïve idea: Repeat on
the smaller subproblem
F(M-1, N-1)
Time complexity:
O((M+N)(MN))
(0, 0)
M/2
(M, N)
Key observation: optimal alignment (longest path) must use
an intermediate point on the M/2-th row. Call it (M/2, k),
where k is unknown.
(0,0)
(3,0)
(3,2)
(3,4)
(3,6)
(6,6)
• Longest path from (0, 0) to (6, 6) is
max_k (LP(0,0,3,k) + LP(6,6,3,k))
Hirschberg’s idea
• Divide and conquer!
Y
X
Forward algorithm
Align x1x2…xM/2 with Y
M/2
F(M/2, k) represents the
best alignment between
x1x2…xM/2 and y1y2…yk
Backward Algorithm
Y
X
M/2
Backward algorithm
Align reverse(xM/2+1…xM)
with reverse(Y)
B(M/2, k) represents the best
alignment between
reverse(xM/2+1…xM) and
reverse(ykyk+1…yN )
Linear-space alignment
Using 2 (4) rows of space, we can compute
for k = 1…N, F(M/2, k), B(M/2, k)
M/2
Linear-space alignment
Now, we can find k* maximizing F(M/2, k) + B(M/2, k)
Also, we can trace the path exiting column M/2 from k*
Conclusion: In O(NM) time, O(N) space, we found
optimal alignment path at row M/2
Linear-space alignment
k*
M/2
M/2
• Iterate this procedure to the two sub-problems!
N-k*
Analysis
• Memory: O(N) for computation, O(N+M) to
store the optimal alignment
• Time:
– MN for first iteration
– k M/2 + (N-k) M/2 = MN/2 for second
–…
k
M/2
M/2
N-k
MN
MN/2
MN/4
MN + MN/2 + MN/4 + MN/8 + …
= MN (1 + ½ + ¼ + 1/8 + 1/16 + …)
= 2MN = O(MN)
MN/8
Heuristic Local Sequence
Alignment: BLAST
State of biological databases
Sequenced Genomes:
Human
Mouse
Neurospora
Tetraodon
Drosophila
Rice
sea squirts
3109
2.7109
4107
3108
1.2108
1.0109
1.6108
Yeast
Rat
Fugu fish 3.3108
Mosquito 2.8108
Worm
Arabidopsis
Current rate of sequencing (before new-generation sequencing):
4 big labs  3 109 bp /year/lab
10s small labs
Private sectors
With new-generation sequencing:
Easily generating billions of reads daily
1.2107
2.6109
1.0108
1.2108
Some useful applications of
alignments
Given a newly discovered gene,
- Does it occur in other species?
Assume we try Smith-Waterman:
Our
new
gene
104
The entire genomic database
1010 - 1011
May take several weeks!
Some useful applications of
alignments
Given a newly sequenced organism,
- Which subregions align with other organisms?
- Potential genes
- Other functional units
Assume we try Smith-Waterman:
Our newly
sequenced
mammal
3109
The entire genomic database
1010 - 1011
> 1000 years ???
BLAST
• Basic Local Alignment Search Tool
– Altschul, Gish, Miller, Myers, Lipman, J Mol Biol 1990
– The most widely used bioinformatics tool
• Which is better: long mediocre match or a few nearby,
short, strong matches with the same total score?
– Score-wise, exactly equivalent
– Biologically, later may be more interesting, & is common
– At least, if must miss some, rather miss the former
• BLAST is a heuristic algorithm emphasizing the later
– speed/sensitivity tradeoff: BLAST may miss former, but gains
greatly in speed
BLAST
• Available at NCBI (National Center for
Biotechnology Information) for download and
online use. http://blast.ncbi.nlm.nih.gov/
• Along with many sequence databases
query
Main idea:
1.Construct a dictionary of all the words in
the query
2.Initiate a local alignment for each word
match between query and DB
Running Time: O(MN)
However, orders of magnitude faster
than Smith-Waterman
DB
BLAST  Original Version
……
Dictionary:
All words of length k (~11 for DNA, 3
for proteins)
Alignment initiated between words of
alignment score  T (typically T = k)
query
……
Alignment:
Ungapped extensions until score
below statistical threshold
scan
DB
Output:
All local alignments with score
> statistical threshold
query
BLAST  Original Version
A C G A A G T A A G G T C C A G T
k = 4, T = 4
The matching word GGTC
initiates an alignment
Extension to the left and
right with no gaps until
alignment falls < 50%
Output:
GTAAGGTCC
GTTAGGTCC
C C C T T C C T G G A T T G C G A
Example:
Gapped BLAST
Added features:
•
•
Pairs of words can
initiate alignment
Extensions with
gaps in a band
around anchor
Output:
GTAAGGTCCAGT
GTTAGGTC-AGT
C T G A T C C T G G A T T G C G A
A C G A A G T A A G G T C C A G T
Example
Query: gattacaccccgattacaccccgattaca (29 letters)
[2 mins]
Database: All GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or phase 0, 1 or 2 HTGS
sequences) 1,726,556 sequences; 8,074,398,388 total letters
>gi|28570323|gb|AC108906.9| Oryza sativa chromosome 3 BAC OSJNBa0087C10 genomic sequence,
complete sequence Length = 144487 Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21
(95%) Strand = Plus / Plus
Query:
Sbjct:
4
tacaccccgattacaccccga 24
||||||| |||||||||||||
125138 tacacccagattacaccccga 125158
Score = 34.2 bits (17),
Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plus
Query:
Sbjct:
4 tacaccccgattacaccccga 24
||||||| |||||||||||||
125104 tacacccagattacaccccga 125124
>gi|28173089|gb|AC104321.7|
Oryza sativa chromosome 3 BAC OSJNBa0052F07 genomic sequence,
complete sequence Length = 139823 Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21
(95%) Strand = Plus / Plus
Query:
Sbjct:
4 tacaccccgattacaccccga 24
||||||| |||||||||||||
3891 tacacccagattacaccccga 3911
Example
Query: Human atoh enhancer, 179 letters
[1.5 min]
Result: 57 blast hits
1.
2.
3.
4.
5.
6.
7.
8.
gi|7677270|gb|AF218259.1|AF218259 Homo sapiens ATOH1 enhanc...
gi|22779500|gb|AC091158.11| Mus musculus Strain C57BL6/J ch...
gi|7677269|gb|AF218258.1|AF218258 Mus musculus Atoh1 enhanc...
gi|28875397|gb|AF467292.1| Gallus gallus CATH1 (CATH1) gene...
gi|27550980|emb|AL807792.6| Zebrafish DNA sequence from clo...
gi|22002129|gb|AC092389.4| Oryza sativa chromosome 10 BAC O...
gi|22094122|ref|NM_013676.1| Mus musculus suppressor of Ty ...
gi|13938031|gb|BC007132.1| Mus musculus, Similar to suppres...
355 1e-95
264 4e-68
256 9e-66
78 5e-12
54 7e-05
44 0.068
42 0.27
42 0.27
gi|7677269|gb|AF218258.1|AF218258 Mus musculus Atoh1 enhancer sequence Length = 1517
Score = 256 bits (129), Expect = 9e-66 Identities = 167/177 (94%),
Gaps = 2/177 (1%) Strand = Plus / Plus
Query:
3 tgacaatagagggtctggcagaggctcctggccgcggtgcggagcgtctggagcggagca 62
||||||||||||| ||||||||||||||||||| ||||||||||||||||||||||||||
Sbjct: 1144 tgacaatagaggggctggcagaggctcctggccccggtgcggagcgtctggagcggagca 1203
Query:
63 cgcgctgtcagctggtgagcgcactctcctttcaggcagctccccggggagctgtgcggc 122
|||||||||||||||||||||||||| ||||||||| |||||||||||||||| |||||
Sbjct: 1204 cgcgctgtcagctggtgagcgcactc-gctttcaggccgctccccggggagctgagcggc 1262
Query:
123 cacatttaacaccatcatcacccctccccggcctcctcaacctcggcctcctcctcg 179
||||||||||||| || ||| |||||||||||||||||||| |||||||||||||||
Sbjct: 1263 cacatttaacaccgtcgtca-ccctccccggcctcctcaacatcggcctcctcctcg 1318
Different types of BLAST
• blastn: search nucleic acid databases
• blastp: search protein databases
• blastx: you give a nucleic acid sequence,
search protein databases
• tblastn: you give a protein sequence,
search nucleic acid databases
• tblastx: you give a nucleic sequence,
search nucleic acid database, implicitly
translate both into protein sequences
BLAST cons and pros
• Advantages
– Fast!!!!
– A few minutes to search a database of 1011 bases
• Disadvantages
– Sensitivity may be low
– Often misses weak homologies
• New improvement
– Make it even faster
• Mainly for aligning very similar sequences or really long sequences
– E.g. whole genome vs whole genome
– Make it more sensitive
• PSI-BLAST: iteratively add more homologous sequences
• PatternHunter: discontinuous seeds
Variants of BLAST
NCBI-BLAST: most widely used version
WU-BLAST: (Washington University BLAST): another popular version
Optimized, added features
MEGABLAST:
Optimized to align very similar sequences. Linear gap penalty
BLAT: Blast-Like Alignment Tool
BlastZ:
Optimized for aligning two genomes
PSI-BLAST:
BLAST produces many hits
Those are aligned, and a pattern is extracted
Pattern is used for next search; above steps iterated
Sensitive for weak homologies
Slower
Pattern hunter
• Instead of exact matches of consecutive
matches of k-mer, we can
• look for discontinuous matches
– My query sequence looks like:
• ACGTAGACTAGCAGTTAAG
– Search for sequences in database that match
• AXGXAGXCTAXC
• X stands for don’t care
• Seed: 101011011101
Pattern hunter
• A good seed may give you both a higher
sensitivity and higher specificity
• You may think 110110110110 is the best seed
– Because mutation in the third position of a codon
often doesn’t change the amino acid
– Best seed is actually 110100110010101111
• Empirically determined
• How to design such seed is an open problem
• May combine multiple random seeds
Things we’ve covered so far
• Global alignment
– Needleman-Wunsch and variants
• Local Alignment
– Smith-Waterman
• Improvement on space and time
• Heuristic algorithms
– BLAST families
• Things we did not cover:
– Statistics for sequence alignment
– To handle gaps more accurately: affine gap penalty
– Multiple sequence alignment