Transcript Slide 1

Lesson 3
Aligning sequences and searching
databases
1
Some Terminology
Matrix = Table
Probability = ‫סיכוי‬
Likelihood = ‫סבירות‬
Global and Local pairwise alignments
5
Global vs. Local
• Global alignment – finds the best alignment
across the entire two sequences.
ADLGAVFALCDRYFQ
||||
|||| |
ADLGRTQN-CDRYYQ
• Local alignment – finds regions of similarity in
parts of the sequences.
ADLG
||||
ADLG
CDRYFQ
|||| |
CDRYYQ
6
The sequence similarity is
restricted to a single domain
Domain A
Protein tyrosine kinase
Domain B
PTK2
domain
Domain X
Protein tyrosine kinase
domain
Leukocyte TK
7
Which alignment is the correct one?
AAGTGAATTCGAA
AGGCTCATTTCTGA
AAG-TGAATT-C-GAA
AGGCT-CATTTCTGA-
A-AG-TGAATTC--GAA
AG-GCTCA-TTTCTGA-
8
Scoring system (naïve)
Perfect match: +1
Mismatch: -2
Indel (gap): -1
AAG-TGAATT-C-GAA
AGGCT-CATTTCTGAScore: = (+1)x9 + (-2)x2 + (-1)x5 = 0
A-AG-TGAATTC--GAA
AG-GCTCA-TTTCTGAScore: = (+1)x8 + (-2)x2 + (-1)x6 = -1
Higher score  Better alignment
9
DNA scoring matrices
• Uniform substitutions between all nucleotides:
From
To
A
G
C
T
A
2
-2
-6
-6
G
-2
2
-6
-6
C
-6
-6
2
-2
T
-6
-6
-2
2
Mismatch
Match
10
Scoring gaps (I)
Gap extension penalty < Gap opening penalty
11
Protein matrices – actual substitutions
The idea: Given an alignment of a large number of
closely related sequences we can score the relation
between amino acids based on how frequently they
substitute each other
M
M
M
M
M
M
M
M
G
G
G
G
G
G
G
G
Y
Y
Y
Y
Y
Y
Y
Y
D
D
E
D
Q
D
E
E
E
E
E
E
E
E
E
E
In the fourth column
E and D are found in 7 / 8
12
PAM Matrices
• Family of matrices PAM 80, PAM 120, PAM 250
• The number on the PAM matrix represents
evolutionary distance
• Larger numbers are for larger distances
13
Example: PAM 250
Similar amino acids have greater
score
14
PAM - limitations
• Based only on a single, and limited dataset
• Examines proteins with few differences (85%
identity)
• Based mainly on small globular proteins so the
matrix is biased
15
BLOSUM
• Henikoff and Henikoff (1992) derived a set of
matrices based on a much larger dataset
• BLOSUM observes significantly more
replacements than PAM, even for infrequent
pairs
16
BLOSUM: Blocks Substitution
Matrix
• Based on BLOCKS database
– ~2000 blocks from 500 families of related proteins
– Families of proteins with identical function
• Blocks are short
conserved patterns of
3-60 amino acids
without gaps
AABCDA----BBCDA
DABCDA----BBCBB
BBBCDA-AA-BCCAA
AAACDA-A--CBCDB
CCBADA---DBBDCC
AAACAA----BBCCC
17
Example : Blosum62
Derived from blocks where the sequences
share at least 62% identity
18
PAM vs. BLOSUM
PAM100 = BLOSUM90
PAM120 = BLOSUM80
PAM160 = BLOSUM60
PAM200 = BLOSUM52
PAM250 = BLOSUM45
More distant sequences
19
Frequency
Amino Acid Blosum62 PAM 250 in vertebrates
W
Tryptphan
11
17
1.30%
M Methionine
5
6
1.80%
H
Histidine
8
6
2.90%
C
Cysteine
9
12
3.30%
Y
Tyrosine
7
10
3.30%
Q
Glutamine
5
4
3.70%
I
Isoleucine
4
5
3.80%
F Phenylalanine
6
9
4%
R
Arginine
5
6
4.20%
N
Asparagine
6
2
4.40%
P
Proline
7
6
5%
E Glutamic acid
5
4
5.80%
D Aspartic acid
6
4
5.90%
T
Threonine
5
3
6.20%
V
Valine
4
4
6.80%
K
Lysine
5
5
7.20%
A
Alanine
4
2
7.40%
G
Glycine
6
5
7.40%
L
Leucine
4
6
7.60%
S
Serine
4
2
8.10%
Intermediate summary
1. Scoring system =
substitution matrix + gap penalty.
2. Used for both global and local alignment
3. For amino acids, there are two types of
substitution matrices: PAM and Blosum
21
Computational Aspects
22
Many possible alignments
AAGCTGAATTCGAA
AGGCTCATTTCTGA
AAGCTGAATT-C-GAA
AGGCT-CATTTCTGAAAG-CTGAATT-C-GAA
AGGCT-CATTT-CTGA-
Which alignment has the best
score?
• Two sequences of length 10 have
>> 1,000,000 possible alignments
• Two sequences of length 20 have
>> 1,000,000,000,000 possible
alignments
• Two sequences of length 30 have
>> 1,000,000,000,000,000,000
possible alignments
AAGCT-GAATT-C-GAA
A-GGCT-CATTTCTGA23
Optimal alignment algorithms
•Needleman-Wunsch (global) [1970]
•Smith-Waterman (local) [1981]
•Two sequences of length 10: 100 computer
operations (instead of 1,000,000).
• Two sequences of length 20: 400 computer
operations (instead of 1,000,000,000,000).
• Two sequences of length 30: 900 computer
operations (instead of 1,000,000,000,000,000,000).
24
Matrix Representation
Match = 1
Mismatch = -1
Indel = -2
S
T
0
0 0
A 1 -2
A G C
1 2 3
-2 -4 -6
1
A 2 -4 -1
-1 -3
0
-2
A 3 -6 -3 -2 -1
C 4 -8 -5 -4 -1
AAAC
score(AAAC,AGC) = -1
A-GC
25
Matrix Representation
Match = 1
Mismatch = -1
Indel = -2
S
T
0
0 0
A 1 -2
A G C
1 2 3
-2 -4 -6
1
A 2 -4 -1
-1 -3
0
-2
A 3 -6 -3 -2 -1
C 4 -8 -5 -4 -1
score(AAA,AG) = -2
AAA
A-G
26
Matrix Representation
Match = 1
Mismatch = -1
Indel = -2
S
T
0
0 0
A 1 -2
A G C
1 2 3
-2 -4 -6
1
A 2 -4 -1
-1 -3
0
-2
A 3 -6 -3 -2 -1
C 4 -8 -5 -4 -1
score(,AG) = -2
-AG
27
Matrix Representation
Match = 1
Mismatch = -1
Indel = -2
S
T
0
0 0
A 1 -2
A G C
1 2 3
-2 -4 -6
1
A 2 -4 -1
-1 -3
0
-2
A 3 -6 -3 -2 -1
C 4 -8 -5 -4 -1
How do we fill in the alignment scores in the matrix?
That’s where the algorithm comes into play
28
A Useful Link
• http://alggen.lsi.upc.es/docencia/ember/fram
e-ember.html
– Gives a step by step illustration of the algorithm
for any given pair of sequences.
29
Homology versus chance similarity
30
A suggestion
A. Take the two sequences  Compute score.
B. Take one sequence randomly shuffle it -> find
score with the second sequence. Repeat 100,000
times.
If the score in A is at the top 5% of the
scores in B  the similarity is significant.
31
Searching databases
32
Craig Venter’s Cruise
Craig Venter’s cruise
A sequence found in Craig Venter’s cruise:
…AGGTAGACTAGAGCAGTTAGAACGTTAGTTTA…
Which organism is it coming from??
Database
Q
u
e
r
y
A
G
G
T
A
G
A
C
GTGAGCAGAGAATAGTTTAAC…
GAGCTATGTGAGCAGAGAATA…
CTACGTGAGCAGAGAATAGTT…
CATAGCTACTATGTGAGCAGA…
GAGACCAGAGACTACGATAGC…
CTAAACTGTGAGCAGACTCGT…
GGGGACAGAGAATAGTTTAAC…
TAGCTGAGCTATGTGAGCAGA…
…
…
Searching a sequence database
The idea: Use your sequence as a query to find
homologous sequences in a sequence database
Database
A sequence taken
from Venter’s trip
37
Searching a sequence database
Database
query
38
Searching a sequence database
Database
query
hit
39
Terminology
• Query sequence - the sequence with which
we are searching
• Hit – a sequence found in the database,
suspected as homologous
40
Protein or DNA search
41
Query sequence: DNA or protein?
• For coding sequences, we can use the DNA
sequence or the protein sequence to search
for similar sequences.
• Which is preferable if we want to learn about
homology?
42
Amino acids are better!
• Selection (and hence conservation) works
(mostly) at the protein level:
CTTTCA =
TTGAGT =
Leu-Ser
Leu-Ser
43
Query type
• Nucleotides: a four letter alphabet
• Amino acids: a twenty letter alphabet
• Two random DNA sequences will, on
average, have 25% identity
• Two random protein sequences will, on
average, have 5% identity
44
Computation time
45
Searching a sequence database
Assuming 10 comparisons in every second, a full comparison
of the query to the database requires 11.5 days.
Database
query
107
sequences
46
How do we search a database?
• 11.5 days is ok if we are doing it once.
• 150,000 searches (at least!!) are performed
per day. >82,000,000 sequence records in
GenBank.
47
Heuristic
• Definition: a heuristic is a design to
solve a problem that does not
provide an exact solution (but is not
too bad) but reduces the time
complexity of the exact solution
48
BLAST
• BLAST - Basic Local Alignment and Search
Tool
• A heuristic for searching a database for similar
sequences
49
BLAST - underlying hypothesis
•
The underlying hypothesis: when two
sequences are similar there are short
ungapped regions of high similarity between
them
• The heuristic:
1. Discard irrelevant sequences
2. Perform exact local alignment only with the
remaining sequences
50
How do we discard irrelevant sequences
quickly?
• Divide the database into words of length w
(default: w = 3 for protein and w = 11 for DNA)
• Save the words in a look-up table that can be
searched quickly
AGCTTAGACTAAAGC…
AGCTTAGACTA
GCTTAGACTAA
CTTAGACTAAA
TTAGACTAAAG
TAGACTAAAGC
…
51
BLAST: discarding sequences
• When the user enters a query sequence, it is
also divided into words
• Search the database for consecutive
neighboring words
52
Search for consecutive words
Neighbor word
Database record
This is the filtering stage –
many unrelated hits are
filtered, saving lots of
time!
Query
53
Try to extend the alignment
• Stop extending when the score of the
alignment drops X beneath the maximal score
obtained so far
• Discard segments with score < S
AAGACCTAGGCATTAAGCATTTAAGAGA
GGAAGACAGGCATTAAGCGTCAAAGAGG
Score=11
X=4
Score=9
Score=9
Score=7
54
The result – local alignment
• The result of BLAST will be a series of local
alignments between the query and the
different hits found
55