Lesson 2 - The George S. Wise Faculty of Life Sciences

Download Report

Transcript Lesson 2 - The George S. Wise Faculty of Life Sciences

Lessons 3-4
Sequence homology and
alignment
1
Homology
 Similarity
between characters due to
a common ancestry
2
Sequence homology
 Similarity
between sequences that
results from a common ancestor
VLSPAVKWAKVGAHAAGHG
VLSEAVLWAKVEADVAGHG
Basic
assumption:
Sequence homology →
similar structure/function
3
Sequence alignment
Alignment: Comparing two (pairwise)
or more (multiple) sequences.
Searching for a series of identical or
similar characters in the sequences.
4
Homology
 Ortholog
– homolog with similar
function (via speciation)
 Paralog – homolog which arose from
gene duplication
Common use:
Orthologs –
2 homologs
from different
species
Paralogs –
2 homologs
within the
same species
5
How close?
 Rule
of thumb:
 Proteins are homologous if 25%
identical (length >100)
 DNA sequences are homologous if
70% identical
6
Twilight zone
<
20% identity in proteins – may be
homologous and may not be….
 (Note that 5% identity will be
obtained completely by chance!)
7
Why sequence alignment?
Predict characteristics of a
protein –
use the structure/function of known
proteins for predicting the
structure/function of an unknown
protein
8
Sequence modifications
Sequences change in the course of evolution
due to random mutations
Three types of mutations:
1.
2.
3.
Insertion - an insertion of a letter or several letters to the
sequence. AAGA AAGTA
Deletion - deleting a letter (or more) from the sequence.
AAGA AGA
Substitution - replacing a sequence letter by another.
AAGA AACA
Insertion or Deletion = Indel
9
Local vs. Global
 Global
Global
alignment:
forces
alignment in
regions which
differ
alignment – finds the best
alignment across the entire two
sequences.
ADLGAVFALCDRYFQ
||||
|||| |
ADLGRTQN-CDRYYQ
 Local
Local
alignment will
return only
regions of
good
alignment
alignment – finds regions of
similarity in parts of the sequences.
ADLG
||||
ADLG
CDRYFQ
|||| |
CDRYYQ
10
When global and when local?
11
Global alignment
 PTK2
protein tyrosine kinase 2 of
human and rhesus monkey
12
Protein tyrosine kinase domain
13
Protein tyrosine kinase domain
Human PTK2 and leukocyte tyrosine
kinase
 Both function as tyrosine kinases, in
completely different contexts


Ancient duplication
14
Global alignment of PTK and LTK
15
Local alignment of PTK and LTK
16
Pairwise alignment
AAGCTGAATTCGAA
AGGCTCATTTCTGA
One possible alignment:
AAGCTGAATT-C-GAA
AGGCT-CATTTCTGA-
17
AAGCTGAATT-C-GAA
AGGCT-CATTTCTGA-
This alignment includes:
2 mismatches
4 indels (gap)
10 perfect matches
18
Choosing an alignment:
 Many
different alignments are
possible:
AAGCTGAATTCGAA
AGGCTCATTTCTGA
A-AGCTGAATTC--GAA
AG-GCTCA-TTTCTGA-
AAGCTGAATT-C-GAA
AGGCT-CATTTCTGA-
Which alignment is better?
19
Alignment scoring - scoring of
sequence similarity:

Assumes independence
– Each position considered separately

Score at each position
– Positive if identical
– Negative if different or gap

Score of an alignment is the sum of
position scores
– Can be positive or negative
20
Example - naïve scoring
system:
 Perfect
match: +1
 Mismatch: -2
 Indel (gap): -1
AAGCTGAATT-C-GAA
AGGCT-CATTTCTGAScore: = (+1)x10 + (-2)x2 + (-1)x4 = 2
A-AGCTGAATTC--GAA
AG-GCTCA-TTTCTGAScore: = (+1)x9 + (-2)x2 + (-1)x6 = -1
Higher score  Better alignment
21
Scoring system:
The choice of +1,-2, and -1 scores is
quite arbitrary
 Different scoring systems  different
alignments
 Scoring systems implicitly represent a
particular theory of evolution

– Some mismatches are more plausible
 Transition
vs. Transversion
 LysArg
≠
LysCys
– Gap extension Vs. Gap opening
22
Scoring matrix
Representing the
scoring system as a
table or matrix n  n
(n is the number of
letters the alphabet
contains. n=4 for
nucleotides, n=20 for
amino acids)
 symmetric

A
G
C
A
2
G
-6
2
C
-6
-6
2
T
-6
-6
-6
T
2
23
DNA scoring matrices

Uniform substitution in all
nucleotides:
From
To
A
A
2
G
-6
2
C
-6
-6
2
T
-6
-6
-6
Mismatch
G
C
Match
T
2
24
DNA scoring matrices
Can take into account biological
phenomena such as:
 Transition-transversion
25
Amino Acid Scoring Matrices

Take into account physico-chemical properties
26
Amino Acid Substitutions Matrices

Actual substitutions:
– Based on empirical data
– Commonly used by many bioinformatic
programs
– PAM & BLOSUM
27
Protein matrices – actual
substitutions
The idea: Given an alignment of a large number of
closely related sequences we can score the relation
between amino acids based on how frequently they
substitute each other
M
M
M
M
M
M
M
M
G
G
G
G
G
G
G
G
Y
Y
Y
Y
Y
Y
Y
Y
D
D
E
D
Q
D
E
E
E
E
E
E
E
E
E
E
In the fourth column
E and D are found in 7 / 8
28
PAM Matrix - Point Accepted
Mutations

Based on a database of 1,572 changes in
71 groups of closely related proteins (85%
identity)
– Alignment was easy
Counted the number of the substitutions
per amino acid pair (20 x 20)
 Found that common substitutions occurred
between chemically similar amino acids

29
PAM Matrices



Family of matrices PAM 80, PAM 120,
PAM 250
The number on the PAM matrix
represents evolutionary distance
Larger numbers are for larger distances
30
Example: PAM 250
Similar amino acids have greater
score
31
PAM - limitations
 Only
one original dataset
 Examining
proteins with few
differences (85% identity)
 Based
mainly on small globular
proteins so the matrix is biased
32
BLOSUM
 Henikoff
and Henikoff (1992) derived
a set of matrices based on a much
larger dataset
 BLOSUM
observes significantly more
replacements than PAM, even for
infrequent pairs
33
BLOSUM: Blocks Substitution
Matrix
 Based
on BLOCKS database
– ~2000 blocks from 500 families of
related proteins
– Families of proteins with identical
function
AABCDA----BBCDA
 Blocks
are short
conserved patterns of
3-60 aa long
without gaps
DABCDA----BBCBB
BBBCDA-AA-BCCAA
AAACDA-A--CBCDB
CCBADA---DBBDCC
AAACAA----BBCCC
34
BLOSUM
 Each
block represent sequences
alignment with different identity
percentage
 For
each block the amino-acid
substitution rates were calculated to
create BLOSUM matrix
35
BLOSUM Matrices
 BLOSUMn
is based on sequences
that shared at least n percent
identity
 BLOSUM62
represents closer
sequences than BLOSUM45
36
Example : Blosum62
derived from block where the sequences
clustered share at least 62% identity
37
PAM vs. BLOSUM
PAM100
PAM120
PAM160
PAM200
PAM250
=
=
=
=
=
BLOSUM90
BLOSUM80
BLOSUM60
BLOSUM52
BLOSUM45
More distant sequences
38
Scoring system =
substitution matrix +
gap penalty
39
Gap penalty
We penalize gaps
 Scoring for gap opening & for extension:

– Gap-extension penalty < gap-open penalty
40
Optimal alignment algorithms
 Needleman-Wunsch
(global)
 Smith-Waterman (local).
41
Alignment Search Space



The “search space” (number of possible gapped
alignments) for optimally aligning two sequences
is exponential in the length of the sequences, n.
If n=100, there are
100100 = 10200 =
100,000,000,000,000,000,000,000,000,000,000,000,000,000,0
00,000,000,000,000,000,000,000,000,000,000,000,000,000,00
0,000,000,000,000,000,000,000,000,000,000,000,000,000,000,
000,000,000,000,000,000,000,000,000,000,000,000,000,000,0
00,000,000,000,000,000,000,000,000,000 different
alignments.
Average protein length is about n=250!
42
Searching databases
43
Searching a database
 Using
a sequence as the query to
find homologous sequences in the
database
44
DNA or protein?
 For
coding sequences, we can use
the DNA sequence or the protein
sequence to search for similar
sequences.
 Which is preferable?
45
Protein is better!
 Selection
(and hence conservation)
works on the protein level:
CTTTCA =
TTGAGT =
Leu-Ser
Leu-Ser
46
Query type
 Nucleotides:
a four letter alphabet
 Amino acids: a twenty letter
alphabet
• Two random DNA sequences will
share on average 25% of identity
• Two random protein sequences will
share on average 5% of identity
47
Conclusions
 Using
the amino acid sequence is
preferable for homology search
 Why
use a nucleotide sequence after
all?
 No ORF found, e.g. newly sequenced
genome
 No similar protein sequences were found
 Specific DNA databases are available
(EST)
48
Some terminology
 Query
sequence - the sequence with
which we are searching
 Hit – a sequence found in the
database, suspected as homologous
49
How do we search a database?
 Assume
we perform pairwise
alignment of the query against all
the sequences in the database
 Exact pairwise alignment is O(mn) ≈
O(n2)
(m – length of sequence 1,
n – length of sequence 2)
50
How much time will it take?
O(n2) computations per search.
 Assume n=200, so we have 40,000
computations per search
 Size of database - ~60 million entries
 2.4 x 1012 computations for each
sequence search we perform!
 Assume each computation takes 10-6
seconds  24,000 seconds ≈ 6.66 hours
for each sequence search
 150,000 searches (at least!!) are
performed per day

51
Conclusion
 Using
the exact comparison pairwise
alignment algorithm between query
and all DB entries – too slow
52
Heuristic
 Definition:
a heuristic is a design to
solve a problem that does not
provide exact solution (but is not
too bad) and reduces the time
complexity of the exact solution
53
BLAST
 BLAST
- Basic Local Alignment and
Search Tool
 A heuristic for searching a database
for similar sequences
54
DNA or Protein
 All
types of searches are possible.
Query:
DNA
Protein
Database:
DNA
Protein
blastn – nuc vs. nuc
blastp – prot vs. prot
blastx – translated query vs. protein database
tblastn – protein vs. translated nuc. DB
tblastx – translated query vs. translated database
Translated
databases:
trEMBL
genPept
55
BLAST - underlying hypothesis


1.
2.
The underlying hypothesis: when
two sequences are similar there are
short ungapped regions of high
similarity between the two
The heuristic:
Discard irrelevant sequences
Perform exact local alignment with
remaining sequences
56
How do we discard irrelevant
sequences quickly?
 Divide
the database into words of
length w (w = 3 for protein and w =
7 for DNA)
 Save the words in a look-up table
that can be searched quickly
WTDFGYPAILKGGTAC
WTD
TDF
DFG
FGY
GYP
…
57
BLAST: discarding sequences
 When
the user gives a query
sequence, divide it also into words
 Search the database for consecutive
neighbor words
58
Neighbour words
 neighbor
words are defined
according to a scoring matrix (e.g.
BLOSUM62 for proteins) with a
certain cutoff level
GFB
GFC (20)
GPC (11)
WAC (5)
59
Search for consecutive words
Neighbor word
Look for a seed: hits on
the same diagonal
which can be connected
Database record
At least 2 hits on the
same diagonal with
distance which is
smaller than a
predetermined cutoff
Query
This is the filtering stage
– many unrelated hits
are filtered, saving lots
of time!
60
Try to extend alignment
 Stop
extending when the score of the
alignment drops X beneath maximal
score obtained this far
 Discard segments with score < S
ASKIOPLLWLAASFLHNEQAPALSDAN
JWQEOPLWPLAASOIHLFACNSIFYAS
Score=15
X=4
Score=17
Score=14
62
The result – local alignment
 The
result of BLAST will be a series
of local alignments between the
query and the different hits found
63
E-value
 The
number of times we will
theoretically find an alignment with a
score ≥ Y of a random sequence vs.
a random database
Theoretically,
we could trust
any result
with an
E-value ≤ 1
In practice – BLAST uses estimations.
E-values of 10-4 and lower indicate a
significant homology.
E-values between 10-4 and 10-2 should
be checked (similar domains, maybe
non-homologous).
E-values between 10-2 and 1 are
suspicious…
64
Filtering low complexity
 Low
complexity regions : e.g.,
Proline rich areas (in protein), Alu
repeats (in DNA)
 Regions of low complexity generate
high score of alignment BUT – this
does not indicate homology
65
Solution
 In
BLAST there is masking of lowcomplexity regions in the query
sequence (such regions are
represented as XXXXX in query)
66
Multiple sequence alignment
67
VTISCTGSSSNIGAG-NHVKWYQQLPG
VTISCTGTSSNIGS--ITVNWYQQLPG
LRLSCSSSGFIFSS--YAMYWVRQAPG
LSLTCTVSGTSFDD--YYSTWVRQPPG
PEVTCVVVDVSHEDPQVKFNWYVDG-ATLVCLISDFYPGA--VTVAWKADS-AALGCLVKDYFPEP--VTVSWNSG--VSLTCLVKGFYPSD--IAVEWWSNG--
Like pairwise alignment BUT compare n
sequences instead of 2
Rows represent individual sequences
Columns represent ‘same’ position
May be gaps in some sequences
68
MSA & Evolution
MSA can give you a picture of the
forces that shape evolution!
 Important
amino acids or nucleotides
are not “allowed” to mutate
 Less important positions change
more easily
69
Conserved positions
 Columns
where all the sequences
contain the same amino acids or
nucleotides
 Important for the function or
structure
VTISCTGSSSNIGAG-NHVKWYQQLPG
VTISCTGSSSNIGS--ITVNWYQQLPG
LRLSCTGSGFIFSS--YAMYWYQQAPG
LSLTCTGSGTSFDD-QYYSTWYQQPPG
70
Consensus Sequence
 The
consensus sequence holds the
most frequent character of the
alignment at each column
A
T
C
T
T
G
T
A
A
C
T
T
G
T
A
A
C
T
T
C
T
A
A
C
T
T
G
T
71
Profile
1
2
3
4
5
6
A
T
C
T
T
G
T
A
1
0.67
0
0
.
.
A
A
C
T
T
G
T
T
0
0.33
1
1
.
.
A
A
C
T
T
C
T
C
0
0
0
0
.
.
G
0
0
0
0
.
.
Profile =
PSSM – Position Specific Score Matrix
72
Alignment methods
 Progressive
alignment (Clustal)
 Iterative alignment (mafft, muscle)
 All methods today are an
approximation strategy (heuristic
algorithm), yield a possible
alignment, but not necessarily the
best one
73
Progressive alignment
First step:
A
B
C
D
Compute the pairwise
alignments for all against all
(6 pairwise alignments)
the similarities are stored in a
table
A
B
C
D
A
B
11
C
3
1
D
2
2
10
74
Second step:
A
B
C
D
A
cluster the sequences to create a tree
(guide tree):
B
11
C
3
1
D
2
2
10
guide
•Represents the order The
in which
pairs of tree is imprecise
sequences are to be aligned
and is NOT the tree which
•similar sequences are neighbors in the
truly describes the
tree
•distant sequences are distant from each
A
relationship
between
the
other in the tree
sequences!
B
C
D
75
Third step:
Align most similar pairs
A
B
C
D
Align the alignments as if each
of them was a single sequence
(replace with a single
consensus sequence or use a
profile)
76
Alignment of alignments
M Q T F
L H T W
L Q S W
X
Y
M
L
L
L
M
Q
H
Q
-
T
T
S
T
T
I
I
F
W
W
F
W
L T I F
M T I W
77
Iterative alignment
A
B
C
D
Pairwise distance
table
A
B
C
A
Guide tree
B
11
C
3
1
D
2
2
D
Iterate until the
MSA doesn’t
change
10
MSA
A
B
C
D
78
Searching for remote homologs
 Sometimes
BLAST isn’t enough.
 Large protein family, and BLAST only
gives close members. We want more
distant members
 PSI-BLAST
 Profile
HMMs
79
PSI-BLAST
 Position
Specific Iterated BLAST
Regular blast
Construct profile from
blast results
Blast profile search
Final results
80
PSI-BLAST
 Advantage:
PSI-BLAST looks for seq.s
that are close to ours, and learns from
them to extend the circle of friends
 Disadvantage: if we found a WRONG
sequence, we will get to unrelated
sequences (contamination). This gets
worse and worse each iteration
81
Profile HMM
 Similar
to PSI-BLAST: also uses a
profile
 Takes into account:
 Dependence among sites (if site n is
conserved, it is likely that site n+1 is
conserved  part of a domain
 The probability of a certain column in
an alignment
82
PSI BLAST vs profile HMM
PSI BLAST
Profile HMM
Less exact
More exact
Faster
Slower
83
Case study:
Using homology searching
 The
human kinome
84
Kinases and phosphatases
85
Multi-tasking enzymes
Signal transduction
 Metabolism
 Transcription
 Cell-cycle
 Differentiation
 Function of nervous and
immune system
…
 And more

86
How many kinases in the human
genome?
 1970’s,
advent of cloning and
sequencing
 Predicted: 1000 kinases in the
human genome
87
How many kinases in the human
genome?
 2001
– human genome sequence …
 As well – databases of Genbank,
Swissprot, and dbEST
 How
do we know how many kinases
are out there?
88
The human kinome

1.
2.
In 2002, Manning, Whyte, Martinez,
Hunter and Sudarsanam set out to:
Search and cross-reference all
these databases for all kinases
Characterize all found kinases
89
ePKs and aPKs
Eukaryotic
protein kinases
(majority)
Sequence homology
of the catalytic
domain; additional
regulatory domains
are non-homologous
Atypical protein
kinases
No sequence
homology to ePKs;
some aPK subfamilies
have structural
90
similarity to ePKs
The search
 Several
profiles were built:
based on the catalytic domain of
(a) 70 known ePKs
(b) each subfamily of known aPKs
 HMM-profile
searches and PSI-BLAST
searches were performed
91
The result…
 478
apKs
 40 ePKs
 Total
of 518 kinases in the human
genome (half of the prediction in the
1970’s)
92
Classifying the kinases
1.
2.
Classification based on the catalytic
domain
Classification based on the
regulatory domains
189 sub-families of kinases
93
Comparison to other species
 209
subfamilies of ePKs in human,
worm, yeast and fly
94
The human genome has x2 kinases (in
number) as fly or worm. Many are aPKs.
 Most of them are receptor tyrosine kinases
(RTKs)

Nervous system
 Immune system
 Hemopoiesis

95
The discovery of new kinases: a
new front for battling human
diseases
96
Correlating with human diseases
 160
kinases mapped to amplicons
seen in tumors
 80 kinases mapped to amplicons in
other major illnesses
 Usually kinases are over-expressed
in cancer and other diseases
97
Correlating with human diseases
6
kinase inhibitors have been
approved till today for the use
against cancer
 >70 other inhibitors are in clinical
trial
98