Dr. Kim Henrick - European Bioinformatics Institute

Download Report

Transcript Dr. Kim Henrick - European Bioinformatics Institute

EMBL-EBI
PATTERNS
Kim Henrick
EMBL-EBI
Terri Attwood
School of Biological Sciences
University of Manchester, Oxford Road
Manchester M13 9PT, UK
http://www.bioinf.man.ac.uk/dbbrowser/
EMBL-EBI
Motifs and domains
 Motif: a simple combination of a few consecutive
secondary structure elements with a specific geometric
arrangement (e.g., helix-loop-helix). May have a
specific biological function.
 Domain: the fundamental unit of structure folding and
evolution. It combines several secondary elements and
motifs packed in a compact globular structure. A
domain can fold independently into a stable 3D
structure, and May have a specific function.
 Domain family: proteins that share a domain (possibly
in combination with other domains)
 Protein family: proteins that have the same combination
of domains
EMBL-EBI
Profiles & Motifs are Useful
Helped identify active site of HIV protease
Helped identify SH2/SH3 class of STP’s
Helped identify important GTP oncoproteins
Helped identify hidden leucine zipper in HGA
Used to scan for lectin binding domains
Regularly used to predict T-cell epitopes
Domains are More Useful
EMBL-EBI
Rules of Thumb
Sequence pattern-based motifs should be
determined from no fewer than 5 multiply
aligned sequences
A good degree of sequence divergence is
needed. If “S” is the %similarity and “N”
is the no. of sequences then 1 - SN > 0.95
A good sequence pattern should have no
fewer than 8 defined amino acid positions
EMBL-EBI
Representations of protein families
 Regular expression
 Position specific scoring matrices (profiles)
 Hidden Markov Models
 Probabilistic suffix trees
 Sparse Markov transducers
EMBL-EBI
Pattern recognition methods
 These methods classify proteins into families
the basis of the methods is multiple sequence
alignment
 They depend on developing a representation of
conserved elements of alignments that may be
diagnostic of structure or function, whether
from
 homologous sequence families
 sequences that share some
structural/functional domains
EMBL-EBI
Regular expressions/patterns
These are derived from single conserved regions,
which are reduced to consensus expressions for
db searches
they are minimal expressions, so sequence
information is lost
the more divergent the sequences used, the
more fuzzy & poorly discriminating the pattern
becomes
Alignment
Pattern
GAVDFIALCDRYF
GPIDFVCFCERFY G-X-[IV]-[DE]-F-[IVL]-X2-C-[DE]-R-[FY]2
GRVEFLNRCDRYY
EMBL-EBI
Regular expressions/patterns
Patterns do not tolerate similarity
sequences either match or not, regardless of
how similar they are
matching is a binary ‘on-off’ event & frequently
misses true matches
single-motif methods are very hit-or-miss –
how do you know if you've encoded
the ‘best’ region?
EMBL-EBI
PROSITE
G_PROTEIN_RECEPTOR; PATTERN
PS00237;
G-protein coupled receptor signature
[GSTALIVMYWC]-[GSTANCPDE]-{EDPKRH}-X(2)-[LIVMNQGA]X(2)-[LIVMFT]-[GSTANC]-[LIVMFYWSTAC]-[DENH]-R
/TOTAL=919(919);/POS=869(869);/FALSE_POS=50(50);/FALSE_NEG=70;
/PARTIAL=49; UNKNOWN=0(0)
 This represents an apparent 18% error rate
the actual rate is probably higher
 Thus, a match to a pattern is not necessarily true
& a mis-match is not necessarily false!
 False-negatives are a fundamental limitation to this type
of pattern matching
if you don't know what you're looking for,
you'll never know you missed it!
EMBL-EBI
Regular expressions/rules
 Regular expression patterns are most effective when
applied to highly-conserved, family-specific motifs
 It is often possible to identify, shorter generic patterns that
are characteristic of common functional sites
Functional site
N-glycosylation
Protein kinase C phosphorylation
Casein kinase II phosphorylation
Rule
N-{P}-[ST]-{P}
[ST]-X-[RK]
[ST]-X2-[DE]
 Such features result from convergence to a common
property
glycosylation sites, phosphorylation sites, etc.
 They cannot be used for family diagnosis & don't
discriminate
 they can only be used to suggest whether a certain
functional site might exist (which must then
be tested by experiment)
 such patterns are termed rules
EMBL-EBI
Diagnostic limitations of short motifs
 Consider the sequence motif Asp-Ala-Val-Ile-Asp (DAVID)
results of db searching for such a sequence will differ,
depending on whether we search for exact or permissive
‘fuzzy’ matches
Pattern
D-A-V-I-D
D-A-V-I-[DEQN]
[DEQN]-A-V-I-[DEQN]
[DEQN]-A-[VLI]-I-[DEQN]
[DEQN]-[AG]-[VLI]-[VLI]-[DEQN]
D-A-V-E
(number of matches in OWL31.1)
Matches
99
252
925
2,739
51,506
1,493
 Use of fuzzy regular expressions has the potential
advantage of being able to recognise more distant
relationships
& the inherent disadvantage that more matches will be
made by chance, making it difficult to separate
out true matches from noise
EMBL-EBI
Fingerprints
Fingerprints are groups of motifs excised from
alignments & used for iterative db searching
 no weighting scheme is used
 searches depend only on residue frequencies
 resulting scoring matrices are thus sparse
Each motif trawls the database independently
 search results are correlated to determine which
sequences match all the motifs & which match only
partially
 no information is thrown away
Iteration refines the fingerprint & increases its
potency
 fingerprints are diagnostically more powerful
than regular expressions
EMBL-EBI
Profiles
Profiles are scoring tables derived from full alignments
these define which residues are allowed at given
positions
which positions are conserved & which degenerate
which positions, or regions, can tolerate insertions
the scoring system is intricate, & may include
evolutionary weights, results from structural
studies, & data implicit in the alignment
variable penalties are specified to weight against
INDELs occurring in core 2' structure elements
EMBL-EBI
Profiles
Within a profile, fields contain position-specific
scores for insert & match positions
in conserved regions, INDELs aren't totally
forbidden, but are strongly impeded by large
penalties defined in a DEFAULT field
these are superseded by more permissive
values in gapped regions
the inherent complexity of profiles renders
them highly potent discriminators, but they are
time-consuming to derive
EMBL-EBI
Hidden Markov Models
HMMs are similar in concept to profiles
they are probabilistic models consisting of interconnecting states
essentially, linear chains of match, delete or insert
states
EMBL-EBI
Hidden Markov Models
 Match states are assigned to conserved
columns in an alignment
 Insert states allow for insertions relative to
match states
 Delete states allow match positions to be
skipped
Thus, building an HMM requires each position
in an alignment to be assigned to match, delete
or insert states
EMBL-EBI
Hidden Markov Models
HMMs usually perform well, but can be overtrained
they may also suffer if created from
automatic iterative processes
if it once accepts a false match,
an HMM becomes corrupt
EMBL-EBI
Probabilistic Suffix Trees
Identify short significant contiguous segments
Do not require multiple alignment
Induces a probability distribution on the next symbol to
appear right after the segment (short term memory)
Variable memory length
More efficient than order L Markov chains
Longer memory length compared to
first-order HMMs, and easier to learn
EMBL-EBI
Which method is best?
The range of methods available leads to familiar
problems
 which should we use?
 which is the most reliable?
 which is the most comprehensive?
None of the pattern-recognition techniques is infallible
 each has its optimum area of application
None of the resulting pattern databases are complete
 none is the best
 bearing in mind the diagnostic strengths & weaknesses of
the different approaches, & keeping biological significance
in mind, the best strategy is to use them all
EMBL-EBI
Pattern recognition & prediction
In investigating the meaning of sequences, 2
distinct analytical approaches have emerged
pattern recognition is used to detect similarity
between sequences & hence to infer related
structures & functions
ab initio prediction is used to deduce structure,
& to infer function, directly from sequence
These methods are different & shouldn’t be
confused !!!!!
EMBL-EBI
Pattern recognition & prediction
Sequence- & structure-based pattern
recognition methods demand that some
characteristic has been seen before & housed
in a db
Prediction methods remove the need for
template dbs because deductions are made
directly from sequence
EMBL-EBI
fact & fiction
 Sequence pattern recognition is easier to achieve,
& is much more reliable, than fold recognition
 which is only ~40-50% reliable even in expert hands
 Prediction is still not possible
 & is unlikely to be so for decades to come (if ever)
 Structural genomics will yield representative
structures for more proteins in future
 structures of new sequences will be determined by
modelling
 prediction will become an academic exercise
But, to debunk a popular myth, knowing structure
alone does not inherently tell us function
EMBL-EBI
fact & fiction
 Prediction methods don’t work because we don’t
fully understand the Folding Problem
we can’t read the language sequences use to create their
folds
 But, with sequence analysis techniques, we can try
to find similarities between new sequences & those
in dbs
whose structures & functions we hope have been elucidated
 This is straightforward at high levels of identity, but
below 50% it is difficult to establish relationship
reliably
 Analyses can be pursued with decreasing certainty
~20% identity, where results may look plausible to the eye,
but are no longer statistically significant
EMBL-EBI
TERMINOLOGY
EMBL-EBI
Homology & analogy
The term homology is confounded & abused in the
literature!
 sequences are homologous if they’re related by
divergence from a common ancestor
 analogy relates to the acquisition of common features
from unrelated ancestors via convergent evolution
e.g.,
b-barrels occur in soluble serine proteases & integral
membrane porins; chymotrypsin & subtilisin share
groups of catalytic residues, with near identical spatial
geometries, but no other similarities
EMBL-EBI
Homology & analogy
Homology is not a measure of similarity & is not
quantifiable
it is an absolute statement that sequences have a
divergent rather than a convergent relationship
the phrases "the level of homology is high" or "the
sequences show 50% homology", or any like them, are
strictly meaningless!
This is not just a semantic issue
loose use muddies thinking about evolutionary
relationships
EMBL-EBI
A terminology muddle
 In comparing 3D structures, exactly the same
arguments apply
structures may be similar, as denoted by RMS
positional deviation between compared atomic
positions
common evolutionary origin remains a
hypothesis, until supported by other evidence
homology among similar structures is a
hypothesis
This may be correct or mistaken, but their
similarity is a fact, no matter how it is
interpreted
 Similarity of sequence or structure is just that similarity
 Homology connotes a common
evolutionary origin
EMBL-EBI
Classification of homologs
 Orthologs – Two genes from two
different species that derive from a
single gene in the last common ancestor
of the species.
 Paralogs – Two genes that derive from a
single gene that was duplicated within a
genome.
EMBL-EBI
Orthology & paralogy
Among homologous sequences we can
distinguish
orthologues - largely perform the same function
in different species
paralogues - perform different but related
functions in one organism
EMBL-EBI
Orthology & paralogy
Studying orthologues opens the way to molecular
palaeontology
e.g., using phylogenetic trees to show cross-species
relationships
Paralogues shed light on underlying evolutionary
mechanisms
paralogous proteins are thought to have arisen from
single genes via successive duplication events
duplicated genes follow separate evolutionary
pathways & new specificities evolve through variation
& adaptation
Such complexity presents real challenges for
sequence analysis
EMBL-EBI
Classification of homologs
 Inparalogs - paralogs that evolved by
gene duplication after the speciation
event.
 Outparalogs - paralogs that evolved by
gene duplication before the speciation
event.
EMBL-EBI
Challenges for sequence analysis
Much of the challenge is in getting the biology right
complicated by orthology vs paralogy
Following a db search, it may be unclear how much
functional annotation can be legitimately inherited
by a query
source of numerous annotation errors in dbs
propagation could lead to an error catastrophe
EMBL-EBI
Challenges for sequence analysis
Further complications result from the modular
nature of proteins
modules are autonomous folding units, used as
protein building blocks - like Lego bricks, they
can confer a variety of functions on the parent
protein, either by multiple combinations of the
same module, or via different modules to form
mosaics
Automatic systems don’t distinguish orthologues
from paralogues & don’t consider the modular
nature of proteins
EMBL-EBI
Challenges for sequence analysis
 Identifying evolutionary links between sequences is useful
this often implies a shared function
 Arguably, prediction of function from sequence is of more
immediate value than the prediction of structure
 However, between distantly-related proteins, structure is
more conserved than the underlying sequences
thus, some relationships are only apparent at the structural
level
 Such relationships can't be detected by even the most
sensitive sequence comparison methods
there is thus a theoretical limit to the effectiveness of
sequence analysis methods and a region of identity where
sequence comparisons fail completely to detect structural
similarity
EMBL-EBI
What can we learn from them?
 Ortholog proteins are evolutionary, and
typically functional counterparts in
different species.
 Paralog proteins are important for
detecting lineage-specific adaptations.
 Both of them can reveal information on a
specific species or a set of species.
EMBL-EBI
Databases of protein domains
 Prosite
http://www.expasy.ch/prosite/
 Pfam
http://www.sanger.ac.uk/Software/Pfam/
 Blocks
http://www.blocks.fhcrc.org/
 ProDom
http://prodes.toulouse.inra.fr/prodom/doc/prodom.html
 Prints
http://www.bioinf.man.ac.uk/dbbrowser/PRINTS/
 Domo
http://www.infobiogen.fr/services/domo/
 InterPro
http://www.ebi.ac.uk/interpro/
 Smart
http://smart.embl-heidelberg.de/
 eMotif
http://dna.stanford.edu/identify
EMBL-EBI
Integrating Pattern Databases
MetaFam
IProClass
CDD
InterPro
EMBL-EBI
Reference
Domains, motifs, and clusters in the protein universe
Jinfeng Liu & Burkhard Rost
Current Opinion in Chemical Biology, Vol 7 No 1 2003