Dr. Kim Henrick - European Bioinformatics Institute
Download
Report
Transcript Dr. Kim Henrick - European Bioinformatics Institute
EMBL-EBI
PATTERNS
Kim Henrick
EMBL-EBI
Terri Attwood
School of Biological Sciences
University of Manchester, Oxford Road
Manchester M13 9PT, UK
http://www.bioinf.man.ac.uk/dbbrowser/
EMBL-EBI
Motifs and domains
Motif: a simple combination of a few consecutive
secondary structure elements with a specific geometric
arrangement (e.g., helix-loop-helix). May have a
specific biological function.
Domain: the fundamental unit of structure folding and
evolution. It combines several secondary elements and
motifs packed in a compact globular structure. A
domain can fold independently into a stable 3D
structure, and May have a specific function.
Domain family: proteins that share a domain (possibly
in combination with other domains)
Protein family: proteins that have the same combination
of domains
EMBL-EBI
Profiles & Motifs are Useful
Helped identify active site of HIV protease
Helped identify SH2/SH3 class of STP’s
Helped identify important GTP oncoproteins
Helped identify hidden leucine zipper in HGA
Used to scan for lectin binding domains
Regularly used to predict T-cell epitopes
Domains are More Useful
EMBL-EBI
Rules of Thumb
Sequence pattern-based motifs should be
determined from no fewer than 5 multiply
aligned sequences
A good degree of sequence divergence is
needed. If “S” is the %similarity and “N”
is the no. of sequences then 1 - SN > 0.95
A good sequence pattern should have no
fewer than 8 defined amino acid positions
EMBL-EBI
Representations of protein families
Regular expression
Position specific scoring matrices (profiles)
Hidden Markov Models
Probabilistic suffix trees
Sparse Markov transducers
EMBL-EBI
Pattern recognition methods
These methods classify proteins into families
the basis of the methods is multiple sequence
alignment
They depend on developing a representation of
conserved elements of alignments that may be
diagnostic of structure or function, whether
from
homologous sequence families
sequences that share some
structural/functional domains
EMBL-EBI
Regular expressions/patterns
These are derived from single conserved regions,
which are reduced to consensus expressions for
db searches
they are minimal expressions, so sequence
information is lost
the more divergent the sequences used, the
more fuzzy & poorly discriminating the pattern
becomes
Alignment
Pattern
GAVDFIALCDRYF
GPIDFVCFCERFY G-X-[IV]-[DE]-F-[IVL]-X2-C-[DE]-R-[FY]2
GRVEFLNRCDRYY
EMBL-EBI
Regular expressions/patterns
Patterns do not tolerate similarity
sequences either match or not, regardless of
how similar they are
matching is a binary ‘on-off’ event & frequently
misses true matches
single-motif methods are very hit-or-miss –
how do you know if you've encoded
the ‘best’ region?
EMBL-EBI
PROSITE
G_PROTEIN_RECEPTOR; PATTERN
PS00237;
G-protein coupled receptor signature
[GSTALIVMYWC]-[GSTANCPDE]-{EDPKRH}-X(2)-[LIVMNQGA]X(2)-[LIVMFT]-[GSTANC]-[LIVMFYWSTAC]-[DENH]-R
/TOTAL=919(919);/POS=869(869);/FALSE_POS=50(50);/FALSE_NEG=70;
/PARTIAL=49; UNKNOWN=0(0)
This represents an apparent 18% error rate
the actual rate is probably higher
Thus, a match to a pattern is not necessarily true
& a mis-match is not necessarily false!
False-negatives are a fundamental limitation to this type
of pattern matching
if you don't know what you're looking for,
you'll never know you missed it!
EMBL-EBI
Regular expressions/rules
Regular expression patterns are most effective when
applied to highly-conserved, family-specific motifs
It is often possible to identify, shorter generic patterns that
are characteristic of common functional sites
Functional site
N-glycosylation
Protein kinase C phosphorylation
Casein kinase II phosphorylation
Rule
N-{P}-[ST]-{P}
[ST]-X-[RK]
[ST]-X2-[DE]
Such features result from convergence to a common
property
glycosylation sites, phosphorylation sites, etc.
They cannot be used for family diagnosis & don't
discriminate
they can only be used to suggest whether a certain
functional site might exist (which must then
be tested by experiment)
such patterns are termed rules
EMBL-EBI
Diagnostic limitations of short motifs
Consider the sequence motif Asp-Ala-Val-Ile-Asp (DAVID)
results of db searching for such a sequence will differ,
depending on whether we search for exact or permissive
‘fuzzy’ matches
Pattern
D-A-V-I-D
D-A-V-I-[DEQN]
[DEQN]-A-V-I-[DEQN]
[DEQN]-A-[VLI]-I-[DEQN]
[DEQN]-[AG]-[VLI]-[VLI]-[DEQN]
D-A-V-E
(number of matches in OWL31.1)
Matches
99
252
925
2,739
51,506
1,493
Use of fuzzy regular expressions has the potential
advantage of being able to recognise more distant
relationships
& the inherent disadvantage that more matches will be
made by chance, making it difficult to separate
out true matches from noise
EMBL-EBI
Fingerprints
Fingerprints are groups of motifs excised from
alignments & used for iterative db searching
no weighting scheme is used
searches depend only on residue frequencies
resulting scoring matrices are thus sparse
Each motif trawls the database independently
search results are correlated to determine which
sequences match all the motifs & which match only
partially
no information is thrown away
Iteration refines the fingerprint & increases its
potency
fingerprints are diagnostically more powerful
than regular expressions
EMBL-EBI
Profiles
Profiles are scoring tables derived from full alignments
these define which residues are allowed at given
positions
which positions are conserved & which degenerate
which positions, or regions, can tolerate insertions
the scoring system is intricate, & may include
evolutionary weights, results from structural
studies, & data implicit in the alignment
variable penalties are specified to weight against
INDELs occurring in core 2' structure elements
EMBL-EBI
Profiles
Within a profile, fields contain position-specific
scores for insert & match positions
in conserved regions, INDELs aren't totally
forbidden, but are strongly impeded by large
penalties defined in a DEFAULT field
these are superseded by more permissive
values in gapped regions
the inherent complexity of profiles renders
them highly potent discriminators, but they are
time-consuming to derive
EMBL-EBI
Hidden Markov Models
HMMs are similar in concept to profiles
they are probabilistic models consisting of interconnecting states
essentially, linear chains of match, delete or insert
states
EMBL-EBI
Hidden Markov Models
Match states are assigned to conserved
columns in an alignment
Insert states allow for insertions relative to
match states
Delete states allow match positions to be
skipped
Thus, building an HMM requires each position
in an alignment to be assigned to match, delete
or insert states
EMBL-EBI
Hidden Markov Models
HMMs usually perform well, but can be overtrained
they may also suffer if created from
automatic iterative processes
if it once accepts a false match,
an HMM becomes corrupt
EMBL-EBI
Probabilistic Suffix Trees
Identify short significant contiguous segments
Do not require multiple alignment
Induces a probability distribution on the next symbol to
appear right after the segment (short term memory)
Variable memory length
More efficient than order L Markov chains
Longer memory length compared to
first-order HMMs, and easier to learn
EMBL-EBI
Which method is best?
The range of methods available leads to familiar
problems
which should we use?
which is the most reliable?
which is the most comprehensive?
None of the pattern-recognition techniques is infallible
each has its optimum area of application
None of the resulting pattern databases are complete
none is the best
bearing in mind the diagnostic strengths & weaknesses of
the different approaches, & keeping biological significance
in mind, the best strategy is to use them all
EMBL-EBI
Pattern recognition & prediction
In investigating the meaning of sequences, 2
distinct analytical approaches have emerged
pattern recognition is used to detect similarity
between sequences & hence to infer related
structures & functions
ab initio prediction is used to deduce structure,
& to infer function, directly from sequence
These methods are different & shouldn’t be
confused !!!!!
EMBL-EBI
Pattern recognition & prediction
Sequence- & structure-based pattern
recognition methods demand that some
characteristic has been seen before & housed
in a db
Prediction methods remove the need for
template dbs because deductions are made
directly from sequence
EMBL-EBI
fact & fiction
Sequence pattern recognition is easier to achieve,
& is much more reliable, than fold recognition
which is only ~40-50% reliable even in expert hands
Prediction is still not possible
& is unlikely to be so for decades to come (if ever)
Structural genomics will yield representative
structures for more proteins in future
structures of new sequences will be determined by
modelling
prediction will become an academic exercise
But, to debunk a popular myth, knowing structure
alone does not inherently tell us function
EMBL-EBI
fact & fiction
Prediction methods don’t work because we don’t
fully understand the Folding Problem
we can’t read the language sequences use to create their
folds
But, with sequence analysis techniques, we can try
to find similarities between new sequences & those
in dbs
whose structures & functions we hope have been elucidated
This is straightforward at high levels of identity, but
below 50% it is difficult to establish relationship
reliably
Analyses can be pursued with decreasing certainty
~20% identity, where results may look plausible to the eye,
but are no longer statistically significant
EMBL-EBI
TERMINOLOGY
EMBL-EBI
Homology & analogy
The term homology is confounded & abused in the
literature!
sequences are homologous if they’re related by
divergence from a common ancestor
analogy relates to the acquisition of common features
from unrelated ancestors via convergent evolution
e.g.,
b-barrels occur in soluble serine proteases & integral
membrane porins; chymotrypsin & subtilisin share
groups of catalytic residues, with near identical spatial
geometries, but no other similarities
EMBL-EBI
Homology & analogy
Homology is not a measure of similarity & is not
quantifiable
it is an absolute statement that sequences have a
divergent rather than a convergent relationship
the phrases "the level of homology is high" or "the
sequences show 50% homology", or any like them, are
strictly meaningless!
This is not just a semantic issue
loose use muddies thinking about evolutionary
relationships
EMBL-EBI
A terminology muddle
In comparing 3D structures, exactly the same
arguments apply
structures may be similar, as denoted by RMS
positional deviation between compared atomic
positions
common evolutionary origin remains a
hypothesis, until supported by other evidence
homology among similar structures is a
hypothesis
This may be correct or mistaken, but their
similarity is a fact, no matter how it is
interpreted
Similarity of sequence or structure is just that similarity
Homology connotes a common
evolutionary origin
EMBL-EBI
Classification of homologs
Orthologs – Two genes from two
different species that derive from a
single gene in the last common ancestor
of the species.
Paralogs – Two genes that derive from a
single gene that was duplicated within a
genome.
EMBL-EBI
Orthology & paralogy
Among homologous sequences we can
distinguish
orthologues - largely perform the same function
in different species
paralogues - perform different but related
functions in one organism
EMBL-EBI
Orthology & paralogy
Studying orthologues opens the way to molecular
palaeontology
e.g., using phylogenetic trees to show cross-species
relationships
Paralogues shed light on underlying evolutionary
mechanisms
paralogous proteins are thought to have arisen from
single genes via successive duplication events
duplicated genes follow separate evolutionary
pathways & new specificities evolve through variation
& adaptation
Such complexity presents real challenges for
sequence analysis
EMBL-EBI
Classification of homologs
Inparalogs - paralogs that evolved by
gene duplication after the speciation
event.
Outparalogs - paralogs that evolved by
gene duplication before the speciation
event.
EMBL-EBI
Challenges for sequence analysis
Much of the challenge is in getting the biology right
complicated by orthology vs paralogy
Following a db search, it may be unclear how much
functional annotation can be legitimately inherited
by a query
source of numerous annotation errors in dbs
propagation could lead to an error catastrophe
EMBL-EBI
Challenges for sequence analysis
Further complications result from the modular
nature of proteins
modules are autonomous folding units, used as
protein building blocks - like Lego bricks, they
can confer a variety of functions on the parent
protein, either by multiple combinations of the
same module, or via different modules to form
mosaics
Automatic systems don’t distinguish orthologues
from paralogues & don’t consider the modular
nature of proteins
EMBL-EBI
Challenges for sequence analysis
Identifying evolutionary links between sequences is useful
this often implies a shared function
Arguably, prediction of function from sequence is of more
immediate value than the prediction of structure
However, between distantly-related proteins, structure is
more conserved than the underlying sequences
thus, some relationships are only apparent at the structural
level
Such relationships can't be detected by even the most
sensitive sequence comparison methods
there is thus a theoretical limit to the effectiveness of
sequence analysis methods and a region of identity where
sequence comparisons fail completely to detect structural
similarity
EMBL-EBI
What can we learn from them?
Ortholog proteins are evolutionary, and
typically functional counterparts in
different species.
Paralog proteins are important for
detecting lineage-specific adaptations.
Both of them can reveal information on a
specific species or a set of species.
EMBL-EBI
Databases of protein domains
Prosite
http://www.expasy.ch/prosite/
Pfam
http://www.sanger.ac.uk/Software/Pfam/
Blocks
http://www.blocks.fhcrc.org/
ProDom
http://prodes.toulouse.inra.fr/prodom/doc/prodom.html
Prints
http://www.bioinf.man.ac.uk/dbbrowser/PRINTS/
Domo
http://www.infobiogen.fr/services/domo/
InterPro
http://www.ebi.ac.uk/interpro/
Smart
http://smart.embl-heidelberg.de/
eMotif
http://dna.stanford.edu/identify
EMBL-EBI
Integrating Pattern Databases
MetaFam
IProClass
CDD
InterPro
EMBL-EBI
Reference
Domains, motifs, and clusters in the protein universe
Jinfeng Liu & Burkhard Rost
Current Opinion in Chemical Biology, Vol 7 No 1 2003