Domain analysis, motifs and repeats
Download
Report
Transcript Domain analysis, motifs and repeats
Classifying the protein universe
SynapseAssociated
Protein 97
Wu et al, 2002. EMBO J 19:5740-5751
Domain Analysis and Protein Families
Introduction
What are protein families?
Protein families
Description & Definition
Motifs and Profiles
The modular architecture of proteins
Domain Properties and Classification
Protein Families
Protein families are defined by homology:
In a family, everyone is related to everyone
Everybody in a family shares a common
ancestor:
Protein family 1
Protein family 2
Homology versus Similarity
Homologous proteins have similar 3D structures and (usually) share
common ancestry:
1chg
1chg
Superfamily:
Trypsin-like
Serine
Proteases
1sgt
1chg and 1sgt 31% identity, 43% similarity
We can infer homology from similarity!
1sgt
Homology versus Similarity
But Homologous proteins may not share
sequence similarity:
1chg
1chg
Superfamily:
Trypsin-like
Serine
Proteases
1sgc
1chg and 1sgc 15% identity, 25% similarity
We cannot infer similarity from homology
1sgc
Homology versus Similarity
Similar sequences may not have structural
similarity:
2baa
1chg
2baa
1chg
1chg and 2baa 30% similarity, 140/245 aa
We cannot assume homology from similarity!
Homology versus Similarity
Summary
Sequences can be similar without being homologous
Sequences can be homologous without being similar
Evolution /
Homology
BLAST
Similarity
Families ??
Domain Analysis and Protein Families
Introduction
What are protein families?
Protein families
Description & Definition
Motifs and Profiles
The modular architecture of proteins
Domain Properties and Classification
Description of a Protein Family
Let’s assume we know some members of a
protein family
What is common to them all?
Multiple alignment!
Techniques for searching sequence databases to
Some common strategies to uncover common
domains/motifs of biological significance that
categorize a protein into a family
• Pattern - a deterministic syntax that describes
multiple combinations of possible residues within a
protein string
• Profile - probabilistic generalizations that assign to
every segment position, a probability that each of
the 20 aa will occur
• Intermediate sequence search - link many profile
searches
Motif Description of a Protein
Family
Regular expressions:
........C.............S...L..I..DRY..I.......................W...
I
E W V
/ C x{13} S x{3} [LI] x{2} I x{2} [DE] R [YW] x{2} [IV] x{10} – x{12} W /
Automated Motif Discovery
Given a set of sequences:
GIBBS Sampler
http://bayesweb.wadsworth.org/cgi-bin/gibbs.8.pl?data_type=protein
MEME
http://meme.sdsc.edu/meme/
PRATT
http://www.ebi.ac.uk/pratt
TEIRESIAS
http://cbcsrv.watson.ibm.com/Tspd.html
Combinatorial output!
Automated Profile Generation
Any multiple alignment is a profile!
PSIBLAST
Algorithm:
Start from a single query sequence
Perform BLAST search
Build profile of neighbours
Repeat from 2 …
Very sensitive method for database search
PSI-BLAST
Position Specific Iterative Blast
PSI-Blast profile models only positions in the query sequence
Query
Profile1
Profile2
After n iterations
...
Threshold for
inclusion in profile
HMMs
Hidden Markov Models are Statistical methods
that consider all the possible combinations of
matches, mismatches, and gaps to generate a
consensus (Higgins, 2000)
•Sequence ordering and alignments are not
necessary at the onset (but in many cases
alignments are recommended)
More the number of sequences better the models.
One can Generate a model (profile/PSSM), then
search a database with it (Eg: PFAM)
HMM libraries
PFAM
http://www.sanger.ac.uk/Pfam
The Pfam database is a large collection of protein
families, each represented by multiple sequence
alignments and hidden Markov models (HMMs).
Pfam-A entries are high quality, manually curated
families.
Pfam-B entries are generated automatically.
GTG steps
1. Generate alignment trace graph
•
•
•
Nodes = residues
Edges = aligned in PSI-Blast library
Unweighted
2. Edge weighting
•
Using consistency
3. Clustering
•
•
Driven by consistency
Single site occupancy rule
4. Post-processing
•
•
Generate non-redundant set of inter-cluster edges
Identify sub-trees with conserved residues
Alignment trace graph
Residues
more residues
Protein 1
Protein 2
Protein 3
Protein 4
Protein 5
-Graph representation of input pairwise alignment data
-Vertices = residues
-Edges = aligned in a pairwise alignment from input library
Consistency = neighbour overlap
i
Weight = intersection / union
j
GTG – global trace graph
Input: PSI-Blast all versus all alignments in
NRDB40
Output: superalignment of all proteins
Applications
Pairwise alignment of query and target sequences
Transitive sequence database searching (fast)
Tracking conserved residues (feature space)
Protein 1
Protein 2
Protein 3
Protein 4
Protein 5
Alignment trace graph
Protein 1
Protein 2
Protein 3
Protein 4
Protein 5
Cluster 1
Cluster 2
Edge weight = consistency (fraction of common neighbours)
Cluster ≈ hypothetical column of multiple alignment (single site occupancy)
‘Motif tracking’
A
G
A
K
K
H
A
A
consistency
K
K
K
K
K
K
A
consistency
consistency
Each vertex is labelled with source protein and position in sequence.
Motifs are subtrees enriched in one particular amino acid type.
Remote homolog detection
based on GTG alignment score
Lindahl benchmark, superfamily level
*SPARKS-0
GTG-DEEP
*FOLDpro
*SP3
GTG-LOCAL
*Prospect II
top-1
top-5
*Fugue
Blastlink
SAM-T98
PSI-Blast
Ssearch
HMMer
0
20
40
60
% correct
GTG clustering is informative; detect as many remote homologs as threading methods
Summary
Super-families form elongated clusters in
“protein space”
Profile models fluctuations around an equilibrium point
Consistency ~ path model
Exploits multiple profile models
Discriminative in database searching
Global trace graph data structure
Feature space for pattern discovery
http://ekhidna.biocenter.helsinki.fi/gtg/start
Relationships between families
Pfam clans
A clan is a collection of Pfam-A entries which are related by
similarity of sequence, structure or profile-HMM.
Superfamily
http://supfam.mrc-lmb.cam.ac.uk/SUPERFAMILY/hmm.html
The sequence search method uses a library (covering all proteins of
known structure) consisting of 1539 SCOP superfamilies from
classes a to g. Each superfamily is represented by a group of
hidden Markov models.
Pfam-squared
Based on GTG comparisons of representative sequences from each
PFAM-A family against all PFAM-A families.
Rules of thumb: motif score>1000 means probably related, motif
score >500 means possibly related, score <500 means dubious
Benchmarking a motif/profile
You have a description of a protein family,
and you do a database search…
Are all hits truly members of your protein
family?
TP: true positive
Benchmarking:
TN: true negative
Result
Dataset
FP: false positive
FN: false negative
family member
not a family member
unknown
Benchmarking a motif/profile
Precision / Selectivity
Precision = TP / (TP + FP)
Sensitivity / Recall
Sensitivity = TP / (TP + FN)
Balancing both:
Precision ~ 1, Recall ~ 0: easy but useless
Precision ~ 0, Recall ~ 1: easy but useless
Precision ~ 1, Recall ~ 1: perfect but very difficult
Domain Analysis and Protein Families
Introduction
What are protein families?
Protein families
Description & Definition
Motifs and Profiles
The modular architecture of proteins
Domain Properties and Classification
The Modular Architecture of
Proteins
BLAST search of a multi-domain protein
Phosphoglycerate kinase
Triosephosphate isomerase
What are domains?
Functional - from experiments:
example: Decay Accelerating Factor
(DAF) or CD55
Has six domains (units):
4x Sushi domain (complement
regulation)
1x ST-rich ‘stalk’
1x GPI anchor (membrane attachment)
PDB entry 1ojy (sushi domains only)
P Williams et al (2003) Mapping CD55 Function. J Biol Chem 278(12): 10691-10696
There is only so much we can
conclude…
Classifying domains [To aid structure
prediction
(predict
structural
domains,
molecular function of the domain)]
Classifying complete sequences (predicting
molecular function of proteins, large scale
annotation)
Majority of proteins are multi-domain proteins.
What are domains?
Mobile – Sequence Domains:
Protein 1
Protein 2
Mobile module
Protein 3
Protein 4
Domains are...
...evolutionary building blocks:
Families of evolutionarily-related sequence segments
Domain assignment often coupled with classification
With one or more of the following properties:
Globular
Independently foldable
Recurrence in different contexts
To be precise,
we say: “protein family”
we mean: “protein domain family”
Example: global alignment
Phthalate dioxygenase
reductase (PDR_BURCE)
Toluene - 4 monooxygenase electron
transfer component
(TMOF_PSEME)
Global alignment fails!
Only aligns largest domain.
Sometimes even more complex!
PGBM_HUMAN: “Basement membrane-specific heparan sulphate
proteoglycan core protein precursor”
980
1960
2940
3920
4391
45 domains of 9 different type, according to PFam
http://www.sanger.ac.uk/cgi-bin/Pfam/swisspfamget.pl?name=P98160
http://www.glycoforum.gr.jp/science/word/proteoglycan/PGA09E.html
Properties of domains
Most domains: size approx 75 – 200 residues
So, you have a sequence...
...look it up in existing database
INTERPRO: http://www.ebi.ac.uk/interpro
...search against existing family descriptions
PFAM: http://www.sanger.ac.uk/Software/Pfam
INTERPROSCAN: http://www.ebi.ac.uk/Tools/InterProScan/