Domain analysis, motifs and repeats

Transcript Domain analysis, motifs and repeats

Classifying the protein universe
SynapseAssociated
Protein 97
Wu et al, 2002. EMBO J 19:5740-5751
Domain Analysis and Protein Families
Introduction
What are protein families?
Protein families
Description & Definition
Motifs and Profiles
The modular architecture of proteins
Domain Properties and Classification
Protein Families
Protein families are defined by homology:
 In a family, everyone is related to everyone
 Everybody in a family shares a common
ancestor:
Protein family 1
Protein family 2
Homology versus Similarity
 Homologous proteins have similar 3D structures and (usually) share
common ancestry:
1chg
1chg
Superfamily:
Trypsin-like
Serine
Proteases
1sgt
 1chg and 1sgt  31% identity, 43% similarity
 We can infer homology from similarity!
1sgt
Homology versus Similarity
But Homologous proteins may not share
sequence similarity:
1chg
1chg
Superfamily:
Trypsin-like
Serine
Proteases
1sgc
1chg and 1sgc  15% identity, 25% similarity
We cannot infer similarity from homology
1sgc
Homology versus Similarity
Similar sequences may not have structural
similarity:
2baa
1chg
2baa
1chg
1chg and 2baa  30% similarity, 140/245 aa
We cannot assume homology from similarity!
Homology versus Similarity
Summary
 Sequences can be similar without being homologous
 Sequences can be homologous without being similar
Evolution /
Homology
BLAST
Similarity
Families ??
Domain Analysis and Protein Families
Introduction
What are protein families?
Protein families
Description & Definition
Motifs and Profiles
The modular architecture of proteins
Domain Properties and Classification
Description of a Protein Family
Let’s assume we know some members of a
protein family
What is common to them all?
Multiple alignment!
Techniques for searching sequence databases to
Some common strategies to uncover common
domains/motifs of biological significance that
categorize a protein into a family
• Pattern - a deterministic syntax that describes
multiple combinations of possible residues within a
protein string
• Profile - probabilistic generalizations that assign to
every segment position, a probability that each of
the 20 aa will occur
• Intermediate sequence search - link many profile
searches
Motif Description of a Protein
Family
Regular expressions:
........C.............S...L..I..DRY..I.......................W...
I
E W V
/ C x{13} S x{3} [LI] x{2} I x{2} [DE] R [YW] x{2} [IV] x{10} – x{12} W /
Automated Motif Discovery
Given a set of sequences:
GIBBS Sampler
 http://bayesweb.wadsworth.org/cgi-bin/gibbs.8.pl?data_type=protein
MEME
http://meme.sdsc.edu/meme/
PRATT
 http://www.ebi.ac.uk/pratt
TEIRESIAS
 http://cbcsrv.watson.ibm.com/Tspd.html
 Combinatorial output!
Automated Profile Generation
 Any multiple alignment is a profile!
 PSIBLAST
Algorithm:




Start from a single query sequence
Perform BLAST search
Build profile of neighbours
Repeat from 2 …
Very sensitive method for database search
PSI-BLAST
Position Specific Iterative Blast
 PSI-Blast profile models only positions in the query sequence
Query
Profile1
Profile2
After n iterations
...
Threshold for
inclusion in profile
HMMs
 Hidden Markov Models are Statistical methods
that consider all the possible combinations of
matches, mismatches, and gaps to generate a
consensus (Higgins, 2000)
 •Sequence ordering and alignments are not
necessary at the onset (but in many cases
alignments are recommended)
 More the number of sequences better the models.
 One can Generate a model (profile/PSSM), then
search a database with it (Eg: PFAM)
HMM libraries
 PFAM
http://www.sanger.ac.uk/Pfam
The Pfam database is a large collection of protein
families, each represented by multiple sequence
alignments and hidden Markov models (HMMs).
Pfam-A entries are high quality, manually curated
families.
Pfam-B entries are generated automatically.
GTG steps
1. Generate alignment trace graph
•
•
•
Nodes = residues
Edges = aligned in PSI-Blast library
Unweighted
2. Edge weighting
•
Using consistency
3. Clustering
•
•
Driven by consistency
Single site occupancy rule
4. Post-processing
•
•
Generate non-redundant set of inter-cluster edges
Identify sub-trees with conserved residues
Alignment trace graph
Residues
more residues
Protein 1
Protein 2
Protein 3
Protein 4
Protein 5
-Graph representation of input pairwise alignment data
-Vertices = residues
-Edges = aligned in a pairwise alignment from input library
Consistency = neighbour overlap
i
Weight = intersection / union
j
GTG – global trace graph
 Input: PSI-Blast all versus all alignments in
NRDB40
 Output: superalignment of all proteins
 Applications
Pairwise alignment of query and target sequences
Transitive sequence database searching (fast)
Tracking conserved residues (feature space)
Protein 1
Protein 2
Protein 3
Protein 4
Protein 5
Alignment trace graph
Protein 1
Protein 2
Protein 3
Protein 4
Protein 5
Cluster 1
Cluster 2
Edge weight = consistency (fraction of common neighbours)
Cluster ≈ hypothetical column of multiple alignment (single site occupancy)
‘Motif tracking’
A
G
A
K
K
H
A
A
consistency
K
K
K
K
K
K
A
consistency
consistency
Each vertex is labelled with source protein and position in sequence.
Motifs are subtrees enriched in one particular amino acid type.
Remote homolog detection
based on GTG alignment score
Lindahl benchmark, superfamily level
*SPARKS-0
GTG-DEEP
*FOLDpro
*SP3
GTG-LOCAL
*Prospect II
top-1
top-5
*Fugue
Blastlink
SAM-T98
PSI-Blast
Ssearch
HMMer
0
20
40
60
% correct
GTG clustering is informative; detect as many remote homologs as threading methods
Summary
 Super-families form elongated clusters in
“protein space”
Profile models fluctuations around an equilibrium point
 Consistency ~ path model
Exploits multiple profile models
Discriminative in database searching
 Global trace graph data structure
Feature space for pattern discovery
http://ekhidna.biocenter.helsinki.fi/gtg/start
Relationships between families
 Pfam clans
 A clan is a collection of Pfam-A entries which are related by
similarity of sequence, structure or profile-HMM.
 Superfamily
 http://supfam.mrc-lmb.cam.ac.uk/SUPERFAMILY/hmm.html
 The sequence search method uses a library (covering all proteins of
known structure) consisting of 1539 SCOP superfamilies from
classes a to g. Each superfamily is represented by a group of
hidden Markov models.
 Pfam-squared
 Based on GTG comparisons of representative sequences from each
PFAM-A family against all PFAM-A families.
 Rules of thumb: motif score>1000 means probably related, motif
score >500 means possibly related, score <500 means dubious
Benchmarking a motif/profile
 You have a description of a protein family,
and you do a database search…
 Are all hits truly members of your protein
family?
TP: true positive
 Benchmarking:
TN: true negative
Result
Dataset
FP: false positive
FN: false negative
family member
not a family member
unknown
Benchmarking a motif/profile
Precision / Selectivity
Precision = TP / (TP + FP)
Sensitivity / Recall
Sensitivity = TP / (TP + FN)
Balancing both:
Precision ~ 1, Recall ~ 0: easy but useless
Precision ~ 0, Recall ~ 1: easy but useless
Precision ~ 1, Recall ~ 1: perfect but very difficult
Domain Analysis and Protein Families
Introduction
What are protein families?
Protein families
Description & Definition
Motifs and Profiles
The modular architecture of proteins
Domain Properties and Classification
The Modular Architecture of
Proteins
BLAST search of a multi-domain protein
Phosphoglycerate kinase
Triosephosphate isomerase
What are domains?
 Functional - from experiments:
example: Decay Accelerating Factor
(DAF) or CD55
Has six domains (units):
 4x Sushi domain (complement
regulation)
 1x ST-rich ‘stalk’
 1x GPI anchor (membrane attachment)
 PDB entry 1ojy (sushi domains only)
P Williams et al (2003) Mapping CD55 Function. J Biol Chem 278(12): 10691-10696
There is only so much we can
conclude…
 Classifying domains [To aid structure
prediction
(predict
structural
domains,
molecular function of the domain)]
 Classifying complete sequences (predicting
molecular function of proteins, large scale
annotation)
 Majority of proteins are multi-domain proteins.
What are domains?
 Mobile – Sequence Domains:
Protein 1
Protein 2
Mobile module
Protein 3
Protein 4
Domains are...
 ...evolutionary building blocks:
Families of evolutionarily-related sequence segments
Domain assignment often coupled with classification
 With one or more of the following properties:
Globular
Independently foldable
Recurrence in different contexts
 To be precise,
we say: “protein family”
we mean: “protein domain family”
Example: global alignment
 Phthalate dioxygenase
reductase (PDR_BURCE)
 Toluene - 4 monooxygenase electron
transfer component
(TMOF_PSEME)
Global alignment fails!
Only aligns largest domain.
Sometimes even more complex!
PGBM_HUMAN: “Basement membrane-specific heparan sulphate
proteoglycan core protein precursor”
980
1960
2940
3920
4391
45 domains of 9 different type, according to PFam
http://www.sanger.ac.uk/cgi-bin/Pfam/swisspfamget.pl?name=P98160
http://www.glycoforum.gr.jp/science/word/proteoglycan/PGA09E.html
Properties of domains
Most domains: size approx 75 – 200 residues
So, you have a sequence...
 ...look it up in existing database

INTERPRO: http://www.ebi.ac.uk/interpro
 ...search against existing family descriptions


PFAM: http://www.sanger.ac.uk/Software/Pfam
INTERPROSCAN: http://www.ebi.ac.uk/Tools/InterProScan/

Domain analysis, motifs and repeats

Transcript Domain analysis, motifs and repeats

Directory