protws3 4658

Download Report

Transcript protws3 4658

Structural Genomics:
Case studies in assigning function from structure
?
?
?
? ?
?
?
?
?
? ?
?
James D Watson
[email protected]
Structural Genomics Collaborators
MCSG – Mid-west Centre for Structural Genomics
SPINE – Structural Proteomics in Europe
SGC – Structural Genomics Consortium
Structural Genomics Aims
Pathogens and disease
?
Human Proteins
Automation /
High Throughput
Coverage of Fold Space
Proteins: known sequences and 3D structures
5,500 non-redundant
structures
~1.3m non-redundant
protein sequences
MRTKSPGDSKFHEITKTPPKNQVSNS…
MIVISGENVDIAELTDFLCAA…
PPRIPYSMVGPCCVFLMHH…
MDVVDSLFVNGSNITSACELGFENE…
VYAWETAHFLDAAPKLIEWEVS…
MAQQRRGGFKRRKKVDFIAANKIE…
CELGFENETLFCLDRPRPSKE…
MAQQRRGGFKRRKKVDFIAANKIE…
MGMKKNRPRRGSLAFSPRKRAKKLVP…
MQILKENASNQRFVTRESEV…
MEKFEGYSEKQKSRQQYFVYPFLF…
MEEFVNPCKIKVIGVGGGGSNAVNRMY…
MAVTQEEIIAGIAEIIEEVTGIEP…
…
~260,000
homology
models
Proteins: known sequences and 3D structures
5,500 non-redundant
structures
~10% unknown
Homology
models
3D structures of ~16,000
carefully selected proteins
Protein Function
Protein function has many definitions:
• Biochemical Function - The biochemical role of
the protein e.g. serine protease
• Biological Function - The role of the protein in the
cell/organism e.g.digestion, blood clotting,
fertilisation
Function through homology
Motif
searches
Sequence similarity
Active Site
Templates
HTH motifs
Surface comparison
Structural
Similarity
Template Methodology
Use 3D templates to describe the active site of
the enzyme - analogous to 1-D sequence
motifs such as PROSITE, but in 3-D
(Wallace et al 1997)
•defines a functional site
•search a new structure for a
functional site
•search a database of
structures for similar clusters
SiteSeer’s “reverse” templates
3-residue templates
1
2
3
4
5
6
7
8
9
…
Query structure
Problems with template methods
• Too many hits (hundreds, thousands or even tens of
thousands)
•Use of rmsd rarely discriminates true from false positives
• Local distortion in structure may give a large rmsd
• Top hit rarely the correct hit – even in “obvious” cases
An example
PDB code: 1hsk
UDP-N-acetylenolpyruvoylglucosamine
reductase (MURB)
E.C.1.1.1.158
Glu
Contains the 3D template that characterises
this enzyme class
Sequence identity to template’s
representative structure (1mbb) is 28%
Ser
Arg
Enzyme active site templates
Hits for 1hsk
rmsd=2.19Å
Ser
Arg
Hit E.C number
Rmsd
Enzyme
1. E.C.1.3.99.2
0.76Å
Acyl-CoA dehydrogenase
2. E.C.4.2.1.20
0.76Å
Tryptophan synthase α-subunit
3. E.C.3.2.1.73
1.19Å
Glycosyl hydrolases, family 17
4. E.C.3.2.1.73
1.21Å
Glycosyl hydrolases, family 16
5. E.C.4.1.2.13
1.25Å
Fructose-bisphosphate aldolase (class I)
…
…
102. E.C.1.1.1.158 2.19Å
386.
…
UDP-N-acetylmuramate dehydrogenase
…
…
…
…
3.94Å
…
Glu
Comparison of template environments
Ser
Match to template:
Arg
Glu
Template structure – 1mbb
Query structure – 1hsk
Comparison of template environments
Ser
Match to template:
Arg
Glu
Template structure – 1mbb
Query structure – 1hsk
Comparison of template environments
Identical residues in
neighbourhood:
Template structure – 1mbb
Query structure – 1hsk
Comparison of template environments
Ser
Similar residues in
neighbourhood:
Arg
Glu
Template structure – 1mbb
Query structure – 1hsk
Results for 1hsk
Hit E.C number
Rmsd Score
Enzyme
1. E.C.1.1.1.158
2.08 209.1
UDP-N-acetylmuramate dehydrogenase
2. E.C.3.2.1.14
2.13 146.0
Chitinase A chitodextrinase
1,4-beta-poly-N-acetylglucosaminidase
coly-beta-glucosaminidase
3. E.C.3.2.1.17
1.92 142.4
Turkey lysozyme
4. E.C.3.2.1.17
1.89 138.7
Hen lysozyme
5. E.C.3.5.1.26
1.47 132.3
Aspartylglucosylaminidase
6. E.C.3.2.1.3
1.54 131.1
Glucan 1,4-alpha-glucosidase
ProFunc – function from 3D structure
Homologous
sequences of
known function
Functional
sequence motifs
Q-x(3)-[GE]-x-C-[YW]-x(2)-[STAGC]
HTH-motifs
Electrostatics
Surface comparison
Nests
… etc
Residue
conservation
analysis
Template
based
methods
Homologous
structures of
known function
Function
Binding site
identification
and analysis
Large scale analysis
• Created an edited version of the target database
from the PDB – only those with status “In PDB”
• Extract all PDB codes for each Structural
Genomics group
• Extract ‘prior’ knowledge (Header, Title, Jrnl, etc.)
• Find any associated GOA annotation
• Classify each structure by whether function is
“known” “unknown” or “limited info”
• Run Profunc in a batch process on all codes (~560)
• Extract summary results from each analysis
• Compare to prior knowledge and estimate success
Number of deposits to the TargetDB by Structural
Genomics group (Total of 577 unique entries)
TB (26)
SECSG (19)
RIKEN (124)
S2F (37)
PSF (4)
NYSGC (73)
BCSG (35)
CESG (6)
JCSG (59)
NESG (83)
MCSG (117)
March 2004
PDB Blast
• Run query sequences against the PDB using BLAST
• Filtered out those matches released AFTER the query sequence
• Any hits are ignored from subsequent analyses
37%
No Matches
63%
• Still get significant matches
– why?
Significant Hits
(> 30% Seq ID)
Target selection criteria
Released within months
of SG target
InterPro Scan
• InterPro scan on proteins of known function
No Matches
Significant Hits
• Cannot “backdate” the InterPro database
• Essentially picking up itself
Function of query structure “known”
SSM
Siteseer
DNA
Ligand
Enzyme
HTH motif
0%
20%
No Hits
40%
60%
Different Function
80%
Same Function
100%
Limited Functional Info
SSM
Siteseer
DNA
Ligand
Enzyme
HTH motif
0%
20%
No Hits
40%
Different Function
60%
Same Function
80%
New Function
100%
Unknown Function
SSM
Siteseer
DNA
Ligand
Enzyme
HTH motif
InterPro
0%
20%
No Hits
40%
60%
Hit Unknown Function
80%
New Function
100%
The Good, the Not So Good
and the Ugly
Three examples show the varying levels of information
that can be retrieved from structures:
1. New functional assignment
2. Possible function identified
3. Function remains unknown
The Good: BioH structure (MCSG)
One very strong hit
Ser-His-Asp catalytic triad of
the lipases with rmsd=0.28Å
(template cut-off is 1.2Å)
Function
Discovered
Experimentally confirmed by hydrolase assays
Novel carboxylesterase acting on short acyl chain substrates
The Not So Good: APC1040 (MCSG)
•Assigned as a probable glutaminase
•Most methods suggest b-lactamase activity
•No match to Prosite patterns
Function being assayed
APC1040:
70 F-T-M-Q-S-I-S-K-V-I-S-F-I-A-A-C 85
Class A: [FY] -x-[LIVMFY]-x-S-[TV]-x-K-x(4)-[AGLM]-x(2)-[LC]
The Ugly: MT0777 (MCSG)
Hypothetical protein from:
Methanobacterium
thermoautotrophicum
•No sequence motifs
•Residue conservation is poor.
•Fold associated with many functions
(Rossmann fold)
•Template methods fail
Function Unknown
Future Work
• Improvements to scoring system and additional
templates
• Further utilisation of SOAP services as they
become available (e.g. KEGG API service)
• Possible adaptation to use as part of a larger
workflow or in LIMS systems (Taverna and
MyGrid)
• More truely predictive analyses being developed
(e.g. Electrostatics, ligand prediction, catalytic
residue prediction)
Detection of DNA-binding proteins (with HTH motif)
using structural motifs and electrostatics
(Hugh Shanahan)
Combine electrostatics with
HTH structural templates.
● Can detect HTH DNA-binding
proteins only.
● 1/3 of DNA-binding proteins
families have HTH motif
● Use linear predictor as
discriminant.
●
Find comparable true positive
rate (~80%) with more
complicated methods.
● Very low (< 0.01% ) false
positive rate.
●
Ligand Prediction
Can active site geometry, shape, physical-chemical
properties etc. be used to predict the preferred ligand class?
Active Site & Ligand description/fingerprinting methods:
• Spherical Harmonics
• Hybrid Ellipsoids
Spherical Harmonics
(Richard Morris)
Spherical t-designs
The computation of Legendre polynomials of
high order requires a robust integration
scheme
Hybrid Ellipsoids
(Rafael Najmanovich)
• Every shape can be
modelled by a set of
hybrid ellipsoids
• The parameters
describe location and
a,b,c of the ellipsoid
and a smear factor
• Similar parameters
mean similar active
sites and ligands
Predicting Catalytic Residues
(Alex Gutteridge)
• Aims:
• To predict the location of the active
site in an enzyme structure.
• To predict the catalytic residues of
an enzyme.
• How?
• Train a neural network to identify
catalytic residues.
• Cluster high scoring residues to find
the active site.
Workflows and Taverna
(Tom Oinn)
• Most procedures used
now follow a workflow
type scheme
• Taverna allows users to
pick elements from
services to create their
own workflows for
automation of complex
sets of procedures.
• Removes the need to
write complex scripts
Beta 9 release available at: http://taverna.sourceforge.net/
Acknowledgements
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Janet Thornton
Christine Orengo
Roman Laskowski - Profunc
Richard Morris – Interpro search, Spherical Harmonics
Gail Bartlett, Craig Porter – Enzyme Templates
Alex Gutteridge – Catalytic Residue Prediction
Sue Jones – HTH motifs
Hugh Shanahan – DNA binding, Electrostatics
Jonathan Barker – JESS
Hannes Ponstingl – PITA
Rafael Najmanovich – Hybrid Ellipsoids
Martin Senger, Siamak Sobhany – SOAP, Tom Oinn – Taverna
Annabel Todd and Russell Marsden – UCL
MCSG consortium for lots of structures, plus many more at EBI and UCL
Work was supported by NIH grant (GM 62414) and by the US DoE under
contract (W-31-109-Eng-38)