How to - University of Sheffield
Download
Report
Transcript How to - University of Sheffield
Bridging
cheminformatics and bioinformatics
using
protein structures
Edith Chan
Inpharmatica
London
10 April 2001
SELECTING THE BEST TARGETS
High Validity and Drugability Requires a Unifying
Informatics Framework
Disease-association doesn’t make a protein a target - requires validation as point
of intervention in pathway
Having good biological rationale doesn’t make a protein tractable to chemistry
(drugable)
Genomics, HTS and Combichem have increased numerical throughput many
hundred fold - overload of poorly integrated data, shortfall in productivity
Target Validation Process
Disease
Target
Bioinformatics
Drug Discovery Process
Target
Selection
Leads
Clinic
Cheminformatics
Inpharmatica’s protein structure focus
- uniquely placed to assess both parameters
1
BIOPENDIUM AND CHEMATICA
O
O
O
O
N
N
O
O
O
O
O
ctgacaagtatgaaaac
aacaagctgattg
tccgcagagggcagtct
ttctatgtgcaga
ttgacctcagtcgtc
F
Biopendium
Chematica
N
O
O
O
O
Genome Data
protein target validation
and selection
Target Structure
O
H
Lead Hypotheses
drug discovery
2
STRUCTURE-BASED METHODS FIND MANY
HOMOLOGUES (AND PUTATIVE TARGETS) NOT
DETECTABLE FROM SEQUENCE SIMILARITY
Biochemical function and drugability defined by 3D structure, not
sequence - structure is better conserved
AHHLDRPGHNMCEAGFWQPILL
Test Sequence
Standard
Approaches
30%
Advanced
Approaches
% SEQUENCE ID
100%
Inpharmatica
0
3
BIOPENDIUM
Inputs - all public (or proprietary) protein data
Proprietary methods
Genome-Threader
QBI-Y-Blast
Reverse Search Maximisation
Massive computation
1 million cpu hour set of calculations employing
the most advanced algorithms (1100 processor
farm)
Applied to 600,000 sequences, 14,000
Protein Information
Structures
Sequences
Bound ligands
Families
Functions
structures + bound ligands
Yields 670m precalculated protein relationships
Query results in 15 minutes vs. two weeks with
traditional bioinformatics in an Oracle database
4
Proprietary seq.
ORF prediction
THE INPHARMATICA BIOPENDIUM
Genbank
Taxonomy
Pairwise
sequence
searches
Profile
based
searches
Swissprot
Prints
Prosite
Link complementary data
in the 7 resources
Ligplot
ligand interaction
editor
PDB
Mask
sequences
Precalculated data for
600,000 protein sequences.
(scores and alignments for each hit)
Relational
Database
Inpharmatica
Workbench
Enzyme
Proprietary
structures
Threading
based
approaches
Ligplot
Interactive
sequence alignment
editor
Processed PDB
to XMAS data
Inpharmatica
enhanced RasMol
3D viewer
5
DRUGABLE TARGET DISCOVERY
Finding a novel brain metalloprotease
BIOPENDIUM
Novel brain
protein
identified
CHEMATICA
Drugable site
identified
7
CHEMATICA IS….
O
O
O
O
O
N
O
C
O N
N
N
Site
Identification
Site
Mapping
Fragment
Mapping
Pharmacophore
Generation
Database of putative/known binding sites
site mapping and pharmacophore generation
similarity searching/clustering of sites
large scale virtual screening resource
Ligand 2-D
structures
Chemical
annotation of
PDB
‘real’ ligand
structures
Gene Family
Data Views
Gene family
structures
consensus family
analysis
8
Site identification - How sites in a protein structure are delineated?
a. Sphere is placed between the
VDW surfaces of each atom
pair.
b. Any neighbouring atoms
penetrating sphere cause its
size to be reduced.
c. Repeat for all possible atom
pairs.
d. Generate surface around
surviving sphere to define
site region.
SURFNET: A program for visualizing molecular surfaces, cavities and intermolecular interactions.
Laskowski R A (1995), J. Mol. Graph., 13, 323-330.
9
Physical Parameters of the clefts
8 largest sites are stored together with their physical parameters
Volume
Hydrophobic content
Polar content
surface accessibility
……
In total - 20 parameters calculated.
10
Prediction of binding/active sites
Rule driven:
use of Neural Netsa on a training set of
100 ligand/protein PDBs
Validation:
success rate = 90% on a extended set of 500 PDBs
a backpropagation
net -7-5-1 network
11
How XSITE potential is derived?
•3-D distributions of 20 different atom types about the 20 amino acids are
calculated.
•No assumption of energy terms.
X-SITE: use of empirically derived atomic packing preferences to identify favourable interaction regions in the binding sites of proteins.
Laskowski R A, Thornton J M, Humblet C & Singh J (1996), Journal of Molecular Biology, 259, 175-201.
12
Data set Used
(1) 521 non-homologus protein chains* from PDB that satisfy
no two sequence identity is > 20%
resolution <1.8Å
R factor < 0.2
AND
(2) 376 protein-ligand PDB structures for studying additional atom types
other than those from peptides and proteins, such as Cl, F.
Note: The PDB has about 14K entries!
*cullpdb_pc20_res1.8_R0.2_d001130_chains521 (R. Dunbrack, Jr.)
13
Projecting XSITE distributions onto the predicted binding site
Application of XSITE distributions
to side-chains making up
the calculated protein binding site
14
How Pharmacophore is generated?
a. Compare the XSITE predictions generated for
the different probe atoms at a 3D grid of
densities encompassing the region of the
binding site.
b. The higher the value at a given grid-point the
higher the likelihood of finding that type of
atom at that location.
c. For each probe atom, it derives a “best” map.
d. The net result is a new set of 3D grid maps,
one per probe atom, holding only those regions
where that atom scored higher than the others.
15
What is fragments mapping?
O
t-butyl
ethyl
phenyl
tBoc
naphthayl
di-phenyl
O
O
O
H
O
N
O
carbonyl
O
carboxyl
O
acetamide
O
S
furan
N
acetic acid
O
N
thiophene
oxazole
O
S
O
sulfonyl
O
S
thiazole
pryrole
F
Cl
H
cyano
triazine
b. Privileged structures from
companies.
O P O
thiadiazole
piperazine
sulfnamide
N
O
N N
N
N
imidazole
S
N
N
N
N
N
N
cyclohexyl thiazolidine
methylamine
N
N
S
H
H
S
O
O
bi-phenyl
a. In-built database of more than
100 small molecule fragments most common functional
groups and represent the
common building blocks that
satisfy drug-like elements used
in chemistry.
S
mercapto
O
H
methol
O
+
O N
Cl
F
F
F
Cl
Cl
16
How is fragments mapping done?
• Each atom in a fragment is assigned
one of the 20 atom type.
C.ar
C.ar
C.ar
• Each fragment is placed at every
grid-point within the binding site and
subjected to 300 rotations.
• At each rotation a score is calculated
using the appropriate X-SITE
predictions for the atom types that the
fragment contains.
17
CHEMATICA
Gene Family Data Views
Curated, high-quality annotation
and presentation of important
‘drugable’ gene families
NHRs, kinases, caspases,
GPCRs,….
Contains ligand structure
information
Contains crystal environment
classification
Automatic alerts for newly released
structures
Multiple structure comparison
options
18
CHEMATICA
Consensus Family Analysis
Size and topology of binding
sites for MMP-1 & MMP-8 are
similar, but detailed interactions
differ
Spheres signify negative charge
requirement in different areas of
the binding pockets
provides potential for specificity
MMP-1 MMP-8 MMP-13 MMP-3
19
Validation Study
Taken two sets of data from literature
1) GOLD (Jones, Willett, Glen, Leach and Taylor)
Genetic Optimization for Ligand Docking
(71% success rate in ligand binding mode in 100 pdbs)
our method - 70%
2) SUPERSTAR (Verdonk, Cole and Taylor)
Empirical method for interactions in proteins
(67% success rate for original 4 probes ~67% in 122 pdbs)
our method - 84%
1. Jones et al. J. Mol. Biol. (1997) 267, 727-748
2. Verdonk et al. J. Mol. Biol. (1999) 289, 1093-1108
20
Acknowledgements
Inpharmatica
Alex Michie
John Overington
Simon Skidmore
UCL
Roman Laskowski
Adrian Shepherd
Janet Thornton
21