Transcript Slide 1

ChEMBL –
Large-Scale Open Access Data
for Drug Discovery
John Overington
EMBL-EBI
[email protected]
Private to Public Domain Transfer
•
•
•
Five year strategic award from
Wellcome Trust
Large-scale Drug Discovery
Structure Activity Relationship
(SAR) data
Linking small molecule
structures to ‘targets’ and
pharmacological activities –
Chemogenomics/Chemical
Biology
•
‘Open Access’, ‘User Friendly’,
‘Translational’, ‘Free’
•
Multiple access mechanisms
Full database download, web
front-ends, web services
•
•
Actively support ad hoc
sabbaticals (academic and
commercial) at EMBL-EBI
ChEMBL Research Strategy
•
Comprehensively catalogue historical drug
discovery
•
•
•
Include successes and failures
Drugs can be small molecules, recombinant proteins,
siRNA, etc.
Derive rules for drug discovery ‘success’ from
these data
•
•
Target selection and prioritisation
Lead discovery, optimisation, candidate selection
Drug Discovery Process
(simplified)
Clinical Trials
Target
Discovery
•Target
identification
•Microarray
profiling
•Target validation
•Assay
development
•Biochemistry
•Clinical/Animal
disease models
Lead
Discovery
•High-throughput
Screening (HTS)
•Fragment-based
screening
•Focused libraries
•Screening
collection
Lead
Optimisatio
n
•Medicinal
Chemistry
•Structure-based
drug design
•Selectivity
screens
•ADMET screens
•Cellular/Animal
disease models
•Pharmacokinetics
Preclinical
Development
•Toxicology
•In vivo safety
pharmacology
•Formulation
•Dose prediction
Discovery
Med. Chem. SAR
>450,000 distinct compounds
~25,000 distinct lead series
Phase
1
PK
tolerabilit
y
Phase
2
Phase
3
Efficacy
Safety
&
Efficacy
Development
Clinical
Candidates
~12,000 candidates
Launch
Indication
Discovery &
expansion
Use
Drugs
~1,300
drugs
ChEMBL: Launched Drugs
• Database of all approved drugs
• Chemistry and sequence ‘aware’
• Contents
• Small molecules and biological therapeutics
• USANs, INNs, research codes, other synonyms
• Pharmaceutical properties, prodrugs, dosage, form, etc
• PK data and metabolites, black box warnings, etc.
• 1,378 chemically distinct ‘drugs’, 324 distinct molecular targets
• Controlled vocabulary indications dictionary and hierarchy
New Drugs 2006-2009
Enzyme
mAb
Peptide
Other
Protein
Synthetic
small
molecule
Natural
Product
ChEMBL: Launched Drugs
Nat. Rev. Drug Disc., 5, pp. 993-996 (2006)
ChEMBL: Drug Dosage
~150-200mmol
80
mmol
nmol
70
mmol
60
50
40
30
Metformin,
Hydroxyurea
20
10
0
Steroids,
thyroids
-8.4
-8.08
-7.76
-7.44
-7.12
-6.8
-6.48
-6.16
-5.84
-5.52
-5.2
-4.88
Binned log10 mole dose
-4.56
-4.24
-3.92
-3.6
-3.28
-2.96
-2.64
-2.32
Affinity Of Drugs For Their Targets
• Retrieved Ki, Kd, IC50, EC50, pA2, … endpoints for
drugs against their ‘efficacy targets’
400
350
Frequency
300
250
200
150
100
50
0
2
3
4
5
6
7
8
9
10
11
12
10nM
1nM
100pM
10pM
1pM
-log10 affinity
10mM
1mM
100mM
10mM
1mM
100nM
Function for Drug Efficacy/Affinity
• Empirical function that estimates the probability of in
vivo activity for a compound with acceptable PK
characteristics as a function of target affinity
1.0
P(efficacy)
0.8
mM
mM
0.6
nM
pM
0.4
0.2
0.0
0
2
4
6
8
-log10 Affinity
10
12
ChEMBL: Clinical Candidates
• Database of clinical development candidates
• Contains ~10,000 2-D structures
•
Estimated size ~35-45,000 compounds
• Work in progress
• Deeper coverage of key gene families
• e.g. Protein kinases, 184 distinct clinical candidates
90
VEGFR
80
70
PDGFR
60
50
40
p38a
30
20
C-Kit
Aurora
ErbB
CDK
Clinical candidates by target
10
0
Launched III
II
I
Kinase clinical candidates
by highest phase
Industry Productivity
File Registration number vs USAN date
800000
700000
600000
500000
400000
300000
200000
100000
0
1960
1965
1970
1975
1980
1985
1990
1995
2000
2005
2010
Industry Productivity
70
64 USANs/100,000 compounds
60
16 Drugs/100,000 compounds
50
40
30
1.9 USANs/100,000 compounds
20
0.4 Drugs/100,000 compounds
10
0
1100,000
100,001200,000
200,001300,000
300,001400,000
400,001500,000
500,001600,000
File registration number range
USAN assignment typically at entry to phase 3
600,001700,000
700,001,
800,000
ChEMBL: SAR data
• Bioactive compounds
• Link through to validated synthetic routes and assay protocols
• Bidirectionally linking compounds to/from targets
• Built from 12 primary journals
•J.Med.Chem. Biorg.Med.Chem., PNAS, JBC,
Bioorg.Med.Chem.Letts., Eur.J.Med.Chem., DMD,
Xenobioitica, Nature, Science, AACR, J.Nat.Prod.
• StARlite 1 – June 2001
• StARlite 31 – August 2008
H
N
N H
O
N
N
N
H
N
H
N
O
H
O
Compound
>Thrombin (Homo sapiens)
Ki=4.5 nM
Target
MAHVRGLQLPGCLALAALCSLVHSQHVFLAPQQARSLLQRVRRANTFLEEVRK
GNLERECVEETCSYEEAFEALESSTATDVFWAKYTACETARTPRDKLAACLEG
NCAEGLGTNYRGHVNITRSGIECQLWRSRYPHKPEINSTTHPGADLQENFCRN
PDSSTTGPWCYTTDPTVRRQECSIPVCGQDQVTVAMTPRSEGSSVNLSPPLEQ
CVPDRGQQYQGRLAVTTHGLPCLAWASAQAKALSKHQDFNSAVQLVENFCRNP
DGDEEGVWCYVAGKPGDFGYCDLNYCEEAVEEETGDGLDEDSDRAIEGRTATS
EYQTFFNPRTFGSGEADCGLRPLFEKKSLEDKTERELLESYIDGRIVEGSDAE
IGMSPWQVMLFRKSPQELLCGASLISDRWVLTAAHCLLYPPWDKNFTENDLLV
RIGKHSRTRYERNIEKISMLEKIYIHPRYNWRENLDRDIALMKLKKPVAFSDY
IHPVCLPDRETAASLLQAGYKGRVTGWGNLKETWTANVGKGQPSVLQVVNLPI
VERPVCKDSTRIRITDNMFCAGYKPDEGKRGDACEGDSGGPFVMKSPFNNRWY
QMGIVSWGEGCDRDGKYGFYTHVFRLKKWIQKVIDQFGE
StARLITe
Bioactivity
Drug Optimisation
Imidazole
Prototype
N
triazole
N
1st generation
2nd generation
3rd generation
N
O
N
4th generation
N
N
N
N
N
N
+
N
O
O
+
N
N
O
N O
O
O
O
S
N
O
N
H
O
Cl
N
Tinidazole 1970
Metronidazole 1962
Cl
O
N
N
N
N
O
O
O
H
O
N
O
O
Cl
N
Cl
Cl
O
N
N
N
Cl
Cl
N
H
N
N
O
Clotrimazole 1970
Ketoconazole 1978
Itraconazole 1984
N
N
Cl
N
Cl
N
N
N
Cl
Streptomyces
natural product
trichomonacidal
‘toxic’
Posaconazole 2005
N
O
N
O
Azomycin
(1956)
OH
N
N
N
N
F
N
N
N
F
N
N
O
Terconazole 1980
+
N
O
N
F
Cl
Cl
O
Cl
Cl
Miconazole 1970
N
N
S
Sulconazole 1980
N
F
N
N
OH
Cl
N
Voriconazole 2002
N
N
N
Cl
N
F
F
N
N
F
OH
N
F
N
N
OH
OH
Cl
O
P
O
Fluconazole 1988
Econazole 1972
Bifonazole 1981
O
F
N
N
N
Fosfluconazole 2004
After W. Sneader
ChEMBL SAR Contents
• Abstracted from 26,299 papers from 12 journals
• Monthly update cycle - optimised curation pipeline
• Autocuration tools – clean up and index other large
SAR datasets
• Updates and ongoing curation process all data, not
simply new article data
• 521,237 compound records
• 440,055 distinct compound structures
• 5,439 targets
• 3,512 protein molecular targets
• ~2,200 orthologous targets (1,644 human)
• 1,936,969 million experimental bioactivities
Counts refer to StARlite release 31
Interface and Searching
Interface and Searching
Interface and Searching
Interface and Searching
Interface and Searching
Interface and Searching
Interface and Searching
Rule-based Optimisation – Bioisosteres
• Identify data-driven ‘rational’ lead-optimisation strategies
• Useful in automated design
• e.g. Replacement of carboxylic acid
• Reflect synthetic ease and expectation for functional effect
Search
StARLITe for
functional
group
StARLITe
HO
O
HO
Search for all
‘contexts’ where
acid has been
replaced
H
N N
N
N
Retrieve
assay
value
O
60
Frequency (%)
DIC50
O
H2N S A
50
sulphonamide
O
N
N
N
N
40
30
tetrazole
A
O
20
HO S A
sulphonic acid
O
10
A
O
-6
-4
-2
0
2
Effect on affinity (-log10 IC50)
4
6
ester
O
Typical Compound Collection - Novartis
N
N
N
N
N
benzene
pyridine
piperidine
N
piperazine
N
cyclohexane
S
N
imidazole
naphthalene
thiophene
O
quinoline
cyclopropane
benzimidazole
N
N
O
S
O
benzthiazole
benzodioxole
N
pyran
quinazoline
N N
N
triazole
Ertl, Koch and Roggo, Novartis
N
adamantane
pyrrolidine
thiazole
N
N
pyrrole
imidazoline
O
N
isoxazole
N
O
tetrahydroisoquinoline
S
N
N
N N
N
benzofuran
pyrazole
N
N
furan
indole
N
morpholine
N
O
pyrimidine
N
O
N
N
N
cyclopentane
N
N
O
N
purine
N
N
N
N
tetrazole
triazine
tetrahydrofuran
N
isoquinoline
Screening File Comparison - Novartis
Depleted fragments
35
N N
tetrazole N
N
Enriched fragments
30
N
N
tetrahydrofuran
N
purine
N
O
Novartis rank
25
20
15
pyrrolidine
N
N
pyrazole
N
10
O
morpholine
N
N
5
pyrimidine
N
piperidine
N
benzene
StARLITe rank
pyridine
N
0
0
5
10
15
20
25
30
35
Genome-Scale Druggability Assessment
Nat. Rev. Drug. Disc., 8, pp. 900-907 (2008)
•
•
Nature 460, 352-358 (2009)
Now possible to rapidly map chemical intervention points onto
genomic data
•
In ‘real time’ as gene model is developed
Develop therapeutic hypotheses for expert review/analysis/validation
•
Reuse existing drugs/clinical candidates in new contexts
•
Anticipate required optimisation (comparative modelling, etc)
Indication Discovery
Marks et al., Lancet, 367, pp. 668-678 (2006)
•
•
Map chemical biology/pharmacology data onto microarray datasets
•
Rapid path to clinic and patient benefit
Develop therapeutic hypotheses for expert review/analysis/validation
Marks
et al., Lancet
, 367, pp. 668-678
•
Reuse existing
drugs/clinical
candidates
in new contexts
(2006)
The ChEMBL-og - www.chemblog.org