Transcript Slide 1
ChEMBL – Large-Scale Open Access Data for Drug Discovery John Overington EMBL-EBI [email protected] Private to Public Domain Transfer • • • Five year strategic award from Wellcome Trust Large-scale Drug Discovery Structure Activity Relationship (SAR) data Linking small molecule structures to ‘targets’ and pharmacological activities – Chemogenomics/Chemical Biology • ‘Open Access’, ‘User Friendly’, ‘Translational’, ‘Free’ • Multiple access mechanisms Full database download, web front-ends, web services • • Actively support ad hoc sabbaticals (academic and commercial) at EMBL-EBI ChEMBL Research Strategy • Comprehensively catalogue historical drug discovery • • • Include successes and failures Drugs can be small molecules, recombinant proteins, siRNA, etc. Derive rules for drug discovery ‘success’ from these data • • Target selection and prioritisation Lead discovery, optimisation, candidate selection Drug Discovery Process (simplified) Clinical Trials Target Discovery •Target identification •Microarray profiling •Target validation •Assay development •Biochemistry •Clinical/Animal disease models Lead Discovery •High-throughput Screening (HTS) •Fragment-based screening •Focused libraries •Screening collection Lead Optimisatio n •Medicinal Chemistry •Structure-based drug design •Selectivity screens •ADMET screens •Cellular/Animal disease models •Pharmacokinetics Preclinical Development •Toxicology •In vivo safety pharmacology •Formulation •Dose prediction Discovery Med. Chem. SAR >450,000 distinct compounds ~25,000 distinct lead series Phase 1 PK tolerabilit y Phase 2 Phase 3 Efficacy Safety & Efficacy Development Clinical Candidates ~12,000 candidates Launch Indication Discovery & expansion Use Drugs ~1,300 drugs ChEMBL: Launched Drugs • Database of all approved drugs • Chemistry and sequence ‘aware’ • Contents • Small molecules and biological therapeutics • USANs, INNs, research codes, other synonyms • Pharmaceutical properties, prodrugs, dosage, form, etc • PK data and metabolites, black box warnings, etc. • 1,378 chemically distinct ‘drugs’, 324 distinct molecular targets • Controlled vocabulary indications dictionary and hierarchy New Drugs 2006-2009 Enzyme mAb Peptide Other Protein Synthetic small molecule Natural Product ChEMBL: Launched Drugs Nat. Rev. Drug Disc., 5, pp. 993-996 (2006) ChEMBL: Drug Dosage ~150-200mmol 80 mmol nmol 70 mmol 60 50 40 30 Metformin, Hydroxyurea 20 10 0 Steroids, thyroids -8.4 -8.08 -7.76 -7.44 -7.12 -6.8 -6.48 -6.16 -5.84 -5.52 -5.2 -4.88 Binned log10 mole dose -4.56 -4.24 -3.92 -3.6 -3.28 -2.96 -2.64 -2.32 Affinity Of Drugs For Their Targets • Retrieved Ki, Kd, IC50, EC50, pA2, … endpoints for drugs against their ‘efficacy targets’ 400 350 Frequency 300 250 200 150 100 50 0 2 3 4 5 6 7 8 9 10 11 12 10nM 1nM 100pM 10pM 1pM -log10 affinity 10mM 1mM 100mM 10mM 1mM 100nM Function for Drug Efficacy/Affinity • Empirical function that estimates the probability of in vivo activity for a compound with acceptable PK characteristics as a function of target affinity 1.0 P(efficacy) 0.8 mM mM 0.6 nM pM 0.4 0.2 0.0 0 2 4 6 8 -log10 Affinity 10 12 ChEMBL: Clinical Candidates • Database of clinical development candidates • Contains ~10,000 2-D structures • Estimated size ~35-45,000 compounds • Work in progress • Deeper coverage of key gene families • e.g. Protein kinases, 184 distinct clinical candidates 90 VEGFR 80 70 PDGFR 60 50 40 p38a 30 20 C-Kit Aurora ErbB CDK Clinical candidates by target 10 0 Launched III II I Kinase clinical candidates by highest phase Industry Productivity File Registration number vs USAN date 800000 700000 600000 500000 400000 300000 200000 100000 0 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 Industry Productivity 70 64 USANs/100,000 compounds 60 16 Drugs/100,000 compounds 50 40 30 1.9 USANs/100,000 compounds 20 0.4 Drugs/100,000 compounds 10 0 1100,000 100,001200,000 200,001300,000 300,001400,000 400,001500,000 500,001600,000 File registration number range USAN assignment typically at entry to phase 3 600,001700,000 700,001, 800,000 ChEMBL: SAR data • Bioactive compounds • Link through to validated synthetic routes and assay protocols • Bidirectionally linking compounds to/from targets • Built from 12 primary journals •J.Med.Chem. Biorg.Med.Chem., PNAS, JBC, Bioorg.Med.Chem.Letts., Eur.J.Med.Chem., DMD, Xenobioitica, Nature, Science, AACR, J.Nat.Prod. • StARlite 1 – June 2001 • StARlite 31 – August 2008 H N N H O N N N H N H N O H O Compound >Thrombin (Homo sapiens) Ki=4.5 nM Target MAHVRGLQLPGCLALAALCSLVHSQHVFLAPQQARSLLQRVRRANTFLEEVRK GNLERECVEETCSYEEAFEALESSTATDVFWAKYTACETARTPRDKLAACLEG NCAEGLGTNYRGHVNITRSGIECQLWRSRYPHKPEINSTTHPGADLQENFCRN PDSSTTGPWCYTTDPTVRRQECSIPVCGQDQVTVAMTPRSEGSSVNLSPPLEQ CVPDRGQQYQGRLAVTTHGLPCLAWASAQAKALSKHQDFNSAVQLVENFCRNP DGDEEGVWCYVAGKPGDFGYCDLNYCEEAVEEETGDGLDEDSDRAIEGRTATS EYQTFFNPRTFGSGEADCGLRPLFEKKSLEDKTERELLESYIDGRIVEGSDAE IGMSPWQVMLFRKSPQELLCGASLISDRWVLTAAHCLLYPPWDKNFTENDLLV RIGKHSRTRYERNIEKISMLEKIYIHPRYNWRENLDRDIALMKLKKPVAFSDY IHPVCLPDRETAASLLQAGYKGRVTGWGNLKETWTANVGKGQPSVLQVVNLPI VERPVCKDSTRIRITDNMFCAGYKPDEGKRGDACEGDSGGPFVMKSPFNNRWY QMGIVSWGEGCDRDGKYGFYTHVFRLKKWIQKVIDQFGE StARLITe Bioactivity Drug Optimisation Imidazole Prototype N triazole N 1st generation 2nd generation 3rd generation N O N 4th generation N N N N N N + N O O + N N O N O O O O S N O N H O Cl N Tinidazole 1970 Metronidazole 1962 Cl O N N N N O O O H O N O O Cl N Cl Cl O N N N Cl Cl N H N N O Clotrimazole 1970 Ketoconazole 1978 Itraconazole 1984 N N Cl N Cl N N N Cl Streptomyces natural product trichomonacidal ‘toxic’ Posaconazole 2005 N O N O Azomycin (1956) OH N N N N F N N N F N N O Terconazole 1980 + N O N F Cl Cl O Cl Cl Miconazole 1970 N N S Sulconazole 1980 N F N N OH Cl N Voriconazole 2002 N N N Cl N F F N N F OH N F N N OH OH Cl O P O Fluconazole 1988 Econazole 1972 Bifonazole 1981 O F N N N Fosfluconazole 2004 After W. Sneader ChEMBL SAR Contents • Abstracted from 26,299 papers from 12 journals • Monthly update cycle - optimised curation pipeline • Autocuration tools – clean up and index other large SAR datasets • Updates and ongoing curation process all data, not simply new article data • 521,237 compound records • 440,055 distinct compound structures • 5,439 targets • 3,512 protein molecular targets • ~2,200 orthologous targets (1,644 human) • 1,936,969 million experimental bioactivities Counts refer to StARlite release 31 Interface and Searching Interface and Searching Interface and Searching Interface and Searching Interface and Searching Interface and Searching Interface and Searching Rule-based Optimisation – Bioisosteres • Identify data-driven ‘rational’ lead-optimisation strategies • Useful in automated design • e.g. Replacement of carboxylic acid • Reflect synthetic ease and expectation for functional effect Search StARLITe for functional group StARLITe HO O HO Search for all ‘contexts’ where acid has been replaced H N N N N Retrieve assay value O 60 Frequency (%) DIC50 O H2N S A 50 sulphonamide O N N N N 40 30 tetrazole A O 20 HO S A sulphonic acid O 10 A O -6 -4 -2 0 2 Effect on affinity (-log10 IC50) 4 6 ester O Typical Compound Collection - Novartis N N N N N benzene pyridine piperidine N piperazine N cyclohexane S N imidazole naphthalene thiophene O quinoline cyclopropane benzimidazole N N O S O benzthiazole benzodioxole N pyran quinazoline N N N triazole Ertl, Koch and Roggo, Novartis N adamantane pyrrolidine thiazole N N pyrrole imidazoline O N isoxazole N O tetrahydroisoquinoline S N N N N N benzofuran pyrazole N N furan indole N morpholine N O pyrimidine N O N N N cyclopentane N N O N purine N N N N tetrazole triazine tetrahydrofuran N isoquinoline Screening File Comparison - Novartis Depleted fragments 35 N N tetrazole N N Enriched fragments 30 N N tetrahydrofuran N purine N O Novartis rank 25 20 15 pyrrolidine N N pyrazole N 10 O morpholine N N 5 pyrimidine N piperidine N benzene StARLITe rank pyridine N 0 0 5 10 15 20 25 30 35 Genome-Scale Druggability Assessment Nat. Rev. Drug. Disc., 8, pp. 900-907 (2008) • • Nature 460, 352-358 (2009) Now possible to rapidly map chemical intervention points onto genomic data • In ‘real time’ as gene model is developed Develop therapeutic hypotheses for expert review/analysis/validation • Reuse existing drugs/clinical candidates in new contexts • Anticipate required optimisation (comparative modelling, etc) Indication Discovery Marks et al., Lancet, 367, pp. 668-678 (2006) • • Map chemical biology/pharmacology data onto microarray datasets • Rapid path to clinic and patient benefit Develop therapeutic hypotheses for expert review/analysis/validation Marks et al., Lancet , 367, pp. 668-678 • Reuse existing drugs/clinical candidates in new contexts (2006) The ChEMBL-og - www.chemblog.org