NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space
Download ReportTranscript NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space Markus Sitzmann1, Wolf-Dietrich Ihlenfeldt2, and Marc C. Nicklaus1 [1] Computer-Aided Drug Design Group, Chemical Biology Laboratory, NCI-Frederick, NIH, DHHS [2] Xemistry GmbH, Auf den Stieden 8, D-35094 Lahntal, Germany NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry Small Molecule Databases • since the early 2000s: number of databases “publishing” small molecules grew enormously, e.g. PubChem, ChemSpider, ChEMBL, DrugBank – what is the overlap, how many small-molecules are there currently? • ambiguities in the representation of small molecules (e.g. tautomerism, salts, ionic resonance forms) • growing number of chemical structure identifiers (InChI/InChIKey, PubChem SID/CID, ChemSpider ID, ChEBI ID, …) NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry Chemical Identifier Resolver SMILES SYBYL Line Notation CAS Registry Number chemical names GIF image SD File ChemNavigator SID CML chemical structure FDA UNII NCI/CADD Identifiers NSC number InChI/InChIKey MRV PubChem SID/CID ChemSpider ID Chemical Formula NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce ChEBI ID PDB Ligand ID U.S. Government Chemical Databases and Open Chemistry NCI/CADD Web Resources Chemical Identifier Resolver Works as a resolver for different chemical structure identifiers. Allows one to convert a given structure identifier into another representation or structure identifier. first beta release: July 2009 current release (beta 4): April 2011 http://cactus.nci.nih.gov/chemical/structure NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry NCI/CADD Web Resources Chemical Identifier Resolver • it is usable by a simple URL API: http://cactus.nci.nih.gov/chemical/structure/”identifier”/”representation” XML format: http://cactus.nci.nih.gov/chemical/structure/”identifier”/”representation”/xml example: http://cactus.nci.nih.gov/chemical/structure/Tamiflu/cas 204255-11-8 MIME type: text/plain • if a request is not resolvable: HTTP404 status message NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry NCI/CADD Public Web Resources Chemical Identifier Resolver chemical names IUPAC names (by OPSIN) CAS numbers SMILES strings IUPAC InChI/InChIKeys NCI/CADD Identifiers CACTVS HASHISY NSC number PubChem SID ChemSpider ID ChemNavigator SID ZINC FDA UNII resolver http://cactus.nci.nih.gov/chemcial/structure “identifier” NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce /smiles /names, /iupac_name /cas /inchi, /stdinchi /inchikey, /stdinchikey /ficts, /ficus, /uuuuu /image /file, /sdf /mw, /monoisotopic_mass /formula /twirl, /3d /urls /chemspider_id /pubchem_sid /chemnavigator_sid “representation” U.S. Government Chemical Databases and Open Chemistry NCI/CADD Web Resources Chemical Identifier Resolver representation identifier MIME type http request detection of the identifier type identifier is a hashed structure representation (e.g. InChIKey), trivial name etc. calculation of the requested structure representation identifier is a full structure representation (e.g. SMILES, InChI) http response e.g. InChI, GIF image structure database lookup NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce CACTVS e.g. CAS number, chemical name NCI/CADD Chemical Structure Database (CSDB) U.S. Government Chemical Databases and Open Chemistry NCI/CADD Web Resources Chemical Identifier Resolver representation identifier MIME type http request detection of the identifier type identifier is a hashed structure representation (e.g. InChIKey), trivial name etc. calculation of the requested structure representation identifier is a full structure representation (e.g. SMILES, InChI) http response e.g. InChI, GIF image structure CACTVS e.g. CAS number, chemical name database lookup NCI/CADD Chemical Structure Database (CSDB) NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry Chemical Identifier Resolver Resolving Chemical Names http://cactus.nci.nih.gov/chemical/structure/L-alanin/smiles/xmls ?resolver=name_by_chemspider,name_by_opsin,name_by_cir <request string="L-alanin" representation="smiles"> <data id="1" resolver="name_by_chemspider" string_class="Chemical Name (ChemSpider)"> <item id="1">C[C@H](N)C(O)=O</item> </data> <data id="2" resolver="name_by_opsin" string_class="IUPAC Name (OPSIN)"> <item id="1">C[C@H](N)C(O)=O</item> </data> <data id="3" resolver="name_by_cir" string_class="Chemical Name (CIR)"> <item id="1“>C[C@H](N)C(O)=O</item> </data> </request> NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry Chemical Identifier Resolver Chemical Structure Database (CSDB) • ChemNavigator iResearch Library compilation of commercially available screening compounds from ~330 international chemistry suppliers • PubChem database including Open NCI database, EPA DSSTox databases, NIAID HIV databases, NIST Webbook, NLM ChemIDplus, ChemSpider … • Commercial Sources / others Asinex, Comgenex, eMolecules, ChEMBL, … NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce ChemNav. iResearch Lib. ~56% PubChem ~38% ~6% others currently: ~150 chemical structure databases ~120 million structure records ~81.6 million unique structures by NCI/CADD FICuS Identifier ~84 million unique structures by Std. InChIKey U.S. Government Chemical Databases and Open Chemistry NCI/CADD Structure Identifiers FICTS, FICuS, uuuuu NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry Unique Representation of Chemical Structures NCI/CADD Structure Identifiers • based on hashcodes calculated by the chemoinformatics toolkit CACTVS O HN OH N NH2 9850FD9F9E2B4E25 • CACTVS hashcodes: represent a chemical structure uniquely as 16-digit hexadecimal number (64-bit unsigned) high sensitivity to structural features of a compound change if connectivity changes NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry Unique Representation of Chemical Structures NCI/CADD Structure Identifiers Molfile SDF SMILES ChemDraw cdx PDB original structure record structure normalization parent structure hashcode calculation E_HASHISY NCI/CADD Identifier SDF SMILES database NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry Unique Representation of Chemical Structures NCI/CADD Structure Identifiers Molfile SDF SMILES ChemDraw cdx PDB original structure record structure normalization parent structure hashcode calculation E_HASHISY SDF SMILES database NCI/CADD Identifier FICTS FICuS • calculation of a set of parent structures with different uuuuu sensitivity to chemical features • representation of chemical structures on different levels NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry Unique Representation of Chemical Structures NCI/CADD Structure Identifiers sensitive / not sensitive Fragments Isotopes Charges Tautomers Stereo FICTS FICuS uuuuu O HN ON NH2 Na+ 4A122D094098B50D-FICTS-01-1D 0E26B623DF7FAD30-FICuS-01-70 9850FD9F9E2B4E25-uuuuu-01-27 <CACTVS hashcode (E_HASHISY)>-<tag>-<version>-<checksum> NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry O O N NH OH HN OH O N NH2 NH2 HN OH N NH2 stereoisomers tautomer O O HN salt ON charged form Na+ HN NH2 O OH N NH2 HN O- N NH3+ histidine isotope HN O N NH2 O “errors” O O Na HN OH N NH OH N 15NH 2 NH NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry O O N NH OH HN OH NH2 O N NH2 HN E92E4BA2869F3611-FICTS 6C16DE2351F9FF50-FICTS stereoisomers OH N NH2 8A7AD1EB498CC76A-FICTS tautomer O O HN salt ON charged form Na+ HN NH2 N NH2 HN histidine E5F83F10C5DB080A-FICTS O OH O- N NH3+ A3DAE0788050DDE4-FICTS 9850FD9F9E2B4E25-FICTS FICTS isotope HN O N O “errors” O NH2 E5F83F10C5DB080A-FICTS O Na HN OH N NH NH OH N 15NH 2 B2FDA68AEDA06DB9-FICTS 9850FD9F9E2B4E25-FICTS NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry O O N NH OH HN OH NH2 O N NH2 HN E92E4BA2869F3611-FICuS 9850FD9F9E2B4E25-FICuS stereoisomers OH N NH2 8A7AD1EB498CC76A-FICuS tautomer O O HN salt ON charged form Na+ HN NH2 N NH2 HN histidine E5F83F10C5DB080A-FICuS O OH O- N NH3+ A3DAE0788050DDE4-FICuS 9850FD9F9E2B4E25-FICuS FICuS isotope HN O N O “errors” O NH2 E5F83F10C5DB080A-FICuS O Na HN OH N NH NH OH N 15NH 2 B2FDA68AEDA06DB9-FICuS 9850FD9F9E2B4E25-FICuS NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry O O N NH OH HN OH NH2 O N NH2 HN 9850FD9F9E2B4E25-uuuuu 9850FD9F9E2B4E25-uuuuu stereoisomers OH N NH2 9850FD9F9E2B4E25-uuuuu tautomer O O HN salt ON charged form Na+ HN NH2 N NH2 HN histidine 9850FD9F9E2B4E25-uuuuu O OH O- N NH3+ 9850FD9F9E2B4E25-uuuuu 9850FD9F9E2B4E25-uuuuu uuuuu isotope HN O N O “errors” O NH2 9850FD9F9E2B4E25-uuuuu O Na HN OH N NH NH OH N 15NH 2 9850FD9F9E2B4E25-uuuuu 9850FD9F9E2B4E25-FICuS NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry O O N NH OH HN OH NH2 O N NH2 OH HN N HNDVDQJCIGZPNO-RXMQYKEDSA-N HNDVDQJCIGZPNO-UHFFFAOYSA-N stereoisomers NH2 HNDVDQJCIGZPNO-YFKPBYRVSA-N tautomer O O HN salt ON charged form Na+ HN NH2 N NH2 HN histidine UHPNKBYGGMJTIM-UHFFFAOYSA-M O OH O- N NH3+ HNDVDQJCIGZPNO-UHFFFAOYSA-N HNDVDQJCIGZPNO-UHFFFAOYSA-N Std. InChIKey isotope HN O N O “errors” O O Na NH2 UHPNKBYGGMJTIM-UHFFFAOYSA-M HN OH N NH NH OH N 15NH 2 HNDVDQJCIGZPNO-CDYZYAPPSA-N HNDVDQJCIGZPNO-UHFFFAOYSA-N NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry NCI/CADD Chemical Structure Database Structure Normalization original record original record original record original record original record original record original record original record original record original record original record 119.8 million original structure records in CSDB NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry NCI/CADD Chemical Structure Database Structure Normalization original record original record FICTS original record FICTS original record FICTS original record FICTS original record FICTS original record FICTS original record FICTS original record FICTS original record original record 83.1 million FICTS parent structures 119.8 million original structure records in CSDB NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry NCI/CADD Chemical Structure Database Structure Normalization original record original record FICTS original record FICTS FICuS original record FICTS FICuS original record FICTS FICuS original record FICTS FICuS original record FICTS FICuS original record FICTS FICuS original record FICTS original record original record 83.1 million FICTS parent structures 81.6 million FICuS parent structures 119.8 million original structure records in CSDB NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry NCI/CADD Chemical Structure Database Structure Normalization original record original record FICTS original record FICTS FICuS original record FICTS FICuS uuuuu original record FICTS FICuS uuuuu original record FICTS FICuS uuuuu original record FICTS FICuS uuuuu original record FICTS FICuS original record FICTS original record original record 83.1 million FICTS parent structures 81.6 million FICuS parent structures 76.2 million uuuuu parent structures 119.8 million original structure records in CSDB NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry NCI/CADD Chemical Structure Database Structure Normalization original record original record FICTS original record FICTS FICuS original record FICTS FICuS uuuuu original record FICTS FICuS uuuuu original record FICTS FICuS uuuuu original record FICTS FICuS uuuuu original record FICTS FICuS original record FICTS original record original record 83.1 million FICTS parent structures 119.8 million original structure records in CSDB NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce 81.6 million FICuS parent structures 76.2 million uuuuu parent structures tautomerinvariant U.S. Government Chemical Databases and Open Chemistry Tautomer Analysis How much “chemical space” is “just generated” by drawing tautomers? NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry NCI/CADD Chemical Structure Database Tautomer Analysis • CACTVS: generation of all formal tautomers for a given organic compound (prototropic tautomerism) • rule set of 21 transforms encoded as (CACTVS-extended) SMIRKS • rule set is systematically applied to the original structure (and all tautomers that have been generated in previous steps) • tautomer generation is limited to 1000 SMIRKS transform operations/structure • all tautomers are ranked by a scoring function • the highest ranked tautomer is defined as the canonical tautomer NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry NCI/CADD Chemical Structure Database Tautomer Analysis • 21 SMIRKS transform rules: rule 1: 1.3 (thio)keto/(thio)enol rule 2: 1.5 (thio)keto/(thio)enol rule 3: simple (aliphatic) imine rule 4: special imine rule 5: 1.3 aromatic heteroatom H shift rule 6: 1.3 heteroatom H shift rule 7: 1.5 (aromatic) heteroatom H shift (1) rule 8: 1.5 aromatic heteroatom H shift (2) rule 9: 1.7 (aromatic) heteroatom H shift rule 10: 1.9 (aromatic) heteroatom H shift rule 11: 1.11 (aromatic) heteroatom H shift NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce rule 12: furanones rule 13: keten/ynol exchange rule 14: ionic nitro/aci-nitro rule 15: pentavalent nitro/aci-nitro rule 16: oxim/nitroso rule 17: oxim/nitroso via phenol rule 18: cyanic/iso-cyanic acids rule 19: formamidinesulfinic acids rule 20: isocyanides rule 21: phosphonic acids U.S. Government Chemical Databases and Open Chemistry NCI/CADD Chemical Structure Database Tautomer Analysis FICuS FICuS starting from the set of FICuS parent structures we systematically generated all tautomers based on the 21 SMIRKS rule set available in CACTVS FICuS FICuS generated 680 million tautomers FICuS FICuS 70.6 million FICuS parent structures (2009 DB version) for 1.7% of the FICuS parent structures the enumeration was not exhaustive NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry NCI/CADD Chemical Structure Database Tautomer Analysis tautomeric overlap within each individual database release (%) 90 80 70 number database releases 60 50 frequency 40 30 20 10 0 0.0 0.5 1.0 1.5 2.0 average: ~0.3% of original structure records NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry NCI/CADD Chemical Structure Database Tautomer Analysis Asinex ChemBridge ComGenex ChemNavigator Columbia University Molecular Screening Center number EPA DSSToxdatabase Specs releases Ambinter BIND BindingDB ChemNavigator KEGG overlap within each individual database release (%) tautomeric NCI Open Database NIST WebBook NLM ChemIDplus 90 NMRShiftDB Thomson Pharma 80 Wombat 70 NCI/DTP 60 PASS Training Set SGC-Ox 50 frequency 40 30 ChemDB ZINC 20 ChEBI ChemSpider 10 0 0.0 0.5 1.0 1.5 2.0 average: ~0.3% of original structure records NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry NCI/CADD Chemical Structure Database Tautomer Analysis occurrence of “tautomerism-critical” molecules within each individual database release (%) 30 25 20 number database releases 15 frequency 10 5 0 0.5 2.5 4.5 6.5 8.5 10.5 12.5 14.5 16.5 18.5 20.5 22.5 24.5 average: ~9.5% of FICuS parent structures percentage of FICuS parent structure in each database release occurring somewhere in CSDB with a conflict NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry Example for a Tautomer “Conflict” HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5) O HN N O • HPMBP is used in liquid membranes (selective removal of metal ions) • selectivity and efficiency depends on the tautomeric form of HPMBP which itself depends on solvent and concentration of HPMBP He, D.; Li Z.; Ma M.; Huang J.; Yang Y. Study of extraction characteristics of HPMBP. 1. Tautomer and extraction characteristics. J. Chem. Eng. Data 2009, 54(10), 2944-2947 NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry Example for a Tautomer “Conflict” HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5) O N N OH O O R/S N N O canonical tautomer by CACTVS HN N CACTVS generates 7 tautomers O 5 tautomers have potential stereo center on atoms or bonds N N O R/S OH OH E/Z OH E/Z HN N O NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce HN N OH O R/S HN N O U.S. Government Chemical Databases and Open Chemistry Example for a Tautomer “Conflict” HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5) O N N OH 33064-14-1 49 references O O R/S N N O HN N O 3 tautomers have CAS Registry Numbers assigned 4551-69-1 859 references (no stereo) N N O R/S OH OH E/Z OH E/Z HN N O HN N OH O R/S HN N O 127117-31-1 3 references (Z) NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry Example for a Tautomer “Conflict” HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5) O N N OH 12 databases O O R/S N N O 16 databases (no stereo) 3 databases (R) 2 databases (S) HN N 6 databases N O R/S OH OH E/Z OH E/Z N O occurrences in databases indexed in CSDB HN N O HN N OH O R/S HN N O 1 database (no stereo) NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry Example for a Tautomer “Conflict” HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5) O O O R/S N N OH 12 databases ACD 3D Ambinter BindingDB ChemBank ChemDB ChemSpider ChemNavigator MLSMR NIAID Scripps Screening Center Thomson Pharma ZINC N N HN O 16 databases (no stereo) 3 databases (R) 2 databases (S) OH ChemSpider ZINC N N N 6 databases ChemSpider ECOTOX ZINC OH E/Z E/Z O O HN N O NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce Ambinter ChemDB ChemSpider occurrences in databases DiscoveryGate ChemNavigator Thomson Pharma ACD 3D ACX Ambinter BioByte QSAR ChemBank ChemBridge ChemDB ChemSpider R/S DiscoveryGate OH EPA GCES MLSMR HN OH N NCI Open Database NIST MS-Lib NLM ChemIDplus Sigma-Aldrich Thomson Pharma ChemDB O R/S HN N O 1 database (no stereo) U.S. Government Chemical Databases and Open Chemistry Scaffold Analysis NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry NCI/CADD Chemical Structure Database Scaffold Analysis level 1 level 2 O N NH O N NH example molecular scaffold tree Schuffenhauer et al. J. Chem. Inf. Model. 2007, 47, 47-58 O N N N NH O S O simple scaffold Bemis et al. J. Med. Chem. 1996, 39, 2887-2893 archetype scaffold Bemis et al. J. Med. Chem. 1996, 39, 2887-2893 NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry NCI/CADD Chemical Structure Database Scaffold Analysis CSDB uuuuu compound set 76.2 million NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry NCI/CADD Chemical Structure Database Scaffold Analysis level 2 level 1 molecular scaffold tree O N NH O 8.1 million scaffolds N NH CSDB uuuuu compound set simple scaffold N NH 6.8 million scaffolds 76.2 million archetype scaffold 0.8 million scaffolds NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry NCI/CADD Chemical Structure Database Scaffold Analysis level 2 number of unique scaffolds per hierarchy level level 1 O molecular scaffold tree N NH O 8.1 million scaffolds 76.2 million 8.0 80.0 7.0 70.0 6.0 60.0 5.0 50.0 4.0 40.0 3.0 30.0 2.0 20.0 1.0 10.0 0 NH Number of unique structures (in million) uuuuu compound set Number of Unique Scaffolds (in millions) CSDB N 0 1 2 3 4 5 6 7 8 9 10 Hierarchy Level NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry Atom Neighborhoods NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry NCI/CADD Chemical Structure Database Multilevel Neighborhoods of Atoms (MNA) MNA level 1 H N H OH O HC HO CHCC CHCN CCCC CCOO NCC OHC OC MNA level 2 C(C(CC-H)C(CC-C)-H(C)) C(C(CC-H)C(CN-H)-H(C)) C(C(CC-H)C(CN-H)-C(C-O-O)) C(C(CC-H)N(CC)-H(C)) C(C(CC-C)N(CC)-H(C)) N(C(CN-H)C(CN-H)) -H(C(CC-H)) -H(C(CN-H)) -H(-O(-H-C)) -C(C(CC-C)-O(-H-C)-O(-C)) -O(-H(-O)-C(C-O-O)) -O(-C(C-O-O)) Filimonov D., Poroikov V., Borodina Yu., Gloriozova T. J. Chem. Inf. Comput. Sci., 1999, 39 (4), 666-670. NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry NCI/CADD Chemical Structure Database Multilevel Neighborhoods of Atoms (MNA) CSDB uuuuu compound set 76.2 million NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry NCI/CADD Chemical Structure Database Multilevel Neighborhoods of Atoms (MNA) Unique MNAs CSDB level 1 13,426 level 2 918,516 uuuuu compound set 76.2 million NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry NCI/CADD Chemical Structure Database Multilevel Neighborhoods of Atoms (MNA) Unique MNAs CSDB uuuuu compound set 76.2 million 1.3 billion relationships level 1 13,426 level 2 918,516 ~ 17 MNAs per uuuuu parent structure 2.3 billion relationships ~ 30 MNAs per uuuuu parent structure NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry NCI/CADD Chemical Structure Database Multilevel Neighborhoods of Atoms (MNA) Unique MNAs CSDB uuuuu compound set 76.2 million 1.3 billion relationships level 1 13,426 level 2 918,516 ~ 17 MNAs per uuuuu parent structure 2.3 billion relationships ~ 30 MNAs per uuuuu parent structure surprising: 424,784 MNAs (level 2) are exclusive to a set of 1,3 million structures in ChemSpider NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry NCI/CADD Web Resources Chemical Structure Web Services Chemical Identifier Resolver NCI/CADD web service external (web) services NCI/CADD web service http Chemical Structure Web Services CACTVS NCI/CADD Chemical Structure Database (CSDB) NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce other software packages e.g. OPSIN U.S. Government Chemical Databases and Open Chemistry NCI/CADD Web Resources Chemical Identifier Resolver avogadro.openmolecules.net/ Symyx Draw Resolver http://www.akosgmbh.eu/globalsearch/index.htm http://www.symyx.com/ webel.py - A Cinfony module http://baoilleach.blogspot.com/2009/11/ introducing-webel-cheminformatics.html gChem Virtual Molecular Model Kit http://chemagic.com/web_molecules/script_page_large.aspx CACTVS http://www.xemistry.com NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce IUPHAR DATABASE http://www.iuphar-db.org U.S. Government Chemical Databases and Open Chemistry Work in progress … Chemical Structure Lookup Service II NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry Work in progress … Chemical Structure Lookup Service II NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry Acknowledgments Thanks to all database providers! CADD Group, CBL, NCI Igor Filippov University of Cambridge Daniel Lowe Peter Murray-Rust ChemNavigator Scott Hutton Tad Hurst ChemSpider Antony Williams Valery Tkachenko Noel’ O Boyle (University College Cork, Ireland) Richard Apodaca (Metamolecular) Hans-Juergen Himmler Our web site: http://cactus.nci.nih.gov NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry NCI/CADD Web Resources Chemical Identifier Resolver http://cactus.nci.nih.gov/chemical/structure http://cactus.nci.nih.gov/blog NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry Acknowledgments - Software CACTVS ChemWriter Peter Ertl Python Web Framework Python SQL library Javascript library NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry