NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space

Download Report

Transcript NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space

NCI/CADD Chemical Identifier Resolver:
Indexing and Analysis of Available Chemistry Space
Markus Sitzmann1, Wolf-Dietrich Ihlenfeldt2, and
Marc C. Nicklaus1
[1] Computer-Aided Drug Design Group, Chemical Biology Laboratory,
NCI-Frederick, NIH, DHHS
[2] Xemistry GmbH, Auf den Stieden 8, D-35094 Lahntal, Germany
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce
U.S. Government Chemical Databases and Open Chemistry
Small Molecule Databases
• since the early 2000s: number of databases “publishing”
small molecules grew enormously, e.g. PubChem,
ChemSpider, ChEMBL, DrugBank – what is the overlap, how
many small-molecules are there currently?
• ambiguities in the representation of small molecules (e.g.
tautomerism, salts, ionic resonance forms)
• growing number of chemical structure identifiers
(InChI/InChIKey, PubChem SID/CID, ChemSpider ID,
ChEBI ID, …)
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce
U.S. Government Chemical Databases and Open Chemistry
Chemical Identifier Resolver
SMILES
SYBYL Line Notation
CAS Registry Number
chemical names
GIF image
SD File
ChemNavigator SID
CML
chemical structure
FDA UNII
NCI/CADD Identifiers
NSC number
InChI/InChIKey
MRV
PubChem SID/CID
ChemSpider ID
Chemical Formula
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce
ChEBI ID
PDB Ligand ID
U.S. Government Chemical Databases and Open Chemistry
NCI/CADD Web Resources
Chemical Identifier Resolver
Works as a resolver for different
chemical structure identifiers.
Allows one to convert a given
structure identifier into another
representation or structure
identifier.
first beta release: July 2009
current release (beta 4): April 2011
http://cactus.nci.nih.gov/chemical/structure
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce
U.S. Government Chemical Databases and Open Chemistry
NCI/CADD Web Resources
Chemical Identifier Resolver
• it is usable by a simple URL API:
http://cactus.nci.nih.gov/chemical/structure/”identifier”/”representation”
XML format: http://cactus.nci.nih.gov/chemical/structure/”identifier”/”representation”/xml
example: http://cactus.nci.nih.gov/chemical/structure/Tamiflu/cas
204255-11-8
MIME type: text/plain
• if a request is not resolvable: HTTP404 status message
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce
U.S. Government Chemical Databases and Open Chemistry
NCI/CADD Public Web Resources
Chemical Identifier Resolver
chemical names
IUPAC names (by OPSIN)
CAS numbers
SMILES strings
IUPAC InChI/InChIKeys
NCI/CADD Identifiers
CACTVS HASHISY
NSC number
PubChem SID
ChemSpider ID
ChemNavigator SID
ZINC
FDA UNII
resolver
http://cactus.nci.nih.gov/chemcial/structure
“identifier”
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce
/smiles
/names, /iupac_name
/cas
/inchi, /stdinchi
/inchikey, /stdinchikey
/ficts, /ficus, /uuuuu
/image
/file, /sdf
/mw, /monoisotopic_mass
/formula
/twirl, /3d
/urls
/chemspider_id
/pubchem_sid
/chemnavigator_sid
“representation”
U.S. Government Chemical Databases and Open Chemistry
NCI/CADD Web Resources
Chemical Identifier Resolver
representation
identifier
MIME type
http request
detection of
the identifier
type
identifier is a
hashed structure
representation
(e.g. InChIKey),
trivial name
etc.
calculation of the
requested structure
representation
identifier is a
full structure
representation
(e.g. SMILES, InChI)
http response
e.g. InChI, GIF image
structure
database lookup
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce
CACTVS
e.g. CAS number,
chemical name
NCI/CADD Chemical Structure
Database (CSDB)
U.S. Government Chemical Databases and Open Chemistry
NCI/CADD Web Resources
Chemical Identifier Resolver
representation
identifier
MIME type
http request
detection of
the identifier
type
identifier is a
hashed structure
representation
(e.g. InChIKey),
trivial name
etc.
calculation of the
requested structure
representation
identifier is a
full structure
representation
(e.g. SMILES, InChI)
http response
e.g. InChI, GIF image
structure
CACTVS
e.g. CAS number,
chemical name
database lookup
NCI/CADD Chemical Structure Database (CSDB)
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce
U.S. Government Chemical Databases and Open Chemistry
Chemical Identifier Resolver
Resolving Chemical Names
http://cactus.nci.nih.gov/chemical/structure/L-alanin/smiles/xmls
?resolver=name_by_chemspider,name_by_opsin,name_by_cir
<request string="L-alanin" representation="smiles">
<data id="1" resolver="name_by_chemspider" string_class="Chemical Name (ChemSpider)">
<item id="1">C[C@H](N)C(O)=O</item>
</data>
<data id="2" resolver="name_by_opsin" string_class="IUPAC Name (OPSIN)">
<item id="1">C[C@H](N)C(O)=O</item>
</data>
<data id="3" resolver="name_by_cir" string_class="Chemical Name (CIR)">
<item id="1“>C[C@H](N)C(O)=O</item>
</data>
</request>
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce
U.S. Government Chemical Databases and Open Chemistry
Chemical Identifier Resolver
Chemical Structure Database (CSDB)
• ChemNavigator iResearch Library
compilation of commercially available
screening compounds from ~330 international chemistry suppliers
• PubChem database
including Open NCI database,
EPA DSSTox databases, NIAID HIV
databases, NIST Webbook,
NLM ChemIDplus, ChemSpider …
• Commercial Sources / others
Asinex, Comgenex, eMolecules,
ChEMBL, …
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce
ChemNav.
iResearch Lib.
~56%
PubChem
~38%
~6%
others
currently:
~150 chemical structure databases
~120 million structure records
~81.6 million unique structures by
NCI/CADD FICuS Identifier
~84 million unique structures by Std. InChIKey
U.S. Government Chemical Databases and Open Chemistry
NCI/CADD Structure Identifiers
FICTS, FICuS, uuuuu
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce
U.S. Government Chemical Databases and Open Chemistry
Unique Representation of Chemical Structures
NCI/CADD Structure Identifiers
• based on hashcodes calculated by the chemoinformatics
toolkit CACTVS
O
HN
OH
N
NH2
9850FD9F9E2B4E25
• CACTVS hashcodes:
 represent a chemical structure uniquely as
16-digit hexadecimal number (64-bit unsigned)
 high sensitivity to structural features of a compound
 change if connectivity changes
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce
U.S. Government Chemical Databases and Open Chemistry
Unique Representation of Chemical Structures
NCI/CADD Structure Identifiers
Molfile
SDF
SMILES
ChemDraw cdx
PDB
original
structure
record
structure
normalization
parent
structure
hashcode
calculation
E_HASHISY
NCI/CADD
Identifier
SDF
SMILES
database
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce
U.S. Government Chemical Databases and Open Chemistry
Unique Representation of Chemical Structures
NCI/CADD Structure Identifiers
Molfile
SDF
SMILES
ChemDraw cdx
PDB
original
structure
record
structure
normalization
parent
structure
hashcode
calculation
E_HASHISY
SDF
SMILES
database
NCI/CADD
Identifier
FICTS
FICuS
• calculation of a set of parent structures with different
uuuuu
sensitivity to chemical features
• representation of chemical structures on different levels
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce
U.S. Government Chemical Databases and Open Chemistry
Unique Representation of Chemical Structures
NCI/CADD Structure Identifiers
sensitive /
not sensitive
Fragments
Isotopes
Charges
Tautomers
Stereo
FICTS
FICuS
uuuuu
O
HN
ON
NH2
Na+
4A122D094098B50D-FICTS-01-1D
0E26B623DF7FAD30-FICuS-01-70
9850FD9F9E2B4E25-uuuuu-01-27
<CACTVS hashcode (E_HASHISY)>-<tag>-<version>-<checksum>
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce
U.S. Government Chemical Databases and Open Chemistry
O
O
N
NH
OH
HN
OH
O
N
NH2
NH2
HN
OH
N
NH2
stereoisomers
tautomer
O
O
HN
salt
ON
charged form
Na+
HN
NH2
O
OH
N
NH2
HN
O-
N
NH3+
histidine
isotope
HN
O
N
NH2
O
“errors”
O
O
Na
HN
OH
N
NH
OH
N
15NH
2
NH
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce
U.S. Government Chemical Databases and Open Chemistry
O
O
N
NH
OH
HN
OH
NH2
O
N
NH2
HN
E92E4BA2869F3611-FICTS
6C16DE2351F9FF50-FICTS
stereoisomers
OH
N
NH2
8A7AD1EB498CC76A-FICTS
tautomer
O
O
HN
salt
ON
charged form
Na+
HN
NH2
N
NH2
HN
histidine
E5F83F10C5DB080A-FICTS
O
OH
O-
N
NH3+
A3DAE0788050DDE4-FICTS
9850FD9F9E2B4E25-FICTS
FICTS
isotope
HN
O
N
O
“errors”
O
NH2
E5F83F10C5DB080A-FICTS
O
Na
HN
OH
N
NH
NH
OH
N
15NH
2
B2FDA68AEDA06DB9-FICTS
9850FD9F9E2B4E25-FICTS
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce
U.S. Government Chemical Databases and Open Chemistry
O
O
N
NH
OH
HN
OH
NH2
O
N
NH2
HN
E92E4BA2869F3611-FICuS
9850FD9F9E2B4E25-FICuS
stereoisomers
OH
N
NH2
8A7AD1EB498CC76A-FICuS
tautomer
O
O
HN
salt
ON
charged form
Na+
HN
NH2
N
NH2
HN
histidine
E5F83F10C5DB080A-FICuS
O
OH
O-
N
NH3+
A3DAE0788050DDE4-FICuS
9850FD9F9E2B4E25-FICuS
FICuS
isotope
HN
O
N
O
“errors”
O
NH2
E5F83F10C5DB080A-FICuS
O
Na
HN
OH
N
NH
NH
OH
N
15NH
2
B2FDA68AEDA06DB9-FICuS
9850FD9F9E2B4E25-FICuS
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce
U.S. Government Chemical Databases and Open Chemistry
O
O
N
NH
OH
HN
OH
NH2
O
N
NH2
HN
9850FD9F9E2B4E25-uuuuu
9850FD9F9E2B4E25-uuuuu
stereoisomers
OH
N
NH2
9850FD9F9E2B4E25-uuuuu
tautomer
O
O
HN
salt
ON
charged form
Na+
HN
NH2
N
NH2
HN
histidine
9850FD9F9E2B4E25-uuuuu
O
OH
O-
N
NH3+
9850FD9F9E2B4E25-uuuuu
9850FD9F9E2B4E25-uuuuu
uuuuu
isotope
HN
O
N
O
“errors”
O
NH2
9850FD9F9E2B4E25-uuuuu
O
Na
HN
OH
N
NH
NH
OH
N
15NH
2
9850FD9F9E2B4E25-uuuuu
9850FD9F9E2B4E25-FICuS
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce
U.S. Government Chemical Databases and Open Chemistry
O
O
N
NH
OH
HN
OH
NH2
O
N
NH2
OH
HN
N
HNDVDQJCIGZPNO-RXMQYKEDSA-N
HNDVDQJCIGZPNO-UHFFFAOYSA-N
stereoisomers
NH2
HNDVDQJCIGZPNO-YFKPBYRVSA-N
tautomer
O
O
HN
salt
ON
charged form
Na+
HN
NH2
N
NH2
HN
histidine
UHPNKBYGGMJTIM-UHFFFAOYSA-M
O
OH
O-
N
NH3+
HNDVDQJCIGZPNO-UHFFFAOYSA-N
HNDVDQJCIGZPNO-UHFFFAOYSA-N
Std. InChIKey
isotope
HN
O
N
O
“errors”
O
O
Na
NH2
UHPNKBYGGMJTIM-UHFFFAOYSA-M
HN
OH
N
NH
NH
OH
N
15NH
2
HNDVDQJCIGZPNO-CDYZYAPPSA-N
HNDVDQJCIGZPNO-UHFFFAOYSA-N
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce
U.S. Government Chemical Databases and Open Chemistry
NCI/CADD Chemical Structure Database
Structure Normalization
original record
original record
original record
original record
original record
original record
original record
original record
original record
original record
original record
119.8 million original
structure records
in CSDB
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce
U.S. Government Chemical Databases and Open Chemistry
NCI/CADD Chemical Structure Database
Structure Normalization
original record
original record
FICTS
original record
FICTS
original record
FICTS
original record
FICTS
original record
FICTS
original record
FICTS
original record
FICTS
original record
FICTS
original record
original record
83.1 million
FICTS
parent structures
119.8 million original
structure records
in CSDB
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce
U.S. Government Chemical Databases and Open Chemistry
NCI/CADD Chemical Structure Database
Structure Normalization
original record
original record
FICTS
original record
FICTS
FICuS
original record
FICTS
FICuS
original record
FICTS
FICuS
original record
FICTS
FICuS
original record
FICTS
FICuS
original record
FICTS
FICuS
original record
FICTS
original record
original record
83.1 million
FICTS
parent structures
81.6 million
FICuS
parent structures
119.8 million original
structure records
in CSDB
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce
U.S. Government Chemical Databases and Open Chemistry
NCI/CADD Chemical Structure Database
Structure Normalization
original record
original record
FICTS
original record
FICTS
FICuS
original record
FICTS
FICuS
uuuuu
original record
FICTS
FICuS
uuuuu
original record
FICTS
FICuS
uuuuu
original record
FICTS
FICuS
uuuuu
original record
FICTS
FICuS
original record
FICTS
original record
original record
83.1 million
FICTS
parent structures
81.6 million
FICuS
parent structures
76.2 million
uuuuu
parent structures
119.8 million original
structure records
in CSDB
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce
U.S. Government Chemical Databases and Open Chemistry
NCI/CADD Chemical Structure Database
Structure Normalization
original record
original record
FICTS
original record
FICTS
FICuS
original record
FICTS
FICuS
uuuuu
original record
FICTS
FICuS
uuuuu
original record
FICTS
FICuS
uuuuu
original record
FICTS
FICuS
uuuuu
original record
FICTS
FICuS
original record
FICTS
original record
original record
83.1 million
FICTS
parent structures
119.8 million original
structure records
in CSDB
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce
81.6 million
FICuS
parent structures
76.2 million
uuuuu
parent structures
tautomerinvariant
U.S. Government Chemical Databases and Open Chemistry
Tautomer Analysis
How much “chemical space” is “just generated” by drawing tautomers?
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce
U.S. Government Chemical Databases and Open Chemistry
NCI/CADD Chemical Structure Database
Tautomer Analysis
• CACTVS: generation of all formal tautomers for a given organic
compound (prototropic tautomerism)
• rule set of 21 transforms encoded as (CACTVS-extended) SMIRKS
• rule set is systematically applied to the original structure
(and all tautomers that have been generated in previous steps)
• tautomer generation is limited to 1000 SMIRKS transform
operations/structure
• all tautomers are ranked by a scoring function
• the highest ranked tautomer is defined as the
canonical tautomer
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce
U.S. Government Chemical Databases and Open Chemistry
NCI/CADD Chemical Structure Database
Tautomer Analysis
• 21 SMIRKS transform rules:
rule 1: 1.3 (thio)keto/(thio)enol
rule 2: 1.5 (thio)keto/(thio)enol
rule 3: simple (aliphatic) imine
rule 4: special imine
rule 5: 1.3 aromatic heteroatom H shift
rule 6: 1.3 heteroatom H shift
rule 7: 1.5 (aromatic) heteroatom H shift (1)
rule 8: 1.5 aromatic heteroatom H shift (2)
rule 9: 1.7 (aromatic) heteroatom H shift
rule 10: 1.9 (aromatic) heteroatom H shift
rule 11: 1.11 (aromatic) heteroatom H shift
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce
rule 12: furanones
rule 13: keten/ynol exchange
rule 14: ionic nitro/aci-nitro
rule 15: pentavalent nitro/aci-nitro
rule 16: oxim/nitroso
rule 17: oxim/nitroso via phenol
rule 18: cyanic/iso-cyanic acids
rule 19: formamidinesulfinic acids
rule 20: isocyanides
rule 21: phosphonic acids
U.S. Government Chemical Databases and Open Chemistry
NCI/CADD Chemical Structure Database
Tautomer Analysis
FICuS
FICuS
starting from the set of
FICuS parent structures
we systematically generated
all tautomers based on the
21 SMIRKS rule set
available in CACTVS
FICuS
FICuS
generated
680 million tautomers
FICuS
FICuS
70.6 million
FICuS
parent structures
(2009 DB version)
for 1.7% of the FICuS parent structures
the enumeration was not exhaustive
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce
U.S. Government Chemical Databases and Open Chemistry
NCI/CADD Chemical Structure Database
Tautomer Analysis
tautomeric overlap within each individual database release (%)
90
80
70
number
database
releases
60
50
frequency
40
30
20
10
0
0.0
0.5
1.0
1.5
2.0
average: ~0.3% of original structure records
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce
U.S. Government Chemical Databases and Open Chemistry
NCI/CADD Chemical Structure Database
Tautomer Analysis
Asinex
ChemBridge
ComGenex
ChemNavigator
Columbia University
Molecular Screening
Center number
EPA DSSToxdatabase
Specs
releases
Ambinter
BIND
BindingDB
ChemNavigator
KEGG overlap within each individual database release (%)
tautomeric
NCI Open Database
NIST WebBook
NLM ChemIDplus
90
NMRShiftDB
Thomson Pharma
80
Wombat
70
NCI/DTP
60
PASS Training Set
SGC-Ox
50
frequency
40
30
ChemDB
ZINC
20
ChEBI
ChemSpider
10
0
0.0
0.5
1.0
1.5
2.0
average: ~0.3% of original structure records
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce
U.S. Government Chemical Databases and Open Chemistry
NCI/CADD Chemical Structure Database
Tautomer Analysis
occurrence of “tautomerism-critical” molecules within each individual database release (%)
30
25
20
number
database
releases
15
frequency
10
5
0
0.5
2.5
4.5
6.5
8.5
10.5 12.5 14.5 16.5 18.5 20.5 22.5 24.5
average: ~9.5% of FICuS parent structures
percentage of FICuS parent structure in each database release
occurring somewhere in CSDB with a conflict
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce
U.S. Government Chemical Databases and Open Chemistry
Example for a Tautomer “Conflict”
HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5)
O
HN
N
O
• HPMBP is used in liquid membranes
(selective removal of metal ions)
• selectivity and efficiency depends on
the tautomeric form of HPMBP
which itself depends on solvent and
concentration of HPMBP
He, D.; Li Z.; Ma M.; Huang J.; Yang Y. Study of extraction characteristics of HPMBP.
1. Tautomer and extraction characteristics. J. Chem. Eng. Data 2009, 54(10), 2944-2947
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce
U.S. Government Chemical Databases and Open Chemistry
Example for a Tautomer “Conflict”
HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5)
O
N
N
OH
O
O
R/S
N
N
O
canonical
tautomer
by CACTVS
HN
N
CACTVS generates 7 tautomers
O
5 tautomers have potential stereo center on atoms or bonds
N
N
O
R/S
OH
OH
E/Z
OH
E/Z
HN
N
O
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce
HN
N
OH
O
R/S
HN
N
O
U.S. Government Chemical Databases and Open Chemistry
Example for a Tautomer “Conflict”
HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5)
O
N
N
OH
33064-14-1
49 references
O
O
R/S
N
N
O
HN
N
O
3 tautomers have
CAS Registry Numbers
assigned
4551-69-1
859 references
(no stereo)
N
N
O
R/S
OH
OH
E/Z
OH
E/Z
HN
N
O
HN
N
OH
O
R/S
HN
N
O
127117-31-1
3 references
(Z)
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce
U.S. Government Chemical Databases and Open Chemistry
Example for a Tautomer “Conflict”
HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5)
O
N
N
OH
12 databases
O
O
R/S
N
N
O
16 databases (no stereo)
3 databases (R)
2 databases (S)
HN
N
6 databases
N
O
R/S
OH
OH
E/Z
OH
E/Z
N
O
occurrences in databases
indexed in CSDB
HN
N
O
HN
N
OH
O
R/S
HN
N
O
1 database
(no stereo)
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce
U.S. Government Chemical Databases and Open Chemistry
Example for a Tautomer “Conflict”
HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5)
O
O
O
R/S
N
N
OH
12 databases
ACD 3D
Ambinter
BindingDB
ChemBank
ChemDB
ChemSpider
ChemNavigator
MLSMR
NIAID
Scripps Screening
Center
Thomson Pharma
ZINC
N
N
HN
O
16 databases (no stereo)
3 databases (R)
2 databases (S)
OH
ChemSpider
ZINC N
N
N
6 databases
ChemSpider
ECOTOX
ZINC
OH
E/Z
E/Z
O
O
HN
N
O
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce
Ambinter
ChemDB
ChemSpider
occurrences
in databases
DiscoveryGate
ChemNavigator
Thomson Pharma
ACD 3D
ACX
Ambinter
BioByte QSAR
ChemBank
ChemBridge
ChemDB
ChemSpider
R/S
DiscoveryGate
OH
EPA GCES
MLSMR HN
OH
N
NCI Open Database
NIST MS-Lib
NLM ChemIDplus
Sigma-Aldrich
Thomson Pharma
ChemDB
O
R/S
HN
N
O
1 database
(no stereo)
U.S. Government Chemical Databases and Open Chemistry
Scaffold Analysis
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce
U.S. Government Chemical Databases and Open Chemistry
NCI/CADD Chemical Structure Database
Scaffold Analysis
level 1
level 2
O
N
NH
O
N
NH
example
molecular scaffold tree
Schuffenhauer et al.
J. Chem. Inf. Model. 2007, 47, 47-58
O
N
N
N
NH
O S O
simple scaffold
Bemis et al.
J. Med. Chem. 1996, 39, 2887-2893
archetype scaffold
Bemis et al.
J. Med. Chem. 1996, 39, 2887-2893
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce
U.S. Government Chemical Databases and Open Chemistry
NCI/CADD Chemical Structure Database
Scaffold Analysis
CSDB
uuuuu
compound
set
76.2 million
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce
U.S. Government Chemical Databases and Open Chemistry
NCI/CADD Chemical Structure Database
Scaffold Analysis
level 2
level 1
molecular scaffold tree
O
N
NH
O
8.1 million scaffolds
N
NH
CSDB
uuuuu
compound
set
simple scaffold
N
NH
6.8 million scaffolds
76.2 million
archetype scaffold
0.8 million scaffolds
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce
U.S. Government Chemical Databases and Open Chemistry
NCI/CADD Chemical Structure Database
Scaffold Analysis
level 2
number of unique scaffolds per hierarchy level
level 1
O
molecular scaffold tree
N
NH
O
8.1 million scaffolds
76.2 million
8.0
80.0
7.0
70.0
6.0
60.0
5.0
50.0
4.0
40.0
3.0
30.0
2.0
20.0
1.0
10.0
0
NH
Number of unique structures (in million)
uuuuu
compound
set
Number of Unique Scaffolds (in millions)
CSDB
N
0
1
2
3
4
5
6
7
8
9
10
Hierarchy Level
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce
U.S. Government Chemical Databases and Open Chemistry
Atom Neighborhoods
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce
U.S. Government Chemical Databases and Open Chemistry
NCI/CADD Chemical Structure Database
Multilevel Neighborhoods of Atoms (MNA)
MNA level 1
H
N
H
OH
O
HC
HO
CHCC
CHCN
CCCC
CCOO
NCC
OHC
OC
MNA level 2
C(C(CC-H)C(CC-C)-H(C))
C(C(CC-H)C(CN-H)-H(C))
C(C(CC-H)C(CN-H)-C(C-O-O))
C(C(CC-H)N(CC)-H(C))
C(C(CC-C)N(CC)-H(C))
N(C(CN-H)C(CN-H))
-H(C(CC-H))
-H(C(CN-H))
-H(-O(-H-C))
-C(C(CC-C)-O(-H-C)-O(-C))
-O(-H(-O)-C(C-O-O))
-O(-C(C-O-O))
Filimonov D., Poroikov V., Borodina Yu., Gloriozova T. J.
Chem. Inf. Comput. Sci., 1999, 39 (4), 666-670.
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce
U.S. Government Chemical Databases and Open Chemistry
NCI/CADD Chemical Structure Database
Multilevel Neighborhoods of Atoms (MNA)
CSDB
uuuuu
compound
set
76.2 million
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce
U.S. Government Chemical Databases and Open Chemistry
NCI/CADD Chemical Structure Database
Multilevel Neighborhoods of Atoms (MNA)
Unique MNAs
CSDB
level 1
13,426
level 2
918,516
uuuuu
compound
set
76.2 million
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce
U.S. Government Chemical Databases and Open Chemistry
NCI/CADD Chemical Structure Database
Multilevel Neighborhoods of Atoms (MNA)
Unique MNAs
CSDB
uuuuu
compound
set
76.2 million
1.3 billion relationships
level 1
13,426
level 2
918,516
~ 17 MNAs per uuuuu parent structure
2.3 billion relationships
~ 30 MNAs per uuuuu parent structure
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce
U.S. Government Chemical Databases and Open Chemistry
NCI/CADD Chemical Structure Database
Multilevel Neighborhoods of Atoms (MNA)
Unique MNAs
CSDB
uuuuu
compound
set
76.2 million
1.3 billion relationships
level 1
13,426
level 2
918,516
~ 17 MNAs per uuuuu parent structure
2.3 billion relationships
~ 30 MNAs per uuuuu parent structure
surprising:
424,784 MNAs (level 2) are exclusive to a set of
1,3 million structures in ChemSpider
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce
U.S. Government Chemical Databases and Open Chemistry
NCI/CADD Web Resources
Chemical Structure Web Services
Chemical
Identifier
Resolver
NCI/CADD
web service
external
(web) services
NCI/CADD
web service
http
Chemical Structure Web Services
CACTVS
NCI/CADD Chemical Structure
Database (CSDB)
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce
other
software
packages
e.g. OPSIN
U.S. Government Chemical Databases and Open Chemistry
NCI/CADD Web Resources
Chemical Identifier Resolver
avogadro.openmolecules.net/
Symyx Draw Resolver
http://www.akosgmbh.eu/globalsearch/index.htm
http://www.symyx.com/
webel.py - A Cinfony module
http://baoilleach.blogspot.com/2009/11/
introducing-webel-cheminformatics.html
gChem
Virtual Molecular Model Kit
http://chemagic.com/web_molecules/script_page_large.aspx
CACTVS
http://www.xemistry.com
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce
IUPHAR DATABASE
http://www.iuphar-db.org
U.S. Government Chemical Databases and Open Chemistry
Work in progress …
Chemical Structure Lookup Service II
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce
U.S. Government Chemical Databases and Open Chemistry
Work in progress …
Chemical Structure Lookup Service II
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce
U.S. Government Chemical Databases and Open Chemistry
Acknowledgments
Thanks to all database providers!
CADD Group, CBL, NCI
Igor Filippov
University of Cambridge
Daniel Lowe
Peter Murray-Rust
ChemNavigator
Scott Hutton
Tad Hurst
ChemSpider
Antony Williams
Valery Tkachenko
Noel’ O Boyle (University College Cork, Ireland)
Richard Apodaca (Metamolecular)
Hans-Juergen Himmler
Our web site:
http://cactus.nci.nih.gov
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce
U.S. Government Chemical Databases and Open Chemistry
NCI/CADD Web Resources
Chemical Identifier Resolver
http://cactus.nci.nih.gov/chemical/structure
http://cactus.nci.nih.gov/blog
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce
U.S. Government Chemical Databases and Open Chemistry
Acknowledgments - Software
CACTVS
ChemWriter
Peter Ertl
Python Web Framework
Python SQL library
Javascript library
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce
U.S. Government Chemical Databases and Open Chemistry
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce
U.S. Government Chemical Databases and Open Chemistry