SMILES 2 C371 Lecture Based on Dr. David Wild’s C571 Presentations Fall 2004 Linear Notations • Represent the atoms, bonds, and connectivity as a linear text.

Download Report

Transcript SMILES 2 C371 Lecture Based on Dr. David Wild’s C571 Presentations Fall 2004 Linear Notations • Represent the atoms, bonds, and connectivity as a linear text.

SMILES 2
C371 Lecture
Based on Dr. David Wild’s C571
Presentations
Fall 2004
Linear Notations
• Represent the atoms, bonds, and connectivity as
a linear text string
• SMILES
– Concise
– Orignally designed for manual command line entry
into text-only systems
– Now widely used
• Can be input to a spreadsheet cell, on one line
of a text file, or in an Oracle database text field
• System to generate canonical form of SMILES
Review of SMILES
• Atoms represented by normal chemical
symbols (uppercase for aliphatics,
lowercase for aromatic)
• Adjacent atoms imply single bonds
• Use = for double, # for triple bonds
• Hydrogens usually implicit
• Parentheses imply branching
• Ring closure indicated by numbers
SMILES Review (cont’d)
• Can make Hydrogens explicit
• Non-organic atoms are put in square
brackets, e.g., [Xe]
• Charged species also in square brackets
with a + or -, e.g., [Na+] or [O-]
• Unknown atoms indicated by a *
• Stereochemistry represented by @@
SMILES for Tyrosine
NC(Cc1ccc(O)cc1)C(=O)O
O
HO
NH2
CH CH2
OH
SMILES FOR Acetaminophen
(Tylenol)
O=C(O)Nc1ccc(O)cc1
SMILES for Isatin
O=c2[nH]c1ccccc1c2=O
N
O
O
Isatin
Canonicalizing SMILES –
Morgan Algorithm
• Each atom has a connectivity value: how
many atoms it is connected to
• That value is replaced by the sum of the
connectivity values of the its neighbors
• Continues iteratively, until number of
different values is maximized
• Atoms are numbered in decreasing order
of connectivity value
– In case of a tie, other properties are used
(e.g. atomic number, bond order, etc).
Canonicalizing SMILES –
CANGEN
• Two-stage procedure used by Daylight
• First stage CANON, generates a canonical
connection table using a modified version
of the Morgan Algorithm that produces a
tree structure
• Second stage GENES creates a unique
SMILES using a depth-first search of a the
molecular graph tree output by CANON
• More information – JCICS 29,1989,97-101
Representing reactions
CH4 + 2O2

CO2 + 2H2O
• Need to identify the 2D arrangement of
products and reagents and distinguish them)
– Possibly map which starting material atoms
map to which product atoms.
• Other information (e.g., yield, equilibrium
constants, conditions generally stored
separately
• Not all reactions specified stoichiometrically
Simple Reaction SMILES
CH4 + 2O2

CO2 + 2H2O
Reaction SMILES: C.OO>>C(O)O.O
• Each reagent and product represented as
SMILES
• Reagents on the left of a “>>”; products on
the right
• Individual reagents and products are
separated by a “.”
Reaction SMILES example
Reaction SMILES: C.O=O>O=[O+]-[O-]>O=C=O.O
• Agents specified between the two “>>”
Reaction SMILES example
Reaction SMILES: C(=O)Cl.NC>>C(=O)NC.Cl
• Note implicit hydrogens
Atom-mapping SMIRKS
representation
SMIRKS:
[C:1](=[O:2])[Cl:3].[H:99][N:4]([H:100])[C:0]>>[C:1](=[O:2])[N:4]([H:100])[C:0].[Cl:3][H:99]
• Each reactant atom gets a tag (e.g “C”
becomes “[C:1]”) which maps to the same
product tag.
• Hydrogens are explicit
Daylight RS/SMIRKS Sites
• Basic reaction representation (Reaction
SMILES)
– http://www.daylight.com/dayhtml_tutorials/languages/
smiles/index.html
• SMIRKS introduction
– http://www.daylight.com/dayhtml_tutorials/languages/
smirks/index.html
• SMIRKS theory
– http://www.daylight.com/dayhtml/doc/theory/theory.rxn
.html
• SMIRKS depicter
– http://www.daylight.com/daycgi_tutorials/react.cgi
Representing generic structures
• A generic structure is one which, by
ambiguity, represents a (possibly infinite)
set of possible structures
• Ambiguity usually takes the form of “R”
groups
• Originally used for representing patents
• Now used for representing combinatorial
libraries too
• Also known as Markush Structures
Specifying a substructure query
with SMARTS
• SMARTS: a superset of SMILES extended
to allow partial structures (substructures)
and optional parts of molecules to be
represented
• Simple example
*C(=O)O
where the * represents an attachment point
(i.e. any number of any atoms)
• More information:
– http://www.daylight.com/meetings/summerschool01/course/basics/
smarts.html
– http://www.daylight.com/dayhtml/doc/theory/theory.smarts.html
SMARTS special characters
(examples)
*
a
A
R
Rn
Any atom
Aromatic atom
Aliphatic atom
Ring atom
Atom in ring of size
n
Hn n attached
hydrogens
Xn n total connections
~
:
@
&
;
,
Any bond
Aromatic bond
Any ring bond
Logical AND
Logical AND (low
prec.)
Logical OR
!
Logical NOT
SMARTS examples
[!C;R]
[O;H1]
c:c
C~N
*C(=O)O
Any atom in a ring that is not
aliphatic Carbon
Hydroxyl group (-OH)
Two carbons separated by
aromatic bond
Carbon and nitrogen attached by
any bond
Carboxyl Group
Try out a SMARTS search
• DepictMatch:
– http://www.daylight.com/cgi-bin/contrib/depictmatch.cgi
• Enter a set of SMILES and a SMARTS, and any
part of the SMILES that is found in the SMARTS
is highlighted
• As an example, we’ll use the sample dataset
described on the following two slides, and use
*C(=O)O (carboxyl group) as our SMARTS and
RC(=O)O (carboxyl attached to a ring)
Sample dataset
Acetaminophen
Chlorpromazine
Alprenolol
Amphetamine
Captopril
Diclofenac
Gabapentin
Salicylate
Sample Dataset SMILES file
•
•
•
•
•
CC(=O)Nc1ccc(O)cc1 Acetaminophen
CC(C)NCC(O)COc1ccccc1CC=C Alprenolol
CC(N)Cc1ccccc1 Amphetamine
CC(CS)C(=O)N1CCCC1C(=O)O Captopril
CN(C)CCCN1c2ccccc2Sc3ccc(Cl)cc13
Chlorpromazine
• OC(=O)Cc1ccccc1Nc2c(Cl)cccc2Cl Diclofenac
• NCC1(CC(=O)O)CCCCC1 Gabapentin
• COC(=O)c1ccccc1O Salicylate
Web / Oracle Systems
• Advantages
– Single database for structures and data
– No software to install on client machines (except
maybe plug-ins like Chime)
– Not dependent on (expensive) contract with MDL
– Highly customizable
• Disadvantages
– Requires extensive web-based interface software to
be written, for registration, searching, etc
– Company will have to maintain system internally
– Requires current ISIS system to be abandoned
Chemistry Cartridges
• Daylight DayCart
– http://www.daylight.com/products/daycart.html
• Tripos Auspyx
– http://www.tripos.com/sciTech/inSilicoDisc/chemInfo/auspyx.html
• Accelrys Accord for Oracle
– http://www.accelrys.com/accord/oracle.html
• MDL Direct
– http://www.mdl.com/products/framework/rel_chemistry_server/in
dex.jsp
• IDBS ActivityBase
– http://www.id-bs.com/products/abase/
• JChem Cartridge
– http://www.jchem.com
Example - DayCart
• Store SMILES as string (VARCHAR2) in Oracle
database
• Cartridge provides extra functions and
extensions to functions for searching based on
chemical structures
• Structure search implemented by EXACT
function
• Substructure search implemented by MATCHES
function
• Similarity search implemented by TANIMOTO
and EUCLID functions
Measuring similarity between
molecules
• Similar Property Principle: “Molecules with
similar structure are likely to have similar
biological activity”
• Generally the Tanimoto Coefficient or
Euclidean Distance between fingerprints is
used
Fingerprint Similarity – Tanimoto
•
•
•
•
•
Also known as Jaccard Coefficient
‘1s’ in common / ‘1s’ not in common
0’s are treated as not significant
Similarity is between 0 (dissimilar) and 1 (same)
Good cutoff for likely biologically similar molecules
is 0.7 or 0.8
Tanimoto Similarity =
• Example:
c
#a + #b - c
A 101101011
B 011101101
c = ‘1’s in common
#a = ‘1’s in fingerprint A
#b = ‘1’s in fingerprint B
c=4
#a = 6
#b = 6
Tanimoto Similarity = 4 / ( 6 + 6 – 4 ) = 0.5
Fingerprint similarity –
Euclidean
• Pythagorean distance
• For binary dimensions, equivalent to the square
root of the Hamming distance (i.e. square root of
the number of bits that are different)
• 0’s are treated as significant
• Smaller values mean more similar
• Example:
101101011
011101101
Different?
xx
xx
Euclidean distance = sqrt(4) = 2.0
Sample dataset
Acetaminophen
Chlorpromazine
Alprenolol
Amphetamine
Captopril
Diclofenac
Gabapentin
Salicylate
Sample Dataset SMILES file
•
•
•
•
•
CC(=O)Nc1ccc(O)cc1 Acetaminophen
CC(C)NCC(O)COc1ccccc1CC=C Alprenolol
CC(N)Cc1ccccc1 Amphetamine
CC(CS)C(=O)N1CCCC1C(=O)O Captopril
CN(C)CCCN1c2ccccc2Sc3ccc(Cl)cc13
Chlorpromazine
• OC(=O)Cc1ccccc1Nc2c(Cl)cccc2Cl Diclofenac
• NCC1(CC(=O)O)CCCCC1 Gabapentin
• COC(=O)c1ccccc1O Salicylate
Oracle table Test for sample
dataset
Smiles
-----CC(=O)Nc1ccc(O)cc1
CC(C)NCC(O)COc1ccccc1CC=C
CC(N)Cc1ccccc1
CC(CS)C(=O)N1CCCC1C(=O)O
CN(C)CCCN1c2ccccc2Sc3ccc(Cl)cc13
OC(=O)Cc1ccccc1Nc2c(Cl)cccc2Cl
NCC1(CC(=O)O)CCCCC1
COC(=O)c1ccccc1O
Name
---Acetaminophen
Alprenolol
Amphetamine
Captopril
Chlorpromazine
Diclofenac
Gabapentin
Salicylate
LogP
---0.27
2.81
1.76
0.84
5.20
4.02
-1.37
2.60
DayCart structure search using
SQL
select * from Test where
exact(Smiles, “CC(N)Cc1ccccc1”) = 1;
Smiles
-----CC(N)Cc1ccccc1
Name
---Amphetamine
LogP
---1.76
DayCart substructure search
select * from Test where
matches(Smiles, “*C(=O)O”) = 1;
Smiles
-----CC(CS)C(=O)N1CCCC1C(=O)O
OC(=O)Cc1ccccc1Nc2c(Cl)cccc2Cl
NCC1(CC(=O)O)CCCCC1
COC(=O)c1ccccc1O
Name
---Captopril
Diclofenac
Gabapentin
Salicylate
LogP
---0.84
4.02
-1.37
2.60
Substructure search for
carboxylic acid
Acetaminophen
Chlorpromazine
Alprenolol
Amphetamine
Captopril
Diclofenac
Gabapentin
Salicylate
DayCart substructure / value
search
select * from Test where
(matches(Smiles, “*C(=O)O”) = 1)
AND (LogP > 1.0));
Smiles
-----OC(=O)Cc1ccccc1Nc2c(Cl)cccc2Cl
COC(=O)c1ccccc1O
Name
---Diclofenac
Salicylate
LogP
---4.02
2.60
DayCart similarity search
select * from TEST where
Aspirin
tanimoto(SMILES, “CC(=O)Oc1ccccc1C(=O)O”) > 0.6;
SMILES
-----COC(=O)c1ccccc1O
CC(=O)Nc1ccc(O)cc1
CC(N)Cc1ccccc1
NAME
---Salicylate
Acetaminophen
Amphetamine
LOGP
---2.60
0.27
1.76
Similarity search for carboxylic
acid

Acetaminophen

Alprenolol
Amphetamine
Captopril

Chlorpromazine
Diclofenac
Gabapentin
Salicylate
More examples of DayCart
http://www.daylight.com/meetings/summersc
hool02/course/admin/daycart_hints.html