Similarity Methods C371 Fall 2004 Limitations of Substructure Searching/3D Pharmacophore Searching • Need to know what you are looking for • Compound is either there.

Download Report

Transcript Similarity Methods C371 Fall 2004 Limitations of Substructure Searching/3D Pharmacophore Searching • Need to know what you are looking for • Compound is either there.

Similarity Methods
C371
Fall 2004
Limitations of Substructure Searching/3D
Pharmacophore Searching
• Need to know what you are looking for
• Compound is either there or not
– Don’t get a feel for the relative ranking of the
compounds
• Output size can be a problem
Similarity Searching
• Look for compounds that are most similar
to the query compound
• Each compound in the database is ranked
• In other application areas, the technique is
known as pattern matching or signature
analysis
Similar Property Principle
• Structurally similar molecules usually have
similar properties, e.g., biological activity
• Known also as “neighborhood behavior”
• Examples: morphine, codeine, heroin
• Define: in silico
– Using computational techniques as a
substitute for or complement to experimental
methods
Advantages of Similarity Searching
• One known active compound becomes the
search key
• User sets the limits on output
• Possible to re-cycle the top answers to
find other possibilities
• Subjective determination of the degree of
similarity
Applications of Similarity Searching
• Evaluation of the uniqueness of proposed
or newly synthesized compounds
• Finding starting materials or intermediates
in synthesis design
• Handling of chemical reactions and
mixtures
• Finding the right chemicals for one’s
needs, even if not sure what is needed.
Subjective Nature of Similarity
Searching
• No hard and fast rules
• Numerical descriptors are used to
compare molecules
• A similarity coefficient is defined to
quantify the degree of similarity
• Similarity and dissimilarity rankings can be
different in principle
Similarity and Dissimilarity
“Consider two objects A and B, a is the number of
features (characteristics) present in A and
absent in B, b is the number of features absent
in A and present in B, c is the number of features
common to both objects, and d is the number of
features absent from both objects. Thus, c and
d measure the present and the absent matches,
respectively, i.e., similarity; while a and b
measure the corresponding mismatches, i.e.,
dissimilarity.” (Chemoinformatics; A Textbook
(2003), p. 304)
2D Similarity Measures
• Commonly based on “fingerprints,” binary
vectors with 1 indicating the presence of
the fragment and 0 the absence
• Could relate structural keys, hashed
fingerprints, or continuous data (e.g.,
topological indexes that take into acount
size, degree of branching, and overall
shape)
Tanimoto Coefficient
• Tanimoto Coefficient of similarity for
Molecules A and B:
SAB =
c
_
a+b–c
a = bits set to 1 in A, b = bits set to 1 in B, c =
number of 1 bits common to both
Range is 0 to 1.
Value of 1 does not mean the molecules are
identical.
Similarity Coefficients
• Tanimoto coefficient is most widely used
for binary fingerprints
• Others:
– Dice coefficient
– Cosine similarity
– Euclidean distance
– Hamming distance
– Soergel distance
Distance Between Pairs of
Molecules
• Used to define dissimilarity of molecules
• Regards a common absence of a feature
as evidence of similarity
When is a distance coefficient a
metric?
• Distance values must be zero or positive
– Distance from an object to itself must be zero
• Distance values must be symmetric
• Distance values must obey the triangle
inequality: DAB ≤ DAC + DBC
• Distance between non-identical objects
must be greater than zero.
• Dissimilarity = distance in the ndimensional descriptor space
Size Dependency of the Measures
• Small molecules often have lower
similarity values using Tanimoto
• Tanimoto normalizes the degree of size in
the denominator:
SAB =
c
_
a+b–c
Other 2D Descriptor Methods
• Similarity can be based on continuous
whole molecule properties, e.g. logP,
molar refractivity, topological indexes.
• Usual approach is to use a distance
coefficient, such as Euclidean distance.
Maximum Common Subgraph
Similarity
• Another approach: generate alignment between
the molecules (mapping)
• Define MCS: largest set of atoms and bonds in
common between the two structures.
• A Non-Polynomial- (NP)-complete problem: very
computer intensive; in the worst case, the
algorithm will have an exponential computational
complexity
• Tricks are used to cut down on the computer
usage
Maximum Common Subgraph
Reduced Graph Similarity
• A structure’s key features are condensed
while retaining the connections between
them
• Cen ID structures with similar binding
characteristics, but different underlying
skeletons
• Smaller number of nodes speeds up
searching
3D Similarity
• Aim is often to identify structurally different
molecules
• 3D methods require consideration of the
conformational properties of molecules
Tanimoto Coefficient to Find
Compounds Similar to Morphine
3D: Alignment-Independent
Methods
• Descriptors: geometric atom pairs and
their distances, valence and torsion
angles, atom triplets
• Consideration of conformational flexibility
increases greatly the compute time
• Relatively fewer pharmacophoric
fingerprints than 2D fingerprints
– Result: Low similarity values using Tanimoto
Pharmacophore
• A structural abstraction of the interactions
between various functional group types in
a compound
• Described by a spatial representation of
these groups as centers (or vertices) of
geometrical polyhedra, together with
pairwise distances between centers
• http://www.ma.psu.edu/~csb15/pubs/searle.pdf
3D: Alignment Methods
• Require consideration of the degrees of
freedom related to the conformational
flexibility of the molecules
• Goal: determine the alignment where
similarity measure is at a maximum
3D: Field-Based Alignment
Methods
• Consideration of the electron density of
the molecules
– Requires quantum mechanical calculation:
costly
– Property not sufficiently discriminatory
3D: Gnomonic Projection Methods
• Molecule positioned at the center of a
sphere and properties projected on the
surface
• Sphere approximated by a tessellated
icosahedron or dodecahedron
• Each triangular face is divided into a
series of smaller triangles
Finding the Optimal Alignment
• Need a mechanism for exploring the
orientational (and conformational) degrees
of freedon for determining the optimal
alignment where the similarity is
maximized
• Methods: simplex algorithm, Monte Carlo
methods, genetic alrogithms
Evaluation of Similarity Methods
• Generally, 2D methods are more effective
that 3D
– 2D methods may be artificially enhanced
because of database characteristics (close
analogs)
– Incomplete handling of conformational
flexibility in 3D databases
• Best to use data fusion techniques,
combining methods
For additional information . . .
• See Dr. John Barnard’s lecture at:
http://www.indiana.edu/~cheminfo/C571/c571_Barnard6.ppt