Similarity Methods C371 Fall 2004 Limitations of Substructure Searching/3D Pharmacophore Searching • Need to know what you are looking for • Compound is either there.
Download ReportTranscript Similarity Methods C371 Fall 2004 Limitations of Substructure Searching/3D Pharmacophore Searching • Need to know what you are looking for • Compound is either there.
Similarity Methods C371 Fall 2004 Limitations of Substructure Searching/3D Pharmacophore Searching • Need to know what you are looking for • Compound is either there or not – Don’t get a feel for the relative ranking of the compounds • Output size can be a problem Similarity Searching • Look for compounds that are most similar to the query compound • Each compound in the database is ranked • In other application areas, the technique is known as pattern matching or signature analysis Similar Property Principle • Structurally similar molecules usually have similar properties, e.g., biological activity • Known also as “neighborhood behavior” • Examples: morphine, codeine, heroin • Define: in silico – Using computational techniques as a substitute for or complement to experimental methods Advantages of Similarity Searching • One known active compound becomes the search key • User sets the limits on output • Possible to re-cycle the top answers to find other possibilities • Subjective determination of the degree of similarity Applications of Similarity Searching • Evaluation of the uniqueness of proposed or newly synthesized compounds • Finding starting materials or intermediates in synthesis design • Handling of chemical reactions and mixtures • Finding the right chemicals for one’s needs, even if not sure what is needed. Subjective Nature of Similarity Searching • No hard and fast rules • Numerical descriptors are used to compare molecules • A similarity coefficient is defined to quantify the degree of similarity • Similarity and dissimilarity rankings can be different in principle Similarity and Dissimilarity “Consider two objects A and B, a is the number of features (characteristics) present in A and absent in B, b is the number of features absent in A and present in B, c is the number of features common to both objects, and d is the number of features absent from both objects. Thus, c and d measure the present and the absent matches, respectively, i.e., similarity; while a and b measure the corresponding mismatches, i.e., dissimilarity.” (Chemoinformatics; A Textbook (2003), p. 304) 2D Similarity Measures • Commonly based on “fingerprints,” binary vectors with 1 indicating the presence of the fragment and 0 the absence • Could relate structural keys, hashed fingerprints, or continuous data (e.g., topological indexes that take into acount size, degree of branching, and overall shape) Tanimoto Coefficient • Tanimoto Coefficient of similarity for Molecules A and B: SAB = c _ a+b–c a = bits set to 1 in A, b = bits set to 1 in B, c = number of 1 bits common to both Range is 0 to 1. Value of 1 does not mean the molecules are identical. Similarity Coefficients • Tanimoto coefficient is most widely used for binary fingerprints • Others: – Dice coefficient – Cosine similarity – Euclidean distance – Hamming distance – Soergel distance Distance Between Pairs of Molecules • Used to define dissimilarity of molecules • Regards a common absence of a feature as evidence of similarity When is a distance coefficient a metric? • Distance values must be zero or positive – Distance from an object to itself must be zero • Distance values must be symmetric • Distance values must obey the triangle inequality: DAB ≤ DAC + DBC • Distance between non-identical objects must be greater than zero. • Dissimilarity = distance in the ndimensional descriptor space Size Dependency of the Measures • Small molecules often have lower similarity values using Tanimoto • Tanimoto normalizes the degree of size in the denominator: SAB = c _ a+b–c Other 2D Descriptor Methods • Similarity can be based on continuous whole molecule properties, e.g. logP, molar refractivity, topological indexes. • Usual approach is to use a distance coefficient, such as Euclidean distance. Maximum Common Subgraph Similarity • Another approach: generate alignment between the molecules (mapping) • Define MCS: largest set of atoms and bonds in common between the two structures. • A Non-Polynomial- (NP)-complete problem: very computer intensive; in the worst case, the algorithm will have an exponential computational complexity • Tricks are used to cut down on the computer usage Maximum Common Subgraph Reduced Graph Similarity • A structure’s key features are condensed while retaining the connections between them • Cen ID structures with similar binding characteristics, but different underlying skeletons • Smaller number of nodes speeds up searching 3D Similarity • Aim is often to identify structurally different molecules • 3D methods require consideration of the conformational properties of molecules Tanimoto Coefficient to Find Compounds Similar to Morphine 3D: Alignment-Independent Methods • Descriptors: geometric atom pairs and their distances, valence and torsion angles, atom triplets • Consideration of conformational flexibility increases greatly the compute time • Relatively fewer pharmacophoric fingerprints than 2D fingerprints – Result: Low similarity values using Tanimoto Pharmacophore • A structural abstraction of the interactions between various functional group types in a compound • Described by a spatial representation of these groups as centers (or vertices) of geometrical polyhedra, together with pairwise distances between centers • http://www.ma.psu.edu/~csb15/pubs/searle.pdf 3D: Alignment Methods • Require consideration of the degrees of freedom related to the conformational flexibility of the molecules • Goal: determine the alignment where similarity measure is at a maximum 3D: Field-Based Alignment Methods • Consideration of the electron density of the molecules – Requires quantum mechanical calculation: costly – Property not sufficiently discriminatory 3D: Gnomonic Projection Methods • Molecule positioned at the center of a sphere and properties projected on the surface • Sphere approximated by a tessellated icosahedron or dodecahedron • Each triangular face is divided into a series of smaller triangles Finding the Optimal Alignment • Need a mechanism for exploring the orientational (and conformational) degrees of freedon for determining the optimal alignment where the similarity is maximized • Methods: simplex algorithm, Monte Carlo methods, genetic alrogithms Evaluation of Similarity Methods • Generally, 2D methods are more effective that 3D – 2D methods may be artificially enhanced because of database characteristics (close analogs) – Incomplete handling of conformational flexibility in 3D databases • Best to use data fusion techniques, combining methods For additional information . . . • See Dr. John Barnard’s lecture at: http://www.indiana.edu/~cheminfo/C571/c571_Barnard6.ppt