Selecting Diverse Sets of Compounds C371 Fall 2004 Review • Similar Property Principle: If structurally similar compounds are likely to exhibit similar activity, then maximum coverage.
Download ReportTranscript Selecting Diverse Sets of Compounds C371 Fall 2004 Review • Similar Property Principle: If structurally similar compounds are likely to exhibit similar activity, then maximum coverage.
Selecting Diverse Sets of Compounds C371 Fall 2004 Review • Similar Property Principle: If structurally similar compounds are likely to exhibit similar activity, then maximum coverage of the activity space should be achieved by selecting a structurally diverse set of compounds. Techniques • High-Throughput Screening (HTS) • Combinatorial Chemistry • Early attempts led to large libraries, but little variability in the molecules created • Need a way to identify subsets of compounds for synthesis, purchase, or testing Chemical Diversity • No unambiguous definition • Need to quantify the degree of diversity of a subset of compounds • Four main approaches: – Cluster analysis – Dissimilarity-based methods – Cell-based methods – Use of optimization techniques CLUSTER ANALYSIS • Aim is to divide a group into clusters where objects in the cluster are similar, but objects in other clusters are dissimilar • Many algorithms for doing this – Hierarchical methods seem to be better than nonhierarchical • Sometimes called a “distance-based” approach to compound selection, because distance is measured between pairs of compounds Key Steps in Cluster Analysis • Generate descriptors for each compound • Calculate the similarity or distance between all compounds • Use a clustering algorithm to group the compounds • Select a representative subset by taking one or more compounds from each cluster “Distance” • 1-S, where S is the similarity coefficient – When molecules are represented by binary descriptors • Euclidean distance – When molecules are represented by physicochemical properties Characteristics of Clustering Methods • Non-overlapping: each object in one cluster only (Most use this approach) – Hierarchical methods – Non-hierarchical methods • Overlapping: object can be in more than one cluster • Efficiency and effectiveness issues: some approaches have very intensive computational requirements Hierarchical Clustering • Clusters increase in size, with each compound in a single cluster (a singleton) at one extreme – Agglomerative methods start at the bottom and merge similar clusters • Ward’s method: clusters are formed to minimize the variance (i.e., the sum of the squared deviations from the mean) • Others: centroid method and the median method – Divisive hierarchical clustering starts with all compounds in a single cluster and partitions the data Selecting the Appropriate Number of Clusters • Need a cutoff value at which you are going to examine the molecules – Jaccard statistic of two clusters, C1 and C2 a -------------------------a + b + c Where a is the number of compounds found in both clusters, b is the number that cluster in 1 but not 2, and c is the number in 2 but not 1 – Same as the Tanimoto coefficient Non-Hierarchical Clustering • Compounds are clustered without forming a hierarchical relationship • Methods: – single-pass assigns a compound to a cluster according to a cut-off value • Problem: doesn’t give same results all of the time, i.e., dependent on the order of the molecules – nearest neighbor: Jarvis Patrick clustering – relocation: K-means DISSIMILARITY-BASED SELECTION METHODS • Attempt to identify a diverse set of compounds directly • Based on calculating distances or dissimilarities between compounds Basic Algorithm for DissimilarityBased Selection Methods • Decide on a desired size, n, of a final subset • Select a compound and place it in the subset • Calculate the dissimilarity between each of the other compounds and those in the subset • Choose the next compound as the one most dissimilar to those in the subset • If fewer than n in the subset, repeat the calculation of the dissimilarity until n is achieved • Complexity varies as the square of n CELL-BASED METHODS • Operate within a pre-defined low-dimensional chemistry space, not dependent on the particular set of molecules being examined • Compounds are allocated to cells according to their molecular properties • Methods are very fast with a time complexity of O(N), but restricted to low-dimensional space – good for very large data sets – Examples: MW, logP, polarity, shape, hydrogen bonding, aromatic interactions BCUT Descriptors • Matrix representation of molecules • Atomic properties used for diagonal – Atomic charges, polarizabilities, hydrogen bonding • Connectivity used for the off-diagonals – 2D graph or interatomic distances from 3D Partitioning Using Pharmacophore Keys • Each potential 3- or 4-point pharmacophore is considered to constitute a cell • A given molecule could be in more than one cell • Promiscous molecules: those that contain a large number of pharmacophores, e.g., very flexible molecules OPTIMIZATION METHODS • Techniques for sampling large sets of molecules • May want to spread the compounds evenly in space • Techniques: Monte Carlo, simulated annealing • Selective replacement CONCLUSIONS • Some research suggests that compounds within 0.85 Tanimoto similarity have between 30% and 80% chance of sharing the same biological activity • No clear consensus on which screening approach is best • Faster computer techniques (e.g., parallel computing) may help • Descriptors used must be related to biological activity