Selecting Diverse Sets of Compounds C371 Fall 2004 Review • Similar Property Principle: If structurally similar compounds are likely to exhibit similar activity, then maximum coverage.

Download Report

Transcript Selecting Diverse Sets of Compounds C371 Fall 2004 Review • Similar Property Principle: If structurally similar compounds are likely to exhibit similar activity, then maximum coverage.

Selecting Diverse Sets of
Fall 2004
• Similar Property Principle: If structurally
similar compounds are likely to exhibit
similar activity, then maximum coverage of
the activity space should be achieved by
selecting a structurally diverse set of
• High-Throughput Screening (HTS)
• Combinatorial Chemistry
• Early attempts led to large libraries, but
little variability in the molecules created
• Need a way to identify subsets of
compounds for synthesis, purchase, or
Chemical Diversity
• No unambiguous definition
• Need to quantify the degree of diversity of
a subset of compounds
• Four main approaches:
– Cluster analysis
– Dissimilarity-based methods
– Cell-based methods
– Use of optimization techniques
• Aim is to divide a group into clusters where
objects in the cluster are similar, but objects in
other clusters are dissimilar
• Many algorithms for doing this
– Hierarchical methods seem to be better than nonhierarchical
• Sometimes called a “distance-based” approach
to compound selection, because distance is
measured between pairs of compounds
Key Steps in Cluster Analysis
• Generate descriptors for each compound
• Calculate the similarity or distance
between all compounds
• Use a clustering algorithm to group the
• Select a representative subset by taking
one or more compounds from each cluster
• 1-S, where S is the similarity coefficient
– When molecules are represented by binary
• Euclidean distance
– When molecules are represented by
physicochemical properties
Characteristics of Clustering
• Non-overlapping: each object in one
cluster only (Most use this approach)
– Hierarchical methods
– Non-hierarchical methods
• Overlapping: object can be in more than
one cluster
• Efficiency and effectiveness issues: some
approaches have very intensive
computational requirements
Hierarchical Clustering
• Clusters increase in size, with each compound in
a single cluster (a singleton) at one extreme
– Agglomerative methods start at the bottom and merge
similar clusters
• Ward’s method: clusters are formed to minimize the variance
(i.e., the sum of the squared deviations from the mean)
• Others: centroid method and the median method
– Divisive hierarchical clustering starts with all
compounds in a single cluster and partitions the data
Selecting the Appropriate Number
of Clusters
• Need a cutoff value at which you are going to
examine the molecules
– Jaccard statistic of two clusters, C1 and C2
-------------------------a + b + c
Where a is the number of compounds found in both
clusters, b is the number that cluster in 1 but not 2,
and c is the number in 2 but not 1
– Same as the Tanimoto coefficient
Non-Hierarchical Clustering
• Compounds are clustered without forming
a hierarchical relationship
• Methods:
– single-pass assigns a compound to a cluster
according to a cut-off value
• Problem: doesn’t give same results all of the time,
i.e., dependent on the order of the molecules
– nearest neighbor: Jarvis Patrick clustering
– relocation: K-means
• Attempt to identify a diverse set of
compounds directly
• Based on calculating distances or
dissimilarities between compounds
Basic Algorithm for DissimilarityBased Selection Methods
• Decide on a desired size, n, of a final subset
• Select a compound and place it in the subset
• Calculate the dissimilarity between each of the
other compounds and those in the subset
• Choose the next compound as the one most
dissimilar to those in the subset
• If fewer than n in the subset, repeat the
calculation of the dissimilarity until n is achieved
• Complexity varies as the square of n
• Operate within a pre-defined low-dimensional
chemistry space, not dependent on the
particular set of molecules being examined
• Compounds are allocated to cells according to
their molecular properties
• Methods are very fast with a time complexity of
O(N), but restricted to low-dimensional space
– good for very large data sets
– Examples: MW, logP, polarity, shape, hydrogen
bonding, aromatic interactions
BCUT Descriptors
• Matrix representation of molecules
• Atomic properties used for diagonal
– Atomic charges, polarizabilities, hydrogen
• Connectivity used for the off-diagonals
– 2D graph or interatomic distances from 3D
Partitioning Using Pharmacophore
• Each potential 3- or 4-point
pharmacophore is considered to constitute
a cell
• A given molecule could be in more than
one cell
• Promiscous molecules: those that contain
a large number of pharmacophores, e.g.,
very flexible molecules
• Techniques for sampling large sets of
• May want to spread the compounds
evenly in space
• Techniques: Monte Carlo, simulated
• Selective replacement
• Some research suggests that compounds within
0.85 Tanimoto similarity have between 30% and
80% chance of sharing the same biological
• No clear consensus on which screening
approach is best
• Faster computer techniques (e.g., parallel
computing) may help
• Descriptors used must be related to biological