Significance of Molecular Similarity Measures for Use in

Download Report

Transcript Significance of Molecular Similarity Measures for Use in

SIMILARITY & DIVERSITY
SEARCHING OF CHEMICAL
DATABASES
Naomie Salim
Universiti Teknologi Malaysia
Drug Discovery Value Chain
The organisation and processing of chemical data
Drug Development Process
develop
assay
10,000’s
compounds
lead
identification
lead
optimisation
clinical
trials
1 drug
to market
Time from synthesis to
product:
1990s (through 1996): 14.9
years
S.R. Shulman & M.
Manocchia,
Pharmacoeconomics,
September, 1997
Associated cost of bringing
new drug to market estimated
well over $500 million
M.L. Lee & K.M. Payne,
American Pharmaceutical
Review, 1999, 1:55
Computer-Aided Drug
Design
• 3-D target structure unknown
random screening if no actives are known
 similarity searching
 pharmacophore mapping (LI)
 pattern recognition methods (LI)
 QSAR (2D & 3D) (LO)
 Combinatorial library design (LI & LO)

• Structure-based drug design (LI)
docking
 de novo design

Cheminformatics research
in UTM
• Mainly concerned with
development & application of
computer techniques
 Similarity
searching
 Retrieval of diverse compounds
from chemical libraries
Drug Lead Optimization
• When a promising drug molecule
has been found in a drug discovery
program, the next step is to optimize
the structure and properties of the
potential drug.
 Search
for chemical compounds with
similar structure or properties to a
known compound.
Rationale For Using Similarity
Information
Similar property
principle

•
structurally similar
molecules are likely to have
similar properties
Given an active target
molecule, a similarity
search can identify
further molecules in the
database for testing
Property P2
•
Property P1
Representing chemicals: eg.
Connection tables -2D
1
O
8
7
3N
4
N9
2
N
5
6
N
10
1O
2C
3N
4C
5N
6C
7C
8C
9N
10 N
2d
1d
2s
3s
4d
5s
2s
7s
8d
6s
3s 7s
4s
5d
6s
7 d 10 s
6d 8s
9d
10 s
9s
Bit string similarity
measure
• The bit string similarity
measure is currently the
most widely used approach
for
database
searching
[Downs and Willett, 1996].
• Sub-structural descriptors
encoded
in
bit
string
representation are capable
of encapsulating the activity
and physical properties of
the
molecules
they
characterised [Martin et al.,
1998].
The Tanimoto coefficient as
the coefficient of choice
• Tanimoto and Cosine coefficients
performed better than distance measures
[Willett and Winterman, 1986]
• Tanimoto coefficient calculation faster does not involve a square root
• Tanimoto involves a normalisation factor
that helps lessen molecular size effects
Tanimoto :
Cosine :
Euclidean Distance :
n = total bit positions in bit-strings, a = bits set in both, b ,c = bits set in only one
Tanimoto the best coefficient to
use for molecular similarity ?
• Binary Tanimoto coefficient has a significant
preferences of certain values, which is
around 0.3 [Godden et al., 1999]
• Distribution of binary Tanimoto coefficient
values tends to shift towards lower values as
number of bits in the query bit-string
decreases [Lajiness, 1997; Flower, 1998;
Dixon and Koehler, 1999]
• Rankings of coefficients have high variations
between datasets [Willett and Winterman,
1986]
Among approaches taken to
overcome problem
• Modification of the Tanimoto
coefficient [Filiminov et al., 1999;
Fligner et al., 2002]
• Combination of different similarity
coefficients into new coefficients [Dixon
and Koehler, 1999]
• Data fusion [Salim et al., 2003]
Approach taken in our
study
• Performance comparison of several
coefficients taken from the general
literature of information retrieval
• Fusion of the rankings obtained from
those coefficients
Similarity
measures
studied
Jaccard/Tanimoto
o
Rogers/Tanimoto
Dice
Sokal/Sneath(3)
ad
bc
a
abc
2a
2a  b  c
Simpson
a
mi n ( a  b , a  c )
ad
bcn
Pearson
ad  bc
( a  b )( a  c )( b  d )( c  d )
Russell/Rao
a
n
ad  bc
Yule
ad  bc
Baroni-Urbani/
Buser
ad  a
Sokal/Sneath(1)
n = total bit positions in
bit-strings,
a = bits set in both,
b = bits set in only one
of the molecule,
c = bits set in only the
other,
d = bits not set in either
a
a  2b  2c
ad  a b  c
M cConnaug hey
Ochiai/Cosine
a2  bc
( a  b )( a  c )
a
( a  b )( a  c )
Kulczynski(1)
a
bc
a  d
Simple M atching
n
Kulczynski(2)
a
(2a  b  c )
2
( a  b )( a  c )
Forbes
a  d bc
Hamannn
2a  2d
a  d  n
Sokal/Sneath(2)
na
( a  b )( a  c )
Fossum
2
1

n a  
2

( a  b )( a  c )
Stiles

l o g1 0
2
n
n a d  b c  
2

( a  b )( a  c )( b  d )( c  d )
ad  bc
Dennis
n ( a  b )( a  c )
b  c
n
M ean M anhattan
Clusters of coefficients
 {Jaccard/Tanimoto, Dice, Sokal/Sneath(1), Kulczynski(1)}
 {Russell/Rao}
 {Simple Matching, Hamann, Sokal/Sneath(2), Rogers/Tanimoto,
Sokal/Sneath(3), Mean Manhattan}
 {Baroni-Urbani/Buser}
 {Ochiai/Cosine}
 {Kulczynski(2), McConnaughey}
 {Forbes}
 {Fossum}
 {Simpson}
 {Pearson}
 {Yule}
 {Stiles}
 {Dennis}
Datasets used


30 activities from MDDR
21 activities from ID Alert
Bit string used



BCI bit strings
Daylight fingerprints
UNITY 2D bit strings
Performance measure

Average number of actives in top
400 structures
What coefficient is the
best for similarity
searching ?
Rankings of coefficients(MDDR)
Target
Avg Num
Bit Set
Std Dev
Bit Set
Be-st Tan
Unity 2D (992 bits)
Enkephalinase inhibitor
163
36.78
For
8
13
2
5
9
11
1
10
12
5
4
7
3
Prolylendopeptidase inhibitor
184
33.19
Den
6
13
5
2
7
9
11
8
12
3
10
3
1
Dopamine (D2) antagonist
198
43.26
Den
9
13
6
4
7
10
11
8
12
2
5
3
1
Vasolidator
206
62.00
Den
6
13
10
5
7
9
11
8
12
2
4
2
1
Thromboxane antagonist
209
37.24
6
13
9
2
7
9
11
8
12
2
5
4
1
Excitatory Amino acid inhibitor
209
59.23
Den
Bar/P
ea
7
13
10
1
5
8
11
6
12
1
9
3
4
Rus
SM
Bar
Cos
Kul
For
Fos
Sim
Pea
Yul
Sti
Den
Potassium channel blocker
220
49.41
Cos
4
11
10
7
1
3
13
2
12
6
8
5
9
Leukotriene D4 antagonist
224
46.18
Bar
2
13
10
1
3
7
12
4
11
5
9
5
8
Factor Xa inhibitor
240
55.97
12
10
6
2
1
13
2
11
5
9
4
8
272
60.77
Kul
Rus
7
Endothelin ETA antagonist
3
1
10
6
3
5
13
2
12
8
11
6
9
BCI 1052 (1052 bits)
Melatonin agonist
193
60.11
For
10
13
4
5
8
11
1
8
12
7
2
6
3
Adrenergic (alpha2) agonist
219
46.65
Bar
7
13
8
1
9
11
2
10
12
5
3
6
4
Antiischemic, Myocardial
229
85.52
12
13
4
6
10
11
1
9
3
8
2
7
5
Adrenoceptor (beta3) agonist
255
76.25
For
Pea/S
ti
7
13
10
8
4
6
11
3
12
1
9
1
5
Neutral Endopeptidase inhibitor
255
79.00
Fos
5
13
10
8
2
6
12
1
11
4
9
3
7
H+/K+ - ATPase inhibitor
276
73.63
Sti
4
13
11
5
6
8
10
7
12
2
9
1
3
IL-8 inhibitor
281
81.08
8
13
12
9
2
1
11
2
10
6
5
4
7
Xanthine
299
69.15
Kul
Cos/F
os
3
11
13
7
1
4
12
1
10
6
9
5
8
1
9
12
5
3
4
13
2
11
5
10
5
8
7
1
10
11
8
2
13
8
3
4
12
4
6
Phosphodiesterase inhibitor
332
77.86
Adenosine (A2) antagonist
348
102.76
Tan
Rus
Rankings of coefficients (ID
Alert)
Target
Avg Num
Bit Set
Std Dev
Bit Set
Unity 2D (Average BS = 207, Std. Dev. = 71.11)
PAF antagonist
167
66.59
Be-st Tan
Rus
SM
Bar
Cos
Kul
For
Fos
Sim
Pea
Yul
Sti
Den
For
8
13
3
2
9
11
1
9
12
5
5
5
4
For
Cos/F
os
Cos/F
os
7
5
4
9
10
6
1
10
2
10
3
10
7
3
13
10
5
1
3
8
1
10
6
8
6
10
3
13
10
5
1
3
12
1
10
8
6
7
9
TXA2 antagonist
184
76.94
HMG CoA reductase inhibitor
204
70.02
Angiotensin II antagonist
211
71.30
Tyrosine kinase inhibitor
218
70.81
Rus
2
1
11
6
2
4
13
4
9
7
11
7
9
ACAT inhibitor
232
72.51
Rus
6
1
11
7
4
2
13
3
5
9
12
8
10
Cephalosporin
245
82.89
Rus
7
1
9
9
2
2
13
4
7
4
12
4
11
BCI 1052 (Average BS = 96,Std. Dev. = 31.80)
Cyclooxygenase inhibitor
77
30.25
5
10
9
1
10
3
10
3
10
6
For
6
6
2
Antibacterial
88
27.71
For
11
13
3
4
5
10
1
8
11
8
2
5
5
Immunosuppressant
95
39.42
Sim
11
3
5
13
7
6
2
8
1
10
4
9
12
NMDA antagonist
97
32.86
Rus
2
1
12
5
4
9
13
3
10
5
10
5
5
Fungicide
100
36.91
Rus
10
1
13
11
6
3
5
7
2
8
4
9
12
Progestogen antagonist
117
34.18
Rus
4
1
4
4
4
2
4
4
2
4
4
4
4
Kappa opioid agonist
119
35.82
Rus
2
1
2
2
2
2
2
2
2
2
2
2
2
Daylight (Average BS = 269, Std. Dev. = 111.59)
Antiarrhythmic agent
163
72.69
1
4
4
4
1
4
4
For
12
13
3
4
4
4
Fibrinogen antagonist
213
89.78
For
12
13
2
7
9
7
1
9
9
4
2
4
4
Calcium channel blocker
Potassium channel blocker
252
88.62
Den
10
13
7
3
7
5
11
9
12
2
4
6
1
275
98.90
Tan
1
12
10
6
2
4
11
3
12
5
9
6
8
Acetylcholinesterase inhibitor
307
79.81
Bar/P
ea/Sti
7
4
12
1
7
4
13
7
10
1
10
1
4
PDE inhibitor
322
125.48
Rus
5
1
5
2
7
7
7
7
2
7
13
7
2
Muscarinic M1 agonist
337
93.66
Rus
5
1
5
5
5
5
5
2
2
5
5
2
5
Summarising ranks from both
databases
Performance
Number of times
in top 3 ranking
Number of times
in bottom 3
ranking
Ranking based
on times in top 3
and bottom 3
Average ranking
Ranking based
on average
ranking
Avg. G-H score
Ranking based
on G-H scores
Tan
Rus
SM
Bar
Cos
Ku2
For
Fos
Sim
Pea
Yul
Sti
Den
15
15
8
14
20
16
15
21
11
13
12
11
13
11
29
19
8
8
10
34
8
35
7
14
7
8
7
11
10
3
2
3
12
1
13
3
9
7
6
6.06
8.61
8.14
5.41
4.98
5.76
8.43
5.25
8.86
5.08
6.84
4.94
5.71
8
12
10
5
2
7
11
4
13
3
9
1
6
0.1832 0.1443 0.1673 0.1827 0.1839 0.1824 0.1541 0.1838 0.1485 0.1845 0.1763 0.1845 0.1828
5
13
10
7
3
8
11
4
12
2
9
1
6
Any coefficient
consistently better than
Tanimoto ? How do
coefficients performed
when compared to
Tanimoto ?
Overall comparison with
Tanimoto (1372 cases,
across 51 activities, 2
databases, 3 fingerprints)
Coefficient Greater
Equal
Less
Bar
383
603
Pea
325
692
Sti
319
703
Cos
260
813
Fos
256
808
Den
341
615
Kul
317
626
Yul
392
427
SM
387
354
For
388
288
Sim
381
271
Rus
360
308
Greater-Less
386
355
350
299
308
416
429
553
631
696
720
704
-3
-30
-31
-39
-52
-75
-112
-161
-244
-308
-339
-344
Average Improvement over
Tanimoto (number of actives
among top 400 in 51 activities, 2
databases, 3 fingerprints
Coefficient
Avg. Imp.
Bar
0.007
Pea
0.005
Sti
0.005
Cos
0.003
Fos
0.001
Kul
0.000
Den
-0.004
Yul
-0.021
SM
-0.050
For
-0.085
Sim
-0.127
Rus
-0.140
Is there any relationship
between performance of
coefficients and number
of bits set in active
compounds ?
Considering activities with the lowest
z-score in term of number of bits set
Target
Avg
Num
Bit
Set
Std
Dev
Bit
Set
Antiarrhythmic agent
163
72.69 7.96
-0.95
Melatonin agonist
193
60.11 9.42
-0.865
Enkephalinase inhibitor
163
Adrenergic (alpha2) agonist
z-score
%of Bit of Bit Num.
Set
Set
Of Act. Best Ind.
Rus
SM
Bar
Cos
Kul
For
Fos
Sim
Pea
Yul
Sti
Den
12
13
3
4
4
4
1
4
4
4
1
4
4
132 For
10
13
4
5
8
11
1
8
12
7
2
6
3
36.78 16.4315 -0.831
100 For
8
13
2
5
9
11
1
10
12
5
4
7
3
219
46.65 10.69
-0.626
113 Bar
7
13
8
1
9
11
2
10
12
5
3
6
4
Cyclooxygenase inhibitor
77
30.25 7.32
-0.60
22 For
6
6
2
5
10
9
1
10
3
10
3
10
6
PAF antagonist
167
66.59 16.83
-0.56
31 For
8
13
3
2
9
11
1
9
12
5
5
5
4
Prolylendopeptidase inhibitor
184
33.19 18.5484 -0.535
294 Den
6
13
5
2
7
9
11
8
12
3
10
3
1
Antiischemic, Myocardial
229
85.52 11.18
-0.534
135 For
12
13
4
6
10
11
1
9
3
8
2
7
5
Fibrinogen antagonist
213
89.78 10.40
-0.50
19 For
12
13
2
7
9
7
1
9
9
4
2
4
4
Dopamine (D2) antagonist
198
43.26 19.9597 -0.337
187 Den
9
13
6
4
7
10
11
8
12
2
5
3
1
TXA2 antagonist
184
76.94 18.55
-0.32
26 For
7
5
4
9
10
6
1
10
2
10
3
10
7
Squalene Synthetase inhibitor
92
36.28 8.75
-0.3
415 Yul
12
13
10
11
9
3
2
8
4
6
1
5
6
Adrenoceptor (beta3) agonist
255
76.25 12.45
-0.294
271 Pea/Sti
7
13
10
8
4
6
11
3
12
1
9
1
5
Neutral Endopeptidase inhibitor
255
79
-0.294
209 Fos
5
13
10
8
2
6
12
1
11
4
9
3
7
Antibacterial
88
27.71 8.37
-0.25
35 For
11
13
3
4
5
10
1
8
11
8
2
5
5
Vasolidator
206
62
407 Den
6
13
10
5
7
9
11
8
12
2
4
2
1
12.45
20.7661 -0.224
6 For/Yul
Tan
Best three coefficients for each activity are highlighted
Forbes appears in this best three list 11 out of 17 cases
Considering activities with the highest z-score in
term of number of bits set
Target
Avg
Num
Bit
Set
Std
Dev
Bit
Set
CRF antagonist
107
16.49 10.17
Factor Xa inhibitor
240
55.97 24.1935 0.2566
Acetylcholinesterase inhibitor
307
79.81 14.99
0.34
ACAT inhibitor
Systemic Lupus Erythematosus, agent
for
Phosphodiesterase inhibitor
232
72.51 23.39
111
30.15 10.55
332
77.86 16.21
0.4137
PDE inhibitor
322
125.5 15.72
0.47
Cephalosporin
245
82.89 24.70
0.53
Adenosine (A2) antagonist
348
102.8 16.99
0.5608
Vasopressin V1 antagonist
119
19.33 11.31
0.6379
Progestogen antagonist
117
34.18 11.12
0.66
Endothelin ETA antagonist
272
60.77 27.4194 0.7087
Kappa opioid agonist
119
35.82 11.31
0.72
Oxazolidinone
123
12.1
11.69
0.7769
Neurokinin antagonist
125
20.03 11.88
0.8464
Anthracycline
126
14.89 11.98
0.8812
201 Rus
Tan/Bar/Co
s/Kul/
160
Fos/Pea/
Sti/ Den
Penicillin
159
24.93 15.11
2.0278
z-score
%of Bit of Bit Num.
Set
Set
Of Act. Best Ind.
0.221
Tan
Rus
SM
Bar
Cos
Kul
For
Fos
Sim
Pea
Yul
Sti
Den
254 Tan
1
11
10
8
2
6
13
3
12
4
9
5
7
412 Kul
7
12
10
6
2
1
13
2
11
5
9
4
8
18 Bar/Pea/ Sti
7
4
12
1
7
4
13
7
10
1
10
1
4
0.35
54 Rus
6
1
11
7
4
2
13
3
5
9
12
8
10
0.36
99 Rus
9
1
12
7
3
6
13
2
10
5
11
4
7
130 Tan
1
9
12
5
3
4
13
2
11
5
10
5
8
19 Rus
5
1
5
2
7
7
7
7
2
7
13
7
2
20 Rus
7
1
9
9
2
2
13
4
7
4
12
4
11
106 Rus
7
1
10
11
8
2
13
8
3
4
12
4
6
125 Kul
6
10
11
8
2
1
13
2
12
5
9
4
7
10 Rus
4
1
4
4
4
2
4
4
2
4
4
4
4
158 Rus
3
1
10
6
3
5
13
2
12
8
11
6
9
6 Rus
2
1
2
2
2
2
2
2
2
2
2
2
2
306 Bar
2
9
10
1
3
8
13
4
12
6
11
5
7
2
1
10
8
4
5
13
3
12
7
11
6
9
1
10
11
1
1
1
13
1
12
1
9
1
1
5
1
11
9
4
2
13
3
12
6
10
6
8
79 Rus
Best three coefficients for each activity are highlighted
Russell/Rao appears in this best three list 10 out of 17 cases
Considering activities with the medium zscore in term of number of bits set
Target
Avg
Num
Bit
Set
Std
Dev
Bit
Set
Thromboxane antagonist
209
37.24 21.0685 -0.181
499 Den
6
13
9
2
7
9
11
8
12
2
5
4
1
Excitatory Amino acid inhibitor
209
59.23 21.0685 -0.181
122 Bar/Pea
7
13
10
1
5
8
11
6
12
1
9
3
4
Calcium channel blocker
252
88.62 12.30
-0.15
46 Den
10
13
7
3
7
5
11
9
12
2
4
6
1
Reverse Transcriptase inhibitor
97
23.56 9.22
-0.126
518 Bar
6
13
10
1
7
10
8
9
12
3
5
4
2
H+/K+ - ATPase inhibitor
276
73.63 13.48
-0.101
638 Sti
4
13
11
5
6
8
10
7
12
2
9
1
3
Antiischemic agent
99
22.76 9.41
-0.057
148 For
12
13
4
6
10
11
1
9
3
8
2
7
5
IL-8 inhibitor
281
81.08 13.72
-0.055
148 Kul
8
13
12
9
2
1
11
2
10
6
5
4
7
HMG CoA reductase inhibitor
204
70.02 20.56
-0.04
30 Cos/Fos
3
13
10
5
1
3
8
1
10
6
8
6
10
Immunosuppressant
95
39.42 9.03
-0.03
69 Sim
11
3
5
13
7
6
2
8
1
10
4
9
12
Potassium channel blocker
220
49.41 22.1774 -0.026
187 Cos
4
11
10
7
1
3
13
2
12
6
8
5
9
Leukotriene D4 antagonist
224
46.18 22.5806 0.0305
465 Bar
2
13
10
1
3
7
12
4
11
5
9
5
8
NMDA antagonist
97
32.86 9.22
0.03
52 Rus
2
1
12
5
4
9
13
3
10
5
10
5
5
Potassium channel blocker
275
98.9
13.43
0.05
29 Tan
1
12
10
6
2
4
11
3
12
5
9
6
8
Angiotensin II antagonist
211
71.3
21.27
0.06
115 Cos/Fos
3
13
10
5
1
3
12
1
10
8
6
7
9
Xanthine
299
69.15 14.60
0.1102
107 Cos/Fos
3
11
13
7
1
4
12
1
10
6
9
5
8
Fungicide
100
36.91 9.51
0.13
101 Rus
10
1
13
11
6
3
5
7
2
8
4
9
12
Tyrosine kinase inhibitor
218
70.81 21.98
0.15
23 Rus
2
1
11
6
2
4
13
4
9
7
11
7
9
z-score
%of Bit of Bit Num.
Set
Set
Of Act. Best Ind.
Tan
Rus
SM
Bar
Cos
Kul
For
Fos
Sim
Best three coefficients for each activity are highlighted
Cos appears in this best three list 8 out of 17 cases, Tan and Fos 7/11
Pea
Yul
Sti
Den
Distribution of number of bits set in top 5%
structures obtained through similarity searching
with 21 5HT4 Agonist targets
400
350
300
Cos
Rus
For
200
150
100
50
80
10
0
12
0
14
0
16
0
18
0
20
0
22
0
24
0
26
0
28
0
30
0
32
0
34
0
36
0
38
0
40
0
42
0
44
0
46
0
48
0
50
0
52
0
54
0
56
0
60
40
20
0
0
Count
250
Number of bits set
Sample 5HT4 Agonists with very different
ranks using the Russell/Rao and the Forbes
coefficients.
Cl
N
N
N
Cl
N
O
N
O
O
Bits set = 169
Rank Rus = 146
Rank For = 10
O
Bits set = 161
Rank Rus = 406
Rank For = 45
N
O
N
N
Cl
O
Cl
N
O
O
N
N
Bits set = 308
Rank Rus = 28
Rank For = 1036
Bits set = 299
Rank Rus = 40
Rank For = 1004
Average number of bits set in different
similarity percentiles
350
7000
7000
Enkephalinase inhibitor
6000
300
5000
6000
Co u n t
4000
3000
Leukotriene D4
antagonist
Endothelin ETA
antagonist
3000
2000
1000
200
70
Tanimoto
s e t
60
b i t s
4000
250
50
o f
Count
Nu mb er o f b its set
b e r
0
50
10
0
15
0
20
0
25
0
30
0
35
0
40
0
45
0
50
0
55
0
0
n u m
1000
150
Av e r a g e
Average number of bits set
2000
5000
40
Russell/Rao
Cosine
30
20
10
0
0.05 0.2 0.35 0.5 0.65 0.8 0.95
Percentile
100
Forbes
Stiles
50
0
0
50
0
100 150 200 250 300 350 400 450 500 550
0.05
0.15
0.25
0.35
Number of bits set
0.55
0.65
0.75
0.85
0.95
0.75
0.85
0.95
Percentile
400
350
350
40
30
20
100
10
0
0.05 0.2 0.35 0.5 0.65 0.8 0.95
Percentile
Russell/Rao
Cosine
Forbes
Stiles
50
200
70
150
50
40
30
20
10
100
0
0.05 0.2 0.35 0.5 0.65 0.8 0.95
Percentile
Russell/Rao
Cosine
Forbes
Stiles
50
0
Tanimoto
60
s e t
s e t
b i t s
Av e r a g e
n u m
b e r
o f
50
b i t s
Tanimoto
60
150
o f
70
250
b e r
200
300
n u m
250
Av e r a g e
Average number of bits set
300
Average number of bits set
0.45
0
0.05
0.15
0.25
0.35
0.45
0.55
Percentile
0.65
0.75
0.85
0.95
0.05
0.15
0.25
0.35
0.45
0.55
Percentile
0.65
Can combination of
coefficients give better
performance?
Combination of coefficients has shown
improvement over use of single coefficients
180
110
Actives in top 5%
Actives in top 5%
160
105
100
95
90
85
140
120
100
80
60
40
20
80
0
1
2
3
4
5
6
7
8
9
Number of fused coeffs
10
11
1
12
2
4
5
6
7
8
9
10
11
12
Number of fused coeffs
(a) 5HT4 agonists
(b) HIV-1 protease inhibitors
144
142
140
138
136
134
132
130
128
126
124
122
30
Actives in top 5%
Actives in top 5%
3
25
20
15
10
5
0
1
2
3
4
5
6
7
8
9
10
11
Number of fused coeffs
12
1
2
3
4
5
6
7
8
9
10
11
12
Number of fused coeffs
Figure 3. Performance of fusions of coefficients compared to performance of single
(d) Benzodiazepine agonists
(c) ACE inhibitors
coefficients. Compounds
were characterised by Unity 2D bit strings.
Be st sum o f a cti ve s
Ave ra ge sum o f actives
Me d ia n sum o f actives
Be st sing le
Avera g e sing le
Me d ia n si ngle
What coefficients to include in
combinations?
Three main scenarios ….
Case 1: For, SM the best, Rus the worst
Coefficient Sum of Rank Position
For
6832.5
1
SM
8696.5
2
Den
10372.5
3
Bar
10814.5
4
Yul
11120.5
5
Pea
11327
6
Tan
12497.5
7
Cos
12781
8
Fos
12844.5
9
Kul
13308
10
Sti
13522
11
Sim
17212
12
Rus
21950.5
13
Eg : Enkephalinase inhibitor,
2-fusion
Case 2: Tan,Fos,Cos,Bar the best
Coefficient Sum of Rank Position
Tan
8866
1
Fos
9850.5
2
Cos
9919.5
3
Bar
10200
4
Rus
10218.5
5
Kul
11295
6
Pea
11473.5
7
Den
12662.5
8
Sti
12829
9
Yul
13849
10
SM
14141
11
Sim
16038
12
For
16506.5
13
Eg : Leukotriene D4 antagonist,
2-fusion
Case 3: Rus the best, SM, For
the worst
Coefficient Sum of Rank Position
Rus
7406
1
Tan
9950.5
2
Cos
9957.5
3
Fos
10005.5
4
Kul
10416
5
Bar
10971.5
6
Pea
12071
7
Sti
12824.5
8
Den
13351
9
Yul
13822.5
10
Sim
14141.5
11
SM
14624.5
12
For
17823.5
13
Eg : Endothelin ETA antagonists,
2-fusion
Corresponding to three different
situations ….
7000
6000
Endothelin ETA
antagonist
Leukotriene D4
antagonist
Enkephalinase inhibitor
4000
3000
2000
1000
Number of bits set
0
55
0
50
0
45
0
40
0
35
0
30
0
25
0
20
0
15
0
10
50
0
0
Count
5000
Typical Scenario ….
Target
Avg Num
Best
Bit Set
BestInd Single Tan
Melatonin agonist
193
Best 2
Best 2 Comb
Best 3
Best 3 Comb
Best 4
Best 4 Comb
For
74.10
70.00
75.00 SM,Den
75.29 SM, Kul,For
75.33
Bar
45.90
43.90
46.52 SM, Kul
46.71 SM,Cos,For
46.48
For
15.67
8.90
23.76 SM,For,Yul
23.52
Adrenergic (alpha2) agonist 219
Antiischemic, Myocardial
Adrenoceptor (beta3)
agonist
Neutral Endopeptidase
inhibitor
H+/K+ - AT Pase inhibitor
229
255
Pea/Sti
Fos
92.52
93.81 T an,Fos
94.14 Rus,Bar,For
94.90
150.24 148.29 150.57 T an,Den
150.95 Rus,SM, Sim 151.29
281
299
Cos/Fos
Phosphodiesterase inhibitor 332
T an
Adenosine (A2) antagonist
93.57
276
Kul
Xanthine
129.00 Rus,SM, For 131.57
255
Sti
IL-8 inhibitor
24.24 SM,For
Fos,Pea or
127.10 124.29 127.19 Cos,Sti
348
Rus
41.48
39.38
42.38 Sim,Pea
42.48 T an,Sim,Yul
42.38
SM,For,Yul,
Sti
SM, Bar,
Cos, For
SM,Kul,For,
Fos
Rus,SM,
For,Sim
Rus,Cos,
Kul,For
Rus,SM,
For,Sim
Kul,Sim,
Pea,Yul
T an,Rus,
Sim,Cos/Kul
/Fos or
Rus,Cos,Fos,
Sim
Any fusion
between
T an,Cos,Kul,
106.57 106.52 106.57 Fos
106.86 Rus,Bar,Sim 106.81
Rus,Kul,Pea
or
T an,Rus,
31.48 31.48 33.52 Rus,Pea
32.62 T an,Rus,Bar 32.29 Cos,Fos
Rus,Kul,Sim,
32.10 29.00 33.38 Rus,Sim
33.00 Rus,Kul,Sim
32.19 Cos/Fos
The best coefficient over all
fusions (51 activities, 2
databases, 3 bit strings)?
Coefficient Sum of Rank Position
Tan
746448.5
1
Bar
751827.5
2
Fos
753762.5
3
Cos
760313.5
4
Kul
777431
5
Pea
783451.5
6
Den
791224
7
Yul
816033
8
SM
827390.5
9
For
839259
10
Rus
842147
11
Sti
854540.5
12
Sim
865398.5
13
2-coefficient fusion
Coefficient Sum of Rank
Position
Tan
18544277
1
Bar
18685068
2
Fos
18703788
3
Cos
18822593
4
Kul
19010199
5
Pea
19265897
6
Den
19315527
7
Yul
19614883
8
Rus
19706964
9
SM
19725423
10
For
19779910
11
Sim
20155363
12
Sti
20870873
13
3-coefficient fusion
What combinations of coefficients
are the best (based on ordinal
values) ?
Comb.
Bar,Fos
Tan,Bar
Bar,Ku2
Bar,Cos
Tan,Fos
Tan,Cos
Tan,Ku2
Cos,Fos
Tan,Pea
Tan,Den
Sum of
Ordinal
mean
value
active
52642 2295.51
53216 2288.23
53366.5 2299.08
53420.5 2293.66
53651.5 2292.50
54164 2292.27
54694 2292.50
55267.5 2290.12
55388 2279.00
55460.5 2280.80
Overall 10 best 2-combinations
Comb.
Tan,Bar,Fos
Bar,Ku2,Fos
Bar,Cos,Ku2
Tan,Bar,Ku2
Tan,Bar,Cos
Bar,Cos,Fos
Tan,Cos,Fos
Tan,For,Fos
Bar,Fos,Yul
Tan,Ku2,Fos
Sum of
Ordinal
mean
value
active
212726 2294.24
213374 2302.60
213680 2301.88
214694 2297.60
214700 2293.60
215678 2296.99
219859 2290.41
220272 2262.36
220831 2270.84
220836 2291.47
Overall 10 best 3-combinations
Combination
Tan,Bar,Cos,Ku2
Tan,Bar,Cos,Fos
Bar,Cos,Ku2,Fos
Tan,Bar,Ku2,Fos
Tan,Ku2,For,Fos
Bar,Cos,Fos,Yul
Tan,Cos,Ku2,Fos
Tan,SM,Ku2,Fos
Tan,Bar,Fos,Yul
Tan,Cos,For,Fos
Ordinal
value
588521.5
590130.5
590610.5
590784
602540.5
604382
605015
605220.5
606512.5
606617.5
Sum of
mean
active
2296.46
2296.12
2300.60
2296.17
2283.06
2279.33
2288.87
2270.85
2279.13
2280.78
Overall 10 best 4-combinations
How do combinations
compare with single
coefficients ?
Average G-H Scores
Coef. Avg. GHS
Sti
0.1845
Pea
0.1845
Cos
0.1839
Fos
0.1838
Tan
0.1832
Den
0.1828
Bar
0.1827
Kul
0.1824
Yul
0.1763
SM
0.1673
Overall 10 best singles
Comb.
Bar,Fos
Tan,Fos
Bar,Kul
Tan,Kul
Tan,Cos
Tan,Bar
Cos,Fos
Bar,Cos
Kul,Fos
Cos,Kul
Avg. GHS
0.1853
0.1851
0.1850
0.1850
0.1846
0.1845
0.1845
0.1844
0.1843
0.1842
Comb.
Avg GHS
Bar,Kul,Fos
0.1854
Bar,Cos,Kul
0.1853
Tan,Bar,Fos
0.1850
Tan,Bar,Kul
0.1850
Tan,Kul,Fos
0.1848
Tan,Cos,Kul
0.1848
Bar,Cos,Fos
0.1847
Tan,Bar,Cos
0.1845
Tan,Cos,Fos
0.1844
Cos,Kul,Fos
0.1843
Overall 10 best 2-fusions
Overall 10 best 3-fusions
Combination
Tan,Bar,Cos,Kul
Tan,Bar,Kul,Fos
Bar,Cos,Kul,Fos
Tan,Cos,Kul,Fos
Tan,Bar,Cos,Fos
Tan,Bar,Fos,Sim
Tan,Bar,Cos,Sim
Tan,Cos,Kul,Den
Tan,Kul,Fos,Den
Tan,Kul,Pea,Den
Avg. GHS
0.1845
0.1844
0.1844
0.1841
0.1840
0.1835
0.1827
0.1827
0.1827
0.1826
Overall 10 best 4-fusions
How do combinations
compare with Tanimoto ?
Combination
For,Fos
Fos,Sim
SM,Fos
SM,For,Fos
For,Fos,Sim
Tan,Fos
Tan,Cos
Tan,Bar,Cos,Fos
Tan,Bar,Cos
Tan,Bar,Fos
Tan,Bar,Fos
Tan,Bar
Tan,Bar,Cos
Bar,Fos
Tan,Fos
Number of
Number of
Number of Number of times better times
times equal times less number of
better
times less
453
402
517
-64
442
347
583
-141
438
404
530
-92
437
378
557
-120
430
407
535
-105
218
967
187
31
239
905
228
11
254
883
235
19
281
851
240
41
293
837
242
51
293
837
242
51
324
766
282
42
281
851
240
41
351
708
313
38
218
967
187
31
Which have overall best
improvement over Tanimoto:
2-fusions, 3-fusions, 4-fusions
or certain single coefficients ?
Improvement Over Tanimoto
(Best 10 of single and fusions)
Coef. Avg. Imp.
Bar
0.007
Pea
0.005
Sti
0.005
Cos
0.003
Fos
0.001
Kul
0.000
Den
-0.004
Yul
-0.021
SM
-0.050
For
-0.085
Overall 10 best singles
Comb.
Bar,Fos
Tan,Fos
Tan,Bar
Tan,Kul
Fos,Sti
Bar,Kul
Tan,Cos
Bar,Cos
Kul,Fos
Fos,Yul
Avg. Imp.
0.039
0.034
0.032
0.032
0.029
0.029
0.026
0.026
0.025
0.025
Overall 10 best 2-fusions
Combination Avg. Imp. Combination
Avg. Imp.
Tan,Bar,Fos
0.033 Tan,Bar,Fos,Sim
0.027
Tan,Bar,Sim
0.033 Bar,Cos,Kul,Fos
0.017
Bar,Kul,Fos
0.030 Tan,Cos,Kul,Fos
0.017
Tan,Kul,Fos
0.030 Tan,SM,Fos,Sim
0.017
Bar,Cos,Kul
0.029 Tan,Bar,Kul,Fos
0.016
Tan,Cos,Kul
0.028 Tan,SM,Bar,Fos
0.016
Bar,Fos,Sim
0.028 Tan,Bar,Cos,Kul
0.015
Tan,Bar,Kul
0.028 Bar,Cos,Fos,Sim
0.015
Bar,Cos,Fos
0.026 Tan,SM,Kul,Fos
0.015
Tan,For,Fos
0.025 Tan,Bar,Cos,Sim
0.015
Overall 10 best 3-fusions
Overall 10 best 4-fusions
What combinations is
best ?
Target
Combination of good ind. Combination inv. Rus. Comb. inv. Rus & good ind. Comb. inv. For & good ind.
coefs.
& For.
coef.
coef.
Avg.
Tan,
num. Std.
Tan, Bar,
Rus, Rus, Rus,
Rus, Rus,
For, For,
of bits dev. of Bar, Tan, Cos, Bar, Cos, Rus, For, For, For, Rus, Rus, Rus, Tan, Cos, For, Cos, For, Tan, Cos,
set
bits set Fos Cos Sti Fos Ku2 For Cos Tan Sti Tan Cos Sti Cos Sti Tan For Sti Cos Sti
Unity 2D (Average bits set = 221.84, std. dev. = 70.78)
Enkephalinase inhibitor
163
36.78 0.03 -0.01 -0.01 0.01 0.01 -0.23 -0.17 -0.16 -0.18 -0.40 -0.44 -0.43 -0.31 -0.34 0.11 0.12 0.13 0.07 0.06
Prolylendopeptidase inhibitor 184
33.19
0.02 0.00 0.00 0.02 0.01 -0.09 -0.06 -0.06 -0.06 -0.30 -0.34 -0.28 -0.21 -0.20 0.06 0.05 0.02 0.05 0.04
Dopamine (D2) antagonist
198
43.26 0.05 0.01
Vasolidator
206
62
0.01 0.00
Thromboxane antagonist
209
37.24 0.02 0.00
Excitatory amino acid
209
59.23 0.02 0.00
inhibitor
Potassium channel blocker
220
49.41 -0.01 0.01
Leukotriene D4 antagonist
224
46.18 0.01 0.00
Factor Xa inhibitor
240
55.97 0.01 0.01
Endothelin ETA antagonist
272
60.77 -0.07 0.00
BCI 1052 (Average bits set = 100.64, std. dev. = 28.78)
0.02 0.03 -0.05 -0.08 -0.07 -0.07 -0.07 -0.27 -0.30 -0.25 -0.19 -0.19 0.12 0.12 0.08 0.11 0.09
-0.01 0.00 -0.01 -0.01 0.00 -0.02 0.00 -0.11 -0.14 -0.09 -0.08 -0.07 0.02 0.01 -0.02 0.02 -0.01
0.01 0.01 0.01 0.01 0.02 0.01 0.02 -0.17 -0.19 -0.15 -0.12 -0.10 0.04 0.04 0.01 0.05 0.04
-0.16 0.02 0.02 0.01 0.01 0.01 -0.18 -0.14 -0.15 -0.26 -0.11 -0.24 0.02 0.01 -0.19 0.03 -0.17
-0.03
0.00
0.01
-0.09
-0.01
0.01
0.01
-0.04
0.01
0.01
0.02
-0.03
0.06
0.03
0.00
-0.07
0.05
0.03
0.02
-0.05
0.04
0.03
0.02
-0.05
0.01
0.02
0.00
-0.12
0.04
-0.08
-0.08
0.11
0.02
-0.09
-0.08
0.09
0.03
-0.03
-0.04
0.08
0.05
-0.05
-0.03
0.10
0.02
-0.03
-0.02
0.05
-0.11
-0.01
-0.08
-0.24
-0.11
-0.03
-0.09
-0.25
-0.16
-0.12
-0.15
-0.39
-0.06
0.00
-0.04
-0.17
-0.12
-0.03
-0.07
-0.26
Squalene synthetase inhibitor 92
36.28
0.01 0.03 -0.04 0.01 0.03 0.01 0.02 0.01 -0.08 -0.04 -0.04 -0.14 0.01 -0.10 0.02 0.04 -0.03 0.03 -0.03
Reverse transcriptase inhibitor 97
23.56
0.03 0.00 -0.07 0.02 0.02 -0.04 -0.03 -0.02 -0.09 -0.20 -0.21 -0.26 -0.14 -0.20 0.09 0.09 0.01 0.07 -0.01
Antiischemic agent
CRF antagonist
Agent for systemic lupus
erythematosus
Vasopressin V1 antagonist
Oxazolidinone
Neurokinin antagonist
Anthracycline
Penicillin
99
107
22.76
16.49
0.04 0.00 -0.01 0.03 0.03 0.04 0.05 0.05 0.01 -0.10 -0.14 -0.16 -0.07 -0.12 0.09 0.09 0.05 0.11 0.05
-0.02 0.00 -0.06 -0.01 -0.01 0.03 0.03 0.03 -0.02 -0.04 -0.05 -0.08 -0.02 -0.06 -0.08 -0.10 -0.16 -0.05 -0.11
111
30.15
0.01 0.01 -0.02 0.01 0.01 0.02 0.01 0.01 -0.02 -0.04 -0.04 -0.07 -0.02 -0.05 -0.01 -0.01 -0.05 0.00 -0.03
119
123
125
126
159
19.33
12.1
20.03
14.89
24.93
-0.01
0.02
-0.05
0.00
-0.02
0.00
0.01
0.00
0.00
0.00
-0.21
-0.01
-0.04
-0.15
-0.07
0.00
0.02
-0.03
0.00
-0.01
0.00
0.01
-0.03
0.00
-0.01
-0.04
0.16
-0.11
0.00
-0.13
-0.02
0.16
-0.09
0.00
-0.09
-0.03
0.18
-0.09
0.00
-0.09
-0.24
0.14
-0.11
-0.15
-0.15
-0.02
-0.04
0.12
0.00
0.02
-0.01
-0.02
0.13
0.00
0.02
-0.22
0.06
0.12
-0.15
-0.06
-0.01
0.01
0.10
0.00
0.02
-0.22
0.05
0.09
-0.15
-0.06
-0.11
-0.05
-0.28
-0.01
-0.24
-0.11
-0.16
-0.32
-0.01
-0.26
-0.33
-0.28
-0.36
-0.16
-0.32
-0.08
0.01
-0.23
0.00
-0.15
-0.29
-0.10
-0.27
-0.15
-0.22
Probability-based similarity
searching
•
•
•
Vector space similarity models (VSM) most
widely used currently do not incorporate the
importance of a particular fragment based on
information gathered from previously known
active and inactive compounds
In the probability-based models (PM), formal
probability theory and statistics are used to
estimate the probability that a structure is active
(relevant) and non-active (non-relevant) to the
query.
In PM, structures with relevance probability
exceeding its non-relevance probability are
ranked in decreasing order of their relevance.
Probability-based models
•
The Binary Independence Retrieval (BIR) model


•
based on the presence or absence of independently
distributed bits in active and inactive structures.
probability of any given bit occurring in a structure is
independent of the probability of occurrence of any
other bits, whether in active structures or inactive
structures.
The Binary Dependence (BD) model

assumes the probability of any given bit occurring in an
active structure is dependent of the probability of any
other bit occurring in an active structure and similarly
for inactive structures.
Query fusion in probability
models
•
•
Results and information obtained from previous
queries used in subsequent queries
Based on these compounds, the probability that
bit bi appearing in an active structure and the
probability that bit bi appearing in an inactive
structure for each bit i is computed.

•
Information is used to obtained the ranking score
function (RSV) for the second set.
The same procedure is repeated for subsequent
search on other datasets
Results for probability-based
searching (Aids dataset)
Results for probability-based
searching (cont.)
AVERAGE NO. OF ACTIVES AT TOP 5% OF LIST
145
141
140
No of actives structure retrieved
135
133
129
130
125
120
120
115
110
105
BIR
BD
No fusion
With query fusion
Compound selection
• More compounds are available than can
be screened cost-effectively
• Compound selection techniques can be
used to
select compounds for screening
 choose compounds to purchase from external
suppliers
 design combinatorial libraries

Virtual Screening Strategies
• Diversity
 lead
generation libraries for
screening against a number of
targets
 a structurally diverse library
should cover biological activity
space well
Diverse Subset Selection
• Dissimilarity-based compound selection
(DBCS)
• Clustering
• Partitioning/Cell-based methods
• Optimisation techniques
DBCS - Dissimilarity-based
methods
•
Compounds selected directly based
on distance
Dissimilarity-based methods aims
to identify a diverse subset of n
molecules from a dataset
containing N molecules by
iteratively selecting compounds
that are as dissimilar as possible to
those that have already been
selected.
3
Property P2
•
2
5
1
4
• If similar molecules
have similar
properties then will
dissimilar molecules
have dissimilar
properties ?
Property P2
Selecting Diverse Subsets
Property P1
Rational Selection Really
Performed Better Than
Random ?
 Taylor [Taylor, 1995] : dissimilarity based selection
worse than random.
 Young et al. [Young et al., 1996] : rational selection
performed no better than random
 Spencer [Spencer, 1997] : maximum dissimilarity
algorithm identified no more actives than random
 Dixon and Koehler [Dixon and Koehler, 1999] :
number of biological targets covered frequently no
more than that obtained with random
Could This Be Attributed To
Weaknesses of Similarity Measures
in Measuring Dissimilarity ?
• Can similar property principle also claim
that structurally dissimilar compound also
have dissimilar activity ?
• Can bit string based similarity measures be
effectively used as a quantitative measure of
dissimilarity?
What Have Been Found
Thus Far
• Non-intuitive results of bit
string based measures,
especially at low similarity
level



Size factor (Lajiness, 1997; Flower, 1997; Dixon and
Koehler, 1999)
Shape factor (Flower, 1997)
Global similarity (Flower, 1997)
• molecules with low similarity values are not
necessarily dissimilar from one another
What have been found
thus far (cont.)
• Statistical preference of values
(Godden et. al., 1999)



only certain range of values occupied
distribution peaks at around 0.3 to 0.4, with some
discrete peaks
this statistically preferred values are not outside
the range of values normally used for
dissimilarity measures - thus dissimilarity
measures can be influence by chance occurrence
or biased by statistically preferred Tanimoto
values
What we want to know ….
Is there a threshold point in similarity
value or distance between two
compounds such that they still have a
meaningful biological similarity ?
If that is so, then we can say that ….
DBCS way of finding most dissimilar molecule is
not very useful because at low level of similarity,
the values do not have much meaning and
compounds might as well be selected randomly.
Now, how do we find out
about this ?
• Try to see whether distribution of bit
string based similarity measures can
be reproduced by randomly
generated bit strings, especially at
lower similarity values
• If the above is true, then we know
that the similarity values range of
which the distribution between real
and random are the same, does not
bear much meaning
Real life datasets used


ID Alert, a database of 11607
investigational drugs of various activities.
Aids99, a selection of 5772 compounds
tested for Aids, which contains various
actives, moderately actives and inactives.
Types of bit strings tested
• BCI bit strings


structural key based (dictionary used - BCI 1052 )
folded into 1052 bits
 Daylight fingerprints


Hashed fingerprints
length = 2048
 Unity 2-D

hashed + structural key based

length = 992 bits
Types of random bit strings tested
 Purely random generated bit strings.
 Randomly generated bit string dataset with
priority given to bits that occur more frequently
in the real dataset
 Randomly generated bit string dataset with cooccurrence of bits derived from the real dataset
taken into account
 Randomly generated bit string dataset with
priority given to bits that occur more frequently
in the real dataset and co-occurrence of bits
derived from the real dataset taken into account
Similarity measures studied
• Tanimoto coefficient
• Complement of Euclidean
distance
• Cosine coefficient
3 way comparisons
done
Line graphs
Chi square
KolmogorovSmirnov
BCI Fingerprints, Using Tanimoto
Aids
No. of
bits set
IDAlert
Comparison of Tanimoto Frequency Distribution for Bit Strings in ID Alert
BCI1052 with 60-79 bits Set (total = 1052 bits), num. of compounds= 2038
Comparison of Tanimoto Frequency Distribution for Bit Strings in Aids99
BCI (dic 1052)- with 60 - 79 bits Set (total = 1052bits), num. of compounds= 1553
1200000
700000
1000000
500000
800000
400000
Frequency
Frequency
60-79
600000
300000
Real
Real
Random
Random
600000
RandomBP
RandomBP
RandomCR
RandomCR
RandomBPCR
RandomBPCR
400000
200000
200000
100000
1
0.
8
0.
84
0.
88
0.
92
0.
96
0.
6
0.
64
0.
68
0.
72
0.
76
0.
4
0.
44
0.
48
0.
52
0.
56
0
0.
04
0.
08
0.
12
0.
16
1
0.
8
0.
84
0.
88
0.
92
0.
96
0.
6
0.
64
0.
68
0.
72
0.
76
0.
4
0.
44
0.
48
0.
52
0.
56
0.
2
0.
24
0.
28
0.
32
0.
36
0
0.
04
0.
08
0.
12
0.
16
0.
2
0.
24
0.
28
0.
32
0.
36
0
0
Tanimoto Values
Tanimoto Values
Comparison of Tanimoto Frequency Distribution for Bit Strings in Aids99
BCI (dic 1052)- with 130 - 149 its Set (total = 1052bits), num. of compounds= 253
Comparison of Tanimoto Frequency Distribution for Bit Strings in ID Alert
BCI1052 with 130-149 bits Set (total = 1052 bits), num. of compounds= 1080
25000
130-149
400000
350000
20000
300000
15000
Frequency
Frequency
Real
250000
Random
10000
Real
RandomBP
Random
RandomCR
200000
RandomBPCR
RandomBP
RandomCR
RandomBPCR
150000
5000
100000
50000
1
1
0.
8
0.
84
0.
88
0.
92
0.
96
0.
6
0.
64
0.
68
0.
72
0.
76
0
0.
04
0.
08
0.
12
0.
16
0.
4
0.
44
0.
48
0.
52
0.
56
0
Tanimoto Values
0.
2
0.
24
0.
28
0.
32
0.
36
0.
8
0.
84
0.
88
0.
92
0.
96
0.
6
0.
64
0.
68
0.
72
0.
76
0.
4
0.
44
0.
48
0.
52
0.
56
0.
2
0.
24
0.
28
0.
32
0.
36
0
0.
04
0.
08
0.
12
0.
16
0
Tanimoto Values
Comparison of Tanimoto Frequency Distribution for Bit Strings in Aids99
BCI (dic 1052)- with 160 - 179 its Set (total = 1052bits), num. of compounds= 33
Comparison of Tanimoto Frequency Distribution for Bit Strings in ID Alert
BCI1052 with 160-179 bits Set (total = 1052 bits), num. of compounds= 226
400
18000
16000
14000
300
12000
250
Real
Random
10000
Random
RandomBP
RandomBP
RandomCR
RandomCR
RandomBPCR
RandomBPCR
Frequency
200
8000
150
6000
100
4000
50
2000
Tanimoto Values
1
0.
8
0.
84
0.
88
0.
92
0.
96
0.
6
0.
64
0.
68
0.
72
0.
76
0.
4
0.
44
0.
48
0.
52
0.
56
0.
2
0.
24
0.
28
0.
32
0.
36
1
0.
8
0.
84
0.
88
0.
92
0.
96
0.
6
0.
64
0.
68
0.
72
0.
76
0.
4
0.
44
0.
48
0.
52
0.
56
Tanimoto Values
0
0.
04
0.
08
0.
12
0.
16
0
0
0.
2
0.
24
0.
28
0.
32
0.
36
Frequency
Real
0
0.
04
0.
08
0.
12
0.
16
160-179
Legend
Real
Random
Ran. + Freq.
Ran. + Corr.
Ran. + Freq.
+ Corr.
350
The ability of random bit strings to
obtain similar distribution of
intermolecular similarity values
• All frequencies generated by real life
dataset are significantly different
from frequencies obtained from
randomly generated bit strings
all ranges of values
 all similarity measures
 confirmed by Chi-square and Kolmogorov-Smirnov
values

Average % of actives covered in percentiles
of Tanimoto structural similarity ranking in
the AIDS dataset.
All similarity values do have
some meaning
• Distribution of bit string based similarity
measures cannot be reproduced by randomly
generated bit strings, even at lower similarity
values
• Randomly generated bit strings produce
significantly different distribution from the real
dataset
Clustering
• Group molecules into clusters so
that
molecules within a cluster are
similar
 molecules from different clusters
are dissimilar

• Choose one or more from each
cluster
• Ward’s clustering has been used
as the industry standard
Soft-computing based
clustering
•
•
•
•
•
•
•
•
•
Fuzzy c-means clustering
Genetic-algorithm based clustering
Kohonen Self Organizing Maps
Neural Gas Network
Enhanced Neural Gas Network
Fuzzy C-means
Fuzzy Gustafson Kessel
Hierarchical Fuzzy C-means
Fuzzy Kohonen Self Organizing Maps
Performance Measure
• Proportion of actives in active cluster
subset
• A cluster is active if it contain at least one
active molecule
• A set of all the active clusters is called
Active cluster subset
• All the singletons are omitted from the
active cluster subset
Comparisons between Ward’s clustering, fuzzy cmeans and Fuzzy Wards clustering for compound
selection based on % of actives in active clusters in
Aids dataset, binary representation
Comparison based on the proportion of actives
structures between Ward’s clustering, GA-based
clustering and the combination of Ward and GA in
Aids dataset, binary representation
0.4
0.38
Proportion of actives in active clusters
0.36
0.34
0.32
Ward's
Ward's and GA
GA
0.3
0.28
0.26
0.24
0.22
0.2
916
887
883
833
746
696
Nunber of compounds in active clusters
672
640
Table 1: Shows the groupings and some
characteristics of the Dataset
Dataset
•
MDL’s MDDR database







1388 molecules from 7
biological classes
Dragon for computation of
TIs
BCI dictionary and toolkits
have been used for bits
strings
PCA have been used
A datamatrix of 1388X10
BCI imlementation of
Wards and Group Average
Matlab implementation for
neural networks
S.No
1
Activity
No. molecules
Interacting on 5HT receptor
Potentially useful in the treatment of depression, anxiety, hypertension,
eating disorders, obesity, drug abuse, cluster headache, migraine, obsessive
compulsive, and associated vascular disorders, panic attacks, agoraphobia
eating, urinary incontinence and impotence.
5HT Antagonists
48
5HT1 agonists
66
5HT1C agonists
57
5HT1D agonists
100
2
Antidepressants
Potentially useful as an antiepileptic, antiparkinsonian, neuroprotective,
antidepressant, antispastic and/or hypnotic agent. Some of the compounds
may be useful in the treatment of dopamine-related CNS disorders such as
Parkinson's disease and schizophrenia.
Mao A inhibitors
84
Mao B inhibitors
174
3
Antiparkinsonians
Potentially useful in the treatment of septic shock, congestive heart failure
and hypertension and in the prevention of acute renal failure.
Dopamine (D1) agonists
32
Dopamine (D2) agonists
104
4
Antiallergic/antiasthmatic
Most of these are used as antiinflammatory, antiasthmatic and antiischemic
agents. However, adenosine (A3) antagonists are useful as a tool for the
pharmacological characterization of the human A3 receptor.
Adenosine A3 antagonists 73
Leukotine B4 antagonists 150
5
Agents for Heart Failure
Potentially useful as a bronchodilator, smooth muscle relaxant or
cardiotonic agent, accelerator of hormone secretion, platelet aggregation
inhibitor, etc.
Phosphodiesterase inhibitors 100
6
AntiArrythmics
Most of the Potassium channel blockers block the cardiac ion channel
carrying the rapid component of the delayed rectifier potassium current.
channel blockers
Potassium channel blockers 100
100
7
Antihypertensives
inhibitors
100
100
Calcium
ACE
Adrenergic (alpha 2) blockers
Total molecules 1388
% Proportion of Actives |
Summary of Clustering using
Topological Indices
26
FCM
24
GK
22
GPAV
20
WARD's
18
F-SOM
16
SOM
SOM_POW
14
NG
12
10
20
30
40
50
60
70
No. of Clusters
80
90 100
ENG
HFC
% Proportion of Actives |
Summary of Clustering
using BCI Bits Strings
50
45
40
35
30
25
20
15
10
5
0
Wards10
GpAv10
SOM
NG
ENG
ART1
10
20
30
40
50
60
70
No. of Clusters
80
90
100
Acknowledgements
•
•
•
•
•
Jehan Zeb Shah, Shahrin Huspi, Rosmayati
Mohamed, Abo Obaida, Willa Welmina Geoffrey,
Peter Willett, John Holliday
Barnard Chemical Information Ltd., United
Kingdom – provision of cheminformatics software
Kerbs Insitute of Biomolecular Research, United
Kingdom – research collaborator, external
supervision for research student
Daylight Chemical Information Systems, Inc.,
Mission Veijo, California – provision of
cheminformatics software and data
CyberChem Group, Fakulti Sains UTM –
provision of cheminformatics software and data
Thank you