Virtual screening using active set dependent optimization

Download Report

Transcript Virtual screening using active set dependent optimization

Slide 1
Optimized Virtual Screening
Miklós Vargyas
Zsuzsanna Szabó
György Pirok
Ferenc Csizmadia
Matthias Steger
Modest von Korff
ChemAxon Ltd.
AXOVAN AG
Allschwil, Switzerland
(Axovan is now Actelion.)
Slide 2
Drug research
Is it searching for a needle in a haystack?
corporate database
structures found
Slide 3
Drug research
Find something similar to a fistful of needles
query structures
(known actives)
corporate
database (targets)
structures found
(virtual hits)
Slide 4
Molecular similarity
What is it?
Chemical, pharmacological or biological properties of two compounds
match.
The more the common features, the higher the similarity between two
molecules.
Chemical
Pharmacophore
Slide 5
Molecular similarity
How to calculate it?
Quantitative assessment of similarity/dissimilarity of structures
 need a numerically tractable form
 molecular descriptors, fingerprints, structural keys
Sequences/vectors of bits, or numeric values that can be compared by
distance functions, similarity metrics.
E ( x, y ) 
n
 x
i 1
i
 yi 
2
T ( x, y ) 
B( x & y)
B( x)  B( y )  B( x & y )
Slide 6
Molecular descriptors
Example 1: chemical fingerprint
hashed binary fingerprint
 encodes topological properties of the chemical graph: connectivity,
edge label (bond type), node label (atom type)
 allows the comparison of two molecules with respect to their
chemical structure
Construction
1. find all 0, 1, …, n step walks in the chemical graph
2. generate a bit array for each walks with given number of bits set
3. merge the bit arrays with logical OR operation
Slide 7
Molecular descriptors
Example 1: chemical fingerprint
Example
CH3 – CH2 – OH
walks from the first carbon atom
length walk
bit array
0
C
1010000000
1
C–H
0001010000
1
C–C
0001000100
2
C–C–H
0001000010
2
C–C–O
0100010000
3
C–C–O–H
0000011000
merge bit arrays for the first carbon atom: 1111011110
Slide 8
Molecular descriptors
Example 1: chemical fingerprint
0100010100010100010000000001101010011010100000010100000000100000
0100010100010100010000000001101010011010100000000100000000100000
Slide 9
Molecular descriptors
Example 2: pharmacophore fingerprint
 encodes pharmacophore properties of molecules as frequency
counts of pharmacophore point pairs at given topological distance
 allows the comparison of two molecules with respect to their
pharmacophore
Construction
1. map pharmacophore point type to atoms
2. calculate length of shortest path between each pair of atoms
3. assign a histogram to every pharmacophore point pairs and count
the frequency of the pair with respect to its distance
Slide 10
Molecular descriptors
Example 2: pharmacophore fingerprint
Pharmacophore point type based
coloring of atoms: acceptor, donor,
hydrophobic, none.
12
12
11
11
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
0
A A A A A A D D D D D D D D D D D D H H H H H H H H H H H H H H H H H H
A A A A A A A A A A A A D D D D D D A A A A A A D D D D D D H H H H H H
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6
A A A A A A D D D D D D D D D D D D H H H H H H H H H H H H H H H H H H
A A A A A A A A A A A A D D D D D D A A A A A A D D D D D D H H H H H H
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6
Slide 11
Virtual screening using fingerprints
Individual query structure
0101010100010100010100100000000000010010000010010100100100010000
query fingerprint
query
proximity
0000000100001101000000101010000000000110000010000100001000001000
0100010110010010010110011010011100111101000000110000000110001000
0100010100011101010000110000101000010011000010100000000100100000
0001101110011101111110100000100010000110110110000000100110100000
0100010100110100010000000010000000010010000000100100001000101000
0100011100011101000100001011101100110110010010001101001100001000
0101110100110101010111111000010000011111100010000100001000101000
0100010100111101010000100010000000010010000010100100001000101000
0001000100010100010100100000000000001010000010000100000100000000
0100010100010011000000000000000000010100000010000000000000000000
0100010100010100000000000000101000010010000000000100000000000000
0101010101111100111110100000000000011010100011100100001100101000
0100010100011000010000011000000000010001000000110000000001100000
0000000100000000010000100000000000001010100000000100000100100000
0100010100010100000000100000000000010000000000000100001000011000
0001000100001100010010100000010100101011100010000100001000101000
0100011100010100010000100001001110010010000010001100000000101000
0101010100010100010100100000000000010010000010010100100100010000
targets
target fingerprints
hits
Slide 12
Virtual screening using fingerprints
Multiple query structures
0100010100011101010000110000101000010011000010100000000100100000
0001101110011101111110100000100010000110110110000000100110100000
0100010100110100010000000010000000010010000000100100001000101000
0101110100110101010111111000010000011111100010000100001000101000
0001000100010100010100100000000000001010000010000100000100000000
0100010100010100000000000000101000010010000000000100000000000000
0101010101111100111110100000000000011010100011100100001100101000
0100010100011000010000011000000000010001000000110000000001100000
0000000100000000010000100000000000001010100000000100000100100000
queries
0101110100110101010111111000010000011111100010000100001000101000
hypothesis fingerprint
proximity
0000000100001101000000101010000000000110000010000100001000001000
0100010110010010010110011010011100111101000000110000000110001000
0100010100011101010000110000101000010011000010100000000100100000
0001101110011101111110100000100010000110110110000000100110100000
0100010100110100010000000010000000010010000000100100001000101000
0100011100011101000100001011101100110110010010001101001100001000
0101110100110101010111111000010000011111100010000100001000101000
0100010100111101010000100010000000010010000010100100001000101000
0001000100010100010100100000000000001010000010000100000100000000
0100010100010011000000000000000000010100000010000000000000000000
0100010100010100000000000000101000010010000000000100000000000000
0101010101111100111110100000000000011010100011100100001100101000
0100010100011000010000011000000000010001000000110000000001100000
0000000100000000010000100000000000001010100000000100000100100000
0100010100010100000000100000000000010000000000000100001000011000
0001000100001100010010100000010100101011100010000100001000101000
0100011100010100010000100001001110010010000010001100000000101000
0101010100010100010100100000000000010010000010010100100100010000
targets
target fingerprints
hits
Slide 13
Hypothesis fingerprints
Advantages
 allows faster operation
 compiles features common to each individual actives
Hypothesis types
Active 1
0
2
7
1
0
1
6
4
0
0
9
0
Active 2
1
6
0
4
3
3
1
2
2
0
5
1
Active 3
2
4
4
1
0
2
5
3
4
3
4
5
Minimum
0
2
0
1
0
1
1
2
0
0
4
0
Average
1
4
3.67
2
1
2
4
3
2
1.33
6
2
Median
1.5
4
5.5
1
0
2
5
3
3
0
5
3
Slide 14
Hypothesis fingerprints
Advantages
Disadvantages
Minimum
• strict conditions for hits if
actives are fairly similar
• false results with
asymmetric metrics
• misses common features of
highly diverse sets
• very sensitive to one
missing feature
Average
• captures common features • less selective if actives are
of more diverse active sets very similar
Median
• captures common features • less selective if actives are
of more diverse active sets very similar
• specific treatment of the
absence of a feature
• less sensitive to outliers
Slide 15
Does this work?
Active set
name
Pharmacophore
fingerprint
size
Tanimoto
Chemical fingerprint
Euclidean
Tanimoto
Euclidean
5-HT3
12
20.14
12.55
776.19
461.44
ACE
89
1.99
1.42
3.71
1.74
Angiotensin2
10
22.80
27.81
183.45
173.91
Beta2
50
3.59
1.52
7.52
2.65
D2
13
61.25
27.64
302.52
155.61
delta
20
109.53
11.66
114.48
56.22
Ftp
35
50.92
46.88
571.50
575.16
mGluR1
18
70.47
5.59
347.72
130.14
139
1.09
1.00
1.46
1.44
8
2.46
2.56
3.71
1.67
NPY-5
Thrombin
Slide 16
Then why do we need optimization?
Too many hits
Slide 17
Then why do we need optimization?
Inconsistent dissimilarity values
0.57
0.47
0.55
Slide 18
What can be optimized?
Parameterized metrics
scaled ,asymmetric
Tanimoto
D
s min(x , y )

( x, y)  1 
  x   s min(x , y )  1    y   s min(x , y )   s min(x , y )
i i
i i
i i
i
i
i
i i
i
i i
i
i
i i
  0,1 asymmetry factor
si  N
scaling factor
weighted, asymmetric
DEuclidean
( x, y) 
 wi xi  yi  
2
xi  yi
  0,1 asymmetry factor
wi  0,1 weights
 wi 1   xi  yi 
2
xi  yi
i
i
Slide 19
Optimization of metrics
Step 1 optimize parameters for maximum enrichment
Step 2 validate metrics over an independent test set
training
set
training
set
query
set
selected
targets
known
actives
test set
test
set
Slide 20
Optimization of metrics
Step 1 optimize parameters for maximum enrichment
query set
1111100010000100001000101000
Target hits
query
fingerprint
training
set
Active hits
Slide 21
Optimization of metrics
One step of the algorithm
v1
v2
v3
vi
vn
potential variable value
temporarily fixed value
final value
running variable value
Slide 22
Optimization of metrics
Step 2 validate metrics over an independent test set
query set
Target hits
1111100010000100001000101000
query
fingerprint
test set
Active hits
Slide 23
Results
Similar structures get closer
0.57
0.47
0.55
0.20
0.06
0.28
Slide 24
Results
Hit set size reduction
Active set: 18 mGlu-R1 antagonists
Target set: 10000 randomly selected drug-like structures + 7 spikes
Metric
Tanimoto
Euclidean
Basic
Scaled
Asymmetric
Scaled Asymmetric
Basic
Normalized
Asymmetric Normalized
Weighted Normalized
Weighted Asymmetric Normalized
Enrichment
70.47
7.63
99.36
11.94
5.59
11.33
18.58
296.30
281.30
Test
Random
hits
hits
5.43
172.00
6.00 1101.71
5.29
106.00
5.86
731.14
5.43 1456.57
5.14
791.29
4.71
368.71
4.14
27.57
3.43
17.00
Slide 25
Results
Improvement by optimization
Active set
size
Euclidean
Optimized
Improvement
ratio
5-HT3
12
12.55
239.24
49.26
ACE
89
1.42
6.50
4.64
Angiotensin2
10
27.81
85.45
11.15
Beta2
50
1.52
24.70
17.42
D2
13
27.64
123.25
11.19
delta
20
11.66
243.57
69.11
Ftp
35
46.88
71.54
5.35
mGluR1
18
5.59
296.30
70.93
139
1.00
3.22
3.25
8
2.56
4.57
2.62
NPY-5
Thrombin
Slide 26
Results
Active Hit Distribution
 offers a more intuitive way to evaluate the efficiency of screening
 based on sorting random set hits and known actives on
dissimilarity values and counting the number of random set hits
preceding each active in the sorted list
number of virtual hits
0.014
0.015
0.017
0.020
0.022
0.023
0.027
0.041
0.043
number of actives
Slide 27
Results
ACE (pharmacophore similarity)
10000
Number of hits
1000
Euclidean
100
Optimized
Euclidean
10
1
1
2
3
4
5
6
7
8
9
10
11
12
Number of actives among the hits
13
14
15
16
Slide 28
Results
NPY-5 (pharmacophore similarity)
Number of Hits
10000
1000
100
10
1
1
3
5
7
9
11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49
Number of Active Hits
Tanimoto
Euclidean
Optimized
Ideal
Slide 29
Results
β2-adrenoceptor (pharmacophore similarity)
Number of Hits
10000
1000
100
10
1
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18
Number of Active Hits
Tanimto
Euclidean
Optimized
Ideal
Slide 30
Results
Structural or pharmacophore fingerprint?
Active set
size
chemical
pharmacophore
diversity*
5-HT3
12
692.21
239.24
0.30
ACE
89
4.29
6.50
0.56
Angiotensin2
10
190.76
85.45
0.40
Beta2
50
10.98
24.70
0.50
D2
13
358.10
123.25
0.30
delta
20
249.40
243.57
0.32
Ftp
35
575.16
71.54
0.30
mGluR1
18
350.86
296.30
0.37
139
1.52
3.22
0.47
8
3.59
4.57
0.46
NPY-5
Thrombin
* Average 1-Tanimoto coefficient between each pair of compounds in the active
set, based on chemical fingerprint.
Slide 31
Results
Scaffold hopping
Slide 32
Acknowledgements
Contributors: Nóra Máté
Szilárd Dóránt
Bernard Przybylski (Axovan)
The research was supported by
(Axovan is now part of Actelion.)
Slide 33
Bibliography
 J. Xu: GMA: A Generic Match Algorithm for Structural Homomorphism,
Isomorphism, and Maximal Common Substructure Match and its
Applications, J. Chem. Inf. Comput. Sci., 1996, 36, 1, 25-34.
 L. Xue, F. L. Stahura, J. W. Godden, J. Bajorath: Fingerprint Scaling
Increases the Probability of Identifying Molecules with Similar Activity in
Virtual Screening Calculations, J. Chem. Inf. Comput. Sci., 2001, 41, 3,
746-753.
 G. Schneider, W. Neidhart, T. Giller, and G. Schmid: 'Scaffold-Hopping' by
Topological Pharmacophore Search: A Contribution to Virtual Screening,
Angew. Chem. Int. Ed., 1999, 38, 19, 2894-2896
 D. Horvath: High Throughput Conformational Sampling and Fuzzy
Similarity Metrics: A Novel Approach to Similarity Searching and Focused
Combinatorial Library Design and its Role in the Drug Discovery
Laboratory; manuscript
 J. Bajorath: Virtual screening in drug discovery: Methods, expectations
and reality
http://www.currentdrugdiscovery.com/pdf/2002/3/BAJORATH.pdf