A Probabilistic Approach to High Throughput Drug Discovery

Download Report

Transcript A Probabilistic Approach to High Throughput Drug Discovery

CHEMICAL
COMPUTING
GROUP INC.
A Probabilistic Approach to
High Throughput Drug Discovery
Introduction
Probability
and Motivation
Modeling in Drug Discovery
Representation
Focused
of Chemical Structures (Descriptors)
Combinatorial Library Design
Summary
and Outlook
Copyright © 2000 Chemical Computing Group Inc.
All Rights Reserved.
1
High Throughput Screening

Large-scale automation of biological assays (HTS)



Noteworthy drawbacks to HTS:





Economics: $1-$5 per assay (provided large collections are assayed)
Logistics: compound formatting, inventory systems and other overhead
Precision Loss: effective “binary” measurement: active/inactive (pass/fail)
High Error Rate: assay, synthesis failure, sample degradation, registration
Resulting effects:



Use robotics to perform 10,000 to 100,000 screens per day
Brute-force approach to drug discovery: “rapidly screen all compounds”
Quality for quantity tradeoff - lots of low quality data
High level of noise (error) in data makes interpretation very difficult
HTS has gained acceptance and is routinely used to
generate lead compounds for drug discovery projects
Copyright © 2000 Chemical Computing Group Inc.
All Rights Reserved.
2
Sources of Compounds for HTS

Initial screening libraries (first libraries used in project)




Follow-up libraries



Historical “in-house” collection of compounds augmented with compounds
purchased from external suppliers
1 million+ compounds available means initial screening library must be
designed (diversity retained using fewer numbers of compounds)
Receptor biased initial screening libraries are a possibility
Parallel synthesis / combinatorial chemistry is an excellent source of large
numbers of (new) compounds
Synthesis of “all” analogs around a lead structure exhibits poor diversity
but very good for “local” exploration and lead follow-up
External screening compound purchasing and in-house
combinatorial chemistry efforts have gained acceptance
and are routinely used in lead generation and follow-up
Copyright © 2000 Chemical Computing Group Inc.
All Rights Reserved.
3
High Throughput Discovery Cycle

Brute-force HTS not practical



At least 10 trillion stable drug candidates
At 1 billion screens per day >27 years are needed to screen all 10 trillion
A discovery cycle can be used to reduce total screens


Use HTS data to affect the selection of compounds to screen next
Scale-up of the traditional experimental discovery cycle
Parallel
Synthesis
HTS
Bioassay
Focused
Library Design
HTS Data
Analysis
Copyright © 2000 Chemical Computing Group Inc.
All Rights Reserved.
4
Required Technology for HTD Cycle

High Throughput Screening facility

Parallel synthesis and combinatorial chemistry capabilities

Methodology for automatically analyzing HTS data



Methodology for designing focused combinatorial libraries



Humans find it difficult to interpret large amounts of noisy data
Automatic HTS QSAR technology necessary for HTD cycle
HTS QSAR results are used to bias a combinatorial library towards activity
ADME properties and other design criteria should be taken into account
Meaningful representation of compounds


Collection of molecular descriptors meaningful across projects (avoid time
consuming variable selection procedures)
Definition of a “chemistry space” for diversity studies (design of initial
screening libraries)
Copyright © 2000 Chemical Computing Group Inc.
All Rights Reserved.
5
CHEMICAL
COMPUTING
GROUP INC.
Probability Modeling in Drug
Discovery
Copyright © 2000 Chemical Computing Group Inc.
All Rights Reserved.
6
Probabilistic Formalism (Bayesian Inference)

Step 1:
Write all observables as a joint probability
density; e.g., Pr (A,B,C)

Step 2:
Decompose density using probability theory
and Bayes’ theorem until components are
measurable; e.g.,
Pr (A,B,C) = Pr (B | A,C) Pr (C | A) Pr (A)

Step 3:
Model each component in product from a
database or experimental data set

Step 4:
Make predictions or estimates using
computed model of Pr(A,B,C)
Copyright © 2000 Chemical Computing Group Inc.
All Rights Reserved.
7
Probabilities in Speech Recognition

Successful speech recognizers select (predict) an output
word sequence from an input waveform by maximizing the
joint likelihood Pr (WAVE, WORDS)


Pr (WAVE, WORDS) = Pr (WAVE | WORDS) Pr (WORDS)




This is used (in part) to solve the isophonetic word sequence problem;
e.g., “imadam” can be “I’m Adam” or “I’m a Dam” or “eye mad am”
Pr(WORDS) is the prior probability of a word sequence (utterance)
Pr(WAVE | WORDS) is used to score the waveform under the assumption
or hypothesis that the word sequence is WORDS
Build model of Pr(WORDS) by training on, say, 500,000,000 words of
newspaper text (the prior knowledge)
Pr(WORDS) effectively depresses importance of unlikely
utterances in favor of more plausible statements (real
phrases)
Copyright © 2000 Chemical Computing Group Inc.
All Rights Reserved.
8
Probabilities in Drug Discovery

Notation: Y = active(0/1) D = drugable(0/1) S = structure

Decompose:
Pr(Y , D, S )  Pr(D | Y , S ) Pr(Y | S ) Pr(S )
Drugable given active structure
(approximated by “is drug-like” efforts)
Activity assuming structure
(probabilistic QSAR efforts)

Product of probabilities balances competing goals




Classification alone (e.g., RP) is not enough: weighted outcomes needed
Methodology similar to “soft” classification problems or fuzzy logic
Any method of probability modeling is valid (e.g., histogram, analytic)
Approximations introduced can be clearly identified

e.g., Pr (D | Y, S)  Pr (D | S) : drugability is independent of activity (!?)
Copyright © 2000 Chemical Computing Group Inc.
All Rights Reserved.
9
Pr(Y|X) via Binary QSAR

If Y is “binary activity” and X is a descriptor vector then
 Pr( X  x | Y  0) Pr(Y  0) 
Pr(Y  1 | X  x)  1 

Pr(
X

x
|
Y

1
)
Pr(
Y

1
)


1
Bayes Theorem
Pr(Y)
Active
Pr(X)
Inactive
X1
Pr(X|Y)
X1

Xk
Xk+1
Xn
Pr(Y|X)
Xn
Active
Inactive Active
Inactive
Pathology of Binary QSAR is reasonable

If new structure is outside the training set then Pr(Y=1), the hit rate, is
used to make predictions (no other information available)
Copyright © 2000 Chemical Computing Group Inc.
All Rights Reserved.
10
Distribution Estimates

Four distributions in formula are of two types



Pr(Y=0), Pr(Y=1)
Pr(X=x|Y=0), Pr(X=x|Y=1)
Prior probability of inactive/active
Probability of ligand assuming inactive/active
Modeling assumption: independent  uncorrelated!

Decompose multi-dimensional distribution into a product
Pr(X  x | Y  y)   Pr(X i  xi | Y  y)
i


Estimate 2n+2 distributions instead of original four
Binary QSAR Algorithm





Compute descriptor vectors
De-correlate descriptors
Estimate distributions from {xi ,yi}
Assemble p (x)
Predict for new descriptors d
Copyright © 2000 Chemical Computing Group Inc.
All Rights Reserved.
di
xi = Q(di - u)
Pr (X = x | Y = y)
Pr (Y = 1 | X = x)
p (Q (d - u))
11
Experience with Binary QSAR

Fundamental methodology publication (robustness study)


Example literature data sets (non-HTS data)



Estrogen receptor (Gao et al.; J. Chem. Info. Comput. Sci., 1999, 36)
O-acyltransferase (ACAT) (Labute et. al.; in press)
Example industrial data sets (HTS assays)




Biocomputing Proceedings of the 1999 Pacific Symposium World
Scientific Publishing, Singapore, 1999
ArQule: 24,000 cpds. ~200 active, 93% on inactives, 60% on actives
Pharmacopeia: 24,000 cpds. >90% on inactives, >90% on actives
SmithKline Beecham: 80,000 cpds. ~100 active, 90% on actives
Best success story: Pharmacia & Upjohn


Binary QSAR model used to select building blocks in combi-chem library
Improved activity from M to nM (factor of 1000)
Copyright © 2000 Chemical Computing Group Inc.
All Rights Reserved.
12
Combined Design Model for HTD Cycle

Use Binary QSAR method twice, once for activity model
and once for drugability model


Train drugability model Pr (D | X) on WDI/ACD for drug-like/non-drug-like
or on specific data sets (e.g., blood-brain barrier permeability)
Complete model of activity and drugability is the product
Pr(D | X) Pr(Y | X) which approximates Pr(D, Y | S)
Binary QSAR
Drugability Data
(e.g., BBB or drug-like)
ADME Model
Design
Model
Library Design
Activity Model
Combinatorial
Library
Binary QSAR
Copyright © 2000 Chemical Computing Group Inc.
All Rights Reserved.
HTS Data
BioAssay
13
CHEMICAL
COMPUTING
GROUP INC.
Representation of Chemical
Structures (Descriptors)
Copyright © 2000 Chemical Computing Group Inc.
All Rights Reserved.
14
A Brief History of QSAR

Original philosophy (Hansch & Leo): Use a fixed set of
meaningful molecular properties to describe a wide variety
of biological phenomena

Linear regression used to determine SAR



Proliferation of descriptors



The determination of linear relationships is basic science
Statistical regression framework used to assess significance of SAR
Early successes lead to introduction of a vast array of descriptors
In principle, any number calculable from a chemical structure can be used
as a molecular descriptor for SAR determination
Over-determination of SAR


Multitude of descriptors lead to need for schemes for variable elimination
3D methods treat each grid-point in field representation as a descriptor
Copyright © 2000 Chemical Computing Group Inc.
All Rights Reserved.
15
Fundamental Notions

Use a fixed set of descriptors for diversity and QSAR/QSPR




Model 3D properties from 2D (connectivity) information






A meaningful chemistry space should not require customization
In QSAR/QSPR automatic variable selection can be dangerous
Make direct use of Hansch & Leo thinking (build on their experience)
3D information from 2D connectivity = 2½ D descriptors
HTS QSAR and large-scale diversity require fast calculation times
2D topological descriptors too weak, 3D descriptors too expensive
Use approximate atomic surface areas as fundamental representation
Complement substructure keys (stay property-based for class-hopping)
Intended applications


QSAR/QSPR models - linear and nonlinear - early and late in project
Chemistry space for library design
Copyright © 2000 Chemical Computing Group Inc.
All Rights Reserved.
16
Exposed Van der Waals Surface Area (VSA)

Calculate exposed Van der Waals surface area for each
atom by subtracting off surface area inside neighbors
A
A
B
r
4r2

4r2-CA
4r2 -CA -CB
Correction factors to sphere formula depend on atomic
radii and inter-atomic distances
Copyright © 2000 Chemical Computing Group Inc.
All Rights Reserved.
17
Connection Table VSA Calculation

Neglect



Non-bonded neighbors (small molecules have little NB contact)
Interaction between angles (1-3 interactions)
Stretch of bond lengths (use ideal bond length)
r
s
d
A


xi  (r 2  si2  d i2 ) /( 2d i )
Parameters




VA  2r 2r   (r  xi )
Radii:
Inter-atomic distances:
Van der Waals (or solvation)
Ideal bond lengths
Define Vi to be the exposed VSA of atom i.
Copyright © 2000 Chemical Computing Group Inc.
All Rights Reserved.
18
Quality of Approximate VSA Calculation
Data set of 1,947 conformations



VDW Surface Area


MOE 2D  3D converter, MMFF94 force field, 0.01 RMS gradient
Molecular weights in [300,1600] range
3D dot calculation
Accuracy




1600
r = 0.9856
r2 = 0.9666
<10% error
Largest errors
on steroids an other
fused ring systems
1400
Approximate VSA

1200
1000
800
600
400
200
0
0
500
1000
1500
Va n d e r W a a ls S u r fa c e Ar e a
Copyright © 2000 Chemical Computing Group Inc.
All Rights Reserved.
19
Subdivision of VSA by Properties

Given an atomic property value
Pi for each atom i





O2
C3
C4
N7
1.2
4.5
5.9
0.2
N7
C1
V2
V1
Vi values:
Pi range:
Descriptors:
Copyright © 2000 Chemical Computing Group Inc.
All Rights Reserved.
C6
C3
C5
O2
Bin Pi by ranges and sum Vi
V7
C8
V6
C4
V3
+ V8
[0,1)
D1
[1,2)
D2
[2,3)
D3
[3,4)
D4
V4
+ V5
[4,5)
D5
[5,6)
D6
20
8 Molar Refractivity Descriptors

Wildman & Crippen SMR model of
Molar Refractivity



Specific attention paid to calculation of
atomic contributions
Protonation state taken as-is from structure
(specific species)
Property bins trained derived from
~50,000 structures


8 descriptors result: SMR_VSAk
Each bin is approximately equally populated
over training set
Wildman,S.A., Crippen,G.M. Prediction of Physiochemical Parameters by Atomic
Contributions. J. Chem. Inf. Comput. Sci., 39(5), 868-873 (1999).
Copyright © 2000 Chemical Computing Group Inc.
All Rights Reserved.
21
10 LogP (octanol/water) Descriptors

Wildman & Crippen SlogP
model of LogP



Specific attention paid to
calculation of atomic contributions
Protonation state taken as-is
from structure (specific species)
Property bins trained derived
from ~50,000 structures


10 descriptors: SlogP_VSAk
Each bin is approximately equally
populated over training set
Wildman,S.A., Crippen,G.M. Prediction of Physiochemical Parameters by Atomic
Contributions. J. Chem. Inf. Comput. Sci., 39(5), 868-873 (1999).
Copyright © 2000 Chemical Computing Group Inc.
All Rights Reserved.
22
SMR_VSA and SlogP_VSA Inter-correlation

Correlation Analysis




SMR SlogP descriptors
weakly correlated
Test made on ~2000
small molecules not
used in definition of
descriptors
Displayed values are r
values (not r2)
Descriptors encode
“orthogonal”
molecular properties
Copyright © 2000 Chemical Computing Group Inc.
All Rights Reserved.
23
14 Partial Charge Descriptors

Gasteiger (PEOE) partial
charge model




Approximation to local pKa
Electrostatic interactions
Similar to Jurs descriptors
14 descriptors result from
uniform interval boundaries

Weak correlation
Stanton D., Jurs, P. Anal. Chem. 62, 2323 (1990)
Gasteiger,J., Marsali. Iterative Partial Equalization of
Orbital Electronegativity - A Rapid Access to Atomic
Charges. Tetrahedron. Vol. 36, p3219 (1980)
Copyright © 2000 Chemical Computing Group Inc.
All Rights Reserved.
24
Encoding of Traditional Descriptors

Traditional descriptors modeled with VSA descriptors



1,932 small organic molecules with weights in (28,800)
SlogP_VSA, SMR_VSA and PEOE_VSA descriptors calculated
Principal components regression models for 64 traditional descriptors
chi0
Kier1
vdw_area
vdw_vol
vsa_hyd
a_count
a_heavy
a_IC
apol
b_count
chi0v
chi1
SMR
b_single
bpol
chi0_C
0.99
0.99
0.99
0.99
0.99
0.98
0.98
0.98
0.98
0.98
0.98
0.98
0.98
0.97
0.97
0.97
Copyright © 2000 Chemical Computing Group Inc.
All Rights Reserved.
chi0v_C
KierA1
a_hyd
a_nC
a_nH
a_nO
b_heavy
chi1_C
chi1v_C
SlogP
a_acc
chi1v
Weight
a_aro
a_don
zagreb
0.97
0.97
0.96
0.96
0.96
0.95
0.95
0.95
0.95
0.95
0.94
0.94
0.93
0.91
0.91
0.91
b_ar
Kier2
vsa_pol
vsa_acc
diameter
VadjEq
a_nN
KierA2
radius
VdistMa
wienPath
wienPol
VadjMa
VdistEq
vsa_oth
a_nF
0.89
0.89
0.89
0.88
0.87
0.87
0.86
0.86
0.86
0.86
0.85
0.84
0.82
0.82
0.82
0.80
b_1rotN
b_double
b_rotN
a_ICM
vsa_don
KierFlex
balabanJ
a_nP
Kier3
a_nCl
KierA3
a_nS
b_1rotR
density
b_rotR
b_triple
0.78
0.77
0.77
0.73
0.73
0.69
0.61
0.60
0.57
0.56
0.55
0.53
0.50
0.49
0.48
0.46
25
Boiling Point

Data set




Exp. boiling point (K)
298 small molecules
18 descriptors:
SlogP_VSA(10),
SMR_VSA(8)
PCA regression



r2 = 0.96, RMSE = 15.53
Leave-one-out:
r2 = 0.94, RMSE = 21.37
Random leave-100-out:
r2 = 0.94
Copyright © 2000 Chemical Computing Group Inc.
All Rights Reserved.
700
600
500
400
300
200
200
300
400
500
600
700
26
Free Energy of Solvation in Water





4
Data set
Exp. Gs (kcal/mol)
291 small molecules
12 descriptors:
PEOE_VSA(3),
SlogP_VSA(7),
SMR_VSA(2)
PCA regression



r2 = 0.90, RMSE = 0.78
Leave-one-out:
r2 = 0.89, RMSE = 0.82
Random leave-100-out:
r2 = 0.88
2
0
-10
-8
-6
-4
-2
0
2
4
-2
-4
-6
-8
-10
Viswanadhan, V.N., Ghose, A.K., Singh, U.C., Wendoloski, J.J.; Prediction of Solvation Free
Energies of Small Organic Moleucles: Additive-Constitutive Models Based on Molecular
Fingerprints and Atomic Constants; J. Chem. Inf. Comput. Sci., 39, 405-412 (1999)
Copyright © 2000 Chemical Computing Group Inc.
All Rights Reserved.
27
Thermodynamic Solubility in Water





20
Data set
Exp. logW at 25ºC
1,438 small molecules
32 Descriptors:
SlogP_VSA (10),
SMR_VSA (8),
PEOE_VSA (14)
PCA regression
 r2

= 0.75, RMSE = 2.4
Leave-one-out:
r2 = 0.74, RMSE = 2.5
15
10
5
0
-15
-10
-5
0
5
10
15
20
-5
-10
-15
Syracuse Research Corporation, 6225 Running Ridge Road, North Syracuse, NY 13212.
URL: http://www.syyres.com.
Copyright © 2000 Chemical Computing Group Inc.
All Rights Reserved.
28
Vapor Pressure

Data set




20
Exp. vapor pressure at
25ºC
1,771 small molecules
32 Descriptors:
SlogP_VSA (10),
-30
SMR_VSA (8),
PEOE_VSA (14)
10
0
-20
-10
0
10
20
-10
PCA regression


r2 = 0.88, RMSE = 2.1
Leave-one-out:
r2 = 0.87, RMSE = 2.2
-20
-30
Syracuse Research Corporation, 6225 Running Ridge Road, North Syracuse, NY 13212.
URL: http://www.syyres.com.
Copyright © 2000 Chemical Computing Group Inc.
All Rights Reserved.
29
Compound Classification with Binary QSAR

Can Binary QSAR separate inhibitor classes using
SLogP_VSAk and SMR_VSAk descriptors?

Data: 455 compounds active against one of 7 targets

Results (classification accuracy)







Class 1: 98.7% p=0.003
Class 2: 96.7% p=0.043
Class 3: 96.5% p=0.290
Class 4: 98.7% p=0.001
Class 5: 98.7% p=0.014
Class 6: 98.7% p=0.012
Class 7: 99.1% p=0.002
Serotonin receptor ligands
Benzodiazepine receptor ligands
Carbonic anhydrase II inhibitors
Cyclooxygenase-2 (Cox-2) inhibitors
H3 antagonsists
HIV protease inhibitors
Tyrosine Kinase inhibitors
Labute,P. Binary QSAR: A New Method for Quantitative Structure Activity Relationships.
Proceedings of the 1999 Pacific Symposium World Scientific Publishing, Singapore (1999)
Copyright © 2000 Chemical Computing Group Inc.
All Rights Reserved.
30
Compound Classification with CART

Learning set for CART (recursive partitioning)




455 compounds active against one of 7 targets
1,942 “random” organic compounds
SlogP_VSA, SMR_VSA descriptors
Classification accuracy (32 node tree, depth 5)







Class 1: 84.5% p=0.07
Class 2: 49.1% p=0.30
Class 3: 92.5% p=0.27
Class 4: 96.8% p=0.01
Class 5: 82.7% p=0.03
Class 6: 85.4% p=0.02
Class 7: 91.4% p=0.01
Serotonin receptor ligands
Benzodiazepine receptor ligands
Carbonic anhydrase II inhibitors
Cyclooxygenase-2 (Cox-2) inhibitors
H3 antagonsists
HIV protease inhibitors
Tyrosine Kinase inhibitors
Xue,L., Godden,J., Gao,H., Bajorath,J. Identification of a Preferred Set of Molecular Descriptors for Compound
Classification Based on Principal Component Analysis. J. Chem. Inf. Comput. Sci, 39, p699 (1999)
Copyright © 2000 Chemical Computing Group Inc.
All Rights Reserved.
31
Thrombin, Trypsin and Factor Xa Activity

Data and descriptors



N
Exp. pKi data for 72 analogs against Thrombin,
Trypsin and Factor Xa
Descriptors: subsets of SlogP_VSAk,
SMR_VSAk , PEOE_VSAk
Principal components regresssion
Thrombin (10 descr.)
r2 = 0.65 RMSE = 0.61
HN
NH2+
N
9
8
8
8
7
7
7
6
6
6
5
5
5
4
4
4
3
3
3
2
4
5
6
7
8
9
NH2
O
9
3
S
O
S
9
2
N
O
Trypsin (9 descr.)
r2 = 0.72 RMSE = 0.47
2
O
O
Factor Xa (15 descr.)
r2 = 0.69 RMSE = 0.35
2
2
3
4
5
6
7
8
9
2
3
4
5
6
7
8
Bohm,M., Sturzebecher,J., Klebe,G. J. Med. Chem., 42, p458-477 (1999).
Copyright © 2000 Chemical Computing Group Inc.
All Rights Reserved.
32
9
Blood-Brain Barrier Permeability

Data set




1.5
Exp. logBBB partition
75 molecules (charged)
14 descriptors:
PEOE_VSA(3),
SlogP_VSA(6),
-2.5
SMR_VSA(5)
PCA regression


r2 = 0.83, RMSE = 0.32
Leave-one-out:
r2 = 0.73, RMSE = 0.43
1
0.5
0
-2
-1.5
-1
-0.5
0
-0.5
0.5
1
1.5
-1
-1.5
-2
-2.5
Luco, J.M. Prediction of the Brain-Blood Distribution of a Large Set of Drugs from Structurally Derived
Descriptors Using Partial Least Squares (PLS) Modeling. J. Chem. Inf. Comput. Sci., 39, 396-404 (1999)
Copyright © 2000 Chemical Computing Group Inc.
All Rights Reserved.
33
Advantages

Orthogonality




Relevance




SlogP_VSA, SMR_VSA and PEOE_VSA exhibit weak correlation
Binary QSAR and Recursive Partitioning methodologies benefit
Less reliance on Principal Components Analysis
SlogP_VSA, SMR_VSA and PEOE_VSA useful for QSAR/QSPR
SlogP_VSA, SMR_VSA and PEOE_VSA used successfully in HTS QSAR
Pharmacokinetic, “drug-like” and ADME properties modeled reasonably
Additivity




VSA conversion of logP and MR to non-whole molecule properties
VSA descriptors are “group” additive (useful for combinatorial designs)
Fundamental units are surface areas for all descriptors (Euclidean space)
More continuous than simple atom/fragment counts
Copyright © 2000 Chemical Computing Group Inc.
All Rights Reserved.
34
CHEMICAL
COMPUTING
GROUP INC.
Focused Combinatorial
Library Design
Copyright © 2000 Chemical Computing Group Inc.
All Rights Reserved.
35
Combinatorial Library Design

Objective: Select reagents for combinatorial synthesis to
minimize iterations in High Throughput Discovery Cycle

Combinatorial Library: all combinations
of R1, R2, R3 and R4 groups.

Select building blocks that bias
library towards drug-like actives

Use non-enumerative techniques to
score building blocks in large virtual libraries




R3
R4
R2
R1
N
R1
N
N
R2
R3
4 connection points, 1000 R-groups = 1012 compounds
Enumeration Impractical. Can’t even store compounds!
Use statistical sampling to score building blocks
Use Binary QSAR model as focusing agent
Copyright © 2000 Chemical Computing Group Inc.
All Rights Reserved.
36
Building Block Scoring Methodology

Estimate probability that building block is in active
compound:
Binary QSAR
Pr(X i  rij | active) 
Pr(active| X i  rij )
Model
 Pr(active| X i  rik )
k

Use random sampling to estimate terms in formula to avoid
enumeration of entire virtual library



Randomly choose a central group and R-groups
Construct virtual product and calculate product descriptors
Focused library design using Binary QSAR model



“Count” the number times reagents appear in active compounds
Binary QSAR model used to estimate activity of virtual products
Select top scoring reagents for pure combinatorial design
Copyright © 2000 Chemical Computing Group Inc.
All Rights Reserved.
37
Building Block Scoring Algorithm
R1
Me
Select Random
Building Blocks
O
O
R3
R2
Me
N
R
R
R
Me
Construct Virtual
Product
O
Me
N
O
Calculate Product
Descriptors
Binary QSAR Model
(2.1,3.2,4.3,0,5.3,2.8, ...)
Estimate Probability
of Product Activity
p = 0.63
R1
Me
Add Probability
to Building Blocks
R3
R2
+0.63
Output Building
Block Scores
Copyright © 2000 Chemical Computing Group Inc.
All Rights Reserved.
O
O
Me
N
R
R
+0.63
+0.63
R
+0.63
aij  aij /  aik
k
38
cGMP PDE V Data Set
Cyclic GMP phosphodiesterase (human type 5) inhibitors
R4
R2
R
N
R3
N
R1
N
N
R1
R3
N
N
O
N
N
N
R
N
R2
N
R2
N
HN
O
R1
N
N
N

263 compounds from literature

1,534 random “inactives” added

IC50: 0.5 nm through 100+ m
Copyright © 2000 Chemical Computing Group Inc.
All Rights Reserved.
263 imidazopyrines, quinazolines,
1,3-bis(cyclopropylmethyl)xanthines
39
Binary QSAR Model for cGMP PDE V

Binary QSAR setup





Results





Descriptors: SlogP_VSA (10), SMR_VSA (8), PEOE_VSA (14)
Threshold to simulate HTS data: active = - log IC50 > 0 (1 m)
Random 10% of data separated for validation set (9 active, 171 inactive)
Remaining 90% of data used for training (52 active, 1568 inactive)
Training set accuracy: 69.2% on active, 98.4% on inactive (p=0.000227)
Leave-one-out accuracy: 55.8% on active, 98.4% on inactive
10-fold block cross-validation gave similar results
Validation set accuracy: 55.6% on active, 98.2% on inactive
Resulting statistically significant model is assumed to
“understand” the original data set

Individual predictions of activity are suspect but a collection of predictions
is meaningful: e.g., estimate the number of actives in a library
Copyright © 2000 Chemical Computing Group Inc.
All Rights Reserved.
40
Quinazoline Virtual Library Definition
H,Me
R4
H, CH3, F,
Cl, Br,
CH2OH,
CH2SH,
CH3SO2,
NO2, HCC
R1
N
R3
N
N
R2
CH2-(2-thienyl), benzyl,
CH2-benzyl, phenyl,
CH2-(3-pyridyl),
CH2-(2-furanyl), 2-pyridyl,
3-(5-Me-isoxazoyl), 2-ClPh,
3-ClPh, 4-ClPh, 3-pyridyl,
1-pyrrolyl, propyl, 3-CH3OPh,
CH2CH2-2(3-Me-pyrrolyl),
3-NO2Ph, H, CH2(CH2)4OH,
CH2(cPr), 4-(CO2Me)Ph,
c-pentyl, c-hexyl,
CH2-(2-THF),
CH2(CH2)2OCOCH2CH3
1-imidazolyl, 2-pyridyl, 3-pyridyl, H, 4pyridyl, 2-thienyl, 2-furyl, Cl, 4-morpholine,
c-hexyl, 4-Me-1-piperazinyl, styrenyl

Quinazoline scaffold

27 x 12 x 10 x 2 = 6,480 products in virtual library
Copyright © 2000 Chemical Computing Group Inc.
All Rights Reserved.
41
Quinazoline Library R-Group Scores
R1
0.121
0.120
0.109
0.083
0.081
0.062
0.061
0.041
0.038
0.037
0.036
0.034
0.032
0.023
0.015
0.015
0.011
0.008
0.007
0.007
0.006
0.005
0.004
0.003
R2
R3
R4
CH2-(2-thienyl)
0.191 1-imidazolyl
0.197 Me
0.960 H
benzyl
0.160 2-pyridyl
0.192 H
0.040 Me
CH2-benzyl
0.145 3-pyridyl
0.142 Cl
phenyl
0.127 H
0.120 F
CH2-(3-pyridyl)
0.111 4-pyridyl
0.112 CH3O
CH2-(2-furanyl)
0.099 2-thienyl
0.093 Br
2-pyridyl
0.076 2-furyl
0.065 CH3S
3-(5-Me-isoxazolyl) 0.072 Cl
0.042 CH3SO2
2-ClPh
0.016 4-morpholine
0.024 NO2
3-ClPh
0.003 c-hexyl
0.013 NCC
4-ClPh
0.001 4-Me-1-piperazinyl
3-pyridyl
1-pyrrolyl
 Use median cutoff at each
3-CH3OPh
CH2CH2-2(3-Me-pyrrolyl)
position (selected R-groups
3-NO2Ph
shown in blue)
H
CH2(CH2)4OH
CH2(cPr)
 Retain only high-scoring
4-(CO2Me)Ph
R-groups that account for
CH2-(c-hexyl)
3-(CO2Me)Ph, c-pentyl, c-hexyl
50% of the probability
CH2-(2-THF)
CH2(CH2)2OCOCH2CH3
Copyright © 2000 Chemical Computing Group Inc.
All Rights Reserved.
42
Resulting Focused Quinazoline Library

3 x 3 x 5 = 45 compounds
(0.7% of original library)
CH2-(2-thienyl)
benzyl
CH2-benzyl
phenyl
CH2-(3-pyridyl)
R1
HN

SAR for R3

H,Me,Cl
Preference for
small hydrophobic
groups agrees with
experiment
N
1-imidazolyl
2-pyridyl
3-pyridyl
N

SAR for R1



R2
-CH2- spacer in top scoring groups agrees with experiment
Benzyl group agrees with experiment
SAR for R2


1-imidazoyl as top scoring group agrees with experiment
2-, 3-pyridyl groups agree with experiment
Copyright © 2000 Chemical Computing Group Inc.
All Rights Reserved.
43
CHEMICAL
COMPUTING
GROUP INC.
Summary and Outlook
Copyright © 2000 Chemical Computing Group Inc.
All Rights Reserved.
44
Complete Methodology for HTDD
Binary QSAR
Drugability Data
(e.g., BBB or drug-like)
ADME Model
Design
Model
Library Design
Activity Model
Combinatorial
Library
Binary QSAR

QSAR models from HTS data (directly) and ADME data
VSA Descriptors: information rich low dimensional space


BioAssay
Binary QSAR: probabilistic QSAR


HTS Data
Applicable for activity models and ADME models
Reagent Scoring: probabilistic scoring

Non-enumerative technique that uses Binary QSAR models to focus
Copyright © 2000 Chemical Computing Group Inc.
All Rights Reserved.
45
Important Ideas

Binary QSAR




VSA Descriptors




Orthogonal, group-additive descriptors reminiscent of Hansch & Leo
Wide applicability gives rise to meaningful chemistry space
Less reliance on variable selection procedures (fewer false correlations)
Reagent Scoring




Shift away from regression-based techniques: more robust to errors
Predictions are “soft”: best suited to collection-based predictions
Probability models can be combined (e.g., ADME & Potency)
Non-enumerative techniques can handle huge virtual libraries
Resulting score complements other criteria (cost, availability, etc.)
Sample + Reject + Estimate procedure can incorporate arbitrary filters
Statistically significant model = “understanding” of data
Copyright © 2000 Chemical Computing Group Inc.
All Rights Reserved.
46
Future Work and Availability

Generation II VSA descriptors




Improve fundamental VSA approximation: can do better than 10% error
Handle “protonics”: average over tautomeric and protonation states
Direct polarizability model might replace logP + MR
Extend probability decomposition to include receptor (T)
Pr(Y , D, S , T )  Pr(D | Y , S , T ) Pr(Y | S , T ) Pr(T , S )
Drugable for T or T’s type
Probabilistic QSAR
Bioinformatics?

All methodology available in MOE version 2000.02
Copyright © 2000 Chemical Computing Group Inc.
All Rights Reserved.
47