Complex role of the gastrin releasing peptide receptor in

Download Report

Transcript Complex role of the gastrin releasing peptide receptor in

Leave-cluster-out crossvalidation is appropriate for scoring
functions derived on diverse protein datasets
Christian Kramer, Peter Gedeck
Novartis Institutes for Biomedical Research, Basel, Switzerland
Leave-cluster-out crossvalidation
PDBbind core set & RFscore
performance
predicted
Validation 5
Train 5
Validation 4
Train 4
Validation 3
100
# complexes per cluster
50
10
5
200
300
400
500
For the leave-cluster-out crossvalidation we suggest
the following clustering scheme: All clusters with more
than nine members are kept (A-W). Clusters with four
to nine members are united (X), clusters with two and
three members are united (Y) and all singletons are
united (Z).
Cluster proximities
Multidimensional scaling of the RFscore space
shows that complexes from the same protein family
indeed cluster. A flexible learning algorithm
should well be able to recognize protein family
membership
A
B
C
D
E
14
0
-5
Free Energy of Binding [kcal/mol]
12
-10
10
F
G
H
I
J
K
L
M
N
O
0
8
-5
-10
6
4
Y
0
2
-5
-10
0
26
31
Cluster
36
41
46
51
56
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
10
8
6
4
2
2 4 6 8 10
2 4 6 8 10
2 4 6 8 10
2 4 6 8 10
2 4 6 8 10
measured
Biological Target
2
Cluster
#samples
R
R
RMSE
HIV Protease
Trypsin
Carbonic Anhydrase
Thrombin
PTP1B (Protein Tyrosine Phosphatase)
Factor Xa
Urokinase
Different similar Transporters
c-AMP Dependent Kinase (PKA)
Beta-Glucosidase
Antibodies
Casein Kinase II
Ribonuclease
Thermolysin
CDK2 Kinase
Glutamate receptor 2
P38 Kinase
Beta-secretase 1
tRNA-guanine transglycosylase
Endothiapepsin
Alpha-mannosidase 2
Carboxypeptidase A
Penicillopepsin
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
188
74
57
52
32
32
29
29
17
17
16
16
15
14
13
13
13
12
12
11
10
10
10
0.11
0.73
0.56
0.37
0.63
0.19
0.78
-0.12
0.54
0.59
0.58
0.44
0.18
0.68
0.64
-0.2
0.79
0.93
0.12
0.6
-0.17
0.78
-0.42
0.01
0.53
0.31
0.14
0.4
0.04
0.61
0.01
0.29
0.35
0.34
0.19
0.03
0.46
0.41
0.04
0.62
0.86
0.01
0.36
0.03
0.61
0.18
1.91
1.04
1.68
2.03
1.02
1.76
0.95
1.17
1.26
1.13
1.57
1.1
1.2
1.09
1.11
1.16
0.59
1.51
1.08
1.34
1.88
1.71
2.22
All Clusters with 4-9 complexes
All Clusters with 2-3 complexes
Singletons
X
Y
Z
387
340
321
0.56
0.53
0.44
0.31
0.28
0.19
1.63
1.61
1.75
1
16
21
D
10
8
6
4
2
Cluster 5 out
Cluster 4 out
Cluster 3 out
Train 3
Validation 2
Train 2
Validation 1
150
The PDBbind09 cluster alphabet
The PDBbind07 core set can be predicted with RMSE
= 1.58, R2 = 0.59 and R = 0.77.
It has been assembled from a clustering of the
PDBbind07 database according to BLAST similarities.
The most active, the least active and the complex
closest to the average activity have been extracted
from each cluster with at least 4 members.
This means that for every validation set entry
there is at least one entry from the same protein
family in the training set.
16
C
10
8
6
4
2
The PDBbind09 refined set consists of 1741
complexes in 561 clusters (90% BLAST similarity).
The distribution of cluster population is shown below.
100
11
B
10
8
6
4
2
Composition of the PDBbind09
database
Learning algorithm
The Random Forest as implemented in R with default
settings was used.
6
A
2 4 6 8 10
Descriptors
The RFscore descriptors as published by Ballester
and Mitchell were used for all models. For every ligand
atom [C,N,O,F,P,S,Cl,Br,I] all protein atoms [C,N,O,S]
within 12 Å distance are counted and summed up to
give 4x9 atom pair descriptors
1
Performance for each cluster after
leave-cluster-out crossvalidation
10
8
6
4
2
Dataset & methods
Dataset
PDBbind07[2] was used for reproducing RF-score
results. The PDBbind09 refined set was used for
demonstration of leave-cluster-out crossvalidation
Cluster 2 out
Cluster 1 out
The range of activities within protein families is smaller
than the total range of activities. To avoid predictions
that benefit from protein-family we suggest to do
leave-cluster-out crossvalidation
Train 1
Empirical rescoring functions for predicting ProteinLigand interaction energies can be trained based on
large diverse collections of crystal structure
geometries augmented with binding data, such as the
PDBbind or the BindingMOAD database.
In a recent publication remarkable success has been
demonstrated in predicting the free energy of
interaction based on atom counts in a 12 Å radius
around the ligand.[1]
However the quality of prediction depends strongly on
the composition of training and validation set. We
suggest a generally applicable validation strategy that
is not prone to protein-family recognition pitfalls.
Complete Set
Introduction
P
Q
R
U
V
W
S
T
61
Table 1: Leave-cluster-out crossvalidation results on the PDBbind09 refined set.
Average R2 = 0.21, average RMSE = 1.60
Target specific scoring functions
If crystal structures with corresponding activities are
available, target specific scoring functions can be
generated. We generated scoring functions within the
clusters with standard out-of-bag crossvalidation for
the four largest clusters.
Biological Target
Validation Set
HIV Protease
Trypsin
Carbonic Anhydrase
Thrombin
Cluster #samples
A
B
C
D
188
74
57
52
2
R
R
RMSE
Out-of-bag within cluster
0.67
0.45
1.17
0.86
0.74
0.71
0.62
0.38
1.64
0.69
0.48
1.09
R
0.11
0.73
0.56
0.37
2
R
RMSE
Cluster left out
0.01
1.91
0.53
1.04
0.31
1.68
0.14
2.03
Conclusion
 The advent of large diverse datasets of proteinligand complexes allows to generate scoring
functions with a QSAR-type fitting procedure
 Global scoring functions must be validated with
protein-ligand complexes that stem from protein
families that are not present in the training set. Else
the validation will look overoptimistic (R2= 0.59 vs R2
= 0.21)
 Target specific scoring functions can be much more
predictive than global scoring functions, even when
trained with the same descriptors.
0
-5
-10
References
[1] Ballester, P.J. & Mitchell, J.B.O. A machine learning approach to predicting proteinligand binding affinity with applications to molecular docking. Bioinformatics 26, 11691175 (2010)
[2] Cheng, T., Li, X., Li, Y., Liu, Z. & Wang, R. Comparative Assessment of Scoring
Functions on a Diverse Test Set. Journal of Chemical Information and Modeling 49,
1079-1093 (2009).
0
Acknowledgments
-5
-10
-6 -4 -2 0 2 4 6 8 -6 -4 -2 0 2 4 6 8 -6 -4 -2 0 2 4 6 8 -6 -4 -2 0 2 4 6 8 -6 -4 -2 0 2 4 6 8
X
CK thanks the Novartis Education Office for a
Presidential Postdoc Fellowship.