QSAR MODELLING OF CARCINOGENICITY FOR REGULATOTY …

Download Report

Transcript QSAR MODELLING OF CARCINOGENICITY FOR REGULATOTY …

QSAR Modelling of
Carcinogenicity for
Regulatory Use in Europe
Natalja Fjodorova, Marjana Novič,
Marjan Vračko, Marjan Tušar,
National institute of Chemistry,
Ljubljana, Slovenia
CAESAR MEETING, 17.11.2008,
BERLIN, GERMANY
Overview
• Carcinogenic potency predictionstate of art
• Data and methods used for
modeling by NIC_LJU
• Statistical performance of obtained
models and their evaluation
• Some findings about structural
alerts
• Conclusion
Carcinogenic potency
prediction- state of art
The QSAR models can be divided into two
families:
• congeneric (for certain classes of chemicals);
external prediction performance for rodent
carcinogenicity is 58 to 71% accurate
• noncongeneric (for different classes of
chemicals); accuracy is around 65%.
Further studies are required to improve the
predictive reliability of noncongeneric chemicals.
Ref.Romualdo Benigni, Cecilia Bossa, Tatiana Netzeva, Andrew Worth.
Collection and Evaluation of (Q)SAR Models for Mutagenicity and
Carcinogenicity. EUR 22772EN, 2007
• The chemicals involved in the
study belong to different
chemical classes,
(noncongeneric substances)
• The work is addressed to
industrial chemicals, referring to
REACH initiative. The aim is to
cover chemical space as much as
possible
Carcinogenicity prediction
in scope of CAESAR project
Present state:
-
compilation of dataset for carcinogenicity
cross-checking of structures
calculation of descriptors
selection of descriptors
development of models – carcingenicity
investigation of structural alerts (SA)ongoing





Dataset:
805 chemicals were extracted from rodent
carcinogenicity study findings for 1481chemicals
taken from Distributed Structure-Searchable
Toxicity (DSSTox) Public Database Network
http://www.epa.gov/ncct/dsstox/sdf_cpdbas.html
derived from the Lois Gold Carcinogenic Database
(CPDBAS)
Response:
for quantitative models
TD50_Rat- Carcinogenic potency in rat
(expressed in mmol/kg body wt/day)
for qualitative models
yes/no principle
P-positive-active
NP-not positive-inactive
Training and test sets
805 chemicals were splitted into
training set (644 chemicals) and
test set (161 chamicals)
(done
at the Helmholtz Centre for Environmental
Research – UFZ (Germany)
Distribution of active (P) and
inactive (NP) chemicals in the
total, training and test sets
Descriptors:
254 MDL descriptors calculated by MDL
QSAR software,
254MDLdes_806carcinogenicity.rar file
835 Dragon descriptors calculated by
DRAGON software,
Dragon_Carc.xls file
88 CODESSA descriptors calculated
using CODESSA software
88_CODESSA_descr_Cancer.xls file
Descriptors used for modeling
Model CARC_NIC_CPANN_01
27 MDL descriptors provided by NIC_LJU
(method for variable selection: Kohonen network and PCA).
Model CARC_NIC_CPANN_02
18 DRAGON and MDL descriptors were taken
from one of the best models (CARC_CSL_KNN_05) developed by
CSL. The goal was to compare results obtained for carcinogenicity
prediction using different methods.
Model CARC_NIC_CPANN_03
34 CODESSA descriptors were taken from one
of the best models (CARC_CSL_KNN_02) developed by CSL.
(method for variable selection for models 2 and 3- cross correlation
matrix, multicolinearity technique, fisher ratio and genetic algorithm)
Counter Propagation Artificial Neural Network
Step1: mapping of molecule Xs
(vector representing structure)
into the Kohonen layer
Step2: correction of weights in
both, the Kohonen and the
Output layer
Step3: prediction of the four-dementional
target (toxicity) Ts=carcinogenicity
Model input parameters
• Minimal correction factor- 0.01
• Maximum correction factor- 0.5
• Number of neurons in x direction(35)
• Number of neurons in y direction(35)
• Number of learning epochs100, 200, 400, 600, 800, 1000,
1200, 1400, 1600, 1800
Statistical evaluation of models
Confusion matrix for two class
True positive (TP) True negative (TN)
False positive (FP) False negative (FN)
Accuracy (AC)
=(TN+TP)/(TN+TP+FN+FP)
Sensitivity(SE)=TP/(TP+FN)
Specificity(SP)=TN/(TN+FP)
Statistical performance of models
Wrong prediction rate
Threshold vs. wrong prediction rate
for test set (model1)
1.00
0.80
FP_rate
0.60
FN_rate
0.40
0.20
0.00
0.00
0.20
0.40
0.60
0.80
1.00
Treshold
Changing the threshold from 0 to 1 leads to decrease the
number of false positive and increases and number of
false negative increases.
This tendency is common for all our models 1, 2 and 3.
Accuracy_SE_SP
Threshold vs. accuracy, SE and SP
for test set (model 1)
1.00
SE
0.80
SP
0.60
ACC
Threshold=0.45
Accuracy=0.68
SE=0.71
SP=0.65
0.40
0.20
0.00
0.00
0.20
0.40
0.60
0.80
1.00
Threshold
In the figure we have marked the maximum accuracy and
corresponding thresholds. For model 1 the optimal threshold
is equal to 0.45. In this case accuracy has a maximal value
of 0.68, sensitivity is 0.71 and specificity is 0.65.
Accuracy_SE_SP
Threshold vs. accuracy, SE and SP
for test set (model 2)
1.00
0.80
SE
SP
ACC
0.60
0.40
Threshold=0.6
Accuracy=0.70
SE=0.69
SP=0.72
0.20
0.00
0.00
0.20
0.40
0.60
0.80
1.00
Threshold
For model 2 optimal threshold for test set is 0.6 and
accuracy has maximal value of 0.70. Sensitivity in this
point is 0.69 and specificity is 0.72.
Accuracy_SE_SP
Threshold vs. accuracy, SE, SP
for test set (model 3)
1.00
SE
SP
ACC
0.80
0.60
0.40
Threshold=0.5
Accuracy=0.68
SE=0.70
SP=0.62
0.20
0.00
0.00
0.20
0.40
0.60
0.80
1.00
Threshold
For model 3 optimal threshold is equal to 0.5, maximum
accuracy is 0.68, sensitivity is 0.70 and specificity is 0.62.
Changing the threshold leads to revision of sensitivity and specificity.
It may be used to increase the number of correctly predicted carcinogens or
non carcinogens.
True positive rate (sensitivity)
ROCs for CARC_NIC_CPANN models_01_02 and 03
1.0
Training_mod_01
Test_mod_01
Training_mod_02
Test_mod_02
Training_mod_03
Test_mod_03
0.8
0.6
0.4
0.2
0.0
A model with no predicted
ability yields the diagonal line
0 .1 .2 .3 .4 .5 .6 .7 .8 .9
.
0 0 0 0 0 0 0 0 0 0
False positive rate (1-specificity)
The closer the curve
tends towards (0,1)
the more accurate
are the prediction
made
Accuracy of prediction and area under
the curve (AUC) (models 1,2,3)
Study structural alerts for our
dataset collected from Benigni
Toxtree program
• We have extracted the following alerts for
out dataset of 805 compounds
• GA-genotoxic alerts
• nGA-non-genotoxic alerts
• NA-no carcinogenic alerts
• When we have calculated how many
chemicals with pointed alerts fall into NPnot positive and P-positive area.
P-positive and NP-not
positive relates only for
results for rats
For substances with
GA about 2/3 belong to
Positive and about 1/3 to
NP-not positive
For substances with nGA
about half substances
belong to Positive and
half to NP
For substances with NAno carcinogenic alerts
about 2/3 belongs to NP
and 1/3 belong to
Positive
Needs for future investigations
Conclusion
• Quantitative models with dependent variabletumorgenic dose TD50 for rats, have shown low
prediction power with correlation coefficient for
the test set less than 0.5.
• Conversely, qualitative models demonstrated an
excellent accuracy of internal performance
(accuracy of the training set is 91-93%) and
good external performance (accuracy of the test
set is 68-70%, sensitivity is 69-73% and
specificity 63-72%).
• Changing the threshold leads to revision of
sensitivity and specificity. It may be used to
increase the number of correctly predicted
carcinogens or non carcinogens.