Presentazione di PowerPoint

Download Report

Transcript Presentazione di PowerPoint

CHEMOMETRIC METHODOLOGIES FOR THE MODELLING OF
HETEROGENEOUS CHEMICALS TOXICITY: DATASET
REPRESENTATIVITY AS THE ABSOLUTE ESSENTIAL
Paola Gramatica1, Viviana Consonni2, Manuela Pavan2, Pamela Pilutti1 and Ester Papa1
1QSAR
and Environmental Chemistry Research Unit - INSUBRIA University (Varese - ITALY)
2Milano Chemometrics & QSAR Research Group - Milano Bicocca University (Milano– ITALY)
e.mail: [email protected]; web: http//dipbsf.uninsubria.it/qsar/
ABSTRACT
The BEAM EU research project focuses on the risk assessment of mixture toxicity. A data set of 124 heterogeneous chemicals of high concern as environmental pollutants has been studied for toxicity on Scenedesmus vacuolatus. Several
chemometric techniques were applied on the experimental toxicity data with the aim of developing a “universal” QSAR able to describe and predict the toxicity of structurally heterogeneous and dissimilarly acting chemical. The chemical
structure of the compounds was described with several types of theoretical molecular descriptors calculated by the software DRAGON [1]. The Genetic Algorithm approach was used as the Variable Subset Selection method applied to OLS
regression. In order to verify the predictive capability of the developed QSAR models a training set selection was performed by Experimental Design. OLS models have been developed on 76 chemicals selected as training set for the two
parameters “a” (correlated with EC50 values) and “b” (steepness) of the Weibull model. Counter Propagation-Artificial Neural Networks (CP-ANN) approaches were also used to verify the utility of non-linear techniques. The used methodologies,
applied to the overall dataset of 124 chemicals, showed a not-satisfactory performance in validation, demonstrating that a “universal” QSAR model is not possible when chemicals are significantly different in structure and mode of action. This
highlights the essential need for data set representativity for the successful application of QSAR. Moreover QSAR models on the limited data sets on the more similar compound, in both structure and mode of action, show high predictive
performance.
MATERIALS and METHODS
Experimental data
The QSAR models have been developed on the EC50 values of 124
chemicals (with defined mode of action, tested experimentally for toxicity
on Scenedesmus vacuolatus by the research group of Prof. Grimme,
Bremen University, EU project: BEAM EVK1-1999-00012) and on the two
parameters “a” and “b” of the Weibull model (the first parameter “a” is an
expression of the location of the sigmoidal toxicity curve, tightly correlated
with EC50 value, while parameter ”b” is an expression of the steepness of
the toxicity curve). The chemicals in this data set are currently in common
use:
antifouling
agent,
antioxidant,
bactericide,
chemotherapeutic,
disinfectant, fungicide, herbicide, insecticide, tool in physiological research
and industrial chemical.
Molecular descriptors
Chemometric methods
The molecular descriptors were calculated by the software DRAGON [1]. A total of
1500 molecular descriptors of different kinds were used to describe compound
chemical diversity. The descriptor typology is:
OD: Constitutional descriptors.
1D: Empirical, Functional groups, Properties, Atom-centred fragments descriptors.
2D: Autocorrelations, Topological, Molecular walk counts, Galvez topological charge
indices, BCUT descriptors.
3D: Geometrical, Randic molecular profiles, WHIM, GETAWAY, RDF, 3D-MoRSE,
Charge descriptors.
In addition, five quantum-chemical descriptors (HOMO, LUMO, (HOMO-LUMO)GAP,
energies, heat of formation and ionization potential Ei,v),calculated by MOPAC (PM3
method) and log Kow experimental were always added as molecular descriptors.
Multiple Linear Regression analysis and Variable Selection were performed
by software MOBY-DIGS [2], using the Ordinary Least Squares regression
(OLS) method and Genetic Algorithm-VSS. In order to verify the predictive
capability of the developed QSAR models a test set selection was performed
by Experimental Design procedure, by the software DOLPHIN [3]. Tools of
regression diagnostics, as residual plots and Williams plots, were used to
check the quality of the best models and define their applicability regarding
the chemical domain. Counter Propagation Artificial Neural Networks (CPANN) approaches were also used to verify the utility of non-linear techniques.
For a stronger evaluation of model applicability for prediction on new
chemicals, the external validation (verified by Q2ext) of all models is also
recommended [4] and was here performed.
QSAR MODELLING ON THE OVERALL DATASET
Ordinary Least Squares regression by Genetic
Algorithm Variable Selection (OLS - GA)
Regression by Counter Propagation Artificial
Neural Networks (CP-ANN)
Did not
work!!!
Not satisfactory model
Response N.Tr. N.Test N.Var
Log1/EC50
70
47
13
2
2
2
EXT
Variables
Q
R
Principal components
64.6
50.7
Not satisfactory model
Unfortunately, the obtained models were found to be
2
2
Response N. Obj. N.Var
Variables
Q
R
unsatisfactory due to their low predictive capability
IC5-nOCON-nNHRPh-C032-N072Log1/EC50
124
7
56.25
61.49
MLOGKow2-LogKowExp
(even after the elimination of some outliers). A
IC5-nCONR2-nOCON-nNHRPh-N072Weibull “a”
121
7
56.91
62.81
MLOGKow2-LogKowExp
“universal” QSAR model is not possible when the
AAC-GGI3-GATS1p-n#CR-nCOORWeibull “b” 122
7
48.3
53.52
H050-Hy
chemicals are significantly different in both structure
and mode of action. For this reason, we decided to
model the EC50 data for a reduced data set of 101 chemicals, including only the chemicals with the more
represented modes of action: amino acid biosynthesis, DNA synthesis and function, lipid biosynthesis, photosynthetic
electron transport, steroid biosynthesis and unspecific action.
Q
36.6
A CP-ANN approach was applied on the experimental EC50 toxicity values of a selected
training set of 70 chemicals in order to develop a QSAR regression model with a non-linear
technique. The 13 significant principal components of the molecular descriptors were used
as predictive variables. The best model was developed by a map of 8x8 neurons and 50
learning epochs. The obtained model turned out to be unsatisfactory due to its low predictive
power.
QSAR MODELLING ON A MORE REPRESENTATIVE DATASET
OLS model obtained on selected
training set
Regression by Computer Propagation Artificial
Neural Networks (CP-ANN)
Now it
works!!!
Satisfactory predictive power
Log1/EC50 101
Log1/EC50
76
-
4
25
4
Variables
KlogKow-nCONR2-nCONN
nNHRPh
KlogKow-nCONR2-nCONN
nNHRPh
2
Q
R
2
Q EXT
88.4
80.4
-
84.3
89.5
67.5
Variables
Q
2
2
R
Q
2
LMO(50%)
Q
2
EXT
Log1/EC50 101
--
6
nCONR2-nCONN-nNHRPhKLOGKow- PJI3-HATS3u
77.06
80.2
75.58
--
Log1/EC50
25
6
nCONR2-nCONN-nNHRPhKLOGKow- PJI3-HATS3u
76.44 80.71
74.2
79.5
76
Log1/E C50= 0.24 +0.97nNHRP h +0.44 KLOGKow +1.52 nCONR2 +1.04 nCONN +
-2.49 P JI3 - 1.36 HATS 3u
3
The best models with good predictive power, on the 101 chemicals
and on the split training set, are based on the same molecular
descriptors: counting of different nitrogen groups (nCONR2nCONN-nNHRPh), calculated LogKow (KLOGKow), a 3D
descriptor of shape (PJI3) and a 3D-GETAWAY of autocorrelation
(HATS3u). The regression line of the externally validated model is
reported (outliers for the training and test set chemicals are
highlighted).
2
Predicted log1/EC50
Response N.Tr. N.Test N.Var
2
Response N.Tr. N.Test N.Var
The CP-ANN approach was applied on the experimental EC50 toxicity values of a reduced data set
of 101 chemicals, which includes only the chemicals with the more represented modes of action. As
predicted variables we used the four ones more frequently present in the population of OLS models.
The best model was developed by a map of 8x8 neurons and 100 learning epochs.
1
Metribuzin
0
Bensulfuron-methyl
2.4 D
-1
Cyproconazole
-2
Test set
Training set
Enoxacin
-3
-3
-2
-1
0
1
2
3
E xperimental log1/E C50
.
SUBSETS OF CHEMICALS WITH THE SAME MODE OF ACTION
OLS model on photosynthetic electron transport
inhibitors (49 chemicals)
OLS model on steroid biosynthesis inhibitors
(17 chemicals)
OLS model on compounds with unspecific mode
of action (18 chemicals)
log1
\E
C
5
0
=1
.6
8+
7
.4
6H
4
v-0
.7
3C
-0
0
4-0
.1
6H
Te
Response
N. Obj. N.Var
2
Variables
Q
2
R
2
LMO(30%)
Q
Log1/EC50
Weibull “a”
49
5
TIE-S3K-IVDM-T(Cl..Cl)-MR
73.46 79.99
70.45
49
5
Xu-X2A-GATS3v-HATS0e-MR
79.39 84.46
78.4
Weibull “b”
49
5
nR06-BEHe3-R4u-C026-MR
74.81 81.61
73.1
1.2
0.8
P
enc
onaz
ole
Variables
Q
2
2
R
2
LMO
Response
N. obj
N. var
Log1/EC50
Weibull “a”
18
3
17
3
nC-H8v-R2e+
78.85
86.8
74.17
Weibull “b
16
3
PJI2-HOMA-C040
83.7
90.29
77.42
GGI10-nCs-LogKowExp 80.11 89.03
Q
(30%)
76.93
Predictedlog1\EC50
0.4
W
e
ibuul "
a
"=1
9
.1
6+
0
.6
2M
R-1
.8
5X
u-1
1
7
.2
3X
2
A+
6
.0
3G
A
TS
3
v+
7
.1
8H
A
TS
0
e
8
0.0
W
e
ibull "
b"
=5
.9
5+
3
.1
7H
O
M
A-3
.3
1C
-0
4
0-3
.3
0P
J
I2
8
-0.4
N
aphthalene
7
-0.8
6
6
4
-1.2
-1.2
B
uturon
-0.8
0.0
0.4
0.8
1.2
5
0
M
e
tribuz
in
Response
-2
-4
-4
-2
0
2
4
6
8
N. Obj. N.Var
Variables
2
Q
2
R
2
Q LMO(30%)
Log1/EC50
17
3
H4v-HTe-C-004
88.72 93.45
83.18
Weibull “a”
17
3
ATS4m-H4v-H8v
83.42 90.85
81.58
Weibull “b”
16
3
GATS1v-GATS3e-R3u+
78.78 86.4
76.82
E
x
pe
rim
e
nta
lW
e
ibull "
a
"
PredictedWeibull"b"
E
x
pe
rim
e
nta
l log1
\E
C
5
0
2
PredictedWeibull"a"
-0.4
P
arathion
4
3
2
1
1
2
3
4
5
6
7
8
E
x
pe
rim
e
nta
lW
e
ibull "
b"
CONCLUSIONS
The QSAR models obtained on reduced datasets, selected for representativity and for similarity of mode of action, are all of good quality. The predictive performances and stability have been verified by internal validation (Q2 and Q2LMO). The
chemical domain of applicability of the proposed models for new chemicals must be always verified by the leverage approach, taking into account that some of these models have been developed on relatively small data sets.
All the proposed models are based on different molecular descriptors, mainly theoretical, encoding different features of the chemical structures related to the modelled end-points. The logKow parameter is selected only in models for unspecific
mode of action (probably as related to the baseline toxicity) and in the global models, thus demonstrating that other molecular descriptors more related to the chemical structure are able to describe and predict the toxicity.
Financially supported The Commission of the European Union (BEAM EVK1-1999-00012 )
REFERENCES
[1] Todeschini R., Consonni V. and Pavan M. DRAGON, version 2.1-2002 (WINDOWS/PC); Milano, Italy. Program for the calculation of molecular descriptors from
HyperChem, Tripos, MDL file, SYBYLmolfile formats from ChemOffice and Tripos molecular design software. Free download available at: http://www.disat.unimib.it/chm
[2] Todeschini R. Moby Digs /Evolution, rel 2.0, Talete Milano, Italy.
[3] Todeschini, R. and Mauri, A. 2000. DOLPHIN-Software for Optimal Distance-Based Experimental Design. rel. 1.1 for Windows, Talete
srl, Milan (Italy).
[4] Tropsha A., Gramatica P. and Gombar V.K. 2003. The Importance of Being Earnest: Validation is the Absolute Essential for
Successful Application and Interpretation of QSPR Models. Quant. Struct.-Act. Relat. 22.