No Slide Title

Download Report

Transcript No Slide Title

LINEAR MODELLING AND PREDICTION OF BIOCONCENTRATION FACTOR
(BCF) BY THEORETICAL MOLECULAR DESCRIPTORS
Papa Ester - Gramatica Paola
Dep.Struct.Funct.Biol. - QSAR Research Unit - University of Insubria ( Varese - Italy )
Web: http://fisio.dipbsf.uninsubria.it/qsar/
e-mail: [email protected]
ABSTRACT
INTRODUCTION
Bioconcentration by aquatic biota is an important factor in assessing the environmental behaviour and potential hazard
evaluation of a chemical, mainly for Persistent Bioaccumulative and Toxic compounds (PBTs). Since the experimental
determination of BCF values is expensive and time consuming, estimation methods have been widely used to supply
missing data. Log P (Kow) is the most widely used physicochemical descriptor for modelling bioconcentration, but for
highly hydrophobic chemicals non-linear models must be applied. Analogous results have been obtained by modelling
with connectivity indices and polarity correction factors. In this study the application of the Genetic Algorithm as
Variable Subset Selection ( GA-VSS ) to a wide set (more than 800) of molecular descriptors of different structural
aspects, like 1D-constitutional, 2D-topological, and 3D-descriptors ( i.e. WHIM descriptors and GETAWAY) produces
highly predictive models of BCF in fish for 238 non-ionic organic compounds. The best linear regression model ( by
Ordinary Least Squares regression ( OLS)), in which log Kow was not selected as molecular descriptor, was always
validated for its predictivity by leave-one-out, leave-more-out and external validation (the selection of the optimal and
most representative test set was derived by the Experimental Design technique). The approach shows that a good
model (Q2ext=87.7) can be obtained without using log Kow or introducing polarity correction factors, simply by applying
theoretical molecular descriptors calculable from the molecular structure.
Bioconcentration is the process of accumulation of water borne chemicals by fish and other aquatic animals through nondietary
routes, i.e by absorption from the water via the respiratory surface and/ or the skin (1,2). The Bioconcentration Factor (BCF) is
defined, for a specific compound, as the equilibrium ratio of the chemical concentration in the exposed organism to the concentration
of the dissolved chemical in the aquatic environment. Therefore BCF can be used as an estimate of a chemical tendency to
accumulate in an aquatic organism and represent a crucial task in the identification and control of chemicals like Persistent
Bioaccumulative and Toxic compounds (PBT). Chemicals bioconcentration is usually estimated by correlation between their BCFs and
hydrophobicity , but some difficulty arise on modelling extremely hydrophobic and large chemicals. Due to this problems different
approaches, using theoretical molecular descriptors of different kinds, have been applied with the principal aim to take into account
many structural aspects of a molecule that can be relevant in determining bioaccumulation. The objective of the present study is to
propose new QSAR models validated by internal and external validation for the BCF prediction, applicable to a wide range of organic
compounds of different chemicals structures; finally a comparison of the BCF values predicted by these models with those obtained by
the Molecular Connectivity Indices, MCI- based models of Lu et al (3) and the Kow- based models of Meylan (4) et al., applied by U.S.
EPA (BCFWIN), is presented in order to verify the reliability and predictive performances of the different estimation models.
MATERIALS and METHODS
EXPERIMENTAL DATA
In this work we used data of BCF measured in fish for 238 non-ionic compounds that were collected from an extensive literature review by Lu et al.(3). Owing to the fact that our goal is a comparison with this work, no effort was made to verify this data: only
acrolein was deleted from the original data set as it was an outlier.
MOLECULAR DESCRIPTORS
The molecular structure of the studied compounds were described by using several molecular descriptors calculated by the software DRAGON of Todeschini et.al (5). A total of 1166 molecular descriptors of different kinds were calculated to describe
compound chemical diversity. The constant values and the descriptors pair-correlated (with a correlation of 1) were excluded, thus the molecular descriptors on which the variable selection by GA was applied are 965.
The descriptor tipology is:
0D: constitutional descriptors
(atoms and group counts)
2D: BCUTs, Galvez indices from the
adjacency matrix, walk counts,
various autocorrelations from the
molecular graph and topological
descriptors.
1D: Functional groups, atom
centered fragments and
empirical descriptors.
3D: Randic molecular profiles
from the geometry matrix,
WHIMs (6-7), GETAWAY (8)
and geometrical descriptors.
In addition 5 quantum-chemical descriptors ((calculated by MOPAC – PM3 method (9) ) HOMO, LUMO, deltaHOMO-LUMO, energies and ionization potential) and Log Kow (taken from EPIWIN package) (10) were used.
Genetic Algorithm was applied on the set of molecular descriptors reduced by eliminating 237 molecular descriptors singularly not-related to the response. Thus the final set of molecular descriptors used as input is constituted of 734 descriptors.
CHEMOMETRIC METHODS
Multiple Linear Regression analysis and variable selection were performed by the software MOBY DIGS (11) using the Ordinary Least Square Regression (OLS) method and GA-VSS (Genetic Algorithm-Variable Subset Selection) (12). All the calculations have
been performed by using the leave-one-out (LOO) and leave-more-out (LMO) procedures and the scrambling of the responses for the validation of the models. (13-14)
External validations (13-16) were performed on two validation sets obtained with the splitting at 50% and 75%of the original data by the Experimental Design procedure, applying the software DOLPHIN (17).
RESULTS AND DISCUSSION
SPLITTING of the
ORIGINAL DATA SET
by applying
EXPERIMENTAL
DESIGN
lo
g
B
C
F
=
-1
7
.5
8
+
1
.6
9
ID
D
M
-0
.4
5
n
H
A
c
c
+
1
5
.6
5
M
A
T
S
2
m
-0
.3
6
G
A
T
S
2
e
-1
.6
4
H
6
p
7
3
5
T
ra
in
in
g
T
e
s
t
2
5
REGRESSION LINE of the MODEL obtained on a
SELECTED TRAINING SET of 179 CHEMICALS
L
o
g
B
C
Fm
o
d
e
l:T
r
a
in
in
g1
7
9m
o
l.
P
rin
cip
alC
o
m
p
o
n
en
tA
n
a
lys
iso
nstru
ctu
reo
f2
38ch
em
ic
als
S
p
littin
gT
rain
in
g
-T
est
6
High Leverage
7-116-175-189
Training
Test
Outliers
107-123
100-158
T
ra
in
in
g
T
e
s
t
The usefulness of QSAR models is mainly in the possibility of predictive
5
1
5
1
1
6
4
5
applications. For this purpose more validation steps are necessary to
1
0
7
1
0
0
avoid overestimation of predictive power of the models and to verify
-5
CalculatedBCFvaues
PC2
1
7
5
-1
5
3
their predictivity:
• Leave-one-out using QUIK rule ( Q Under Influence of K (18)) to avoid
chance correlation.
• Strongest validation using leave-more-out procedure (25-50%).
1
5
8
1
2
3
2
1
1
8
9
7
-2
5
0
-3
5
-4
0
-3
0
-2
0
-1
0
0
1
0
2
0
3
0
4
0
-1
-1
P
C
1
L
o
g
B
C
F
re
n
s
p
o
n
s
ed
is
trib
u
tio
n
S
p
littin
g
:T
ra
in
in
g
-T
e
s
t
On the basis of the structural information
represented from all the used molecular
descriptors and also taking into account the
BCF responses, the original data set was
splitted by applying the Experimental Design
procedure using the software DOLPHIN (17),
to obtain a training set of 179 molecules and a
validation set of 59 chemicals (or alternatively
a training - test set of 119 molecules). This
Design guarantees
that the chemical
composition of training and validation sets
have well balanced structural diversity and are
also representative of the entire range of
biological response.
T
ra
in
in
g
T
e
st
3
3
3
0
2
7
2
4
2
1
N°ofobservation
1
8
1
5
1
2
9
6
3
0
<
=-.5 (-.5
,0
] (0
,.5
]
• Y scrambling ( permutation testing by recalculating models for
5
0
0
1
2
3
BCF
179
5
BCFWIN
238
1+PCF*
LogKow + PCF
MCIs
239
5+PCF*
(0)2 (1)0.5 2 3c 0 + PCF
Q2LOO
IDDM nHAcc MATS2m GATS2e H6p 78.0
Q2LMO25%
77.69
Q2LMO 50% Q2Adj.20% Q2EXT
77.16
87.7
81.4
6
7
randomly reordered response ).
• Use of external validation verified by Q2 ext.
The molecular descriptors, most frequently selected by Genetic Algorithm as the most informative and predictive of the chemical tendency to
bioconcentrate, are related to the dimension of the chemical and to the distribution of polar atoms in molecule. As we expected dimensional
descriptors (MATS2m (19), IDDM (20)) in the proposed models are positive in sign, explaining the bioconcentration tendency of bigger
molecules, while the negative descriptors, considering both polarity factors (H6p(8), GATS2e(21)) and the possibility of forming hydrogen bonds
(nHAcc (22)), explain, for more polar chemicals, the tendency toward aquatic partitioning.
Table 1 - Model Performances
ID N°obj. N° var.
R2
variables
Q2LOO
Q2LMO25% Q2LMO 50% Q2EXT
1
238
4
IDDM nHAcc MATS2m GATS2e
80.5
80.80
80.34
2
179
5
IDDM nHAcc MATS2m GATS2e H6p
78.0
77.69
77.16
3
119
4
IVDM MATS2m L3m nHAcc
77.3
77.23
76.37
Table 2 – Comparison with other models
variables
5
E
x
p
e
rim
e
n
ta
lB
C
F
v
a
lu
e
s
(.5
,1
] (1
,1
.5
] (1
.5
,2
] (2
,2
.5
] (2
.5
,3
] (3
,3
.5
] (3
.5
,4
] (4
,4
.5
] (4
.5
,5
] (5
,5
.5
] (5
.5
,6
] (6
,6
.5
] >6
.5
Model N°obj. N°var.
4
68.61
0.77
81.0
0.59
*PCF= Polarity Correction Factors
SDEP SDEC
81.69
0.601
0.588
87.7
79.50
0.601 0.508
83.5
79.21
0.622
0.595
CONCLUSIONS
RMS
79.50 0.58
R2
 A new predictive model for BCF is proposed.
 This model is based only on theoretical molecular descriptors.
 Genetic Algorithm is applied for Variable Subset Selection.
Our linear models are clearly more predictive than the BCFWIN logKow-based model (10), whose predictivity is not even verified
and moreover simpler than the MCIs model (3). In fact this last model use 5 connectivity index and 8 correction factors proposing
a 13-dimensional non linear model, strongly dependent on the studied dataset in relation to the choice of polar functional groups.
By comparing the residuals of the different models it can be seen that the logKow-model has the biggest RMS, while the MCIbased model and our new models show similar performances.
 Strong validations demonstrate the stability of the models.
 BCF values also for new chemicals (even not yet synthesised) can
be predicted.
REFERENCES
(1) Veith, G.D.; DeFoe, D.L.; Bergstedt, B. V. J. Fish Res: Board Can. 1979, 36, 1040-48;
(2)Barron, M.G. Environ. Sci. Technol. 1990, 24, 1612-18 ;
(3) Lu, X.;Tao, S. Hu,H.; Dawson, R.W., Chemosphere, 2000, 41, 1675-1688;
(4) Meylan, W.M.; Howard, P.H.; Boethling, R.S.; Aronson, D.; Printup, H.; Gouichie, S., Environ. Toxicol. Chem. 1999, 18, 664-672;
(5) Todeschini R., Consonni V. and Pavan E. 2001. DRAGON – Software for the calculation of molecular descriptors, rel. 1.12 for Windows.
Free download available at http://www.disat.unimib/chm.;
(6) Todeschini, R.; Lasagni, M.; Marengo, E. J. Chemometrics 1994, 8, 263-273;
(7) Todeschini, R; Gramatica, P. Quant.Struct.-Act.Relat. 1997, 16, 113-119;
(8) Consonni, V., Todeschini, R., Pavan, M., J. Chem. Inf. Comput. Sci., 2002 in press;
(9) CHEM 3D –Cambridge Soft, 1997, MA , USA;
(10) BCFWIN v. 2.14 in EPIWIN Package 2000 U.S.EPA;
(11)Todeschini, R., 2001. Moby Digs - Software for multilinear regression analysis and variable subset selection by Genetic Algorithm, rel. 2.3 for Windows,
Talete srl, Milan (Italy);
(12) Leardi, R.; Boggia, R.; Terrile, M.,. J. Chemom., 1992, 6, 267-281;
(13) Wold, S. Eriksson, L. Chemometric Methods in Molecular Design, 1995, VCH, Germany, 309-318;
(14) Shi, L.M., Fang, H., Tong, W, Wu, J., Perkins, R., Blair, R.M., Branham, W.S., Dial, S.L., Moland, C.L., Sheehan, D.M., J.Chem.Inf.Comput.Sci.,
2001, 41, 186-195;
(15) Cramer. R.D.; Patterson, D.E.; Bunce, J.D., J.Am.Chem.Soc., 1988, 110, 5959-5967;
(16) Golbraikh, A. Tropsha, A., J. Mol. Graph and Mod., 2002, 20, 269-276;
(17) Todeschini, R.; Mauri, A., 2000; DOLPHIN- Software for Optimal Distance-based Experimental Design rel 1.1 for Windows, Talete srl, Milan (Italy);
(18) Todeschini, R.; Maiocchi, A.; Consonni, V., Chemom. Intell. Lab. Syst., 1999, 46, 13-29;
(19) Moran, P.A.P., Biometrika, 1950, 37, 17-23;
(20) Bonchev, D., Information Theoretic Indices for Characterization of Chemical Structures, 1983, Research Studies Press, Chichester (U.K.), p.249;
(21) Geary, R.C., Incorp. Statist., 1954, 5, 115-145;
(22) Todeschini, R. and Consonni, V. , 2000. Handbook of Molecular Descriptors, Wiley-VCH, Weinheim (Germany), p. 667.