No Slide Title

Download Report

Transcript No Slide Title

QSAR MODELLING OF THE BIODEGRADATION BY HOLISTIC MOLECULAR DESCRIPTORS

1

P. Gramatica

1

, M. Pavan

1

, F. Consolaro

1

, V. Consonni

2

and R. Todeschini

2

QSAR Research Unit, Dept. of Structural and Functional Biology, University of Insubria, Varese, ITALY

2

Milano Chemometrics & QSAR Research Group, Dept. of Environmental Sciences, University of Milano Bicocca, Milano, ITALY E-mail: [email protected]

Web-site: http://fisio.dipbsf.uninsubria.it/dbsf/qsar/QSAR.html

INTRODUCTION

Environmental fate of a chemical is strictly related to its biodegradability. A good prediction of biodegradation would greatly aid in planning the synthesis of chemicals for environmental uses.

During recent years, many approaches have been realised to model biodegradation data with predictive purposes: most of them are based on quantitative structure-biodegradability relationship (QSBR) and mainly on a structure representation by molecular fragments (i.e.

functional groups, number of atoms, etc.).

Our approach to predict the biodegradability is based on an holistic representation of a chemical, by using a set of molecular descriptors that account not only for local characteristics of a structure, but also for general aspects, allowing the extension to multifunctional heterogeneous compounds. Due to the great variability of biodegradation data and the difficulty to consider a well-defined end-point we have applied our descriptors to different aspect of biodegradation: in regression modelling of BOD, ThOD, degradation rate constants and in classification on various biodegradability criteria.

MOLECULAR DESCRIPTORS

The molecular structure has been represented by a wide set of 657 molecular descriptors calculated by the software DRAGON 1 : • constitutional descriptors (56) • walk counts (20) • Galvez index (21) • charge descriptors (7) • molecular profiles (40) • 3D-MoRSE descriptors (160) •GETAWAY descriptors (196) • topological descriptors (69) • BCUT descriptors (7) • 2D autocorrelation descriptors • aromaticity descriptors (4) • geometrical descriptors (18) • WHIM descriptors (99) 2 • empirical descriptors (3) [1] R.Todeschini and V.Consonni - DRAGON - Software for the calculation of molecular descriptors, Talete s.r.l. Milan (Italy) 2000. Download: http://www.disat.unimib.it/chm [2] R.Todeschini and P.Gramatica,

3D-modelling and prediction by WHIM descriptors. Part 5. Theory development and chemical meaning of the WHIM descriptors

, Quant.Struct.-Act.Relat., 16 (1997) 113-119.

REGRESSION MODELS

The regression models have been applied on different data set: 43 alcohols, chetons and aromatic compounds; 28 alchols and chetons; 15 anilines and phenols; 17 PCBs and 43 heterogeneous compounds.

Our representation of a chemical is based on 670 molecular descriptors, thus an effective variable selection strategy is necessary. GA-VSS (Genetic Algorithm - Variable Subset Selection) was applied to the whole set of descriptors in order to set out the most variables in modelling the biodegradation end-points by Ordinary Least Squares regression (OLS).

Regression models have been obtained with satisfactory prediction power. All the models have been also validated on an external test set, by splitting the original data set in representative training and test sets by different approaches on structural similarity.

Response 5day BOD molBOD/mol ThOD % K biodeg Anil-Phen.

obj

43 28 28 15

var

3 2 2 2

BEST MODEL PARAMETERS

Q 2 LOO

77.1

80.7

78.5

95.6

Q 2 LMO

77.1

79.9

77.8

95.4

R 2

80.4

84.3

81.9

97.2

SDEP

0.611

0.677

5.29

0.263

SDEC

0.564

0.601

4.847

F (deg)

53.42

(39) 67.38 (25) 56.73 (25) 0.212 204.88 (12)

s

0.592

0.645

5.13

0.237

Model descriptors

BENm3 – MAXDP – HATS6u HATS5v – R8m+ BENe6 – Ds nN – ARR

BIODEGRADABILITY CLASSIFICATION

Different chemometric methods (CART, K-NN and RDA) were used in order to classify 296 chemicals of environmental concern according to some literature biodegradability criteria obtaining satisfactory results. The selection of the best subset of variables were realized by Genetic Algorithm (GA-VSS) on Logistic regression (Rlog), a regression method useful when there is a restriction on the possible values of the dependent variable Y, and by PLS-DA, which confirmed the results previously obtained. It is important to point out that the literature criteria disagree in most of the cases so that we had to compare them in order to find a new general classification criteria for the compounds studied; the comparison was realised as the scheme below shows. All the models developed on an opportunely selected training set have been validated internally ( ER ) and externally ( ER ext ).

Training set selection procedure

Data set

296 compounds

8 7 6 5 4 3 2 1 0 0

O / l = .9

+ 3 H S + .3

R

2 1 4 1 2 5 3 3 9 6 8 6 0 1 2 5 4 5 6 7 8

HATS5v: leverage-weighted autocorrelation of lag 5 (weighted by atomic van der Waals volumes) R8m+: R maximal autocorrelation of lag 8 (weighted by atomic masses)

hO = .1

- 4 .0

B N - 6 .1

1 3 1 9 5 6 3 6 2 8

Available biodegradability data

152 compounds PREDICTION

Not available biodegradability data

144 compounds

Training set

77 compounds SPLITTING

Test set

75 compounds PREDICTION BENe6: negative Burden eigenvalue n. 6 (weighted byb atomic Sanderson electronegativities) Ds: WHIM total accessibility index (weighted by atomic electrotopological states)

BEST MODEL PARAMETERS Method CART LDA RDA CP-ANN var

4 8 7 7

ER%

9.1

9.1

10.4

0

ER

cv

% ER

ext

%

12.9

10.4

13.3

6.6

6.6

16

Model descriptors

ATS3p - MAXDP - Dm - Mor04v nN - nX - P1u - Dm - MEC - ATS3p - Ku - Mor04v nN - MAXDP - P1u - Dm - Mor03m - ATS3p - Mor04v nN - nX - P2u - MAXDP - Mor04v - ATS2p - ATS3p

4

CONCLUSIONS

Different kinds of holistic molecular descriptors appear relevant in the modelling of the biodegradability. Both in regression models and in classification models molecular descriptors taking into account global structural properties of the molecules have been selected by Genetic Algorithm as correlated to biodegradability and in same cases added to local descriptors.

Linear Discriminant Analysis (LDA) model:

variables: nX, nN, P1u, Ku, Dm, ATS2p, MEC, Mor04v

No Model Error Rate

% (NOMER): 32.5

Confusion matrix in fitting A priori classes

1 2 Assigned objects n.

ER%=9.09

Assigned classes

1 50 5 55 2 2 20 22 Objects n.

52 25

Confusion matrix in prediction A priori classes

1 2 Assigned objects n.

ER

ext

%=6.6

Predicted classes

1 2 Objects n.

57 3 60 2 13 15 59 16

nX: n. of halogen atoms nN: n. of Nitrogen atoms P1u: 1st component shape directional WHIM index Ku: global shape WHIM index Dm: total accessibility WHIM index ATS2p: autocorrelation index of a topological structure MEC: molecular eccentricity Mor04v: 3D-MoRSE-signal 04 (weighted by atomic van der Waals volumes)