QSPR predictions of heat of fusion of organic compounds using Bayesian regularized artificial neural networks Mohammad Goodarzi,a Tao Chenb and Matheus P. Freitasc,* a
Department of Chemistry, Faculty of Sciences and Young Researchers Club, Islamic Azad University, Arak Branch, Arak, Markazi, Iran
b
School of Chemical and Biomedical Engineering, Nanyang Technological University, 62 Nanyang Drive, Singapore 637459, Singapore
c
Departamento de Química, Universidade Federal de Lavras, CP 3037, 37200-000, Lavras, MG, Brazil
Abstract Computational approaches for the prediction of environmental pollutants’ properties have great potential in rapid environmental risk assessment and management with reduced experimental cost. A quantitative structure-property relationship (QSPR) study was conducted to predict the heat of fusion of a set of organic compounds that have adverse effect on the environment. The forward selection (FS) strategy was used for descriptors selection. We examined the feasibility of using multiple linear regression (MLR), artificial neural networks (ANN) and Bayesian regularized artificial neural networks (BRANN) as linear and nonlinear methods. The QSPR models were validated by an external set of compounds that were not used in the model development stage. All models reliably predicted the heat of fusion of the organic compounds under study, whereas more accurate results were obtained by the BRANN model. Keywords: heat of fusion; QSPR; forward selection; MLR; BRANN
2
1. Introduction
In the last two decades, the human life has been increasingly affected by environmental pollution, where various organic pollutants have been recognized to play a major role, such as benzene derivatives, phenolic derivatives and organic acids. These compounds are important environmental contaminants because of their high toxicity, widespread occurrence, and the capability of long-distance transfer, precipitation and accumulation in environment [1]. They affect the growth and decay of plants, and the health of human being and animals. The adverse effect on human health, some of which are highly detrimental, has already been documented in the literature [2-4]. The heat of fusion has been correlated to the concentration of polycyclic aromatic hydrocarbons [5] as a key thermodynamic property for the Freundlich equation [6]. It is defined as the sum of heat of melting and heat of all polymorphic transitions. The amount of heat required to convert a unit mass of a solid at its melting point into a liquid without an increase in temperature. In contrast to other thermodynamic properties, the heat of fusion is difficult to be accurately estimated by the group contribution method [7-9]. One of the major goals in the energetic materials field is to predict the performance, sensitivity, physical and thermodynamic properties of the materials prior to actual synthesis of them. Quantitative structure-activity/property relationship (QSAR/QSPR) techniques have been used to achieve this objective, demonstrating to be powerful tools in many fields of materials and compounds design [8-11]. QSAR/QSPRs are indispensable in current drug discovery (and other computational chemistry applications), since the capability of prediction can greatly facilitate the virtual design of compound libraries, combinatorial libraries with appropriate
3
absorption,
distribution,
metabolism
and
excretion
properties.
Altogether,
QSAR/QSPR technology considerably saves time and money during the drug development process. The predictive accuracy of QSPR analyses is typically affected by two aspects: the selection of descriptors that sufficiently represent the structural information of the molecules, and the choice of a specific predictive model. Several regression methods have been used in the field of QSPR, such as the famous and popular method of artificial neural network (ANN). ANNs are computer-based systems derived from the simplified concept of the human brain. The building unit of a neural network is a simplified model of the functional behavior of an organic neuron. A detailed explanation of the ANN theory and application to chemical problems can be found in previous researches [12-16]. Bayesian regularized ANN (BRANN) is a multilayer feed-forward neural network trained using Bayesian algorithm. In contrast to traditional ANN, where the network weights are assumed to be fixed quantities, the Bayesian principle considers a probability distribution of these weights and infers the posterior distribution over network weights. BRANN has been shown to attain more reliable and accurate predictions than traditional ANN in many applications [17-19]. In this work, BRANN is implemented using the LevenbergMarquardt algorithm. The combination of the two methods can accelerate convergence of target and determine the optimum weights for the network [20, 21], and is briefly described as follows. Actually, a Bayesian structure directly applied to neural networks has been proposed by MacKay [22] to overcome the problem of interpolating noisy data. The MSE (mean-square error) error function for minimization of a case of back propagation learning algorithm is considered, and the adoption of this performance
4
measuring index may lead to overfitting problems because of the unbounded values of weights of ANN. The performance function in the Bayesian-regularized (BR) method is changed by adding a term that consists of MSE of a combination of weights and biases as F = ED + EW where F is the network performance function, ED is sum of square error and Ew is the sum of squares of the network weights and biases, and
and
are objective function
parameters and dictates the emphasis foe getting a smoother network response [23]. Therefore, the modification of performance function aims to improve the ANN model’s generalization capability. In this context, it is assumed that the weights and biases of the ANN are random variables following Gaussian distributions, and the parameters are related to the unknown variances associated with these distributions. However, based on the Bayes rule, the density function for the weights can be updated in Bayesian framework after the data are taken as P ( w | D, , , M )
P ( D |, w, , M ) P ( w | , M ) P (D | , , M )
Where the data set is represented by symbol D, M is the particular neural network, and w is the vector of network weight. The plausibility of weight distribution considering the information of the dataset in the model is P (w | D, , , M ) , P ( D |, w, , M ) is the likelihood function, P ( w | , M ) is the prior density, and P ( D | , , M ) is a normalization factor which guarantees the total probability is one.
Considering that the noise in the training set data, as well as prior distribution for the weights, is Gaussian and then P ( w | D, , , M )
5
1 exp( F ) ZF
where ZF depends on objective function parameters, and based on this structure, the minimization of F is equal to find the (locally) most probable parameters [24]. In the present work, multiple linear regression (MLR), ANN and BRANN, as linear and nonlinear techniques, were used to predict the heat of fusion values of environmental pollutants. The aim of this work was to build QSPR models that can be used for predicting heat of fusion of these compounds from their molecular structure alone, and also to test the performance of the above methods and evaluate their applicability as powerful chemometric tools for predicting thermodynamic properties.
2. Computational methods
The experimental values were taken from the literature [25, 26]. Repeated compounds of Table 1 are due to different experimental values found in the literature. The 2D structures of the molecules were drawn using the HyperChem 7 software [27]. The final optimized geometries were obtained with the semi-empirical AM1 method in Hyperchem. All calculations were carried out at the restricted Hartree-Fock level with no configuration interaction. The molecular structures were optimized using the Polak–Ribiere algorithm until the root mean square gradient was 0.001 Kcal mol-1 [28]. The resulted geometry was transferred into the Dragon program package [29], in order to obtain descriptors on Constitutional, Topological, Geometrical, Charge, GETAWAY (GEometry, Topology and Atoms-Weighted AssemblY), WHIM (Weighted Holistic Invariant Molecular descriptors), 3D-MoRSE (3DMolecular Representation of Structure based on Electron diffraction), Molecular Walk Count, BCUT, 2D-Autocorrelation, Aromaticity Index, Randic molecular profile, Radial Distribution Function, Functional group and Atom-Centred Fragment classes. The
6
calculated descriptors were first analyzed for the existence of constant or near constant variables. The detected descriptors, 800 in total, were then removed since they do not provide sufficient information of the molecular structures. In addition, to decrease the redundancy existing in the descriptor data matrix, the descriptors correlation with each other and with the property of the molecules was examined and the collinear descriptors (i.e. r > 0.9) were detected. Among the collinear descriptors, the one presenting the highest correlation with the activity was retained and others were removed from the data matrix. Finally the 237 descriptors were used for the next step. The Bayesian regularized ANN was used in Matlab using NNet toolbox [30]. TABLE 1
3. Results and discussion
QSAR/QSPR analyses are particularly important for the design of new compounds, because the predictive capability can reduce the time and cost involved in purely experimental studies. Therefore, researchers have paid more attention to such studies on many compounds, but many of these works tended to focus more on the modeling ability of the QSAR/QSPR models and paid little consideration to the model validation and applicability domain which is essential for the assessment of the reliability. There are some techniques for validating multivariate models; one is based on cross-validation and other is the use of an external set, so we performed both of them. On the other hand, another point is that descriptors in model should represent the maximum information of structure variations and collinearity among them must be kept to a minimum. It should be noted that we have used of forward selection as a common and simple technique for feature selection.
7
The Forward Stepwise procedure (FS) is an effective and efficient approach for the selection of informative descriptors in QSPR. It consists of a stepwise addition of the best molecular descriptors to the model so as to minimize the standard deviation (S), until there is no variable outside the model to satisfy the selection criterion. Five descriptors were selected using the FS method, whose values are shown in Table 1. The correlation matrix of these five descriptors is depicted in Table 2, which shows that there is not significant correlation between the selected descriptors. Obviously, some problem to the model would arise if descriptors correlation coefficients are high, or one of them is collinear and other is chance of correlation; Table 2 shows that these problems are not observed in our models. TABLE 2 In order to develop a model and to assess validation, the data set of 74 compounds was divided into a training set of 56 compounds and a test set of 18 compounds. It should be noted that we split data based on the range of heat of fusion values; for the training set, the range was 12.12 to 56.60 (56 compounds), whilst for the test set, it was from 16.99 to 37.44 (18 compounds). This avoids extrapolation, since the training set range covers the test set one. In the training set, with the selected five descriptors, we built the linear model using the training set data, and the following equation was obtained:
Hfus = -11.762 + (53.969 × RDF010m) + (101.46 × R3e+) + (6.8494 × BEHm7) – (9.6472 × Mor20e) + (14.975 × Gs)
Thus, the built model was used to carry out validation and predict the test set data. The prediction results are given in Table 3. We constructed linear models with
8
different number of descriptors. Figure 1 shows that model with all five descriptors is more powerful than the others, since the squared correlation coefficient of experimental versus fitted/predicted heats of formation (r 2) increases and standard error (SE) decreases when more descriptors are added into the model. TABLE 3 FIGURE 1 In order to give insight about the nonlinearity of the model, a three-layer Back Propagation ANN model was constructed using the Levenberg-Marquardt (LM) algorithm as activity function. The input values were auto scaled before training the network, and also the initial weights were selected randomly between -0.3 and 0.3; then the neurons in the hidden layer, weights and biases learning rates and momentum values are optimized. The proper number of neurons in the hidden layer was 5, which was optimized by training the network using different number of neurons in the hidden layer and it was optimized based on root mean square error (RMSE), comparing the outputs with the target values. After optimization of all ANN parameters, the network for the adjustment of weights and biases values was trained. Additionally, a BRANN model was built to verify the enhancement capability of using Bayesian regularization when compared to ANN alone for the five-parametric model; a brief overview of the BRANN modeling is presented here. First, we assign prior distribution over the network weights. After the data is collected, the posterior distribution of the network weights can be determined by Bayesian inference. If a Gaussian prior distribution that penalizes the network weights is applied, and the data are assumed to be from a smooth function with additive Gaussian noise, maximizing the posterior distribution is equivalent to minimizing the standard sum-of-squares error together with a weight decay regularizer [31]. In the prediction stage, both the
9
mean
and variance
2
of the predictive distribution can be calculated to provide a
confidence bound on the predicted values. In this work, the Bayesian regularization was implemented within the Levenberg–Marquardt algorithm (LMBR). The combination of the two methods can accelerate convergence of targets and determine the optimum weights for the network [32, 33]. Before training the networks, the input values were normalized between -1 and 1. The initial weights were selected randomly between -0.3 and 0.3 and then the parameters of the number of nodes (neurons) in the hidden layer, weights and biases learning rates values were optimized. It should be noted that we used one hidden layer and the proper number of nodes in the hidden layer was determined by training the network with different number of neurons in the hidden layer. The root-mean-square error (RMSE) value measures how good the outputs are in comparison with the target values. It is worth mentioning that in BRANN process we used 8 compounds for validation set to evaluate the overfitting; the training of the network for the prediction of heat of fusion must stop when the RMSE of the validation set begins to increase while RMSE of training set continues to decrease. Therefore, training of the network was stopped when overtraining began. Actually, stop training is based on something: first the maximum number of epochs (repetitions) is reached, the maximum amount of time is exceeded, performance is minimized to the goal, etc., but the best way to prevent overtraining is to stop the training based on validation set. After that, we optimized everything and we trained the model we have used for external set as a test set of model, which did not have contribution in model development steps. On the other hand, for evaluating the MLR, ANN and BRANN based models, we used some statistical parameters, such as F statistical, t-test, squared correlation coefficients (r 2), root mean square error of prediction (RMSEP), relative standard error of prediction
10
(RSEP) and mean absolute error (MAE) values [34], in addition to other parameters useful for validation as reported elsewhere [35]. The statistical results for all three models (Table 4) show that they were reasonably well achieved in this study, but the FS-BRANN was more reliable in predicting the heat of fusion values. Figure 2 shows the experimental values versus heat of fusion obtained by FS-BRANN, whilst the residuals of the FS-BRANN predicted values of heat of fusion are plotted against the experimental values in Figure 3. The propagation of residuals at both sides of the zero line indicates that no systematic error exists in the development of the FS-BRANN model. TABLE 4 FIGURE 2 FIGURE 3
4. Conclusion
In this study, by the use of Bayesian regulation artificial neural network (BRANN), and also using artificial neural network (ANN) alone and multiple linear regression (MLR) methods, and five descriptors that were computed by Dragon software, predictive QSPR models were developed for heat of fusion of some environmental organic pollutants. The physicochemical (descriptors) properties of the compounds were selected by a common feature selection technique from a pool of descriptors; these descriptors were RDF010m, R3e+, BEHm, Mor20e and Gs, which captured enough information on molecular structures. The use of single descriptors can capture only one part of the property of interest, or of some occurring process, which is in many cases far from satisfactory. On the other hand, we can not use the
11
whole set of descriptors due to overfitting problems of statistical modeling. The use of multivariate regression instead is a great improvement for correlating physical properties with molecular parameters. We constructed a MLR-based model as a simple and fast linear approach, and a BRANN-based model as a powerful and nonlinear approach on five descriptors that were selected by forward selection; based on the prediction results obtained, it seems that the QSPR study performed is quite useful for predicting the heat of fusion of environmental organic pollutants.
Acknowledgements CNPq is gratefully acknowledged for the fellowship (to M.P.F.), as is FAPEMIG for the financial support.
References [1] R.V. Galiulin, V.N. Bashkin, R.A. Galiulina, Water Air Soil Pollut. 137 (2002) 179–191. [2] J.P. Giesy, K. Kannan, Crit. Rev. Toxicol. 28 (1998) 511–569. [3] A. Katsoyiannis, A. Zouboulis, C. Samara, Chemosphere 65 (2006) 1634–1641. [4] F. Flores-Céspedes, M. Fernández-Pérez, M. Villafranca-Sànchez, E. GonzálezPradas, Environ. Pollut. 142 (2006) 449–456. [5] C. Plaza, B. Xing, J.M. Fernández, N. Senesi, A. Polo, Environ. Poll. 157 (2009) 257–263. [6] A.M. Carmo, L.S. Hundal, M.L. Thompson, Environ. Sci. Technol. 34 (2000) 4363–4369
12
[7] R.C. Reid, J.M. Prausnitz, B.E. Poling, The Properties of Gases and Liquids, 4th ed., McGraw-Hill, New York, 1987. [8] P. Simamora, S.H. Yalkowsky, Ind. Eng. Chem. Res. 33 (1994) 1405–1409. [9] J.F. Krzyzaniak, P.B. Myrdal, P. Simamora, S.H. Yalkowsky, Ind. Eng. Chem. Res. 34 (1995) 2530–2535. [10] M. Goodarzi, M.P. Freitas, QSAR Comb. Sci. 27 (2008) 1092–1098. [11] M. Goodarzi, M.P. Freitas, J. Phys. Chem. A. 112 (2008) 11263–11265 [12] M. Goodarzi, M.P. Freitas, Chemom. Intell. Lab. Sys. 96 (2009) 59–62. [13] M.P. Freitas, E. F. F. da Cunha, T.C. Ramalho, M. Goodarzi, Curr. Comput.-Aid. Drug Des. 4 (2008) 273-282. [14] G. Kateman, Chemom. Intell. Lab. Sys. 19 (1993) 135–142. [15] J. Zupan, J. Gasteiger, Neural Networks in Chemistry and Drug Design, VCH, Weinheim, 1999. [16] S. P. Niculescu, J. Mol. Struct. (Theochem) 622 (2003) 71–83. [17] F.R. Burden, D. A. Winkler, Chem. Res. Toxicol. 13 (2000) 436–440. [18] F.R. Burden, D. A.Winkler, J. Med. Chem. 42 (1999) 3183–3187. [19] M. J. Polley, D. A. Winkler, F.R. Burden, J. Med. Chem. 47 (2004) 6230-6238. [20] Y.H. Wang, Y. Li, Y.H. Li, S.L. Yang, L. Yang, Bioorg. Med. Chem. Lett. 15 (2005) 4076–4084. [21] J. Caballero, M. Garriga, M. Fernandez, Bioorg. Med. Chem. 14 (2006) 3330– 3340. [22] D.J.C. MacKay, Neural Comput. 4 (1992) 448–472. [23] M Fernandez, J. Caballero, L. Fernandez, J. Ignacio Abreu, M. Garriga, Journal of Molecular Graphics and Modelling 26 (2007) 748–759. [24] D.J.C. MacKay, Neural Comput. 4 (1992) 415–447.
13
[25] M.H. Keshavarz, J. Hazardous Mat. 150 (2008) 387–393. [26] C. Chiou, D.W. Schmedding, M. Manes, Environ. Sci. Technol. 39 (2005) 8840– 8846. [27] HyperChem version 7.0, Hypercube, Inc, Gainesville, 2007. [28] D. C. Young, Computational Chemistry: A Practical Guide for Applying Techniques to Real-World Problems, John Wiley & Sons, New York, 2001. [29] R. Todeschini, V. Consonni, M. Pavan, Dragon software, Milano, 2002. [30] Matlab Version 7.6, MathWorks Inc., Natick, MA, 2007. [31] C. M. Bishop, Neural Networks for Pattern Recognition, Oxford University Press, Oxford, 1995. [32] D.J.C. MacKay, Neural Comput. 4 (1992) 448–472. [33] F.D. Foresee, M.T. Hagan, Gauss-Newton Approximation to Bayesian Learning, in: Proceedings of the 1997 IEEE International Conference on Neural Networks, Houston, 1997, pp. 1930–1935. [34] M. Goodarzi, T. Goodarzi, N. Ghasemi, Ann. Chim. 97 (2007) 303–312 [35] M. Goodarzi, Matheus P. Freitas, N. Ghasemi, Eur. J. Med. Chem. 45 (2010) 3911–3915.
14
Figure Captions
Figure 1. Linear models with different number of descriptors versus the correlation coefficient and standard error.
Figure 2. Plot of the calculated Hfusion through the BRANN-based model against the experimental values.
Figure 3. Plot of the residuals of predicted values through the BRANN-based model versus experimental Hfusion values.
15
Table 1. Descriptor values used for models construction. No. Compounds
RDF010m R3e+
BEHm7 Mor20e
Gs
1 1,2,3-Trichlorobenzene
0.093
0.159
0.755
0.618
0.621
2 1,2,3,4-Tetrachlorobenzene
0.062
0.156
0.783
0.657
0.614
3 1,2,4,5-Tetrachlorobenzene
0.062
0.149
1.549
0.552
1
4 Biphenyl
0.309
0.053
0.935
1.15
1
5 Naphthalene
0.248
0.064
0.802
0.951
1
6 2,6-Dimethylnaphthalene
0.305
0.052
1.897
1.137
0.758
7 Phenanthrene
0.309
0.052
2.291
1.304
0.593
8 2,4,5-Pcb
0.217
0.092
2.482
1.015
0.346
9 2,2',5-Pcb
0.217
0.068
2.51
1.356
0.346
10 2,2',4,5,5'-Pcb
0.155
0.088
2.821
1.304
0.338
11 2,2',3,3',4,4'-Pcb
0.124
0.088
3.076
1.201
1
12 Chlorpyrifos
0.213
0.122
2.764
0.789
0.193
13 Lindane
0.106
0.139
2.174
0.307
0.362
14 P,P'-DDT
0.259
0.085
2.951
1.157
0.191
15 1,2,4,5-Tetramethylbenzene
0.301
0.031
1.586
0.755
1
0
0.143
2.146
0.856
1
17 Pyrene
0.309
0.044
2.29
1.324
1
18 2,4,6-Pcb
0.217
0.068
2.532
1.098
0.588
19 2,2',3,3',4,5,5',6,6'-Pcb
0.031
0.086
3.242
0.747
0.57
20 2,8-Dichlorodibenzofuran
0.186
0.071
2.381
1.007
0.588
21 Dieldrin
0.193
0.103
2.959
0.494
0.191
22 Leptophos
0.27
0.085
2.79
0.934
0.188
16 Hexachlorobenzene
16
23 Aldicarb
0.379
0.071
1.861
0.856
0.218
24 Carbaryl
0.378
0.079
2.407
1.181
0.204
25 Alachlor
0.427
0.049
2.532
1.778
0.193
26 Linuron
0.313
0.098
2.36
0.922
0.204
27 Nitrobenzene
0.212
0.077
0.789
0.558
0.524
28 2-Nitrophenol
0.273
0.074
0.848
0.516
0.377
29 3-Nitroaniline
0.432
0.051
3.25
0.909
0.168
30 1-Nitronaphthalene
0.246
0.094
1.331
0.204
0.213
31 3-Nitrophenol
0.283
0.087
0.863
0.526
0.377
32 1-Nitroaniline
0.359
0.071
0.793
0.656
0.377
33 2-Nitrobenzoic Acid
0.28
0.107
1.331
0.374
0.362
34 3-Nitrophthalic Anhydride
0.171
0.065
1.331
0.314
0.351
35 4-Nitrophthalic Anhydride
0.165
0.061
1.331
0.303
0.351
36 1-Methyl-2,4-Dinitrobenzene
0.26
0.061
1.331
0.593
0.356
37 2-Methyl-1,3-Dinitrobenzene
0.26
0.06
1.331
0.423
0.356
38 1,4-Dinitrobenzene
0.214
0.065
1.331
0.399
1
39 1,2-Dinitrobenzene
0.234
0.069
1.331
0.371
0.424
40 2,4-Dinitrophenol
0.288
0.083
1.331
0.442
0.356
41 2-Methyl-4,6-Dinitrophenol
0.326
0.056
1.514
0.394
0.351
42 2,6-Dinitrophenol
0.288
0.064
1.331
0.512
0.372
43 4-Methyl-1,2-Dinitrobenzene
0.263
0.053
1.331
0.452
0.356
44 2,5-Dinitrophenol
0.288
0.076
1.331
0.331
0.356
45 1-Methyl-2,3-Dinitrobenzene
0.26
0.065
1.331
0.391
0.356
46 3,4-Dinitrophenol
0.305
0.081
1.331
0.233
0.356
17
47 2,3-Dinitrophenol
0.311
0.078
1.331
0.256
0.356
48 1,8-Dinitronaphthalene
0.288
0.071
2.225
0.464
0.31
49 1,5-Dinitronaphthalene
0.288
0.051
2.058
0.866
1
50 1,3,5-Trinitrobenzene
0.246
0.044
1.331
0.526
0.346
51
0.246
0.044
1.331
0.526
0.346
52 2,4,6-Trinitroresorcinol
0.406
0.056
1.672
0.356
0.338
53
0.406
0.056
1.672
0.356
0.338
54 1-Methy-2,4,6-Trinitrobenzene
0.286
0.043
1.854
0.342
0.342
55
0.286
0.043
1.854
0.342
0.342
56
0.286
0.043
1.854
0.342
0.342
57
0.286
0.043
1.854
0.342
0.342
58 1-Methoxy-2,4,6-Trinitrobenzene
0.276
0.042
2.242
0.252
0.338
59 1-Methyl-3-Hydroxy-2,4,6-Trinitrobenzene
0.359
0.053
1.887
0.241
0.338
60
0.359
0.053
1.887
0.241
0.338
61 1-Amino-2,4,6-Trinitrobenzene
0.417
0.043
1.755
0.511
0.342
62 1,3-Diamino-2,4,6-Trinitrobenzene
0.545
0.04
1.847
-0.036
0.49
63 1,3,5-Triamino-2,4,6-Trinitrobenzene
0.702
0.036
1.986
-0.356
0.334
64 2,4,6-Trinitrobenzoic Acid
0.333
0.077
2.123
0.084
0.334
65 1,4,5-Trinitronaphthalene
0.316
0.049
2.437
0.709
0.331
66 1-(Methylnitramino)-2,4,6-Trinitrobenzene
0.351
0.073
2.343
0.449
0.198
67
0.351
0.073
2.343
0.449
0.198
68 2,2',4,4',6,6'-Hexanitrobiphenyl
0.442
0.047
2.963
1.04
0.819
69 2,2',4,4',6,6'-Hexanitrobibenzyl
0.504
0.044
2.963
1.04
0.819
70 2,2',4,4',6,6'-Hexanitrodiphenylamine
0.646
0.044
3.01
1.481
0.168
18
71 2,2',4,4',6,6'-Hexanitrostilbene
0.492
0.046
3.164
0.947
0.193
72 2,2',4,4',6,6'-Hexanitrodiphenylsulfide
0.432
0.05
3.25
0.933
0.168
73 2,2',4,4',6,6'-Hexanitrodiphenylsulfone
0.469
0.045
3.397
0.787
0.186
74 3,3'-Dimethyl-2,2',4,4',6,6'-Hexanitrobiphenyl
0.511
0.045
2.996
0.748
0.178
19
Table 2. Correlation matrix for the five selected descriptors a. RDF010m R3e+ RDF010m
BEHm7 Mor20e
1
R3e+
0.4448
1
BEHm7
0.061
0.0102
1
Mor20e
0.0033
0.0004
0.2023
1
Gs
0.1594
0.0193
0.064
0.0419
a
Gs
1
Radial distribution function -1.0/ weighted by atomic masses (RDF010m), R
maximal
autocorrelation
of
lag
3/
weighted
by
atomic
Sanderson
electronegativities(R3e+), highest eigenvalue n.7 of Burden matrix/ weighted by atomic masses (BEHm7), 3D MoRSE-signal 20/ weighted by atomic Sanderson electronegativities (Mor20e) and G total symmetry index/ weighted by atomic electrotopological states (Gs).
20
Table 3. Experimental and calculated heat of fusion ( Hfusion) by BRANN, ANN and MLR models.a No Compounds
Exp BRANN ANN MLR
1
1,2,3-Trichlorobenzene
17.36 15.13
17.24 17.90
2*
1,2,3,4-Tetrachlorobenzene
16.99 17.08
17.41 15.63
3** 1,2,4,5-Tetrachlorobenzene
24.10 26.68
25.27 26.96
4*
Biphenyl
17.49 18.54
17.78 20.58
5
Naphthalene
18.99 17.36
17.99 19.41
6
2,6-Dimethylnaphthalene
24.27 21.72
23.21 23.35
7
Phenanthrene
18.62 21.38
20.78 22.18
8*
2,4,5-Pcb
22.80 22.82
24.31 21.67
9
2,2',5-Pcb
17.91 16.07
18.59 16.14
10** 2,2',4,5,5'-Pcb
18.78 18.78
19.46 17.34
11
2,2',3,3',4,4'-Pcb
29.20 29.51
29.13 28.32
12
Chlorpyrifos
25.94 24.31
23.91 26.32
23.59 22.47
22.52 25.41
26.36 23.31
22.10 22.75
21.00 24.60
24.39 26.18
13* Lindane 14
P,P'-DDT
15* 1,2,4,5-Tetramethylbenzene 16
Hexachlorobenzene
28.74 24.80
27.02 24.16
17
Pyrene
23.51 24.34
25.53 27.27
18
2,4,6-Pcb
16.48 16.91
23.12 16.40
22.63 21.29
22.18 22.17
19* 2,2',3,3',4,5,5',6,6'-Pcb 20
2,8-Dichlorodibenzofuran
25.19 22.48
22.71 20.88
21
Dieldrin
20.08 23.14
23.69 27.47
21
19.49 24.07
23.75 24.35
23* Aldicarb
25.94 24.76
24.17 23.65
Carbaryl
24.27 25.49
25.11 24.80
25** Alachlor
17.74 17.60
18.62 19.34
26
Linuron
28.66 25.96
26.66 25.40
27
Nitrobenzene
12.12 14.24
15.41 15.36
28
2-Nitrophenol
17.45 16.45
16.02 16.96
29** 3-Nitroaniline
23.69 22.46
31.85 32.73
30* 1-Nitronaphthalene
18.43 20.82
20.94 21.39
31
3-Nitrophenol
19.20 18.05
19.58 18.82
32
1-Nitroaniline
16.11 17.38
17.03 19.57
33* 2-Nitrobenzoic Acid
27.99 25.64
24.88 25.14
22
24
Leptophos
34
3-Nitrophthalic Anhydride
18.40 16.84
16.98 15.41
35
4-Nitrophthalic Anhydride
17.14 16.35
16.51 14.78
36* 1-Methyl-2,4-Dinitrobenzene
20.12 18.23
18.29 17.19
37* 2-Methyl-1,3-Dinitrobenzene
19.28 20.19
19.39 18.72
38* 1,4-Dinitrobenzene
28.12 27.00
28.78 26.62
39** 1,2-Dinitrobenzene
28.12 25.15
22.02 19.75
40
2,4-Dinitrophenol
24.17 23.46
25.14 22.39
41
2-Methyl-4,6-Dinitrophenol
19.41 24.52
23.43 23.34
42
2,6-Dinitrophenol
19.58 21.36
21.08 20.02
43
4-Methyl-1,2-Dinitrobenzene
18.83 19.48
18.13 17.90
44
2,5-Dinitrophenol
23.73 23.67
24.84 22.75
45
1-Methyl-2,3-Dinitrobenzene
17.57 20.87
20.67 19.54
22
3,4-Dinitrophenol
25.38 25.39
26.30 25.12
47* 2,3-Dinitrophenol
26.24 25.37
26.36 24.91
46
48
1,8-Dinitronaphthalene
35.20 32.38
26.79 26.39
49
1,5-Dinitronaphthalene
33.03 28.94
29.66 29.67
50
1,3,5-Trinitrobenzene
15.69 16.82
16.47 15.20
16.74 16.82
16.47 15.20
33.50 29.21
28.70 28.91
28.80 29.21
28.70 28.91
21.86 22.70
20.86 22.56
55*
19.58 22.70
20.86 22.56
56**
21.94 22.70
20.86 22.56
57*
23.43 22.70
20.86 22.56
51** 52
2,4,6-Trinitroresorcinol
53* 54
1-Methy-2,4,6-Trinitrobenzene
58
1-Methoxy-2,4,6-Trinitrobenzene
19.64 23.35
22.63 25.38
59
1-Methyl-3-Hydroxy-2,4,6-Trinitrobenzene
26.74 27.61
28.09 28.65
26.01 27.61
28.09 28.65
60** 61
1-Amino-2,4,6-Trinitrobenzene
28.15 28.59
25.06 27.32
62
1,3-Diamino-2,4,6-Trinitrobenzene
35.25 38.11
37.23 42.05
63
1,3,5-Triamino-2,4,6-Trinitrobenzene
56.60 52.36
54.11 51.82
64
2,4,6-Trinitrobenzoic Acid
31.60 29.02
29.46 32.76
27.49 25.58
22.42 25.07
25.85 27.37
28.23 29.27
67
25.85 27.37
28.23 29.27
68* 2,2',4,4',6,6'-Hexanitrobiphenyl
37.44 36.81
36.39 39.39
2,2',4,4',6,6'-Hexanitrobibenzyl
43.85 43.54
45.63 40.30
65* 1,4,5-Trinitronaphthalene 66
69
1-(Methylnitramino)-2,4,6-Trinitrobenzene
23
70
2,2',4,4',6,6'-Hexanitrodiphenylamine
37.38 36.66
38.51 36.41
71
2,2',4,4',6,6'-Hexanitrostilbene
40.21 35.78
35.36 34.88
72
2,2',4,4',6,6'-Hexanitrodiphenylsulfide
38.00 32.17
31.09 32.40
73
2,2',4,4',6,6'-Hexanitrodiphenylsulfone
40.36 38.07
39.67 36.58
74
3,3'-Dimethyl-2,2',4,4',6,6'-Hexanitrobiphenyl
33.69 38.03
37.92 36.35
a
Compounds with * pertain to the test set, and compounds marked with ** were used
as monitoring set during construction of ANN models.
24
Table 4. Comparison of the statistical parameters obtained using the FS-BRANN, FSANN and FS-MLR models.
Parameters RMSEP RSEP(%) MAE(%) r2 F statistic t-test R02 R’02 R0m2 R’0m2 Ra 2
Training set Test set Training set Test set Training set Test set Training set Test set Training set Test set Training set Test set Training set Test set Training set Test set Training set Test set Training set Test set Training set Test set
FS-BRANN 2.4942 1.6781 9.3819 6.9166 18.996 27.636 0.9142 0.8901 575.54 129.54 23.990 11.382 0.8961 0.8608 0.9123 0.8877 0.7912 0.7377 0.8743 0.8465 0.9056 0.8443
25
FS-ANN 3.0756 2.0177 11.5692 8.3165 20.5577 29.0280 0.8641 0.8423 343.3514 85.46904 18.52974 9.244947 0.8384 0.8164 0.8638 0.8423 0.7256 0.7067 0.8491 0.8423 0.8505 0.7766
FS-MLR 3.6511 2.3258 13.734 9.5866 22.592 33.229 0.8082 0.7956 227.52 62.286 15.083 7.8921 0.7711 0.7852 0.8081 0.7837 0.6525 0.7144 0.8001 0.7088 0.7890 0.7104
Figure 1
26
Figure 2
27
Figure 3
28