Ensemble methods for environmental data modelling ...

Viewer
Transcript

ENSEMBLE METHODS FOR ENVIRONMENTAL DATA MODELLING WITH SUPPORT VECTOR REGRESSION 15TH European Colloquium on Theoretical and Quantitative Geography September 7-11, 2007, Montreux, Switzerland Frédéric RATLE, Devis TUIA Institute of Geomatics and Analysis of Risk, University of Lausanne ABSTRACT This paper investigates the use of ensemble of predictors in order to improve the performance of spatial prediction methods. Support vector regression (SVR), a popular method from the field of statistical machine learning, is used. Several instances of SVR are combined using different data sampling schemes (bagging and boosting). Bagging shows good performance, and proves to be more computationally efficient than training a single SVR model while reducing error. Boosting, however, does not improve results on this specific problem.

KEYWORDS Support vector regression, ensemble methods, bagging, boosting, risk assessment, spatial prediction. INTRODUCTION The choice of a good statistical model for environmental data modelling is usually very difficult, since the data can be very noisy (i.e., corrupted with information not relevant to the phenomenon under study) and incomplete, and prediction error is thus very large. Ensemble methods, i.e., the aggregation of several predictors (or classifiers) in order to obtain a better prediction performance, are getting increasingly popular when confronted to such problems. In this paper, the performance of a predictor is investigated on environmental data regarding cesium-137 activity, and it is compared with the performance obtained with the aggregation of several instances of the same predictor using bagging and boosting, which are popular ensemble methods. DATA A dataset concerning the cesium-137 activity mapping has been used in this investigation. Cs-137 is a byproduct of nuclear fission. It is used in small amounts for the calibration of radiation detection equipment, and in larger amounts in medical radiation therapy devices. The cesium-137 found in nature basically comes from nuclear weapons testing, and from industrial and medical waste. External exposure to Cs-137 can cause burns, acute radiation sickness and even death. It significantly increases cancer risks because of the exposure to high energy gamma radiation. The dataset contains 684 samples coming from a soil survey in Russia. This soil survey was done after the Chernobyl accident. Every sample reports the spatial coordinates (in Lambert projection) and the 3 concentration of Cs-137 (in kBq/m ) in a soil sample. This dataset is particularly adequate to our purpose, because it has already been used to show the relevancy of SVR methodologies for environmental data mapping [1].

SUPPORT VECTOR REGRESSION Support vector regression (SVR) is a recent regression method developed in the field of statistical machine learning. A good introduction to the SVR methodology can be found in [2]. Two main features of the SVR model are worth mentioning. First, SVR is able to deal with nonlinear datasets by implicitly mapping the data in a space where a linear fit can be obtained. Indeed, rather than working with raw data, SVR works with a “kernel matrix” of the data, i.e., a semi-positive definite and symmetric matrix induced by some dissimilarity measure. Second, rather than using a classical discrepancy measure L to fit the model such as mean square error or absolute error, SVR uses the ε-insensitive loss function, which can be expressed as ⎧ y − f (x ) − ε if y − f (x ) > ε L( y, f (x )) = ⎨ 0 otherwise ⎩ where ε is a fixed threshold. This loss function renders the algorithms more robust, since it does not penalize errors smaller than ε. Finally, three parameters that influence the behaviour of the SVR algorithm must be tuned a priori: C, a regularization constant, σ, the scale of the Gaussian kernel (the dissimilarity measure), and ε, the width of the insensitive zone in the SVR loss function. These parameters have been tuned using a grid search over the space of possible parameters, and the values giving the smallest validation error have been retained. The following values have been used: C = 2500, σ = 4, ε = 40 . ENSEMBLE METHODS The ultimate goal of any model or parameter selection model is to determine a classification or regression function that is closest to the one from which the data is "sampled". However, in real-world situations, chances are that we will fail at this task, especially if the data are very noisy. From this way of considering model selection emerges the idea of combining predictors: averaging a set of predictions increases the chances that the final predictor is close to the "real" function. Furthermore, it can be shown (cf. [3]) that the variance of a set of predictors is at most equal to the average variance of the individual predictors. Two methods are used here: bagging and boosting. Bagging is a classical method in statistics, which works by simply averaging over a set of predictors. Each predictor is built using a bootstrap sample (i.e., a draw with replacement) of size M of the original data of size N ≥ M . Boosting, originally presented in [4] for classification, has been extended for regression by numerous authors. Here, we follow the Adaboost.R scheme proposed by Drucker [5]. The main idea of boosting is to build a final prediction by building incrementally a set of predictors. At each step, the points for which the predictor performs badly are given a higher sampling probability in the next step. Boosting has been successfully applied to many types of tasks in computer vision and remote sensing using statistical models such as neural networks and decision trees. An important point is the choice of the loss function (a relation measuring the discrepancy of a given model) used to compare the results and to estimate the performance of each model at each boosting step. In this study, two loss functions are used: the absolute error and the SVR ε-insensitive loss function. RELATED WORK A boosting methodology has been applied to SVR prediction of chemical compounds activity in [6]. However, only one loss function was considered (absolute error), and no other ensemble method was tested. Also, in [7], a thorough comparison of ensemble methods for regression is performed on artificial datasets. They have found that for simple cases, boosting sometimes worsens the results obtained with the unaggregated predictor. Indeed, they show that boosting algorithms tend to overfit the data when too many predictors are combined. However, they have shown that bagging performs at worse like a single predictor. RESULTS AND DISCUSSION The implementation used for the SVR algorithm is the one from Canu et al. [8]. The bagging and boosting algorithms have been implemented in Matlab. Figure 1 and 2 show the results obtained with bagging and boosting. The mean relative error value is given for an increasing number of predictors (averaged over 10

runs), along with its corresponding standard deviation. In every case, 50% of the data is used to train the individual models. The average error between one SVR trained with 100% of the data and 50% of the data has been first computed, and is approximately 5%. 1.04 1.05

absolute error cost e-insensitive cost

1

1 Relative error

0.95 Relative error

absolute error cost e-insensitive cost

1.02

0.9

0.85

0.98

0.96

0.8

0.94

0.75 0

2

4

6 8 10 Number of predictors

12

14

16

0.92 0

2

4

6 8 10 Number of predictors

12

14

16

Fig. 1 and 2. Relative error vs number of predictors for bagging and boosting, respectively. The values of errors are normalized by the value of the prediction error made by a single predictor with the absolute error cost. In both cases, a gap is observed between the two curves, which is explained by the εinsensitive loss function. Indeed, the latter provides a lower cost given the same predictor and dataset, so a constant factor is expected between the two cost functions These results show that for bagging, combining models using 50% of the training data reduces the error by up to 12-13%, which means approximately 7-8% compared to an SVR trained with the whole dataset. This is a significant improvement. Furthermore, SVR being usually of complexity O (n 3 ) with respect to the size n of the training dataset, the complexity of the whole procedure can be reduced if less than 8 bagged predictors are used, since the complexity of training a bagged predictor using 50% of the data 3 is O (n 2 ) = O n 3 8 . Here, the use of 6 predictors provides a decrease of 12% of the error (about 7% decrease compared to an SVR with the whole dataset).

(

) ( )

Results for boosting are less encouraging. Indeed, no significant reduction of the error is observed in Figure 2. This result, however disappointing, confirms observations made in [7]. Boosting may be interesting when the specificity of the dataset makes a single predictor prone to overfitting. A possible explanation of the behaviour of boosting is the presence of hotspots (regions of unusually high or low activity) in the data, as shown in [1]. The use of boosting might tend to smooth the hotspots, which obviously leads to a bad prediction performance. CONCLUSIONS In this paper, an ensemble methodology has been tested on the prediction of Cs-137 activity by means of SVR. It has been found that bagging improves significantly the performance of a single SVR model by combining several models trained with bootstrap samples of the original data. Given the cubic complexity of SVR, using 5 or 6 models constructed with half of the data is more efficient than training one model with the whole dataset, especially for very large databases. Boosting, however, did not improve the results obtained with one predictor. This confirms results obtained by other authors, in the sense that boosting may be advantageous for very difficult or specific problems, where using a single predictor may lead to overfitting, but bagging performs more consistently on all

problems. The use of bagging is therefore recommended, as it is easy to implement and is more likely to perform well on a wide range of problems. A point that should be considered in further studies is the sensitivity of SVR parameters (C, σ, ε) to the dataset. Here, as in most papers written on ensemble methods for regression, the parameters have been kept throughout the bagging and boosting process, i.e., the individual predictors differ only by the dataset sampling used for training. This modifies only the model parameters, e.g., regression coefficients, support vectors, etc. Re-tuning the SVR parameters for every bagging (or boosting) predictor is a computationally costly process, but would be worth experimenting. ACKNOWLEDGEMENTS This work is supported by the Swiss National Science Foundation (grants N° 105211-107862 and 113506). FR would like to thank Dr. Alexei Pozdnoukhov for useful comments. REFERENCES 1.

Pozdnoukhov, A. and Kanevski, M., Multi-scale support vector regression for hotspot detection and modeling, Technical report no 06-007, University of Lausanne, 2007.

2.

Smola, A. and Schölkopf, B., A tutorial on support vector regression, Statistics and Computing 14: 199-222, 2004.

3.

Naftaly, U. et al, Optimal ensemble averaging of neural networks, Network: Comput. Neural Syst. 8: 283-299, 1997.

4.

Freund, Y. and Schapire, R., A decision-theoretic generalization of on-line learning and an application to boosting, Journal of Computer and Systems Science 55: 119-139, 1999.

5.

Drucker, H., Improving regressors using boosting techniques, in Proc. of the 14th Int. Conf. on Machine Learning (ICML), 1997.

6.

Zhou, Y. et al, Boosting support vector regression in QSAR studies of bioactivities of chemical compounds, European Journal of Pharmaceutical Sciences 28:344-353, 2006.

7.

Barutçuoglu, Z. and Alpaydin, E., A comparison of model aggregation methods for regression, In. Proc. of 13th Int. Conf. on Artificial Neural Networks (ICANN), Springer, 2003.

8.

Canu, S., et al., SVM and Kernel Methods Matlab Toolbox, Perception Systèmes et Information, INSA de Rouen, 2005.

AUTHORS INFORMATION Frédéric RATLE

Devis TUIA

[email protected] Machine Learning, Kernel methods. Institute of Geomatics and Analysis of Risk, University of Lausanne

[email protected] Machine Learning, Quantitative geography. Institute of Geomatics and Analysis of Risk, University of Lausanne

Ensemble methods for environmental data modelling ...

Institute of Geomatics and Analysis of Risk, University of Lausanne ... The choice of a good statistical model for environmental data modelling is usually very ...

Download PDF

121KB Sizes 1 Downloads 211 Views

Report

Ensemble methods for environmental data modelling ...

Recommend Documents