Prediction of Arsenic in Bedrock Derived Stream ...

Viewer
Transcript

C 2006) Natural Resources Research ( DOI: 10.1007/s11053-006-9013-6

Prediction of Arsenic in Bedrock Derived Stream Sediments at a Gold Mine Site Under Conditions of Sparse Data Navin K. C. Twarakavi,1 Debasmita Misra,1,2 and Sukumar Bandopadhyay1

Arsenic is often present in gold mining areas. The high sensitivity of arsenic to biogeochemical conditions may lead to catastrophic consequences through contamination of resources such as ground water. Therefore, it is critical to understand the spatial occurrence of arsenic across a given site. Previous studies using traditional pattern recognition techniques such as neural networks and kriging have not been entirely successful in predicting arsenic concentrations across a gold mining area. The methods used in this paper are the support vector machines (SVM) and robust least-square support vector machines (robust LS-SVM). The two techniques were used to predict arsenic concentrations in the sediments of Circle City, Alaska, using the gold concentration distribution present within the sediments. The analysis of the results shows an improved performance and better predictive capabilities of SVM and robust LS-SVM than that of the neural networks and kriging techniques. The robust LS-SVM performed better than the SVM. The performance of the SVM was affected by outliers. The removal of the outliers from the data set and application of SVM showed improved results. KEY WORDS: Sediments, SVM, arsenic, gold, dispersion.

INTRODUCTION Arsenic has been known to be toxic to human health. Drinking water contaminated with unsafe levels of arsenic may cause cancer of the skin, bladder, lungs, and possibly other internal organs, and non-cancer effects, including manifestations that are indicative of chronic arsenic poisoning (Andreae, 1980; Azcue and Nriagu, 1994; Focazio and others, 1999; Nimick, 1994; Nimick and others, 1998; NRC, 1999, 2001). A maximum contaminant level goal (MCLG) of zero and a low maximum contaminant level (MCL) of 10 µg/L are indicative of the immense toxicity of arsenic to human health. Arsenic contamination is prevalent especially, in gold mining areas (Moore, 1994). In fact, mobilization of arsenic from mining areas is responsible 1 Department

of Mining and Geological Engineering, College of Engineering and Mines, University of Alaska Fairbanks, P.O. Box 755800, Fairbanks, AK 99775, USA. 2 To whom correspondence should be addressed; e-mail: [email protected].

for one of the largest superfund sites in the United States located at Milltown Dam, Montana (Axtmann and Luoma, 1991). Arsenic is also a common accessory to many mineral deposits, especially gold deposits. Arsenic is mobilized due to oxidation of arsenopyrites that are generally associated with gold ores. Considering the extensive gold mining activities in Alaska, higher levels of arsenic concentration in gold mining areas are expected. A strong correlation between the presence of gold and arsenic has been pointed out by many authors. Understanding the arsenic contamination of natural resources such as soil and water is therefore significant, especially in areas with extensive mining history. However, the complex relationship between the structural geology, mineral chemistry, and mobilization characteristics of arsenic makes it difficult to develop a process-based model for understanding arsenic mobilization. As the spatial and temporal extents involved in a model increases, the issue of equifinality (Savenije, 2001) starts gaining alarming proportions when modeling heavy metal

C 2006 International Association for Mathematical Geology 1520-7439/06

Twarakavi, Misra, and Bandopadhyay contaminations. Under these circumstances, processbased models make way for intelligent learning systems and statistical techniques. Modeling arsenic occurrence in natural resources such as groundwater, sediments at larger spatial scales has been attempted in the past using statistical methods and intelligent systems (Misra and others, 2005; Welch and others, 2000; Focazio and others, 1999). Statistical methods and other intelligent systems such as artificial neural networks (ANNs) are often a prime choice for modeling environmental processes at larger spatial scales. Statistical methods and other intelligent systems are, however, heavily data driven. Instead of incorporating the physical process into the model structure, these methods generally follow a black-box approach. Neural networks have long been used as a modeling tool in many fields of research. Neural networks are black box models that learn from a training data set mimicking the human-learning ability. They are robust to noisy data and can approximate any multivariate nonlinear relation among variables. However, neural networks suffer from a number of disadvantages. Neural networks are extremely ‘datagreedy.’ Neural networks also need a large amount of data during the training stage. They need long learning times and are susceptible to local minima during optimization. In the presence of a number of local minima, neural networks may fail in estimating the global minima. Furthermore, the efficiency of a neural network is reduced, if the training data set possesses a number of outliers. Perhaps, the biggest drawback of neural networks is that the structure of the network has to be defined prior to the model training. This may lead to over-fitting of the model, large number of weights and poor performance with previously unseen data. Therefore, in situations where data is sparse and errors in the data are inevitable, neural networks models could be very inefficient. In the past decade, a number of statistical learning theories has been developed and used widely as they show a definite improvement over neural networks. One of the better statistical learning theorybased methods is the Support Vector Machines (SVMs). SVMs represent a machine-learning model where the prediction error and the model complexity are simultaneously minimized (Vapnik, 1995, 1998). SVM may be considered as the generalized version of neural networks. In fact, SVMs become neural networks under certain specific conditions. Mukherjee, Osuna, and Girosi (1997) showed that the SVM al-

gorithm has a remarkable predictive capability and it performs better than the polynomial and rational approximations, local polynomial techniques, radial basis functions (RBFs), and neural networks when applied on a database of chaotic time series. Other methods such as least-square support vector machine (LS-SVM) also have been developed (Suykens and others, 2002; De Brabanter, 2004) based on statistical learning theory. The mathematical philosophy on which these methods are based is the same as described in Vapnik (1995, 1998). Misra and others (2005) have used traditional methods such as neural networks and kriging techniques to understand the relationship between gold and arsenic in bedrock derived stream sediments. The results showed a definite relationship between arsenic and gold sediments. In this paper, the performance of statistical learning theory-based methods such as SVM and robust LS-SVM and other traditionally used methods such as neural network and kriging techniques are compared to predict arsenic in the sediments of Circle City, Alaska, using the gold concentration distribution within the sediments. The goal is to understand if it is possible to predict arsenic concentrations in sediments in a study area, if associated gold concentrations are known.

STUDY AREA AND CONTAMINATION DATA The area selected for the study is located in the Circle mining district, Alaska (Fig. 1). Gold mining has been a significant part of this district since the late 1890s with over 1 million of gold produced. The district, however, is currently devoid of any mining activity. The geology of the Circle District consists of metamorphosed Paleozoic sediments that are bounded on the north by the major Tintina fault zone and intruded by several granite phases. These rocks are part of the Yukon Tanana upland terrain that hosts several major gold deposits, including the Fort Knox and Pogo gold deposits. The intrusive rocks found in the district are Cretaceous and Tertiary granite and granodiorite that occur as stocks and dikes. These intrusive rocks are elongated in an east–west and north-northwesterly direction and are centered on a regional anticline structure. The gold deposits in the Circle mining district are typically associated with arsenopyrites. Data for various contaminants in the sediments of a mining site in the

Prediction of Arsenic in Bedrock Derived Stream Sediments at a Gold Mine Site

Figure 1. Map showing the location of the study area in Alaska.

Circle mining district was obtained for this study. Figure 1 shows the location of the sampling sites used to collect the data. Over 1088 samples of bedrock derived stream sediments were collected and analyzed for concentrations of gold, arsenic, and other heavy metals in the early 1980s as a part of a project to estimate the mineral potential of the Circle mining district (Wiltse, 1987; Metz, 1991). Misra and others (2005) used various neural networks and kriging techniques to relate the occurrence of gold and arsenic in the study area. Although the possibility of a definite relation between the occurrence of gold and arsenic in the sediments was established, the confidence level, however, was less than desired. Before the application of statistical learning theory-based methods to the study area, a cursory analysis of the relationship between gold and arsenic sediments was performed. Misra and others (2005) performed simple statistical analyses (quantile-quantile plot, semi-variogram analysis, histogram analysis) between gold and arsenic concentrations and showed that a definite relationship exists. The contours of arsenic and gold concentration shown in Fig. 2 were developed based on the inverse distance weighing interpolation scheme. The plot shows a clear overlap of ‘hotspots’ of arsenic and gold concentrations. The arsenic concentrations contours, however, are spread farther as a result of the

higher mobility of arsenic compared to that of gold. Another interesting site-specific observation that can be made is that the mobility of arsenic is higher along the x-direction than that could be observed in the y-direction. To further understand the characteristics of the spatial distributions of gold and arsenic in the sediments, a variogram analysis was performed using VARIOWIN software (Pannatier, 1996). Using the data, variogram surfaces were created indicating the spatial continuity of gold and arsenic concentrations detected in the sediments (Fig. 3a and b). Arsenic seems to exhibit higher spatial continuity than gold. This may be attributed to the higher mobilization characteristics of arsenic. Also, a cross-variogram grid indicating the spatial relation between gold concentrations at any location to arsenic concentrations at any distance from the location was estimated (Fig. 3c). Figure 3c seems to support the observation that there exists a definite relation between gold and arsenic concentrations in sediments. The effective correlation length between gold and arsenic concentrations is estimated to be 0.2◦ ( ∼ 5.6 miles) along the latitudinal direction and 0.05◦ ( ∼ 1.4 miles) along the longitudinal direction. Preliminary statistical analysis on the data suggests a strong relation between gold and arsenic concentrations. However, Misra and others (2005) could not successfully quantify this relation using neural networks and kriging

Twarakavi, Misra, and Bandopadhyay

Figure 2. Contours of arsenic and gold concentrations in sediments. Note that the ‘high concentration’ contours for arsenic and gold nearly overlap each other, indicating a common source for the contaminated sediments.

Figure 3. Standardized variogram plots for different lags along the latitude and longitude directions.

Prediction of Arsenic in Bedrock Derived Stream Sediments at a Gold Mine Site techniques. In this paper, statistical learning theorybased methods (SVM, robust LS-SVM) will be used to quantify the relationship between gold and arsenic concentrations.

METHODOLOGY In this section, a brief explanation of the two methods based on statistical learning theory, SVM and robust LS-SVM is presented.

Support Vector Machines The support vector methodology (Vapnik, 1995, 1998) is based on the statistical learning theory that has been described in the previous literature (Vapnik, 1995, 1998). A brief explanation of the SVM for regression is provided here, and readers are referred to previous literature (Vapnik, 1995, 1998) for more explanation. Suppose we have training data {(x1 , y1 ), . . . , (xm, ym)} where the vector x represent the inputindependent variables (in this case, p, δ, and ) and y represents the output (here, C). In ε-SVM regression (referred here as SVM), the goal is to find a function f(x) that has at most ε deviation from actually observed value, y (Smola and Scholkopf, 1998). In other words, the learning machines accept the fact that an error of ε is acceptable and it is not a problem as long as the model can predict with a deviation no larger than ε. This is perhaps, the most novel an important feature of SVM regression (Fig. 4). To put it mathematically, SVM regression adopt the linear form (Smola and Scholkopf, 1998) f (x) = w, φ(x) + b (1)

where ø is a kernel function that is used to transform the independent variables in to a feature space such that problem between y and ø(x) is linear. ., . represents the dot product between the two vectors. A number of kernels are available for use in SVM regression. The solution to the formulation is obtained by seeking the maximum possible flatness through obtaining a small w (Smola and Scholkopf, 1998). Equation (1) deduces to a convex optimization problem 1 w2 2 yi − w, φ(xi ) − b ≤ εi subject yi − w, φ(xi ) − b ≥ ε

minimize

The intrinsic assumption in Eq. (2) is that a function f exists that approximates all pairs (xi , yi ) with a precision of ε (Smola and Scholkopf, 1998). Most of the times, this may not be the case as errors are unavoidable in a groundwater quality database. Cortes and Vapnik (1993) introduced the slack variables ξi∗ and ξi to cope with the errors and formulated the SVM regression model in to the following convex optimization problem m 1 w2 + C (ξi∗ + ξi ) 2 i=1 ⎧ ⎨ yi − w, φ(xi ) − b ≤ ε + ξi subject y − w, φ(xi ) − b ≥ ε + ξi∗ ⎩ i ξi∗ , ξi ≥ 0

minimize

Loss +ε

x x x

x

-ε

x x

(3)

The constant C > 0 determines the tradeoff between the flatness of f(x) and the amount up to which the deviations are larger than ε are tolerated (Smola and Scholkopf, 1998). The constant, C, which influences a tradeoff between an approximation error and the weights vector norm, w, is a design parameter chosen by the user. An increase in C penalizes large errors

x ξ

(2)

x

ξ

x

-ε

ε

Figure 4. Pre-specified accuracy and slack variable ξ in SVM regression.

y - f(x)

Twarakavi, Misra, and Bandopadhyay and consequently leads to a decrease in approximation error. This is achieved by increasing the weight vector norm, w, which does not necessarily guarantee a good generalization performance of the model. Similar to a constrained optimization problem, the SVM regression problem mentioned in Eq. (3) can be solved by the saddle point of the Lagrange function (Vapnik, 1995, 1998). The solution is obtained through forming a Lagrangian by multiplying the inequality constraints equations (of the form g(x) > 0) with non-negative Lagrange multipliers (α, β, α∗ , β∗ ≥ 0) and subtracting them from the objective function. The posed constrained optimization problem in Eq. (3) can be solved by forming a primal variables Lagrangian L(w, ξ, ξ∗ ). 1 T (ξi∗ + ξi ) w w+C 2 m

L(w, ξ, ξ∗ , α, α∗ , β, β∗ ) =

i=1

−

m

α∗i [yi − wT xi − b + ε + ξ]

i=1

−

m

αi [−yi + wT xi + b + ε + ξ]

i=1

−

m

(β∗ ξi∗ + βξi )

(4)

i=1

where primal variables Lagrangian, L, has to be minimized with respect to primal variables w, b, ξi∗ , and ξi and maximized with respect to non-negative Lagrangian multipliers α, β, α∗ , and β∗ . Therefore, this problem can be solved in a primal space or in a dual space. A solution in dual space can be formulated by using the Karush–Kuhn–Tucker conditions (KKT) for regression. Therefore, Eq. (4) may be solved by maximizing a dual variables Lagrangian, Ld (α, α∗ ) (Vapnik, 1995 , 1998). Ld (α, α∗ ) = −ε

m

(αi + α∗i )

i=1

+

m

(yi (αi − α∗i ))

i=1

−

m 1 (αi − α∗i ) αj − α∗j xTi xj 2 i,j =1

Subject to

L i=1

(αi − α∗i ) = 0 0 ≤ αi , α∗i ≤ C (5)

The solution to the SVM regression explained earlier refers to the linear case. For a nonlinear scenario, the input vectors are mapped in to a highdimensional space using a kernel function, and the SVM regression for linear case is used on the mapped input vectors. A number of kernels are available for this purpose such as RBF kernel and polynomial kernel (Vapnik, 1995 , 1998). Robust LS-SVM The SVM solutions are attracting increasing attention, mostly because they automatically answer certain crucial questions involved in a neural network construction. They derive an ‘optimal’ network structure and answer the most important question related to the ‘quality’ of the resulting network. The main drawback of standard SVM is its high computational complexity. Therefore, recently, a new technique such as the LS-SVM has been introduced. The LS-SVMs are re-formulations of the standard SVMs (Suykens and others, 2002; De Brabanter, 2004). LSSVMs are, however, algorithmically more effective. It is because the solution can be obtained by solving a linear set of equations instead of a computationintensive quadratic-programming problem. Robust LS-SVM is based on the LS-SVM methodology. SVM methods may not work well on data with nongaussian noise (De Brabanter, 2004). To overcome the conditions of non-gaussian noise, the methods of robust LS-SVM was introduced. A detailed explanation of the LS-SVM method can be found in De Brabanter (2004). Modeling Framework The modeling framework for using SVM and robust LS-SVM consists of the following steps: 1. Preparation of training and testing data sets. 2. Training the model using training data set. 3. Cross-validate the model using training data set/subsets of training data set. 4. Testing the trained model using the testing data set. Step 1 involves the development of the training data set and the testing data set. It is important to prepare the training data set and the testing data set with similar statistical characteristics. Some of the important statistical characteristics are the maximum and minimum values, the size of the data sets, the

Prediction of Arsenic in Bedrock Derived Stream Sediments at a Gold Mine Site median, trimmed mean and the range. For developing statistically similar data sets for training and testing phase for the present study, genetic algorithms were used. The adopted methodology has been used before; for example, Misra and others (2005) and Samanta, Bandopadhyay, and Ganguli (2002). In step 2, the model is trained using the training data set for an optimal parameter estimate. The models considered in this application are the SVM and the robust LS-SVM. After the training of the data set, the model is cross-validated using the training data set or subset of the training data set in step 3. The subset of the training data set is prepared following the procedure described in step 1. Different subsets of training data set may be used for crossvalidating the model to ensure stability of the model parameter estimates. In step 4, the model is tested using the testing data set. Step 4 is vital for ensuring the confidence in the model as testing data set is not ‘seen’ by the model before. The SVM was implemented by using the code ¨ described in Scholkopf and others (2000) and Fan, Chen, and Lin (2005) in MATLAB. The robust LSSVM was implemented by using LS-SVM software (Suykens and others, 2002), which has also been written in MATLAB.

RESULTS AND DISCUSSION Using the methodology described earlier, the training and the testing data sets were created from the original database. The resulting data sets show similar statistical characteristics and are spatially scattered throughout the study area. Some key statistical properties of the two data sets are listed in Table 1. Prior to using the data for model development and validation, the data was standardized. Standardization was performed assuming a uniform distribution. Attempts to fit the data with a probabilistic distribution showed that the data does not fall under traditional distributions such as normal, lognormal, or uniform. The selection of uniform distribution for standardizing the data was only for mathematical convenience. Also, SVMs are robust to data that is standardized using any distribution (Vapnik, 1995). The independent variables considered here are the gold concentrations and the spatial coordinates, and the dependent variable is the arsenic concentration. The standard SVM regression described in Vapnik (1995, 1998) was used. For the SVM, a stan-

Table 1. Descriptive Statistics for the Training and Testing Data Sets Mean Training set (544 data) Longitude Latitude Gold (ppb) Arsenic (ppm) Testing set (544 data) Longitude Latitude Gold (ppb) Arsenic (ppm)

Standard deviation

−145.32 65.44 12.50 31.19

0.34 0.09 45.31 54.01

−145.32 65.44 12.49 31.18

0.34 0.09 44.52 48.01

dard RBF kernel is considered to map the input data into the feature space. The parameters for the SVM model (ε, C, and the support vectors) were estimated through a rigorous grid-search approach. For the optimal SVM model, the parameter estimates were found to be C = 30, ε = 0.05, and the number of support vectors were 350 out of 544 data points in the training set. Selection of only 350 data points as support vectors out of the possible 544 data points indicates the ability of the SVM model to work under sparse conditions. Figure 5 shows the predicted and observed arsenic concentrations by the SVM model. While the SVM fits the training data set perfectly (Fig. 5a; R2 = 0.99), the SVM show lesser efficiency in fitting the testing data set (Fig. 5b; R2 = 0.50). The inability of the SVM methodology for a better prediction of testing data set is due to the result of over fitting to the outliers and erroneous data during the validation stage. It is interesting to note, however, that the introduction of a simplest statistical learning theorybased methods such as SVM seems to outperform traditional methods such as neural networks. This is evident on comparison with results of the application of neural network and kriging techniques on the same data set by Misra and others (2005). To further understand the reasons for the relatively poorer fit for the testing data set, the error in prediction was analyzed. Figure 6 shows the histogram for the error in prediction for the testing data set. Figure 6 reveals certain interesting facts. The SVM seems to predict with a high accuracy for most of the testing data set. The predictive capability of SVM seems to fail at higher arsenic concentrations. One of the possible reasons for poorer predictions of higher arsenic concentrations is the tendency of the SVM to fit any outliers. The applicability of SVMs may also fail under conditions of non-gaussian outliers.

Twarakavi, Misra, and Bandopadhyay

(a)

(b) 500

Predicted As concentration (ppm)

Predicted As concentration (ppm)

900 800 700 600 500 400

y = 0.99x R2 = 0.99

300 200 100 0 0

100

200

300

400

500

600

450 400 350 300

y = 0.90x

250

R2 = 0.50

200 150 100 50 0 0

Observed As concentration (ppm)

100

200

300

400

500

600

Observed As concentration (ppm)

Figure 5. Predicted and observed as concentrations for the (a) training data set and (b) testing data set for SVM.

In order to examine the effects of the outliers, SVM were applied after removing the outliers from the data set. A better performance was observed by the SVM methodology on the trimmed data set. The application of SVM without outliers in the data set predicted the testing data set with an R2 of 0.65

(Fig. 7). However, it is important to note that removal of outliers from the data set to fit the SVM methodology results in underachievement of the research objectives. In order to improve the prediction efficiency under conditions of noisy data, the method of

Anderson-D arling Normality Test

-500

-400

-300

-200

-100

0

100

A -Squared P -Value <

56.77 0.005

M ean StD ev Variance Skewness Kurtosis N

1.959 37.529 1408.402 -6.1324 90.0848 523

M inimum 1st Quartile M edian 3rd Quartile Maximum

-540.343 -4.149 0.606 6.633 184.550

95% Confidence Interval for M ean -1.265

5.182

95% Confidence Interval for Median 0.061

9 5 % C onfidence Inter vals

1.484

95% Confidence Interval for StD ev

Mean

35.384

39.952

Median 0

2

4

6

Figure 6. Descriptive statistics for error in prediction of arsenic concentrations (in ppm) from testing data set for SVM.

400

400

350

350

Predicted As concentration (ppm)

Predicted As concentration (ppm)

Prediction of Arsenic in Bedrock Derived Stream Sediments at a Gold Mine Site

300 250 200

y = 0.94x 2

150

R = 0.95

100 50 0

300

y = 0.81x R2 = 0.65

250 200 150 100 50 0

0

100

200

300

400

0

100

Observed As concentration (ppm)

200

300

Observed As concentration (ppm)

Figure 7. Predicted and observed as concentrations for the (a) training data set and (b) testing data set for SVM with outliers removed.

robust LS-SVM was applied on the same data set. Figure 8 shows the plot between the predicted and the observed arsenic concentrations for the training and testing data sets. The application of robust LSSVM shows an overall improvement compared to the SVM. While robust LS-SVM shows lesser fit than that of the SVM for the training data set (Fig. 8a; R2 = 0.95), they show a considerable improvement during prediction of the testing data set (Fig. 8a;

R2 = 0.70). This is a significant observation, as it clearly points at the improved efficiency of the robust LS-SVM under conditions of noisy data and outliers. For further analysis, a simple statistical analysis of the error in prediction by the robust LS-SVM was performed. Figure 9 gives a graphical summary of the prediction error statistics for the robust LS-SVM. Figure 9 clearly shows an improved performance of the robust LS-SVM compared to SVM. It is also

(a)

(b) 800 700

700

Predicted As concentration (ppm)

Predicted As concentraiton (ppm)

800

600

y = 0.94x R2 = 0.95

500 400 300 200

600 500

y = 0.84x

400

2

R = 0.70 300 200 100

100 0

0

0

100

200

300

400

500

600

Observed As concentration (ppm)

700

800

0

100

200

300

400

500

600

700

Observed As concentration (ppm)

Figure 8. Predicted and observed as concentrations for the (a) training data set and (b) testing data set for robust LS-SVM.

800

Twarakavi, Misra, and Bandopadhyay

Anderson-Darling Normality Test

-240

-160

-80

-0

80

160

240

A -S quared P -Value <

60.72 0.005

Mean StDev Variance S kewness Kurtosis N

-0.345 27.430 752.392 -2.4435 37.5538 544

M inimum 1st Quartile Median 3rd Quartile Maximum

-266.710 -1.786 2.560 7.022 237.895

95% Confidence Interval for Mean -2.655 9 5 % C onfidence Inter vals

1.966

95% Confidence Interval for Median 1.982

Mean

3.164

95% Confidence Interval for StDev Median

25.891 -3

-2

-1

0

1

2

29.164

3

Figure 9. Descriptive statistics for error in prediction of arsenic concentrations (in ppm) from testing data set for robust LS-SVM.

of interest to understand the prediction errors for different ranges of observed arsenic concentrations. Figure 10 shows a marginal plot of the percentage of absolute error corresponding to the actual observed arsenic concentration. Figure 10 shows that almost 90% of the testing data set has a percentage abso-

lute error of less than 5%. The error in prediction seems to increase as observed arsenic concentration increases. Because of a an efficient model structure, robust LS-SVM seems to perform better compared to other learning methods (ANN, SVM) in predicting the

% Absolute error

80 60 40 20 0 0

200 400 600 Observed As conc. (ppm)

800

Figure 10. Marginal plot of percentage absolute error and observed as concentration (ppm).

Prediction of Arsenic in Bedrock Derived Stream Sediments at a Gold Mine Site

Figure 11. Arsenic concentrations in streambed sediments (a) observed and (b) predicted by robust LS-SVM.

arsenic in streambed sediments in the study area. The applicability of robust LS-SVM to predict the spatial distribution of arsenic was checked (Fig. 11). The trained robust LS-SVM model was applied throughout the study area to predict the arsenic concentrations. It is clear that robust LS-SVM predicts arsenic concentrations that seem to be in strong correlation with the observed arsenic concentrations.

SUMMARY AND CONCLUSION The application of statistical learning theorybased methods to predict arsenic concentrations in the study area has shown encouraging results. While Misra and others (2005) reported a poor performance by ANN (R2 ∼ 0.35), application of robust LS-SVM and SVM resulted in a huge improvement with R2 of 0.70 and 0.50, respectively. The effectiveness of SVM and robust LS-SVM may be attributed to their robustness attained through structural risk minimization (Vapnik, 1995). It is fair to conclude that the robust LS-SVM shows a great potential for prediction of arsenic in streambed sediments. The robust LS-SVM seems to

outperform SVM in the prediction of arsenic concentrations. The presence of outliers seems to have a significant impact on the performance of SVM. The removal of outliers from the data set significantly improved the performance of SVM. The comparison of robust LS-SVM, to SVM and SVM with outliers removed from the data set show that the former technique is much more robust than SVM. A better performance of robust LS-SVM compared to the SVM and SVM with outliers removed indicated a better characterization of the phenomenon by robust LSSVM. However, the weakness of robust LS-SVM is that the ability to perform under conditions of sparse data is lost to some extent. There are, however, certain limitations that are unavoidable in statistical-based methods, since they are ‘data-driven.’ The accuracy of the results from the SVM and robust LS-SVM are entirely dependent on the errors in the data set and the representation of the study area in the training phases. Unlike process-based methods, the confidence of the results from these statistical-based methods is based on the relative errors in the data. But, in conditions of insufficient data to represent processes and sparseness of the data, one may say that statistical learning method

Twarakavi, Misra, and Bandopadhyay hold promise and show a new direction in prediction of natural processes. REFERENCES Andreae, M. O., 1980, Arsenic in rain and the atmospheric mass balance of arsenic: J. Geophys. Res., v. 85, no. 8, p. 4512– 4518. Axtmann, E. V., and Luoma, S. N., 1991, Large-scale distribution of metal contamination in the fine-grained sediments of the Clark Fork River, Montana, U.S.A: Appl.Geochem., v. 6, p. 75–88. Azcue, J. M., and Nriagu, J. O., 1994, Arsenic: Historical perspectives, in Nriagu J. O., ed., Advances in Environmental Science and Technology: Arsenic in the Environment. Part 1: Cycling and Characterization, Wiley, NY, p. 1–16. De Brabanter, J., 2004, LS-SVM Regression Modelling and its Applications (2004): PhD dissertation, Katholieke universiteit, Kasteelpark Arenberg 10, 3001 Leuven (Heverlee), 245 p. Fan, R. E., Chen, P.-H., and Lin, C.-J., 2005, Working set selection using the second order information for training SVM. Journal of Machine Learning Research, v. 6, no. 12, p. 1889–1918. Focazio, M. J., Welch, A. H., Watkins, S. A., Helsel, D. R., and Horn, M. A., 1999, A retrospective analysis on the occurrence of arsenic in ground-water resources of the United States and limitations in drinking water supply characterizations: U.S. Geological Survey, Water Resources Investigation Report 99-4279, 21 p. Metz, P. A., 1991, Metallogeny of the Fairbanks Mining District, Alaska and Adjacent Areas: MIRL Report No. 90, 370 p. Misra, D., Samanta, B., Dutta, S., and Bandopadhyay, S., 2005, Evaluation of artificial neural network and kriging for the prediction of arsenic in bedrock derived stream sediments using gold concentration data: Int. J. Surface Min. Reclam. Environ., accepted. Moore, J. N., 1994, Contaminant mobilization resulting from redox pumping in a metal-contaminated river-reservoir system, in Environmental Chemistry of Lakes and Reservoirs, Washington, DC: American Chemical Society, p. 451– 471. Mukherjee, S., Osuna, E., and Girosi, F., 1997, Nonlinear prediction of chaotic time series using a support vector machine: in

Principe, J., Gile, L., Morgan, N., and Wilson, E., eds., Proceedings of the VII 1997 IEEE Workshop on Neural Networks for Signal Processing, IEEE, p. 5–11. Nimick, D. A., 1994, Arsenic transport in surface and ground water in the Madison and Upper Missouri River Valleys: Montana: EOS, v. 75, no. 1, p. 247. Nimick, D. A., Moore, J. N., Dalby, C. E., and Savka, M. W., 1998, The fate of arsenic in the Madison and Missouri Rivers, Montana and Wyoming: Water Resources Res., v. 34, no. 11, p. 3051–3067. NRC, 1999, Arsenic in Drinking Water: National Academy Press, Washington, DC, 330 p. NRC, 2001, Arsenic in Drinking Water: Update 2001: National Academy Press, Washington, DC, 244 p. Pannatier, Y., 1996, VARIOWIN: Software for Spatial Data Analysis in 2D: Springer-Verlag, New York, NY, 25 p. Savenije, H. H. G., 2001, Equifinality, a blessing in disguise? Hydrol. Process., v. 15, no. 14, p. 2835–2838. Samanta, B., Bandopadhyay, S., and Ganguli, R., 2002, Data segmentation and genetic algorithms for sparse data division in Nome Placer gold grade estimation using neural network and geostatistics: Explor. Min. Geol. v. 11, no. 1–4; p. 69–76; DOI: 10.2113/11.1-4.69. ¨ Scholkopf, B., Smola, A., Williamson, R., and Bartlett, P. L., 2000, New support vector algorithms: Neural Comput., v. 12, p. 1207–1245. Suykens, J. A. K., Gestel, T. V., De Brabanter, J., De Moor, B., and Vandewalle, J., 2002, Least Squares Support Vector Machines: World Scientific, Singapore, 308 p. Vapnik, V., 1995, The Nature of Statistical Learning Theory: Springer, New York, 314 p. Vapnik, V., 1998, Statistical Learning Theory: Wiley, New York, 736 p. Welch, A. H., Westjohn, D. B., Helsel, D. R., and Wanty, R. B., 2000, Arsenic in ground water of the United States: Occurrence and geochemistry: Ground Water, v. 38, no. 4, p. 589– 604. Wiltse, M. A., 1987, Geochemistry of the Lime Peak-Mt. Prindle area west-central Circle Quadrangle, Alaska, in Smith, T. E., Pessel, G. H., and Wiltse, M. A., eds., Mineral Assessment of the Lime Peak-Mt. Prindle area, Alaska: Alaska Division of Geological and Geophysical Surveys Miscellaneous Publication 29E, p. 6.1–6.37.

arsenic in the shallow ground waters of conterminous ...

13. Abstract_Transfer of Frozen Embryos Derived from In Vitro ...

Nitrogen isotopes in mantle-derived diamonds: indications of a multi ...

TANNAKIZATION IN DERIVED ALGEBRAIC ...

Arsenic in Drinking Water - A Case Study in Rural Bangladesh.pdf ...

Nitrogen isotopes in mantle-derived diamonds: indications of a multi ...

Multi-channel pattems of bedrock rivers: An example ...

DERIVED EQUIVALENCES OF K3 SURFACES AND ORIENTATION ...

Determining the Parameters of Axiomatically Derived ...

Catalytic thiolysis of chemoenzymatically derived ...

DERIVED EQUIVALENT HILBERT SCHEMES OF ...

DERIVED EQUIVALENCES OF K3 SURFACES AND ... - Math.utah.edu

Reformulation of Coal-Derived Transportation Fuels - OSTI.GOV

Catalytic thiolysis of chemoenzymatically derived ...

Derived Categories

arsenic trioxide - European Medicines Agency - Europa EU

Prediction of in vivo intestinal absorption enhancement ...