Using multivariate adaptive regression splines to estimate pollution in soil

Article history: Received 16 September 2016 Received in revised form 17 November 2016 Accepted 10 December 2016 Heavy metal pollution is one of the main factors of the traffic pollution. The public authorities have been monitoring the concentration of heavy metal by means of sampling stations. This paper describes the response surface models and an intelligent regression algorithm, multivariate adaptive regression splines (MARS) models to data collected from soil at the stations where there were high density of buildings, roads, traffic and tramways. The model variables included the number of car and tramways and the concentration levels of Cadmium (Cd), Zinc (Zn) and Lead (Pb), at depth of 0100mm. The objective of this study was to apply MARS to analyze the model output when there are a few numbers of design points. Several MARS models developed to simulate the concentration of each heavy metal. The performance of MARS was compared to that of response surface methodology (RSM) using 1st and 2nd order response surface models with respect to the accuracy metrics; root mean square error and adjusted R2. The results showed that MARS models were successful in goodness of fit, suitable and also reliable as compared to the RSM models. Additionally, use of MARS in selection of the variables indicating great contribution on the response was effective.


Introduction
*With the continuous increase in industrial development, heavy metal contamination has posed a serious threat for the environment. Elevated concentrations in air, water, and soil may occur close to industrial emission sources, particularly nonferrous mining and metal refining industries. Human activities, such as the atmospheric deposition of industrial soot, dust and aerosols, and coal burning exhausts, the application of fertilizers, livestock manures, and agrochemicals, and the disposal of anthropic wastes were the main sources of Cd, Pb and Zn. Increasing deposition of heavy metals on land and air has given a considerable concern about its impact on human health by the society to provide a sustainable environment. In recent years, various ways of approaching the distribution of pollution were analyzed by the scientists. Silva et al. (2001) studied on the main factors of air pollution in Santiago, Chile. They modelled and predicted the atmospheric pollution using the meteorological variables by means of MARS and non-parametric discriminant analysis. Gruszczynski (2005) examined the soil spatial distribution pollution of Chromium (Cr) where pollution by this element was high using interpolation algorithms and artificial neural networks. Covelo et al. (2008) examined the tree fitted regression models on the data containing six heavy metal. The sorption and retention of mixtures of heavy metals was reproduced by binary decisiontree regression models using classification and regression trees (CART) algorithm by an accompanying paper of Vega et al. (2009). Cheng et al. (2009) presented a case study in assessment of the distribution of soil Zinc (Zn) in an area severely polluted. Nieto et al. (2012) improved his work about cyanotoxins, a kind of poisonous substances produced by cyanobacteria, prediction from some experimental cyanobacteria concentrations in the Trasona reservoir (Asturias, Northern Spain) using (MARS). Piedade et al. (2014) applied a new approach of visualization based on tridimensional images of lead (Pb) concentrations in soil of a mining and metallurgy area to determine the spatial distribution of this pollutant and to estimate the most contaminated volumes. Lee and Toscas (2015) estimated the spatial distribution of the lead concentration levels that may affect exposed humans by using penalized regression and tensor product smooths.
In keeping with the above, the contaminated soil was also investigated by different statistical methods as well. Long et al. (2013) used response surface methodology (RSM) based on Box-Behnken experimental design for the analysis of variables of surfactant flushing treatment to optimize toluene removal efficiency from contaminated soil. Martínez-Fernández et al. (2014) studied response surface methodology to develop predictive models from central composite designs. Zhu et al. (2015) studied confirmed the use of Box-Behnken experimental design for analyzing the variables of ultrasoundassisted surfactant extraction treatment.
Different from conventional models, Govaerts and Noel (2005) discussed the analysis of a designed experiment when the response was a curve using three different approaches: two-step nonlinear modeling, pointwise functional regression and smoothed functional regression.
Developed in 1990 by Friedman, MARS is an intelligent, flexible, fast and accurate in prediction for various types of variables. Many applications have shown the successful prediction of MARS;Chun et al. (2003) showed the performance of MARS for simulating the pesticide transport in soils and confirmed that it can simulate complex phenomena in a simple and straightforward way rather than artificial neural networks. Woods and Lewis (2006) gave a method for constructing all-biased designs for polynomial spline regression models. Crino and Brown (2007) combined MARS with a response surface methodology. In his study, MARS showed low computational cost and better interpretability when compared to neural networks and generalized additive models.
Since 1990's, the successful applications of MARS were appeared in several field of studies (Crino and Brown, 2007). The primary objective of this study was to apply MARS to simulate the concentrations of three heavy metal at soil depth of 0-100mm. Several MARS models with interaction terms were developed and compared to the results obtained by 1 st and 2 nd order response surface models. Importance sequence of input variables to each heavy metal was determined respectively. The performance of MARS models was compared to the response surface models with respect to root mean squared error (RMSE), number of variables, number of observations, and the adjusted determination coefficient ( 2 ).

Model description
RSM comprises a group of statistical techniques for empirical model building and model exploitation (Box and Draper, 2007). In this study, it is assumed that some true physical relationships between the expectation of the response y and two factors ( 1 and 2 ) and via physical constants exist as follows (Eq. 1): The nature of the expectation function in E(y) is unknown and it is replaced by an approximating function as either of the following (Eqs. 1 and 2): (2) where, = ( 1 , 2 ) is a linear coding of a factor , i=1, 2 (Box and Draper, 2007). The response is y and 1 , , 2 , . . . , are k known explanatory variables. 2 is a higher-order term and is the interaction term for i=1, ..., k-1, j=1, ..., k. 0 , 1 , … , , 11 , … , are unknown parameters and is a random error (Khuri and Cornell, 1987). The response is modeled by a linear function or the model is upgraded by adding higher-order terms if there is a curvature in the system.
In order to build a MARS model, a response and a set of exploratory variables are required. MARS splits the data into several splines and approximates the regression model using basis functions (BFs) as follows (Eq. 4): where, ℎ ( )is BF that represents the data in each subgroup. 0 , … , are unknown parameters, i=1, …, M, that are estimated by ordinary least squares method once BFs are investigated. BFs can be represented by (Eq. 5): where, "+" means the argument is a truncated power function, is the number of variables (interaction order) in the m th basis expansion. ( , ) is the v th variable, 1 ≤ ( , ) ≤ . , is a knot on each of the corresponding variable where the two BFs in two adjacent domains of data intersect? Therefore, MARS creates knots which can be located among different exploratory variables. The BF represents the relationship between the knots using the reflected pairs of hockey stick function as follows (Eq. 6): Where, ( ) is a new variable with values 0 for all values X up to some threshold t whereas ( ) is equal to x for all values of x larger than the threshold value. The second pair of hockey stick function generates a reflected effect of the first one and illustrates the variation in BFs for changes of t values for variable X. Thus, t denotes the knot where the behavior of the function changes. Each BF is unique between any two knots and replaced by another BF at each knot (Abraham and Steinberg, 2001;Fridedman, 1991). Thus, a knot is located at the beginning of a region and the end of another. Finding the best knot is a search process in MARS. In each spline, the data is splitted in regions. The MARS model fits a regression line from region to region using splines whereas model's response is continuous.
MARS procedure adaptively selects the BF set by two iterative approaches: forward and backward selection. It uses the residual squared error in iterations to compare the partition points. Determination of knot locations is adaptive to data characteristics (Abraham and Steinberg, 2001;Fridedman, 1991). The criterion used to set the final model is a modified generalized cross validation (GCV) of the first proposed one by Craven and Wahba (1978) (Eq. 7).
MARS generates and compares the models in terms of the importance of the inputs in the models using ANOVA. Analysis of variance decomposition of the MARS model is given by the following expression (Fridedman, 1991) (Eq. 8): is overall BFs involve only a single variable, ∑ ( , ) =2 is overall BFs represent the contribution from two variables, is overall BFs represent the contribution from three variables, and so on. The best (or final) MARS model is the one with the smallest GCV value and the largest 2 value (Fridedman, 1991;Hastie et al. 2001).

Data description
The data used to develop 1 st and 2 nd order response models and MARS included the number of cars, the number of tramways, and the concentration measurements of three heavy metals that cause traffic pollution: Cd, Zn, Pb. Urban soils collected at locations where there were high density of buildings, roads and tramways in Eskisehir, Turkey. A systematic sampling was adopted to prove a sampling strategy over the entire workspace. Sample points within topsoil layers 0-10 cm were located roads alongside. Portions of the soil samples which were hold approximately 25gr were grounded in a mechanical agate grinder until fine particles (<200µm) were obtained. The coordinates of the sample locations were recorded with a GPS. All soil samples were dried for 3 h at 105 o C (to a constant weight), milled and passed through a nylon sieve (0.5 mm). 0.5g samples were weighed and transferred into reaction vessels. The descriptive statistics of the heavy metals were presented in Table 1.

Method description
In this study, the software SAS (version 9.0 for Windows) was applied to build the response surface models. A response surface statistical experimental design was used to optimize the concentration of Cd, Zn, and Pb separately. This design is based on a 3 2 factorial design, five replicates of the experiment, leading to 45 observations at nine different sample stations investigated. To properly represent the heavy metal concentration, 1 st and 2 nd order response models were investigated.
Thus six different models were developed in this part of the study. There were two inputs to the response models: the number of cars and the number of tramways. The levels of each input were chosen on the basis of the minimum and the maximum number of vehicles passing at chosen stations. Each level of the two factors was run in all combinations for Cd, Zn, and Pb. The levels of two factors coded as (-1), (0), and (+1) due to a computational ease are listed in Table 2. In the second part of the study, the same input variables in Table 2 were used to develop MARS models. As one MARS model has only one response variable, the total number of MARS models in this study is three for the three outputs: Cd, Zn, and Pb. The MARS software, version 2.0 was used in the analysis (Salford Systems, 2010).

Motivation
ANOVA decomposition of the MARS model captures the main idea of this study. The input variables 1 and 2 are considered as the main factors (variables) whereas 1 2 and 2 2 represent the squared effects in the model. Thus, a second order MARS model has been used so that the BFs of the final MARS model consist both linear and secondorder splines. Besides the interaction terms were also included in the model. The maximum number of interactions was set to 2. Higher order of interaction was not allowed as the number of experimental points was not enough in this study. Hence the interactions were controlled before and only interactions of ( 1 1 ), ( 1 2 2 ), ( 2 1 2 ) and ( 1 2 2 2 ) were taken into account. The same data collected from road were used to develop the MARS models. As there are three outputs of interest, three MARS models were simulated. The coded values of factors were used to provide orthogonality. Each model was assumed to describe the effect of the factors over the interest region (linear coding: -1, 0,+1, quadratic coding: +1,-2,+1).

The analysis of variance
The quality of the fitted models was checked by F-test at 0.01 significance level. To make detailed information regarding to the structure of the variation in main and interaction effects, those were divided into linear and quadratic terms. The statistical insignificance of linear, quadratic and interaction terms were determined by using p-value>0.01. Thus a p-value with '*, **, ***' codes indicates the terms significant at the corresponding level. The results in Table 3 are satisfactory for a good prediction for the experiments carried out.
In Table 3, the results suggested that the 2 nd order model was considered to approximate the surface curvature's nature better than 1 st order response model with respect to RMSE for three outputs.
The results from the canonical analysis of the 2 nd order response model for each output were listed in Table 4.  The obtained results suggested that the response surface predictions were in good agreement with the experimental results. The exception in here was when Cd was the output, the inputs were not satisfying to explain the response surface, but the predictive value was still in the interval of guideline. Thus, the experimental designs were reliable and effective in determining the optimum conditions. Fig. 1 indicates the three dimensional surfaces of the response for Cd, Zn, and Pb contamination. In Fig.  1, the maximum predicted Cd contamination was located at the corresponding levels of ( 1 , 2 ) = (±1, ±1). The minimum of the response was located at the levels of ( 1 , 2 ) = (0, 0). The minimum predicted value of Zn contamination was located at the corresponding levels of ( 1 , 2 ) = (−1, +1) and the maximum value of the response was located at the levels of ( 1 , 2 ) = (0, 0). The response surface showed to be a mount shaped. The maximum predicted Pb is located at the point ( 1 , 2 ) = (+1, 0). The plot shows that there is a saddle point in the surface.

MARS modeling results
The evaluated final MARS model and its estimated coefficients were given for each heavy metal below. The models were the linear combinations of five, six, and three BFs, respectively which used two original variables (number of tramways, number of cars). In Cd concentration MARS model, five BFs were found to be statistically important. The effects 1 and 2 appeared both individually and interactively (i.e. their product with each other). For instance, the BFs max (0, 2 -0) and max (0, 1 +1) represented the single effects on Cd concentration. The single effects whose impact was positive (with coefficients of 0.529, 0.753, and 0.299) on the response were sensitive to the knot values 0 and (-1). The two BFs, max (0, 1 +1)×max (0,0-2 ) and max(0, 1 +1)×max(0, 2 -0) denoted that the impact of 1 was emerged through the interaction with 2 . It was also worth to mention that the impact of the contribution from the two effects was negative (with coefficients of -0.541 and -0.301). Hence it can be said that the effect 1 played a role in the Cd concentration when it was larger than the knot (-1) while 2 <0 and 2 >0.
Having the similar interpretation for Zn and Pb concentration MARS models, it appeared the single effects and the contributions from two effects were significant. Hence the models were composed of six and three BFs, respectively. To further access the capability of MARS and RSM, some of the important statistics were given in Table 5. MARS uses 2 and GCV criteria to assess the goodness of a fit. For the three concentration models, the corresponding 2 obtained by MARS are 82%, 99%, 84%, respectively. By considering 45 observations, overall 2 scores indicate good fit than the scores 2 of 26%, 83%, and 71% obtained by RSM models, respectively. MARS also computed GCV scores of the corresponding models as 0.051, 41.573, and 211.805, respectively. Table 5 compares also the accuracy of MARS and RMS models in estimating the concentration of Cd, Zn, and Pb. The MARS models indicate better than RSM models in terms of RMS accuracy. It was apparent that RMSE scores were decreased by MARS models.
MARS allows displaying the predicted response as a function of the others (Salford Systems, 2010).
In Fig. 2, the interactions of the input variables were presented for each concentration model. It is clear that given the 1 and 2 levels of (-1), the largest contribution to Cd concentration have been obtained. That is, the number of tramways and the number of cars both affected the contamination in soil in terms of Cd concentration although there were caused less traffic at the time. Secondly, it can be concluded that the 1 of level (0) and the 2 levels of (-1, 0) increased the contamination of the Zn concentration in soil. In other words, being moderate of the number of tramways and being increased from medium to high of the number of cars caused the largest contribution to Zn concentration. Finally, the largest contribution to Pb contamination was obtained when the number of tramways at highest level (+1) whereas the number of cars was at medium level (0). This result might be considered that the reason of the contamination in soil in terms of Pb was mostly from the reason of the number of tramways.

Relative importance
It can be seen from Table 6 that MARS possesses more information regarding to the important variables which RSM could not easily produce.
MARS explicitly indicates the important variables and ignores the unneeded ones. The most important variables are the ones that have the largest impact on the goodness-of-fit or GCV score (Steinberg and Colla, 1999). The relative importance of the variables for each concentration model was summarized in the Table 6. As can be seen, both variables car and tramway have great contribution to Cd, Zn, and Pb MARS models.

Conclusions
This paper used a nonparametric method MARS and RSM based on a polynomial experimental design to approximate the concentration of some heavy metals in soil. Our objective was to develop response surface models to characterize the concentration of Cd, Zn, and Pb in soil and compare them with MARS. This method was used to explore the relationship by splitting the variables over its region and transform the original ones to new variables. The MARS algorithm was run to illustrate the concentration models combined with a group of spline base inputs.
The RSM based on a 3 2 factorial design was used first and then MARS was applied to the same datasets. The results of the analysis of variance on metal contaminations were given in tables for Cd, Zn, and Pb, respectively. The results of MARS showed that it extracted more information as RSM stood improper when there were a few numbers of runs in the experiment. MARS proposed extra information using the distribution of the design points. The MARS contamination models showed a moderate improvement in goodness of fit than the second order response surface models in terms of RMSE and the adjusted 2 .
It was also noticeable that although the MARS models evaluated seemed more complex than the second order response surface models, the number of the covariates was still reasonable as compared to the number of observations. The successful applications of MARS in these three problems indicate that MARS is computationally efficient and easy to interpret. It can also estimate the contributions of the input variables and enable the scientists have an insight and understanding of the significant variables occur in the data.

Discussion
Our conclusions provide a better understanding of the response surfaces that are obtained by MARS. Based on the early studies of Yazici (2009a, 2009b) on modeling the response surfaces by MARS, our approximation is still adaptable to the first order and second order response surface models so that the conclusion has not been only referenced to one very specialized data set given in the frame of this paper but others as well. Since in low dimensions, the set of the design points were not proper for modeling, the weakness of RSM can be removed by using MARS.