Comparison of PLSR and PCR techniques in terms of dimension reduction: an application on internal migration data in Turkey

Hatice Samkar, Gamze Guven / International Journal of Advanced and Applied Sciences, 3(8) 2016, Pages: 7‐13 8 Prediction (REP) and did not observe significant differences in precisions of prediction between PLSR and PCR. Guiteras et al. (1998) applied LS, PCR and PLSR techniques on multivariate data and compared the predictions of the model in terms of the Relative Root Mean Squared Difference (RRMSD). To predict per capita gross domestic product of Turkey, Yeniay and Goktas (2002) used LS, Ridge Regression (RR), PLSR, and PCR techniques on a set of data gathered from 80 cities in Turkey and showed that PLSR and PCR techniques had the most predictive ability, respectively. Wentzell and Montoto (2003) made a simulation study on the complex chemical mixtures which contained a large number of components and reported that there was not any important difference in terms of prediction ability between PLSR and PCR techniques. Li (2010) made comparison of the prediction performances of PLSR, PCR and RR techniques according to Mean Square Error of Prediction (MSEP) and obtained similar results. Yaroshchyk et al. (2012) compared PCR, PLSR, MultiBlock Partial Least Squares Regression (MB-PLSR) and Serial Partial Least Squares Regression (S-PLSR). They emphasized that PLSR and PCR models produced similar prediction accuracy, although in the case of PLSR there were notably less latent variables in use by the model. Khajehsharifi et al. (2014) compared the prediction ability of PLSR and PCR techniques with respect to Root Mean Square Error of Prediction (RMSEP) on a real data set and found that PLSR had more predictive ability than the other. Mahesh et al. (2015) compared the protein contents and hardness values predicted by PLSR and PCR models for bulk samples of Canadian wheat, and assessed prediction performances of regression models by calculating MSEP, Standard Error of Cross validation (SECV), and correlation coefficient (r). The purpose of this study is to compare PLSR and PCR techniques with respect to Root Mean Square Error of Cross Validation (RMSECV) criterion in terms of dimension reduction on internal migration data in Turkey. The study is organized as follows: PLSR and PCR techniques are briefly explained in Section 2; RMSECV criterion is described in Section 3; these techniques are applied on migration data in Turkey in Section 4; finally, Section 5 presents the conclusion. 2. PLSR and PCR techniques Firstly, the matrix of X independent variables is standardized in PLCR and PCR techniques. Following this, new orthogonal variables are obtained by using the linear combination of these standardized variables. Lastly, to predict regression coefficients, LS method is applied to these variables (Vigneau et al., 1996). The basic difference between PLSR and PCR techniques is the fact that while PLSR uses information on both dependent and independent variables, PCR uses only information on independent variables when component or latent variables are obtained (Naes and Martens, 1985; Garthwaite, 1994). PLSR and PCR techniques are explained in the following section. 2.1. PLSR technique The aim of PLSR is to describe the structure between X and Y blocks and to predict Y block via X block (D’Ambra and Sarnacchiora, 2010). PLSR models the relationship between these two blocks via score vectors. PLSR decomposes zero mean variables X and zero mean variables Y as follows: = + (1) = + (2) where and are matrices of score vectors (components, latent vectors); and represent loading matrices; and and represent residual matrices (Rosipal and Krämer, 2006). This decomposition is done to maximize the covariance between and . While forming the PLSR model, a lower number of components are used instead of using all the independent variables by constructing new variables. The new variables are called scores and denoted with T score matrix. T score matrix is formed with the linear combinations of the multiplication of original X matrix with the weight matrix W*. ( = ∗) (3) In PLSR, the weights are determined by maximizing the covariance between the latent variables and dependent variables (Zeng et al., 2007). In addition, T’s, the corresponding scores, are good predictors of and can be given as follows: ( = + ) (4) where C is the Y-weight matrix, and F is the Yresidual matrix and shows the deviation between the observed and modelled responses. Finally, matrix B demonstrating the PLSregression coefficients is obtained from following equation (Wold et al., 2001): = (5) PLSR technique can be used when the number of dependent variables is single or more than one (Garthwaite, 1994). However, when the dependent variable Y is single and is diagonal, the PLSR arrives at the LS solution in one component, and PLSR and LS regression coefficients are equal (Wold et al., 2001). 2.2. PCR technique PCR is a technique that deals with the multicollinearity problem by removing the nonstable structure of the model and by decreasing the variances of regression coefficients (Massy, 1965). In PCR technique, firstly, the components are obtained with Principal Component Analysis (PCA). Following this, regression analysis is conducted by Hatice Samkar, Gamze Guven / International Journal of Advanced and Applied Sciences, 3(8) 2016, Pages: 7‐13 9 using the principal component scores as independent variables (D'Ambra and Sarnacchiaro, 2010). In order to demonstrate the PCR analysis, the singular value decomposition (SVD) of X is done as follows: = (6) where U matrix represents linear transformation of X and S matrix is a diagonal matrix with elements equal to singular values. They are linked to the principal component score matrix T by T = US. The regression equation based on scores can be given as follows (Naes and Mevik 2001): y = α + Tγ + ε (7) Score vectors corresponding to small eigenvalues can be left out in order to avoid collinearity problems from influencing the solution (Geladi and Kowalski, 1986). Let the matrix UA be defined as the columns of U corresponding to the A largest eigenvalues of X’X. The PCR is then defined as the regression of y onto UA. y = α + U α + f (8) where is generally different from the error term ε above. The estimates of the ’s in are found by LS. The PCR predictor is obtained as: = + (9) The value of uA for a new sample is found by projecting x onto the A first principal components and by diving the score/projection, t, by the square root of the eigenvalues. Note that for A = p, the PCR predictor becomes identical to the LS predictor . In practice, the best choice of A is usually determined by cross-validation (Naes and Mevik, 2001). 3. RMSECV In this study, the RMSECV criterion is used to compare the PLSR and PCR techniques in terms of the dimension reduction ability on a real data set. Therefore, we focused on certain concepts associated with RMSECV and calculation of RMSECV. Cross Validation is a generally applicable way to predict the performance of a model. In this study, the Leave-One-Out Cross Validation (LOOCV) technique is used. Leave-One-Out is the most classical Cross Validation procedure. In this procedure, each data point is successively “left out” from the sample and used for validation (Arlot and Celisse, 2010). In other words, the Leave-One-Out Cross Validation (LOOCV) uses a single observation from the whole sample as the validation data, and the remaining observations as the training data. This process is repeated until each observation in the entire sample is used once as the validation data (He et al., 2010). Predicted Residual Sum of Squares (PRESS) statistics is calculated using the LOOCV method. The sum of squares of the difference between and ( ) is called PRESS (Allen, 1974): = ∑ ( ) = ∑ ( − ( )) (10) where, ( ) is the fitted value of the ith response based on all the observations except the ith one. PRESS is a generally regarded as a measure of how well a regression model will perform in predicting new data. A model with a small value of PRESS is desired (Montgomery et al., 2001). RMSECV is a concept which is based on PRESS statistics and which is calculated in relation to the number of principle components or latent variables. RMSECV can be given as follows: = (11) When different models are compared, the one with the smallest RMSECV is considered the best predictive model (Bodzioch et al., 2009). 4. The application of PLSR and PCR techniques on internal migration data in Turkey In this section, internal migration data of Turkey from 2011 compiled by Turkish Statistical Institute (TSI) are used to compare PLSR and PCR techniques in terms of dimension reduction ability (TSI, 2011). The analysis is conducted with Matlab PLS_Toolbox. The data set contained 81 observations and 13 independent variables. The dependent and independent variables are shown in Table 1. Table 1: The dependent and independent variables Y In-migration X1 Population X2 Unemployment ratio (%) X3 Number of Beds in Hospitals (100000 per person) X4 Number of Doctors X5 Number of Nurses X6 Electricity consumption per capita X7 Percentage of housing with piped water system X8 Infant Mortality Rate X9 Total Fertility Rate X10 Number of Motor Vehicles X11 Number of Students per teacher X12 Number of students per school X13 Annual income per capita($) The data has been standardized after the logarithmic transformation has been applied, and then multiple regression analysis has been conducted. The results of regression analysis have revealed that about 92% of the variance in the dependent variable has been explained by the independent variables, and the model is significant at the level of 5%. Regression coefficients, Tolerance Values (TV), variance inflation factor (VIF) and t statistics can be seen in Table 2. As can be seen in the table, the VIF of X1, X4, X5 and X10 variables are greater than 10 and hence the TV of these variables are less than 0.1. Although the model is significant, all the regression coefficients except for X1 are insigni


Introduction
*Multiple regression analysis is a technique to assess the functional relationship between a dependent variable and two or more independent variables. In multiple linear regression models, the method of Least Squares is widely used to estimate the regression coefficients. The least squares method produces estimators with desirable properties under certain assumptions (Chatterjee and Hadi, 2015). One of the assumptions is that independent variables are uncorrelated with each other (Gujarati, 2003). Linear or near-linear relationship between independent variables leads to multicollinearity problem. The presence of multicollinearity problem affects the signs of some regression coefficients in the model. Also, it results in large variances and covariance for the least-squares estimators of the regression coefficients, and thus the confidence intervals of coefficients and t statistics tend to be wider and smaller, respectively (Montgomery et al., 2001). While all or most of the regression coefficients are insignificant, the coefficient of determination is high and the model is significant.
The least squares estimators of the regression coefficients are the best linear unbiased estimator. Here "best" means giving the lowest variance of the estimate when compared to other unbiased, linear estimators. However, this feature is invalid in the presence of multicollinearity. In this case, the multicollinearity should be removed. A lot of methods are proposed to remove multicollinearity. One of the methods is to use biased estimation techniques (Rawlings et al., 1998).
One of the biased estimation techniques which eliminate multicollinearity by reducing dimension is Partial Least Squares Regression (PLSR), and the other is Principle Component Regression (PCR). As these two techniques are applied to problems involving high collinearity in which the variance tends to dominate the biased, both techniques bring about similar results in many cases, and when used appropriately, the techniques produce better estimations than Least Squares (LS) technique (Fekedulegn et al., 2002;Frank and Friedman, 1993;Ziegel, 2004). In addition, these two techniques can not only be applied to data in which the number of observations is higher than that of independent variables but also be applicable to data in which the number of observations is lower than that of independent variables (Helland, 1988;Vigneau et al., 1997).
In literature, there are several studies comparing PLSR with PCR in terms of different criteria. Naes and Martens (1985) compared these two techniques in terms of Mean Squares Error (MSE) and found that PLSR used fewer latent variables than PCR. Luinge et al. (1993) showed that these two techniques were comparably similar in terms of prediction error on a real data set. Diaz et al. (1997) applied PLSR, PCR and LS on a real data set and used Root Mean Square Deviation (RMSD) and square of the correlation coefficient r 2 to evaluate the different techniques. Ni and Gong (1997) compared LS, PCR and PLSR techniques according to Relative Error of Prediction (REP) and did not observe significant differences in precisions of prediction between PLSR and PCR. Guiteras et al. (1998)  and Serial Partial Least Squares Regression (S-PLSR). They emphasized that PLSR and PCR models produced similar prediction accuracy, although in the case of PLSR there were notably less latent variables in use by the model. Khajehsharifi et al. (2014) compared the prediction ability of PLSR and PCR techniques with respect to Root Mean Square Error of Prediction (RMSEP) on a real data set and found that PLSR had more predictive ability than the other. Mahesh et al. (2015) compared the protein contents and hardness values predicted by PLSR and PCR models for bulk samples of Canadian wheat, and assessed prediction performances of regression models by calculating MSEP, Standard Error of Cross validation (SECV), and correlation coefficient (r).
The purpose of this study is to compare PLSR and PCR techniques with respect to Root Mean Square Error of Cross Validation (RMSECV) criterion in terms of dimension reduction on internal migration data in Turkey.
The study is organized as follows: PLSR and PCR techniques are briefly explained in Section 2; RMSECV criterion is described in Section 3; these techniques are applied on migration data in Turkey in Section 4; finally, Section 5 presents the conclusion.

PLSR and PCR techniques
Firstly, the matrix of X independent variables is standardized in PLCR and PCR techniques. Following this, new orthogonal variables are obtained by using the linear combination of these standardized variables. Lastly, to predict regression coefficients, LS method is applied to these variables (Vigneau et al., 1996).
The basic difference between PLSR and PCR techniques is the fact that while PLSR uses information on both dependent and independent variables, PCR uses only information on independent variables when component or latent variables are obtained (Naes and Martens, 1985;Garthwaite, 1994).
PLSR and PCR techniques are explained in the following section.

PLSR technique
The aim of PLSR is to describe the structure between X and Y blocks and to predict Y block via X block (D'Ambra and Sarnacchiora, 2010). PLSR models the relationship between these two blocks via score vectors. PLSR decomposes zero mean variables X and zero mean variables Y as follows: = and are matrices of score vectors (components, latent vectors); and represent loading matrices; and and represent residual matrices (Rosipal and Krämer, 2006). This decomposition is done to maximize the covariance between and .
While forming the PLSR model, a lower number of components are used instead of using all the independent variables by constructing new variables. The new variables are called scores and denoted with T score matrix. T score matrix is formed with the linear combinations of the multiplication of original X matrix with the weight matrix W * .
( = * ) (3) In PLSR, the weights are determined by maximizing the covariance between the latent variables and dependent variables (Zeng et al., 2007). In addition, T's, the corresponding scores, are good predictors of and can be given as follows: ( = + ) (4) where C is the Y-weight matrix, and F is the Yresidual matrix and shows the deviation between the observed and modelled responses.
Finally, matrix B demonstrating the PLSregression coefficients is obtained from following equation (Wold et al., 2001): = (5) PLSR technique can be used when the number of dependent variables is single or more than one (Garthwaite, 1994). However, when the dependent variable Y is single and is diagonal, the PLSR arrives at the LS solution in one component, and PLSR and LS regression coefficients are equal (Wold et al., 2001).

PCR technique
PCR is a technique that deals with the multicollinearity problem by removing the nonstable structure of the model and by decreasing the variances of regression coefficients (Massy, 1965).
In PCR technique, firstly, the components are obtained with Principal Component Analysis (PCA). Following this, regression analysis is conducted by using the principal component scores as independent variables (D' Ambra and Sarnacchiaro, 2010).
In order to demonstrate the PCR analysis, the singular value decomposition (SVD) of X is done as follows: = (6) where U matrix represents linear transformation of X and S matrix is a diagonal matrix with elements equal to singular values. They are linked to the principal component score matrix T by T = US. The regression equation based on scores can be given as follows (Naes and Mevik 2001): y = α + Tγ + ε (7) Score vectors corresponding to small eigenvalues can be left out in order to avoid collinearity problems from influencing the solution (Geladi and Kowalski, 1986).
Let the matrix U A be defined as the columns of U corresponding to the A largest eigenvalues of X'X. The PCR is then defined as the regression of y onto where is generally different from the error term ε above. The estimates of the 's in are found by LS. The PCR predictor is obtained as: = + (9) The value of u A for a new sample is found by projecting x onto the A first principal components and by diving the score/projection, t, by the square root of the eigenvalues. Note that for A = p, the PCR predictor becomes identical to the LS predictor . In practice, the best choice of A is usually determined by cross-validation (Naes and Mevik, 2001).

RMSECV
In this study, the RMSECV criterion is used to compare the PLSR and PCR techniques in terms of the dimension reduction ability on a real data set. Therefore, we focused on certain concepts associated with RMSECV and calculation of RMSECV.
Cross Validation is a generally applicable way to predict the performance of a model. In this study, the Leave-One-Out Cross Validation (LOOCV) technique is used. Leave-One-Out is the most classical Cross Validation procedure. In this procedure, each data point is successively "left out" from the sample and used for validation (Arlot and Celisse, 2010). In other words, the Leave-One-Out Cross Validation (LOOCV) uses a single observation from the whole sample as the validation data, and the remaining observations as the training data. This process is repeated until each observation in the entire sample is used once as the validation data (He et al., 2010).
Predicted Residual Sum of Squares (PRESS) statistics is calculated using the LOOCV method. The sum of squares of the difference between and ( ) is called PRESS (Allen, 1974): where, ( ) is the fitted value of the ith response based on all the observations except the ith one. PRESS is a generally regarded as a measure of how well a regression model will perform in predicting new data. A model with a small value of PRESS is desired (Montgomery et al., 2001). RMSECV is a concept which is based on PRESS statistics and which is calculated in relation to the number of principle components or latent variables. RMSECV can be given as follows: When different models are compared, the one with the smallest RMSECV is considered the best predictive model (Bodzioch et al., 2009).

The application of PLSR and PCR techniques on internal migration data in Turkey
In this section, internal migration data of Turkey from 2011 compiled by Turkish Statistical Institute (TSI) are used to compare PLSR and PCR techniques in terms of dimension reduction ability (TSI, 2011). The analysis is conducted with Matlab PLS_Toolbox. The data set contained 81 observations and 13 independent variables. The dependent and independent variables are shown in Table 1. The data has been standardized after the logarithmic transformation has been applied, and then multiple regression analysis has been conducted. The results of regression analysis have revealed that about 92% of the variance in the dependent variable has been explained by the independent variables, and the model is significant at the level of 5%. Regression coefficients, Tolerance Values (TV), variance inflation factor (VIF) and t statistics can be seen in Table 2.
As can be seen in the table, the VIF of X 1, X 4 , X 5 and X 10 variables are greater than 10 and hence the TV of these variables are less than 0.1. Although the model is significant, all the regression coefficients except for X 1 are insignificant. Also, the maximum eigenvalue ( ) is 13.776; the minimum eigenvalue ( ) is 0.000029; and the condition number is about 6475034. In addition, the sum of inverse of ei Consequ multicolline For this r estimation t are the RMS PCR and th variance in variables ex these two t optimal nu determined