Use of nonparametric regression in mode B type measurement model: A simulation study approach

Article history: Received 18 August 2019 Received in revised form 11 November 2019 Accepted 12 November 2019 In the conventional PLS-path modeling, the relationship among latent variables (LVs) is estimated by fitting a simple/multiple linear regression lines. For this purpose, researchers have to assume that the endogenous LV is the linear function of exogenous LVs, which is rarely met in real data analysis. The statisticians have devised a non-linear model-fitting approach to overcome the issue of linearity, but for that purpose, one should assume some specific functional form like quadratic, cubic or some degree of a polynomial in advance. Hence, when the linearity assumption is violated, the only appropriate choice is to use the nonparametric regression approaches. This study is mainly focused on the estimation of the latent variable model by incorporating three nonparametric smoothing procedures: Kernel regression estimate, local polynomial estimate, and smoothing spline estimates. An algorithm for LV models is proposed and presented based on nonparametric regression approaches for the mode B type measurement model (i.e., formative model). From simulation studies, it was clearly concluded that nonparametric based LV modeling approaches perform well for large sample sizes (i.e., for sample size 100 and above) as compared to standard PLS-path modeling procedure. However, for small samples (less than 100 observations), the standard PLS-path modeling procedure was giving better results.


Introduction
*The popularity of LV models is increasing day-byday not only in the fields of social and behavioral sciences but also got a wide application in the disciplines of economics, medical and management sciences for studying the relationship among LVs as well as their associated manifest variables (MVs). In conventional PLS-path modeling, the relationship among LVs is estimated by applying the multiple linear regression. For this purpose, the researcher has to assume that the endogenous LV, denoted by " " is the linear function of exogenous LVs, denoted by 1 , 2 , … , .
Although these models have a beauty that the researcher can easily interpret the coefficient values in terms of significant contribution, these models are quite restrictive. Especially, the two assumptions: linearity and additivity, make it sometimes very impractical. The statisticians have devised a nonlinear model-fitting approach to overcome the issue of linearity, but for that purpose, practitioners should assume some specific functional form like quadratic, cubic or some degree of the polynomial in advance. Hence, the only choice in the case of nonlinearity is to use the nonparametric regression approaches.
In the literature, the estimation of regression function using a nonparametric regression approach has been studied for a long time. The most popular estimates for nonparametric regression function include kernel regression estimate, local polynomial regression estimate, and smoothing spline estimates. According to Kelava et al. (2017), the use of nonparametric regression in the context of LVs is a newly emerged research area. Recently, they estimated the LV model without specifying the underlying distributions. They adopted a two-step procedure: In the first step, the measurement model is estimated by using a common factor model, while in the second step, nonparametric regression using smoothing splines estimates was used to analyze the relation among LVs. They did not study the other nonparametric estimates like local polynomial or kernel regression estimate etc. Hence, there is sufficient room left for research in adopting other estimation procedures like kernel regression estimate, or local polynomial etc. in fitting a latent variable model. This research study is mainly focused on the estimation of the PLS-path model (having the mode B type measurement model) by incorporating the above mentioned nonparametric smoothing procedures. Before presenting the proposed procedure, a brief review of PLS-PM is presented in the next sections, followed by a review of nonparametric regression techniques. In the last sections of this article, the results of simulation studies, as well as an application to real-world data, are presented.

PLS-path modeling
PLS-Path Modeling is a statistical modeling approach; in which, several blocks of variables are linked together to measure linear dependence relationships among them. The history of PLS-Path Modeling starts with the advent of NILES (Non-linear Iterative Least Squares) (Wold, 1966). Later on, it was re-named by Wold (1973) as NIPALS (Nonlinear Iterative PArtial Least Squares), which later on extended to PLS-Path Modeling (Wold, 1982).
PLS-PM is comprised of two parts: First part is called "the inner model (or structural model)" while the second part is known as "the outer model (or the measurement model)" (Lohmöller, 1989). The "inner model" specifies the relationships among latent variables, while the "outer model" specifies the relationships between latent variables and their associated MVs. A simple PLS-PM is depicted in Fig.  1.

Fig. 1: PLS-path model
The PLS-PM algorithm suggested by Lohmöller (1989) comprises of following steps: Step I: Initialization: To initialize the algorithm, any arbitrary numbers are chosen as weights to approximate the LV scores ̂ or ̂ by computing the linear combination of associated MVs. In simple words, each LV is constructed as a weighted sum of their associated MVs, and generally, the weights are all taken as equal to one (Monecke and Leisch, 2012). However, in the second and next iterations, the weights calculated at step number 4 are utilized.
Step II: Inner approximation: In this step, each LV is estimated by taking the weighted sum of other linked LVs. Now, the values of the weights, are depending on any of the three weighting schemes: (i) centroid weighting scheme (Wold, 1982): Which utilizes the sign of the correlations between LVs (i.e., -1 or +1). (ii) factor weighting scheme (Lohmöller, 1989): It takes the correlation values instead of their signs. (iii) path weighting scheme (Lohmöller, 1989): Also known as a structural scheme, in which regression coefficients are taken as weights instead of correlation coefficients.
Step III: Outer approximation: In step I, all the weights were taken as "one" or any arbitrary number, but in this step, these weights are recalculated on the basis of estimated values of LVs obtained in step II according to the type of measurement model.
Step IV: Estimation of LV scores: The outer weights computed in step III are now used to estimate the LV scores by taking the weighted sum of their associated MVs.
Step V: Repeating the steps until convergence occurs: The process of inner approximation and outer approximation is repeated (i.e., loop of Step II to IV) until and unless the relative change between two consecutive iterations of all the outer weights become smaller than a prefixed threshold value or tolerance value (usually taken as 10 -5 ).
Step VI: Computing the path coefficients, loading coefficients, and total effects: Once the LV scores are finalized (after convergence of outer weights values), the path coefficients can be estimated by fitting multiple linear regression for each endogenous LV involved in the inner model.

Nonparametric regression
A major drawback of the classical parametric approach is that the observed data may fail to follow a specific parametric model and the incorrect modeling assumption may lead to seriously flawed statistical conclusions. The idea of nonparametric regression is to use models of the form: where ( ), some class of regression function, and is an independent and identically distributed random variable with zero mean and unit variance. The nonparametric regression does not impose any functional form assumption and estimates the relationship by a smooth curve. The most commonly used nonparametric regression techniques are kernel regression estimate, local polynomial regression estimate, splines smoothing estimate. The detail discussion related to all these nonparametric estimation methods are available in some excellent books like Härdle (1990), Wand and Jones (1995), Fan and Gijbels (1996), Györfi et al. (2002), and Härdle et al. (2004) are few of them. However, a brief review of each is presented here.

Kernel regression estimate
Consider the simple case, that is, one predictor and one response variable, and the neighborhood points of X0 be bounded in the interval X0±h, where "h" is called as bandwidth and always a positive real number. Then the nonparametric estimator of m(X) is given by: which is known as "local constant" or "Nadarya-Watson" estimator. The smoothing parameter "h" (technically called bandwidth) is adjusted for the degree of smoothness. Here "K(.)" is Kernel function.
There are various forms of Kernel function are available in the literature, and these might neither affect the estimates of regression function nor the form of density. For example, Uniform Kernel function may be expressed as: The choice of "h" is usually done by trial and error, or by cross-validation. The level of smoothness depends on the value of "h", i.e., smaller the value of "h", the wigglier curve (wavy) will be, while a larger value of "h" produces a smooth curve.

Local linear estimate
NW estimator is a local constant approximation where the local constant is achieved by taking the average of Y values for all values of X lies in the interval X0±h. Another procedure, which fits a linear regression line locally (i.e., through the points lying in the same neighborhood), then this leads to a nonparametric technique known as Local Linear (LL) estimator. It's worthy to mention here that, if smoothing is increased i.e. when "h" approaches to infinity, the LL estimator and the parametric OLS estimator will be equal, but remember it is only true for a linear relationship.

Local polynomial estimate
To further improve the estimation, a local quadratic or cubic or polynomial of any order can be fitted rather than a local linear regression line. If "p" denotes the order of the local polynomial, then the local polynomial at p=0 will be equivalent to the NW estimator, while p=1 and p=2 will be exactly equal to Local Linear (LL) and local quadratic estimators respectively.

Splines smoothing regression
A spline is defined as a piecewise polynomial having pieces connected by a sequence of knots 1 < 2 <. . . < such that these pieces are joining smoothly at these knots. The Spline may be linear or of any degree. A spline of degree "d" is generally expressed as: which is a power series and where, Hence, if d=1, then the linear spline will be of the form:

The proposed procedure for using nonparametric regression in PLS-Path modeling
The existing procedure of PLS-path modeling consists of six steps, which are already illustrated in subsection 1.1. To fit the LV model using the PLSpath modeling approach, the linearity pattern among LVs is assumed, which may not be fulfilled at every situation (as discussed in the Introduction section). In this section, a fully nonparametric algorithm for LV models is proposed by modifying the existing methodology of the PLS-path modeling approach. The modification is done in two places: 1. A nonparametric weighting scheme is proposed based on LOESS (Sen, 1968) approach, i.e. similar to path weighting scheme (Lohmöller, 1989), i.e., the median of slopes for local linear lines are taken as weights. 2. After finalizing the LV scores, the nonparametric regression smoothers (kernel smoothing or local polynomial regression or splines smoothing regression) is adopted to estimate the relationship among LVs instead of fitting simple/multiple linear regression.

Simulation study
In literature, Monte Carlo simulation is extensively used to empirically assess the performance of statistical procedures under certain conditions, like the size of the model, sample size etc. In LV models literature, most of the studies are designed under the guidelines provided in Paxton et al. (2001). In this section, three simulation studies are designed to investigate the performance of the proposed nonparametric LV modeling algorithm for a formative model (Mode B) keeping in view the guidelines of Paxton et al. (2001). The R programming language is used to code the program for the proposed algorithm (with a certain level of modifications in the "plspm (version 0.4.9)" package (Sanchez et al., 2015).
The three simulation studies are designed (Ranging from simple to complex) small to large sample sizes (i.e., seven different sample sizes starting from 20, 30, 50, 100, 200, 300, and 500). The numbers of replications are fixed at 500. The following models were fitted on each data set: Conventional PLS-path modeling, and proposed NPbased LV modeling with three different smoother approaches i.e., kernel smoothing, local polynomial smoothing (degree=0, degree=1, and degree=2) and spline smoothing. The consistency threshold is fixed at 0.00001. The performance of a model can be judged by considering how much the predicted values are closer to observed values. Two different consistency criteria MAE (Mean Absolute Error) and RMSE (Root Mean Square Error), are used to compare the performance of nonparametric-based path modeling and the existent PLS-PM approach.
The predicted values of LV scores are determined by a 10-fold cross-validation approach for each sample size at each iteration. The simulation results are presented by tabular form (the amount of MAE and RMSE) in section 4.

Simulation study 1
The simplest model is considered in this first simulation study, by taking one endogenous LV and one exogenous latent variable having two MVs associated with each. The path coefficient and loading values are fixed at 0.7, as these were taken by many researchers for assessing the performance of PLS-path modeling. The specified model is depicted in Fig. 2. Using this specification, 3500 datasets are generated with seven different sample sizes starting from 20, 30, 50, 100, 200, 300, and 500, i.e. 500 replications for each sample size produces 500 X 7= 3500 datasets. The unit value (i.e., 1) is used as an initial approximation for weights. Further, different skewness values (-3, -4) and kurtosis values (5, 6) are applied to generate non-normal data for each MV. The path weighting scheme is applied in conventional PLS-path modeling while the LOESS approach is incorporated for nonparametric-based LV Modeling approaches. The results of MAE and RMSE for each standardized parameter estimate are presented in Table 1.

Simulation study 2
Another model which is more complex than model 1 is considered in this simulation study, by taking one endogenous LV and two exogenous LVs having three MVs associated with each. The path coefficient and loading values are fixed at 0.6, as these were taken by many researchers for assessing the performance of PLS-path modeling (Paxton et al., 2001). The specified model is depicted in Fig. 3.
Using these specifications, 3500 datasets are generated with seven different sample sizes starting from 20, 30, 50, 100, 200, 300, and 500, i.e. 500 replications for each sample size produces 500 X 7= 3500 datasets. The unit value (i.e., 1) is used as an initial approximation for weights. Further, different skewness values (-3, -4, -5) and kurtosis values (5, 6, 7) are applied to generate non-normal data for each associated MV. The path weighting scheme is applied in conventional PLS-path modeling while the LOESS approach is incorporated for nonparametric-based LV modeling approaches. The results of MAE and RMSE for each standardized parameter estimate are presented in Table 2.

Simulation study 3
Another more complex model is considered in this simulation study, by taking two endogenous LV and three exogenous LVs having three MVs associated with each. Here, to make it more complex, the loading coefficient and structural path coefficients are also not fixed but varied to become more representative for real-world models. The loading coefficients are taken as 0.7, 0.6 and 0.5, while structural path coefficients are fixed at 0.5 and 0.6 for both endogenous LVs. The specified model is depicted in Fig. 4.
Using these specifications, 3500 datasets are generated with seven different sample sizes starting from 20, 30, 50, 100, 200, 300, and 500, i.e. 500 replications for each sample size produces 500 X 7= 3500 datasets. The unit value (i.e., 1) is used as an initial approximation for weights. Further, different Skewness values (-3, -4, -5) and kurtosis values (5, 6, 7) are applied to generate non-normal data for each associated MV. The path weighting scheme is applied in conventional PLS-path modeling while the LOESS approach is incorporated for nonparametric LV modeling approaches. Here, the model involves two endogenous variables, so the prediction performance of these two LVs are tabulated in Table  3 and Table 4. While the results of overall prediction performance in terms of MAE and RMSE are presented in Table 5.

Discussion of simulation results
The results for the simplest model involving two LVs (one endogenous and one exogenous LV) presented in Table 1, showed that the amount of MAE and RMSE reduces as the sample size increases for all approaches. Further, by comparing the results row-wise, it can be concluded that a sample size of 20 and 30, the conventional PLS-PM approach gives better prediction performance (MAE= 0.7750, 0.7580 and RMSE= 1.0030, 0.9997), the smallest amount as compare to Kernel or local polynomial or spline-based approaches. But as the sample size increases, the local polynomial at degree=0 (i.e., constant local line approach) and spline-smoothers give better results. At sample size 100 and above, the spline-smoother gives more stable and better results than other approaches. However, this is applicable only in this case, when the model consists of two LVs having total of four indicators.   The results tabulated in Table 2 for a model involving three LVs (one endogenous and two exogenous LV) showed that the amount of MAE and RMSE reduces as the sample size increases for all approaches. Further, by comparing the results rowwise, it can be concluded that a sample size of 20, 30, and 50 the conventional PLS-PM approach gives better prediction performance (MAE= 0.6666,0.66657,0.6638 and RMSE= 0.8985,0.8947,8930), the smallest amount as compare to Kernel or local polynomial or spline-based approaches. But as the sample size increases, the local polynomial at degree=0 (i.e., constant local line approach) and spline-smoothers give better results. At sample size 100 and above, the spline-smoother gives more stable and better results than other approaches. From these as well as from Tables 3-5 results, spline-smoothing outperforms in case of large samples. The results tabulated in Tables 3, Tables 4 and  Tables 5 for a complex model involving five LVs (two endogenous and three exogenous LV) showed that the amount of MAE and RMSE reduces as the sample size increases for all approaches. Further, by comparing the results row-wise, it can be concluded that at sample size up to 100, the conventional PLS-PM approach gives better prediction performance (MAE= 0.7332,0.7295,0.7204,0.7172 and RMSE= 0.9545,0.9528,0.9499,0.9466), the smallest amount as compare to Kernel or local polynomial or splinebased approaches.
But as the sample size increases, the local polynomial at degree=0 (i.e., constant local line approach) and spline-smoothers give better results. At sample size 100 and above, the spline-smoother gives more stable and better results than other approaches. Hence, from all these simulation results, spline-smoothing outperforms in the case of large samples.

Application of proposed procedure on real data set: Offense model
In this section, the proposed nonparametricbased path modeling is applied on a real data set "Offense". The data set "Offense" contains the offense statistics of American's National Football League (NFL) for the season 2010-11. The "offense" data set is freely available in "plspm" package in R, or it can be downloaded from www.teamrankings.com. The data set contains 32 observations on 17 manifest variables. These 17 MVs are associated with five latent variables: For further details on each MV, see Sanchez and Trinchera (2012). The full structural and measurement model for the offense model is sketched in Fig. 5.
There are three exogenous LV involve in this model (i.e., Special, Rushing and Passing) while the Scoring and Offense LVs are depending on one or more than one LVs. Suppose this model is fitted by the PSL-path modeling technique and factor scores are computed. To study the relationship pattern among these LVs, the scatterplot of each endogenous LV vs exogenous LV is sketched and depicted in Fig.  6.
From these scatterplots, it is evident that one of the plots don't exhibit a linear pattern between endogenous and exogenous LVs, i.e., scoring vs. Special. So, it is clearly an indication of a violation of the linearity assumption. Hence, the only choice in the case of nonlinearity is to use the nonparametric regression approaches. The proposed procedure is applied to the "offense" data set and the factor scores for Scoring are predicted using PLS-PM, local polynomial (degree=0) and spline approaches. The performance of NP-LV models is assessed via MAE and RMSE, computed through a one-leave-one-out cross-validation approach. The MAE and RMSE amounts, as well as predicted factor scores for initial twenty observations, are tabulated in Table 6.
The predicted factor scores shown in Table 6 are obtained by applying the conventional PLS-PM approach and two nonparametric Local polynomials (degree=0) and spline smoothing indicate that the predictions will not same. For example, the predicted factor scores for the fifth observation are -0.5149, -0.3932 and -0.7007. The reason is that: In the conventional PLS-PM approach, simple or multiple linear regression lines are globally fitted while using nonparametric approaches local lines or curves are fitted. Further, the prediction performance measures (MAE and RMSE) also indicate that spline smoothing and local polynomial are giving better performance for the prediction of factor scores of Scoring.

Conclusion
In this study, an algorithm based on nonparametric regression is proposed for LV path modeling having measurement models of Formative type (Mode B). Three approaches: Kernel regression, local polynomial regression and spline smoothers are implemented to get the relationship among LVs and finally to get the predicted factor scores of endogenous LVs. The performance of the proposed procedure is assessed by conducting a variety of simulation designs (simple to complex) and results are computed through computing MAE and RMSE. Although simulation results give a clear indication that the conventional PLS-PM approach is performing well at small sample sizes, while nonparametric-based proposed procedure outperforms in case of large sample size (i.e., sample size 100 and above). The literature also recommends that nonparametric regression should be used for large sample sizes. But, when the linearity assumption is violated the only choice is to use nonparametric regression, otherwise, prediction results will be over or under-estimated. In the future, the current research can be extended by introducing the interaction effects in the model.