Smoothing parameter selection in semiparametric regression models with censored data

© 2017 The Authors. Published by IASE. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).


Introduction
*Consider the semiparametric regression model (Eq. 1); = + ( ) + , = 1, … , where 's are response observations, = ( 1 , … , )is p-dimensional vector, 's are observations of an extra univariate variable, = ( 1 , … , )′is an unknown p-dimensional parameter vector to be estimated, (. )is an unknown smooth function and 's are random error terms with zero mean and variance 2 . There are many approaches to estimate β and f in the model (1) with uncensored data. One of the primary approaches is the penalized splines method discussed by Eilers and Marx (2010) and Ruppert et al. (2003). Furthermore, Hall and Opsomer (2005), and Liang (2006) gave some theoretical results on penalized splines.
In our study, we are interested in estimating the parameter vector β and unknown function f in model (1), when the 's are observed incompletely and right censored by a random variable , = 1, … , while and are observed completely. Therefore, incomplete observations { , , } are transformed to complete observations { , , , , 1 ≤ ≤ } as follows = ( , ) and = ( ≤ ), = 1, … , where 's are the adjusted response observations with unknown distribution M, 's are the values of the censoring variable and 's are the values of the censoring indicator. As can be seen from the Eq. 2, censoring indicator provides the censoring information; if ith observation is complete then takes 1 otherwise 0. Also, we assume that T and C are independent random variables with unknown distributions F and G, respectively.
In this paper, we focus on the model (1) when the response variable is incompletely observed (or censored data). In the literature, for the censored linear and nonlinear regression models, appropriate estimators are defined by replacing incomplete observations with synthetic data (Buckley and James, 1979;Koul et al., 1981;Lai and Ying, 1992) for simplicity, we consider the transformed versions of the censored observations, called as synthetic data, proposed by Koul et al. (1981). We apply a penalized regression spline estimator to the synthetic data to determine a proper smoothing parameter selection criterion. The above estimator is a generalization of the well-known penalized spline estimator for the model (1).
In our study, the main difference is that we consider a randomly right censored semiparametric regression model, estimated by using several smoothing parameter selection criteria. The basic idea is to find a useful selection criterion that provides a good approximation to the model (1). Also it is aimed that the comparison between the performances of these different selection criteria.
In case of the censored data, a number of authors have studied a semiparametric regression model (1). Examples of this study include Chen and Khan (2001), Qin and Jing (2000), Wang and Li (2002), Orbe et al. (2003), Lu and Cheng (2007), and Zhou and Liang (2009). Also note that, estimation procedure for right-censored response observations is proposed by Kaplan and Meier (1958) and then Miller (1976) proposed Kaplan-Meier weights for linear regression model estimation using with K-M estimator. Also, Stute (1999) studied on nonlinear censored regression model with K-M weights and inspected its theoretical and asymptotical properties. In addition to these, Koul et al. (1981) suggested synthetic data transformation with using K-M estimator as an alternative for K-M weights.
The rest of the paper is designed as follows. Section 2 gives information about the preliminaries and methodology. The ideas on the features of the model and the proposed estimator are given in section 3. In Section 4, it is discussed the smoothing parameter selection criteria. Section 4 provides a real data example and a Monte Carlo simulation study. Finally, the conclusions are presented in the Section 5.

Preliminaries and methodology
Let F, G and M be the cumulative distribution functions of Y, C and L respectively. These are assumed to be positive random variables with distribution functions Under assumption that T and C are independent, survival function M can be written as To assess the statistical accuracy of the model which is computed from censored data set, we also assume that where <∞, F and G distributions have no jumps and G is continuous. It follows from (3) that, there is a probability distribution for T at each possible values for X and Z. From Eq. 5, for given s point, the mean of this distribution can be defined as Because of the nature of the censored data, the estimation of semiparametric model cannot be directly computed by the traditional methods here. To get rid of this problem Koul et al. (1981) proposed a synthetic data transformation. Normally, right-censored response variable (Y) and updated response variable (T) have different expectations because of censoring but synthetic data transformation provide a regularization for solving this problem (see Appendix A). That is, the censored response observations 's are converted to a synthetic response values according to the rule where (. ) is the common distribution of the censoring variable as mentioned in the introduction to this section. In the censored data applications, however, the censoring distribution G is usually unknown. For this reason, to estimate the components of the model (1) the ordinary methods cannot be used directly here. To overcome this problem Koul et al. (1981) proposed to replace G with its Kaplan Meier estimator, given by where ( ) ′ , (1) ≤ (2) ≤ ⋯ ≤ (2) are the ordered values of variable ( ) and ( ) values are the ordered associated with values of ( ) . Substituting ̅ ( ) = 1 −̂( ) for (. ) in (7), we obtain the following synthetic data: We will now see how to estimate the smooth function f(Z) and the regression coefficients vector , based on the above synthetic observations. For these purposes, several estimation methods, such as smoothing spline, kernel smoothing and regression spline, are improved in the literature. For convenience, we use regression spline method to estimate the unknown regression function and parametric coefficients in the model (1). This estimate procedure is explained in the following section.

Derivation of the estimator
Now we consider the ideas described above to apply the penalized spline method to the case of censored data. In this method, f(Z) is approximated by rth degree penalized spline with truncated polynomial basis (Ruppert et al., 2003): where ≥ 1is an integer that indicates the degree of penalized spline and ( − ) + = ( − ) if ( > ) and 0 otherwise. Also 1 < ⋯ < are the specifically selected knots {min( ) ≤ 1 < ⋯ < ≤ max ( )}.
There are several studies about selection of number of knots. In this study we selected knots according to full search algorithm (FSA) method which is expressed in Ruppert et al. (2003).
Using the above truncated polynomial and (9), it follows that the censored regression model (1) can be modified as a mixed model, given by where = 1, … , and ̂' s are the synthetic response observations ̂' s are random error terms for given G and → ∞, (̂) ≅ 0 (see Appendix B). Thus, we can fit model (1) using penalized spline through the mixed model (11). In a matrix and vector form, the model (11) can be rewritten as Brumback et al. (1999) and Ruppert et al. (2003) T̂= XB + Ab +̂ X and A are design matrices, given by obtained by minimizing the penalized least-squares criterion (13) where D = (0 +1 , 1 ) that is D is a diagonal penalty matrix whose first (p + 1) elements are 0, and the remaining elements are 1.
The part b′Db in (13) is called a penalty term because it penalizes curvatures in the function f, thus yielding a smoother result. The amount of penalty is controlled by a smoothing parameter > 0. In general, large values of produce smoother estimators while smaller values produce more wiggly estimators. As can be seen from here, the parameter plays a key role in estimating of the model (11). Also, one of our tasks is to select an optimal value of the in here. This problem is discussed in section (4).
Minimization of the criterion (13) leads to the system of equations Using the Eq. 14, we obtain the regression spline estimator for the coefficients vector b in the model (12), given by Then, substituting this estimator of b into (13), we obtain where U −1 = I − A(A ′ A + λD) −1 A and thus, the vector of fitted values is given by

Estimation of variance
In practice smoothing parameter depends on variance of the error terms ε̂= T̂− Ab − XB. As in linear regression, we can develop an estimator for 2 from the error or residuals sum of squares (RSS) where H λ is a smoother matrix, as in defined in(14c). Thus, an estimator of 2 is obtained as where (I − H λ ) 2 represents the degrees of freedom (DF) for residuals. The quantity MSE is called the mean square Terror. As in linear regression, DF can be used in estimation of 2 . Since MSE has a negligible bias term, the Eq. 16 is an unbiased estimate of 2 (Ruppert et al., 2003).
Furthermore, it is easy to see that the MSE which is the expected value of RSS is where the first term measures the variance, while the second term measures bias. To see these measures of the estimator, we expand B in (14b) by (12) to find Hence, the bias and variance-covariance matrix of this estimator are, respectively, where 2 is illustrated in (16).

Choice of the smoothing parameter
In literature there are many classical methods used to select the amount of smoothing. Here we consider and compare the most widely used three selection criteria. As described in previous sections, these are AICc, GCV and Mallow's Cp. The positive parameter λ that minimizes any selection criteria is selected as an optimum smoothing parameter. Hurvich et al. (1998) proposed an improved version of classical Akaike's information criterion (AIC) to handle parameter selection problems. This selection criterion is denoted by AICc and defined as follows where H is a smoother matrix in (14c).

GCV criterion
The GCV is the most widely used criterion for selecting the smoothing parameter. It provides a more efficient calculation of leave-one out cross validation (CV). This criterion is described by Craven and Wahba (1978).

Mallows' Cp criterion
The Cp is a common selection criterion proposed by Mallows (1973) for providing an MSE estimation scaled by 2 . It is given as follows In practice if 2 is unknown, it is can be provided as where is the pre-chosen value of the smoothing parameter with any selection method.

Numerical studies
In this section, we consider the different estimators for the components of the censored semiparametric regression model (12). It is discussed the estimator's finite properties with the two data sets. First data set is consisted by a real data from cancer patients, while second data set is obtained by a simulated data based on different censoring levels and sample sizes.

Real data example
We use a right censored real data from bowel cancer patients in Izmir, Turkey. To analyze the survival times of the patients we may write the semiparametric regression model corresponding to Eq. 11 as where  To visualize function (. ), we use a scatter diagram by plotting the smooth curve with 95% confidence intervals (see second panel of Fig. 1). Also, it is illustrated a scatter diagram of the survival times against Alb in the first panel of the Fig. 1. The second panel in Fig. 1 shows that relationship between and may be nonlinear. In our context, the variable is used as nonparametric part of the model (17). The main idea is to obtain a useful approximation ̂( ) to the real function ( ). For these purposes, we fit a 3th-degree piecewise polynomial spline, as specified in (9), for the nonlinear ( )using penalized spline method based on smoothing parameter λ selected by AICc, GCV, and Cp criteria, respectively. The estimated three spline fits are illustrated in the Fig. 2. In order to assess the quality of these fits we used the mean square error (MSE) values, given by It is easy to see that Cp outperforms the remaining two criteria, AICc and GCV, in estimating the nonparametric component of the model (17) with right censored data.

Simulation study
In this section, we carried out a Monte Carlo simulation study to show the practical performance of smoothing parameter criteria. To see the performance of the small, medium and large samples of each selection criteria, we consider three censoring levels (CLs), 10%, 30%, and 50% for each samples sizes with n = 50; 100; and 200. In this study, we generate the censoring variable as below;  (Wang et al., 2004). Under these conditions we obtain three censoring levels, 10%, 30% and 50%, respectively. The empirical data is generated by censored semiparametric regression model in generic form = 2 1 + 1 2 + 1.5 3 + ( ) + , = 1, … , where 's have uniform distribution as U(0,1), (j=1,2,3), ( ) = exp{sin( ) cos(3 ) + √ } with = 4( − 0.5)/ and ~(0,1). In here, however, because of the censoring, we consider transformed response (or synthetic) observations ̂ instead of as described in the section 3. For each censoring levels and sample sizes we conducted 1000 Monte Carlo simulation studies, and obtained 1000 estimates of the vector β = ( 1 , 2 , 3 ) ′ = (2,1, −1.5)′s forming the parametric part, and calculated 1000 smooth curves, constructing the nonparametric part, of the censored semiparametric model (19). The results of the simulation study are summarized in the rest of paper.
In this simulation study, because of nine different configurations are made, it is not possible to display all these configurations here. Therefore, only four different configurations are given in Figs. 1-4  Outcomes from the CL=30% for small sized samples are similar to Fig. 3 and are not reported. Furthermore, findings from the other sample sizes (n=100 and 200) for CL=10% and CL=30% (not displayed here) are similar to Fig. 3 and also not given here. Accordingly, it can be said that criteria are not superior to each other under the low and medium censoring levels. On the other hand, the effect of the heavy censoring levels is given in the Figs. 3-5 for small, medium, and large sample sizes, respectively.
It is seen from these Figures that the MSE values of the Cp are smaller than other criteria. This means that Cp is better than other criteria in terms of MSE for nonparametric component with heavy censoring level. Also, as can be seen Fig. 5, the two profile curves (corresponding to GCV and AICc) are located very close to each other.
In Table 2 "Bias" is biases of the estimated coefficients from simulation mean, "Std" is the simulation standard deviation, and "MSE" is the simulation mean square error of the estimated parameters, 1 , 2 and 3 respectively. Generally, the effect of the censoring levels tends to increase the variances of the estimated regression coefficients. The precision is declined as the censoring level increases.   In addition, the precision is also improved as the sample size increases. To explain this issue, it is carried out the averages and the standard errors for the estimates of the regression coefficients obtained by the model (19) based on penalized spline for each criterion, sample, and censoring levels. The findings are illustrated in Table 2. In our context, Table 2 displays the biases and standard deviations of the vector β = (̂1,̂2,̂3) ′ over the 1000 simulations. To assess the quality of the vector β, we used MSE value which is given by; From the Table 1 we can see that the biases of β are very small. In this case, it can be said that GCV Cp Criteria MSE estimations of regression coefficients are quite satisfying for the three sample sizes and censoring levels. Especially the coefficients and their standard deviations estimated by Cp criterion are smaller than other criteria for all censoring levels and sample sizes. It appears that the Cp method generally outperforms other methods.
In addition, since the outcomes from medium sized samples (n = 100) are similar to those from large sized samples (n= 200), they are not reported here. In particular, when n is large (n = 200) which is illustrated in Fig. 6, the AICc method has small biases for all censoring levels. This means that the AICc method gives more unbiased estimates for large sized samples. However, when n is large, estimates of parametric coefficients, MSE, standard deviation values are more stable for Cp method under all censoring levels. Generally, the estimates obtained by three criteria are satisfying for all censoring levels. Furthermore, for each selection methods MSE values are very close to each other, especially for n = 200. This means that the remaining two estimators also work well. In summary, as can be seen from Table 2, the criteria giving smallest MSE are indicated by bold color. As expected, the MSE values are improved as the sample sizes increases.

Samples
There is not much of a difference between the three smoothing parameter selection criteria. However, MSE values of large sized samples are more stable than those of small sized samples in all censoring levels.

Conclusion
In this paper, we examined the performance of semiparametric model when response variable is randomly right-censored. As known, in semiparametric models, smoothing parameter selection plays an important role. Based on three selection criteria, we carried out three different models with right censored data. It is considered regression spline (penalized spline) method to estimate model parameters. In order to obtain accurate estimates, the smoothing parameter should be carefully chosen.
The main goal is to determine the selection criterion that gives better model fits. The following equation represents the MSE values of the fitted semiparametric models, estimated by averaging at the 1000 simulated data points. The mention MSE values are calculated as In this paper, we focused on the measure the performances of the selection methods and quality of the estimation method. For these purposes, we carried out both simulation study and real data example with using survival data. When we examine the results of simulation and real data experiments, we encountered some expected situations such as, big MSE values for high censoring levels for all selection methods, better estimations when low censoring levels or high sample sizes. In here Cp method illustrates better performance and estimations and note that AICc and GCV methods have similar results.
According to summarized results expressed in the Table 3, we can present the following recommendations and conclusions:  Especially for large sample sizes and censoring levels, Cp has gave a better performance than AICc and GCV. In this case, the use Cp would be more beneficial.  For small and medium sized samples, AIC gives a better performance than GCV and Cp criteria under the censoring level 30%. For especially medium censoring levels, the implementation of AIC would seem to be more appropriate.  As known, the mentioned regression function in this paper is a conditional expectation when response variable is censored from the right. Normally expectation of values are 1 ∑ =1 This is the standard estimation of ( ). Because of censoring, instead of , we use ( , ) as in Eq. 2. If censoring distribution G is known, then the unbiased estimate of ( ) will be as follows: In general, censoring distribution G is unknown. In this case, ( ) on a certain s point is estimated with help of Kaplan-Meier mean Proof of equation is already illustrated by Susarla et al. (1984). According to that it can be said that; → ( ).

Appendix B. Synthetic response observation
When → ∞ it can be said that ( ) ≅ 0 which is also means that ( ) ≅ 0. The vector form of this expression is given by (ε) ≅ 0. Let us consider the following model, T = XB + Ab + ε and (T) = XB + Ab.