Flexible parametric survival models: An application to gastric cancer data

Article history: Received 21 December 2016 Received in revised form 12 February 2017 Accepted 18 February 2017 Flexible parametric survival models using cubic splines become popular in survival data analysis. The property of allowing converging hazard functions leads them to be the alternatives to Cox proportional hazards model and parametric survival models. In this study, flexible parametric survival models are applied to the data set of 106 gastric cancer patients. According to this data set, metastasis and muscle contraction are found as important risk factors on survival.


Introduction
proportional hazards (PH) model is widely used in the analysis of time-to-event data with censoring and covariates. However, it has some disadvantages such as the estimation of baseline hazard function and the assumption of PH. The baseline hazard function is treated as a highdimensional nuisance parameter and consequently parametric survival models may be appropriate to estimate it. When the PH assumption is violated, parametric survival models give more precise estimates and lead to some benefits. Another, important but less widely used model in survival analysis is proportional odds (PO) model. This model becomes a candidate model under non-proportional hazards. In this study, flexible parametric survival models suggested by Royston and Parmar (2002) based initially on the assumption of either PH or PO scaling of covariate effects is examined. The class of such models is based on transformation of the survival function by a link function proposed by Younes and Lachin (1997) using the parameterized link function of Aranda-Ordaz (1981). Royston and Parmar (2002) used restricted cubic splines to model baseline hazard directly. The restricted cubic splines offer greater flexibility in shape of the hazard function when compared with standard parametric models (Nelson et al., 2007). The purpose of this study is to present the flexible parametric survival models of Royston and Parmar (2002) and apply the models to the real data set in cancer. The comparison of the models is also discussed.

Survival models
Cox PH regression model which was discussed by Cox (1972) is used for analyzing censored survival data. The general Cox PH model is defined through the hazard function h(t) as where z is a covariate vector and h0(t) is the baseline hazard function. The model may be written in integrated form as where, H(t) is the cumulative hazard function. The model has an assumption of proportional hazards, however it has no distributional assumption. The use of Cox PH regression model may have two difficulties such as: Baseline hazard function: In the Cox PH model, the baseline hazard function, which may be estimated by the method of Kalbfleisch and Prentice (1980) among others, is treated as a highdimensional nuisance parameter and is highly erratic. However, the behavior of the hazard function is of potential medical interest because it is directly related to the time-course of an illness. To estimate it informatively (that is, smoothly), some type of parametric model may be appropriate (Royston, 2001).
Non-proportional hazards: A second issue is how to deal with non-proportional hazards which may occur, for example, when modelling prognostic factors in studies with medium or long follow-up times. Although the Cox PH model may be extended to allow for non-proportional hazards, for example by incorporating time varying regression coefficients there is no 'natural', widely accepted approach, and obtaining a satisfactory model can be complicated.
There are further concerns about the complexity involved in the practical interpretation of the coefficients and in the robustness of such models (Royston, 2001).
An alternative approach to modelling survival data is the PO model which was first described in a semi-parametric framework by Bennett (1983) and was further developed by several authors, including importantly by Dinse andLagakos (1983), Crowder et al. (1991), Collett (1994), Rossini and Tsiatis (1996), Yang and Prentice (1999), and Kirmani and Gupta (2001). The model assumes that the effect of the covariates is to increase or decrease the odds of dying by a given duration by a proportionate amount: where, z is a covariate vector, S0(t) is the baseline survival function that is taken from a suitable distribution, and exp ( ′ )is a multiplier reflecting the proportionate increase in the odds associated with covariate values z (Rodriguez, 2007). Here, the covariates act multiplicatively on the odds of survival beyond t. The model is a linear model for the log-odds ratio. As for the PH model, a nonparametric estimate of the baseline hazard function can be obtained. The model is then fitted by estimating β-parameters in the linear component of the model and the baseline survival function from the data (Collett, 1994). In PO model there is an assumption that the odds ratio is constant over time. The PO model with its property of convergent hazard functions is of considerable value in modelling survival data with non-proportional hazards.
To allow non-proportional hazards, Royston (2001), Royston and Parmar (2002), and Lambert and Rosyton (2009) developed flexible parametric models based initially on the assumption of either PH or PO scaling of covariate effects. Generically, the class of such models is based on transformation of the survival function by a link function g(.) where, S0(t)=S(t;0) is the baseline survival function and β is a vector of parameters to be estimated for covariates z. Within this framework, Younes and Lachin (1997) used the parameterized link function of Aranda-Ordaz (1981) where θ = 1 corresponds to the proportional odds model and θ→ 0 to the proportional hazards model (Royston and Parmar, 2002). For estimation, Younes andLachin (1997) used B-splines, Shen (1998) used sieve maximum likelihood and monotone splines and Royston and Parmar (2002) used natural cubic splines to model g[S0(t)] within the Aranda-Ordaz family of link function.

Parametric survival models with splines
Splines are flexible mathematical functions defined by piecewise polynomials and used in regression type models for non-linear effects. The points at which the polynomials join are called knots. Several techniques have been developed for exploring the functional forms. The most common splines used in practice are cubic splines. However, splines can be of any degree, n. Also, function is forced to have continuous 0th, 1st and 2nd derivatives (Lambert and Royston, 2009). In survival analysis, the splines are used in different aspects in several studies such as Hastie and Tibshirani (1990a,b), Gamerman and West (1987), O'Sullivan (1988), Durrelman and Simon (1989), Sleeper and Harrington (1990), Zucker and Karr (1990), Kooperberg and Stone (1992), Gray (1992Gray ( , 1994, Rosenberg (1995) The advantages of parametric survival models can be given as: the survival and hazard functions can be derived and manipulated, they do not have an assumption of proportional hazards, they allow prediction at any time point for any set of covariates and they can use of restricted cubic splines for hazard function. Besides, the key point of parametric survival models is the assumption of the survival time distribution. This creates the need of more flexible models. One of the alternatives is modelling the baseline cumulative odds or hazard function as natural cubic spline function of log time (Royston and Parmar, 2002).
The one of the most common distribution of survival time is Weibull distribution. Suppose that T is random variable having a Weibull distribution with characteristic life µ and shape parameter p. Then the log cumulative hazard function is given by, which is linear in x=lnt. If the distribution of T departs from Weibull then lnH(t) will be related to x by a non-linear function ≡ ( , ) where survival function S(t) is exp(-exps) (Royston and Parmar, 2002). The distribution of the survival time may be different than most common ones, and then more flexible parametric models can be used. The approach taken by Royston and Parmar (2002) is to model the logarithm of the baseline cumulative odds or hazard function as a natural cubic spline function of log time, so the general function ( , ) is approximated by a spline. The PH spline model with fixed covariate vector z may be written (Royston and Parmar, 2002)  The restricted cubic splines are defined as cubic splines constrained to be linear beyond boundary knots first knot and final knot. Restricted cubic splines with K knots can be fit by creating K-1 derived variables. For knots k1, …, kK, a restricted cubic spline function can be written as ( ) = 0 + 1 1 + ⋯ + −1 −1 .
A restricted cubic spline function of lnt, with knots k0, can be written as {ln( ) , 0 }. This is then used for baseline log cumulative hazard in PH model. The estimation of the models is done by using full likelihood which is discussed by Royston and Parmar (2002) and Lambert and Rosyton (2009) in detail.

Application: Gastric cancer
Gastric cancer is the third most common cause of cancer-related death in the world and it remains difficult to cure. In this study, we consider patients who were diagnosed with gastric cancer. The data used here is taken from Eroglu et al. (1997), but some of changes were made in the original data set, covariates and categories of the covariates to apply flexible parametric models. In the following analysis death is the endpoint of interest. The survival time (min=1, max=67) is measured in months and the mean survival time is obtained as 42.33±2.94. Patients who were still alive at the end of the followup period were treated as censored observations. The complete data set consists of 106 patients, of which 36.8% are censored. In the concept of analysis, prognostic factors which affect the survival time of gastric cancer patients are tried to be determined by flexible parametric survival models.
The mean age is obtained as 56.68 ± 1.2. The information about the covariates used in the following study is given in Table 1. In the non-parametric approach to survival analysis we provide the estimates of Kaplan-Meier (KM) survival function and the log-rank test. The logrank test allows for testing the equality of survival functions different groups. We observe that the equality of the survival functions of metastasis (p=0.000), family history (p=0.032) and muscle contraction (p=0.000) are rejected wheras it is not rejected for the others at a 95% confidence level.
Firstly, Cox PH model is applied to data set and AIC is obtained as 315.672. According to this model, metastasis and also muscle contraction are found significant at 95% confidence level. Patients with metastasis have 2.95 times the hazard for the patients without metastasis. Patients with muscle contraction have 8.358 times the hazard for the patients without muscle contraction.
The PH assumption of Cox PH model is shown by a correlation analysis of Schoenfeld (1982) residuals with time. All the covariates except metastasis (p=0.0001) hold the PH assumption. Therefore PH assumption does not satisfy for this data set and the results of Cox PH model becomes suspicious. Then flexible parametric survival models are applied to the data set as an alternative survival model. Here the Royston (2001), Royston and Parmar (2002) and Lambert and Rosyton (2009) approach is used. They use cubic splines and suggest selecting the df for the spline part of the model by minimizing the Akaike Information Criterion (AIC). The AIC may also be used to select the scale for the model. Table 2 shows the AICs for the gastric cancer data for PH and PO models with between 1 and 6 df (0 and 5 knots).  (PO model) 191.3755 192.1880(PO model) 191.3755 192. 189.8785 190.4991 190.2563 The model minimizing the AIC is the PH model with 2 knots for which AIC = 187.7287. The PH model with 2 knots is significant (p=0.0008) and the results are given by Table 3. According to the PH model with splines (k=2), metastasis and muscle contraction are found as important risk factors which effect the death risk of the patient. Patients with metastasis have 3.01 times the hazard for the patients without metastasis. Patients with muscle contraction have 11.57 times the hazard for the patients without muscle contraction. The significance of the covariates is same with Cox PH model; however the difference in hazard ratios and model fit is distinctly obvious. This difference is so important that it cannot be ignored in survival data sets.

Conclusion
Survival data is generally analyzed by semiparametric and parametric survival models. The semi-parametric model named as Cox PH model allows the distribution of the survival time to be unknown. The baseline hazard is not necessary to estimate hazard ratio. The model may become invalid in case of non-proportional hazards and it is less consistent with theoretical survival function. Parametric survival models are well known models, since it completely specify hazard and survival functions and may possibly predict time-quantile. However, there is an assumption on underlying distribution. The disadvantage of these models brings about the suggestion of flexible parametric survival models. These models use becomes popular especially in flexibility by increasing the degrees of freedom of the spline functions in estimating hazard function. In this study, Cox PH model and also flexible survival models with splines (PH with spline and PO with spline) are used in analyzing gastric cancer data.
The models are tried to determine the possible effects of age, chemotherapy, pathological stage, sex, metastasis, smoking, ulcer treatment, family history, alcohol, tumor stage, muscle contraction and radiotherapy on survival. The best model is obtained as PH model with splines within the models taken into consideration in this study. Metastasis and muscle contraction are found as important risk factors in this cancer data.