Estimating restricted mean treatment effects with stacked survival models
Abstract
The difference in restricted mean survival times between two groups is a clinically relevant summary measure. With observational data, there may be imbalances in confounding variables between the two groups. One approach to account for such imbalances is to estimate a covariateadjusted restricted mean difference by modeling the covariateadjusted survival distribution and then marginalizing over the covariate distribution. We demonstrate that the meansquared error of the restricted mean difference is bounded by the meansquared error of the covariateadjusted survival distribution estimators. This implies that a better estimator of the covariateadjusted survival distributions is associated with a better estimator of the restricted mean difference. Thus, this paper proposes estimating restricted mean differences with stacked survival models. Stacked survival models estimate a weighted average of several survival models by minimizing predicted error. By including a range of parametric and semiparametric models, stacked survival models can effectively estimate a covariateadjusted survival distribution and, therefore, the restricted mean treatment effect in a wide range of scenarios. We demonstrate through a simulation study that the new estimator can perform nearly as well as Cox regression when the proportional hazards assumption is satisfied and significantly better when proportional hazards is violated. The proposed estimator is also illustrated with data from the United Network for Organ Sharing to evaluate postlung transplant survival between large and smallvolume centers.
Keywords: Biasvariance tradeoff; proportional hazards assumption; restricted mean difference; stacked survival models; survival analysis.
1 Introduction
Patients with endstage lung disease (e.g., advanced cystic fibrosis) may be eligible for lung transplantation after other treatment options fail. Unfortunately, posttransplant survival is poor after lung transplantation, especially in comparison to other solid organ transplants, with one and threeyear graft survival of and , respectively. Given the poor prognosis, understanding the factors related to posttransplant survival remains an important but controversial task. For example, previous research has suggested that transplant center volume (which we operationalize as the number of lung transplants performed at a center over the preceding two years) is associated with posttransplant survival with higher volume centers associated with lower mortality (Weiss et al., 2009; Thabut et al., 2010). However, the effect of center volume after properly adjusting for confounders remains unclear.
In the absence of censoring, the difference in posttransplant survival time between high and low volume centers would be traditionally summarized by the difference in mean survival time. However, the mean for a nonnegative random variable (e.g., survival time) is defined as , where is the survival function of the random variable . The estimate of the mean is, therefore, not defined when for all observed , a situation regularly experienced with even light censoring. Furthermore, there exists no consistent estimator of when , where is the maximum followup time. Since substantial censoring is experienced in lung transplantation, a different summary measure is required.
The restricted mean is an alternative summary measure that is always estimable in the presence of rightcensored survival times (Royston and Parmar, 2013). The restricted mean truncates observations at some time point [i.e., ]. By choosing a value for within the observed followup time, the restricted mean is an estimable summary measure with a direct interpretation that is closely related to the mean. For example, the average difference in posttransplant survival over one year between highvolume and lowvolume centers is the difference in oneyear restricted means between the two groups. However, estimating the difference of the restricted mean survival time with data from observational studies, such as the data available in the lung transplantation example, is difficult due to potential confounding. In particular, the difference in the area under the KaplanMeier survival curve up to time is not necessarily a consistent estimator of the causal restricted mean difference between the two treatment groups.
To account for imbalances in confounders between the two treatments, several researchers have proposed estimating the covariateadjusted restricted mean difference by modeling the covariateadjusted survival distribution (i.e., estimating the conditional survival function), and then marginalizing over the covariate distribution to estimate restricted mean difference (referred to as the “regression” approach). For such an approach, the model of the covariateadjusted survival distribution (or the estimator of the conditional survival function) defines the estimator of the restricted mean difference.
The covariateadjusted survival distribution is often assumed to follow a proportional hazards model. For example, Karrison (1987) proposed modeling the survival time distribution as a proportional hazards model with a piecewise constant baseline hazard function. As a natural alternative, Zucker (1998) proposed a proportional hazards model with an unspecified baseline hazard, i.e., a Cox proportional hazards model (Cox, 1972). Both Karrison and Zucker assume that the proportional effect of the covariates on the hazard is the same for each treatment. Chen and Tsiatis (2001) relax this assumption by estimating separate baseline hazard functions and covariate effects for each treatment.
Unfortunately, the proportional hazards assumption may not hold in many applications. For example, centers with greater lung transplant volume are more likely to perform bilateral lung transplants (as opposed to single lung transplants), and the type of lung transplant is wellknown to violate the proportional hazards assumption (Thabut et al., 2009). As such, current approaches, which rely on the proportional hazards assumption, may produce biased and inefficient estimates of the restricted mean difference between high volume and low volume centers. As an alternative, we could estimate the survival distribution with a different model (e.g., an accelerated failure time model), but the estimator would then be biased if the accelerated failure time assumption is violated. Rather than rely on a particular parametric or semiparametric model, we pursue a flexible estimator of the conditional survival function and, therefore, of the restricted mean difference that performs well across a wide range of situations.
In particular, we investigate ‘stacked survival models’ for estimating the conditional survival function. Stacking finds a weighted average of several conditional survival function estimators by minimizing predicted error (Wey et al., 2013). Since the minimization is based on predicted error, stacking can include parametric models, semiparametric models and nonparametric models. This allows more weight to be given to the model that most accurately estimates the underlying survival function for a given situation and sample size. Wey et al. (2013) demonstrated that the stacked survival model is competitive with parametric models and the Cox proportional hazards model for estimating the conditional survival function when assumptions are satisfied, but performs better when assumptions are violated. We show that the meansquared error of the restricted mean difference estimator is bounded by the meansquared error of estimator the conditional survival function. As such, better estimators of the conditional survival function will generally lead to a better estimator of the restricted mean difference. Therefore, our goal is the improvement of restricted mean treatment effect estimation in a wide range of situations by estimating the conditional survival function with stacked survival models.
Section 2 introduces the estimator of the restricted mean treatment effect and bounds the meansquared error (MSE) of the restricted mean treatment effect by the meansquared error of the conditional survival function. Section 3 outlines how stacked survival models can be applied to restricted mean treatment effect estimation. A simulation study evaluates the finite sample performance of the proposed estimator in Section 4. The proposed estimator is then applied to a observational registry of postlung transplantation survival from the United Network for Organ Sharing in Section 5. Concluding remarks are presented in Section 6.
2 Proposed Estimator
Throughout the paper, random variables and observed variables are distinguished by capital and lower case letters, respectively. The factor (i.e., the treatment or condition), which the restricted mean survival is compared across, is denoted by , where denotes the subject, and follows the Bernoulli random variable (i.e., ). Additional covariates, denoted by vector , are measured at the beginning of the study and follow the distribution of the random variable . For this paper, we define the nonnegative survival time random variable as , where and are the (possibly unobservable) survival time random variables had a patient received treatment and , respectively. We assume that there are no unmeasured confounders; that is, the set of potential outcomes, , is conditionally independent of given (i.e., , where denotes statistical independence). The censoring time is , which follows the distribution of the continuous nonnegative random variable and is assumed to be conditionally independent of (i.e., ). Hence a sample of right censored survival data for patients is , , where and .
Now let and be, respectively, the treatmentspecific conditional survival functions for the covariateadjusted survival and censoring distributions for treatment . For brevity, we write throughout that and .
2.1 Restricted Mean Treatment Effects
Following the outline of Chen and Tsiatis (2001), we estimate the restricted mean for each treatment group with the “regression” approach, which involves modeling the covariateadjusted survival time distribution. In particular, the restricted mean for treatment is defined as
where (2.1) holds due to the assumption that (i.e., the assumption of no unmeasured confounders). It is important to note that the outer expectation in (2.1) is taken with respect to the marginal, rather than conditional, covariate distribution.
Once we estimate the treatmentspecific conditional survival functions, , then we estimate the expectation over the covariate space with the empirical covariate distribution. Thus, the estimator for the restricted mean of the potential outcome is
(2) 
where is the estimate of . In practice, a closed form solution of equation (2) may not exist, and we, therefore, approximate equation (2) by
(3) 
where are the ordered event times, i.e., for all , and is one more than the number of event times less than . If no event time equals (i.e., for all ), then and is the largest event time less than . For two treatments, the estimated difference in restricted mean survival time is
(4) 
which also corresponds to the difference in the area under the estimated survival curves for the two potential outcomes up to time .
2.2 Influence of TreatmentSpecific Conditional Survival Functions
There is a clear connection between the restricted mean treatment effect and the treatmentspecific conditional survival functions. We further formalize this connection by placing an upper bound on the MSE of the restricted mean treatment effect estimator that depends on the MSE of the estimators for the treatmentspecific conditional survival functions.
Theorem 1.
Define the meansquared error of a restricted mean treatment effect estimator as , then
where is the meansquared error of an estimator of the treatmentspecific conditional survival function for treatment . Note that the expectation is over the random variable for the covariate space and the sampling distribution of the conditional survival function estimator.
The inequality (see the Supporting Information for the proof) illustrates that the MSE of the restricted mean treatment effect should be associated with the MSE of the treatmentspecific conditional survival functions. Of course, meansquared error decomposes into a squared bias term and a variance term, which implies the following corollary
Corollary 2.
Define and as, respectively, the bias and variance of the restricted mean treatment effect, then
since .
Thus, the performance of the treatmentspecific conditional survival functions places an upper bound on both traditional measures of performance (i.e., bias and MSE) for the restricted mean treatment effect. The bound on the squared bias, or absolute bias, is less tight than the bound on the MSE due to a positive, and potentially large, variance term. Therefore, we would expect a relatively strong association between the MSE of the restricted mean treatment effect and the MSE of the treatmentspecific conditional survival functions, but a relatively weak association between the bias of the restricted mean treatment effect and the MSE of the treatmentspecific conditional survival functions. The simulation study in Section 4 presents an empirical demonstration of this relationship.
3 Stacked Survival Models
Since the performance of the restricted mean treatment effect estimator is closely related to the performance of the treatmentspecific conditional survival function estimator, we propose estimating restricted mean treatment effects with stacked survival models. Stacked survival models estimate a weighted combination of survival models that can span parametric, semiparametric, and nonparametric models by minimizing prediction error. For estimating conditional survival functions, nonparametric estimators can be preferred to parametric and semiparametric estimators due to relaxed assumptions. Yet, even when misspecified, parametric and semiparametric estimators can possess better operating characteristics in small sample sizes due to smaller variance than nonparametric estimators. Fundamentally, this is a biasvariance tradeoff situation and, by minimizing predicted error, stacking estimates an optimal combination of survival models that balances the biasvariance tradeoff of each estimator at a given sample size. In particular, Wey et al. (2013) illustrate that stacked survival models effectively estimate a conditional survival function across a wide range of situations.
In uncensored settings, stacking estimates the optimal weighted average of several candidate models by minimizing predicted squared error (Breiman, 1996). However, predicted squared error is often poorly defined in censored settings due to potentially unobserved event times. Therefore, a different loss function that is tailored to censored data is required. Wey et al. (2013) evaluate the Brier Score (Graf et al., 1999) over a grid of time points for stacking survival models. In addition, they show that, under certain conditions, the stacked survival model using the Brier Score is uniformly consistent for the true conditional survival function. This section focuses on appropriately modifying stacked survival models for estimating restricted means.
The Brier Score (Graf et al., 1999) measures the predicted squared error of a conditional survival function at a particular time point. Following Lostritto et al. (2012), the estimated Brier Score for a given estimator of the conditional survival function for treatment at a single time point can be written as
(5) 
where , , is the estimated conditional survival function of the censoring distribution for subjects that received the treatment, and is the set of patients that received treatment . For a fixed time , censored observations with will contribute to the Brier Score, but when the censored observations will only contribute to the Brier Score indirectly through the estimation of . In this paper, is the (unconditional) treatmentspecific KaplanMeier survival estimator denoted here after as .
To estimate the conditional survival function for each treatment group, the stacking procedure considers the same set of candidate models, and each model has a corresponding conditional survival function estimate, say survival models. The set of candidate survival models included in the stack has an influence on the performance of the stacked estimator. In particular, Breiman (1996) and Wey et al. (2013) found that stacking performs well with a diverse set of models. We note, however, that other combinations of survival models could be included in the stack. Since the goal is estimating the entire conditional survival function up to time , stacked survival models minimize the sum of over a set of time points, say . The stacking weights for the models are estimated by a weighted least squares problem with the additional constraints that for and :
(6) 
where is the survival estimate from the model while leaving the observation out during the fitting process. This ensures that stacking does not reward models that overfit the data. This is traditionally done by leaving only the observation out in the fitting process. However, the computational requirements induced by bootstrapping for confidence interval estimation (which is discussed subsequently) prevent fitting the set of candidate survival models separate times. The data are instead randomly split into five equally sized sets and is obtained for observations in a given set by fitting the survival models to the observations in the other four sets. As such, five sets of survival models, rather than sets of survival models, are fit for each of the candidate survival models.
After minimizing equation (6), the stacked estimate of the conditional survival function for treatment is
(7) 
where is the survival model estimated with all observations on treatment . The treatmentspecific restricted means and the restricted mean treatment effects are then estimated with equations (3) and (4), respectively.
We estimate confidence intervals with the nonparametric bootstrap (Efron and Tibshirani, 1993). In particular, we randomly sample with replacement of the observed ; this is called the bootstrap data set. The bootstrapped estimate of the treatmentspecific conditional survival function is defined as . Since the stacking weights are reestimated for each bootstrap, the final estimate of the conditional survival function for treatment with data from the bootstrap is . The bootstrap estimates for the treatmentspecific restricted means and restricted mean treatment effect are calculated as before [see equations (3) and (4)]. The confidence interval of the restricted mean treatment effect () can then be estimated by the and percentiles of the bootstrap distribution.
Remark 1. The Brier Score measures agreement at only one particular time. As such, the set of values (i.e., ) over which the Brier Score is minimized [see equation (6)] has implications for performance. In particular, care should be taken to avoid picking only very small, or very large values. Wey et al. (2013) recommend at least nine evenly spaced quantiles of the observed event distribution to ensure good estimation of the conditional survival function. The effect of the set of time points over which the Brier Score is minimized is investigated in the Supporting Information.
Remark 2. Wey et al. (2013) show that, given the stack contains a uniformly consistent estimator of the conditional survival function, the stacked estimator is uniformly consistent for the underlying conditional survival function. Therefore, when at least one model within the set of candidate survival models is correctly specified, is consistent for the true restricted mean treatment effect by the dominated convergence theorem (Ferguson, 1996). The proposed estimator is, therefore, consistent in a wider range of scenarios than previous methods that assume a proportional hazards model to estimate the conditional survival distribution.
4 Simulation Study
4.1 Setup
The simulation study evaluates the finite sample performance for estimating the causal restricted mean treatment effect with stacked survival models. We consider four different datagenerating scenarios, indexed by , for the covariateadjusted survival distribution of the potential outcomes. When then , where ; when then , where . The covariate effects for the control group (i.e., ) are
where is the cumulative distribution function of a standard normal distribution (i.e., the nonlinear effect is a ‘smooth step function’). The covariate effects for the treatment group (i.e., ) are
The censoring distributions are defined similarly with replaced by , and are designed to achieve a marginal censoring rate of approximately . The censoring distributions for the control group (i.e., ) are
while the censoring distributions for the treatment group (i.e., ) are
For brevity, we refer to scenarios 1 and 2 as, respectively, the linear and nonlinear exponential scenarios, and scenarios 3 and 4 as, respectively, the linear and nonlinear gamma scenarios.
The covariate distribution is a fourdimensional multivariate normal with mean zero, unit variances, and a positive correlation structure (). To mimic observational studies with confounding, the treatment assignment depends on the covariate distribution. In particular, the probability of receiving treatment, i.e., , is defined by .
Each simulation scenario evaluates performance when and with replications and a sample size of . The bootstrap distributions are estimated with bootstrap replicates. All simulations were run in R version 3.0.0 (R Development Core Team, 2013). The stacking weights are estimated by minimizing the Brier Score over nine equally spaced quantiles of the observed events with the constrained minimization problem solved by the alabama package (Varadhan, 2012).
We consider a mixture of parametric, semiparametric and nonparametric candidate survival models. The parametric models are the Weibull model and logNormal model with only linear main effects. Both models are special cases of an accelerated failure time model, while the Weibull is also a special case of a proportional hazards model. The semiparametric models are two versions of the Cox model. The first Cox model has only linear main effects, while the second Cox model uses penalized splines for main effects with the roughness penalty set to . The survival package estimates both the parametric and semiparametric models (Therneau, 2013). The nonparametric estimator in the set of candidate survival models is random survival forests (RSF), which is estimated with the randomSurvivalForest package (Ishwaran et al., 2008) with trees grown. We use the default tuning parameters of the randomSurvivalForest package for RSF.
We consider two different versions of stacked survival models. The first version excludes random survival forests (RSF) from the set of candidate survival models, while the second includes RSF. In particular,

The ‘Stacked’ estimator only includes the two parametric and two semiparametric survival models in the set of candidate survival models.

The ‘Stacked (with RSF)’ estimator includes the two parametric models, two semiparametric models, and one nonparametric model with default tuning parameters.
There are two significant motivations for investigating two different versions of the stacked estimator. The first motivation is the potential reduction of computational demands. In particular, random survival forests (RSF) are computationally expensive, which is significantly compounded by the nonparametric bootstrap used to estimate confidence intervals. We can save substantial computational time if the stacked estimator without RSF performs as well, or better, than the stacked estimators with RSF. The second motivation is that sample size of is potentially too small for a nonparametric estimator, such as RSF, to perform well enough to contribute to a better stacked estimator.
Each restricted mean estimator in this simulation study uses the “regression” approach described in Section 2.1. The different methods of estimating the restricted mean differ in their approach to estimating the treatmentspecific conditional survival functions in equation (3). The two versions of the stacked estimator are compared to a Cox proportional hazards model with linear main effects (referred to as the ‘Cox estimator’), and a Cox proportional hazards model with penalized splines (referred to as the ‘Splines estimator’). The Cox estimator was proposed by Chen and Tsiatis (2001), while the Splines estimator is a straightforward extension of the Cox estimator that should be more robust in a variety of situations. Note that each stacked estimator includes both the Cox and Splines estimators in the set of candidate survival models.
The methods are compared on the basis of percent relative bias, i.e., , and the ratio of mean squared error (MSE) to the Cox estimator, where . Confidence interval performance is assessed with two measures: the ratio of average confidence interval length (ACL) to the Cox estimator and coverage probability. For each method, the estimated confidence intervals use the and percentiles of the bootstrap distribution with bootstrap replicates. We also present the ratio of ‘integrated squared survival error’ (ISSE) to the Cox estimator for each method: , which corresponds to the average of the meansquared error of the treatmentspecific conditional survival functions presented in Section 2.2. All expectations are approximated by averaging across the replications.
In the exponential scenario with linear covariate effects, the correctly specified Cox estimator should perform well, while the Splines estimator  despite being correctly specified  should be slightly less efficient due to increased flexibility. In contrast, for the exponential scenario with nonlinear covariate effects, the Cox estimator should perform poorly due to model misspecification, while the Splines estimator remains correctly specified and should perform relatively well. Despite including both the Cox estimator and the Splines estimator, both stacked estimators will likely perform relatively worse in both exponential scenarios due to the increased flexibility of additional models in the stack.
Each parametric and semiparametric model is misspecified in both gamma scenarios. However, the parametric models in the stacked estimators closely approximate the truth in the linear scenario (e.g., same mean function). As such, in the linear gamma scenario, the stacked estimators should perform better than both the Cox and Splines estimator due to approximately correct parametric models. The nonlinear gamma scenario assesses the robustness of each estimator since none of the parametric and semiparametric models are correctly specified or closely approximate the underlying mean function. However, the stacked estimators should perform better than the Cox and Splines estimators due to the increased flexibility of additional models in the set of candidate models. This is also an opportunity for the stacked estimator with random survival forests (RSF) to perform well relative to the Stacked estimator without RSF.
4.2 Results
Across the different scenarios, the Stacked estimator (without RSF) possesses as good, or better, bias and MSE than the Stacked estimator with RSF. The Stacked estimator with RSF also, at times, does not achieve nominal coverage at this modest sample size. In addition, the Stacked estimator with RSF actually performs slightly worse than the Stacked estimator (without RSF) in the nonlinear gamma scenario; an advantageous scenario for random survival forests (RSF). Due to the similar, or better, performance, we only compare the Cox and Splines estimators to the Stacked estimator (without RSF) for the rest of this section.
For the exponential scenarios (Table 1), the Stacked estimator possesses similar, or slightly more, bias than the Cox estimator when the covariate effects are linear, and the Stacked estimator has similar, or more, bias than the Splines estimator for both linear and nonlinear covariate effect scenarios. These results are expected as the Cox and Splines estimators are correctly specified in, respectively, the linear and nonlinear exponential scenarios. In the exponential scenario with nonlinear covariate effects, the Stacked estimator has approximately lower relative bias than the misspecified Cox estimator. In the linear scenario, the Stacked estimator has similar, or slightly larger, MSE than the Cox estimator but, in the nonlinear scenario, the Stacked estimator has approximately lower MSE than the Cox estimator. In addition, the Stacked estimator possesses approximately lower MSE than the Splines estimator in both exponential scenarios. Finally, the Stacked estimator possesses approximately narrower confidence intervals than the Splines estimator in both scenarios including, surprisingly, the nonlinear scenario. The Stacked estimator is therefore competitive in the exponential scenarios with linear effects and more efficient than both the Cox and Splines estimators in the nonlinear exponential scenario.
For both gamma scenarios (Table 2), the Stacked estimator possesses less relative bias than the Cox estimator and less relative bias than the Splines estimator. In addition, the Stacked estimator possesses lower MSE than the Cox estimator and lower MSE than the Splines estimator. The Cox estimator surprisingly possesses lower MSE than the Splines estimator and as good, or better, ACL than the Stacked estimator in almost every gamma scenario. However, aside from ACL, the Stacked estimator performs better than both the Cox and Splines estimators under nonproportional hazards.
Wey et al. (2013) motivated stacking with the goal of balancing, for a given sample size, the low variance of (potentially misspecified) parametric models with robust (but potentially inefficient) semiparametric and nonparametric models. This effect is illustrated in the simulation study even when the random survival forests (i.e., the nonparametric estimator) are excluded from the set of candidate survival models. For example, in the linear exponential scenario, the Stacked estimator possesses as good, or better, MSE than the correctly specified Cox and Splines estimators. This is likely due to the inclusion of a correctly specified parametric Weibull model. Yet, even the Weibull model is misspecified in the nonlinear gamma scenario, the Stacked estimator still performs better than both the Cox and Splines estimators despite the lack of either a correctly specified parametric model or a nonparametric estimator. This ability to adaptively find a good balance between the low variance of (potentially misspecified) parametric models with the more robust, but still potentially misspecified, semiparametric models (e.g., a Cox model with penalized splines) is the most appealing aspect of stacked survival models.
Section 2.2 argues that performance of the treatmentspecific conditional survival functions will be more tightly associated with MSE, rather than bias, of the restricted mean treatment effect. Figure 1 provides an illustration of this point from the simulation study. In particular, the Pearson correlation of the ISSE with the MSE of the restricted mean treatment effect is , while the correlation of the ISSE with the bias of the restricted mean treatment effect is . As illustrated in the Supporting Information, the correlation at a larger sample size increases for the MSE, while correlation slightly decreases for the bias.
Remark 3. The Supporting Information contains several additional investigations into the simulation study: a larger sample size, the difference between the bootstrap standard error and Monte Carlo standard error, different approaches to confidence interval estimation, and the choice of time points over which the stacking weights are determined.
5 Effect of Center Volume in Lung Transplantation
We applied the proposed estimator to data from an observational registry of postlung transplant survival to estimate the effect of large center volume on graft survival (i.e., time to death or retransplantation). In particular, we want to estimate the difference in the restricted mean of posttransplant survival between high volume centers (defined as lung transplants over the past two years (Tsuang et al., 2013)) and low volume centers (defined as lung transplants over the past two years). However, previous research has demonstrated that the relationship between transplant type, which is an important confounder, and the posttransplant hazard is likely nonproportional (Thabut et al., 2009). Therefore, this example represents a setting for estimating restricted mean treatment effects with stacked survival models instead of the Cox proportional hazards model.
The United Network for Organ Sharing (UNOS) collects patient information, donor information and survival status of every solid organ transplant performed in the United States. This analysis only includes lung transplants performed between January 1, 2008 and December 31, 2011 in adult recipients receiving their first lungonly transplantation. We adjust for potential confounding from several patient related covariates including gender, age, lung allocation score, native disease grouping (obstructive, vascular, cystic and restrictive), distance walked in six minutes, ventilator use, level of oxygen use, and type of transplant (single versus bilateral lung transplant). We also adjust for several donor related covariates: age over 55 years, African American race, smoking history greater than pack years, and height difference between donor and recipient. The event of interest is time to death or retransplantation. A total of transplanted patients were included in this analysis. Approximately of the survival times were censored.
Similar to Stacked estimator (without random survival forests) in Section 4, the stacked survival model includes two versions of the Cox model, a Weibull model, and a logNormal model. The Weibull, logNormal, and the first Cox model fit linear main effects to all continuous covariates except the height difference between donor and recipient, which is fit with a quadratic main effect. The second Cox model fits penalized splines to each continuous covariate. The confidence intervals are estimated with and quantiles of the bootstrap distribution with bootstrap replications.
Figure 2 gives the estimated causal restricted mean treatment effect for high volume versus low volume centers from years to years based on the Stacked estimator and the Cox estimator. Both the Stacked and Cox estimators estimate a restricted mean difference between large and small volume centers greater than zero over the range of followup, which indicates better posttransplant survival for high volume centers. The Stacked estimate is approximately larger than the Cox estimate from years to years, while at years the Stacked estimate is approximately smaller than the Cox estimate. In addition, the confidence intervals for the Stacked estimator do not include zero up to about years, while the confidence intervals for the Cox estimator contain zero until about years and becomes nonsignificant only from to years. As illustrated in the simulations, the difference in significance between the two estimators may be a function of the consistently lower MSE of the Stacked estimate in nonproportional hazards scenarios.
Weiss et al. (2009) were specifically interested in the mortality risk for center volume at one year. Since earlier survival is indicative of perioperative and early postoperative mortality (which is relatively high for lung transplantation), the oneyear restricted mean survival is a clinically important outcome. The Stacked estimator suggests a larger difference than the Cox estimator in the causal restricted mean survival over one year. In addition, the confidence interval for the Cox estimate includes zero, while that the Stacked estimator does not.
6 Concluding Remarks
We frame the estimation of causal restricted mean treatment effects as a problem of estimating the conditional survival function. In most application areas, there is little information to suggest a priori an appropriate distributional assumption for the survival time or functional form for the covariates. This motivates flexibly estimating restricted mean treatment effects for observational studies with stacked survival models. In particular, we demonstrate that stacked survival models successfully estimates an optimal combination of candidate survival models that performs well at a given sample size.
Restricted mean survival is traditionally estimated by first estimating the conditional survival distribution with a Cox proportional hazards model, yet the simulation study illustrates that there is little cost to considering the more flexible stacked survival models given that they achieve similar mean squared error (MSE) under proportional hazards. When the proportional hazards assumption is violated, stacked survival models can substantially reduce the bias and MSE of the causal restricted mean treatment effect. In addition, the stacked estimator is consistent in a wider range of situations compared than current approaches that assume a proportional hazards model. We also demonstrated that the Stacked estimator identifies a clinically meaningful difference in lung transplantation between highvolume and lowvolume centers for the one year restricted mean, while the proportional hazards estimator fails detect the difference.
We considered two different approaches to stacked survival models: one estimator (the ‘Stacked’ estimator) excluded the nonparametric random survival forests (RSF) from the set of candidate models, and the other estimator included RSF in the set of candidate models. We found that the Stacked estimator (without RSF) performed as good, or better, and was computationally faster than the stacked estimator with RSF; these relationships held even when the sample size was increased to (see the Supporting Information). Thus, the Stacked estimator without RSF was a reasonable choice for estimating restricted means with stacked survival models in the simulation scenarios considered here. However, different scenarios may exhibit meaningful differences in performance when RSF or other nonparametric estimators are included in the set of candidate survival models. This point illustrates that the appropriate selection of candidate survival models remains a significant area of future research for stacked survival models.
There are two main approaches to estimating the causal restricted mean treatment effect: the “regression approach” pursued in this paper and an approach based on inverseprobability weighting (IPW) for treatment assignment and censoring (Schaubel and Wei, 2011; Zhang and Schaubel, 2012). The IPW approach requires forming models for the censoring and treatment distributions. The “regression” approach is a more efficient estimator of the restricted mean difference when the conditional survival model has been correctly specified. However, the IPW approach is sometimes preferred because standard methods that estimate the conditional survival distribution may be overly restrictive (e.g., the Cox proportional hazards model). The flexibility of stacked survival models may mitigate some of the concerns of the “regression” approach.
Statisticians regularly want to form linear contrasts of the restricted mean survival. A common approach to restricted mean regression uses pseudoobservations from the leaveoneout jackknife of the KaplanMeier restricted mean estimator as the outcome variable in a generalized linear model (Andersen et al., 2003, 2004). The potential benefit for estimating pseudoobservations with stacked survival models is not clear. However, an alternative to pseudovalues is the modelfree contrast approach proposed by Rudser et al. (2012), which would likely benefit from stacked survival models. The investigation into these areas is a future research interest.
There has been recent research on using model averaging to account for uncertainty in the confounders of the treatmentoutcome relationship (Wang et al., 2012; Cefalu et al., 2013). However, previous work has assumed that the structure of the relationship between the covariates and outcome was known (e.g., linear relationship between covariates and loghazard) although, in practice, there is usually little evidence to support a priori assumptions on the survival outcomes and functional form of the covariate. We demonstrate that principally averaging different model structures can lead to substantially better performance in the estimation of the causal restricted mean treatment effect. Thus, an interesting avenue for future research would consider the selection of covariates based on both the outcome and treatment models, while also considering different distributional assumptions and functional forms for the outcome model.
A conditional survival function is required by many methods besides restricted mean treatment effects: for example, censored quantile regression (Wey et al., 2014), timedependent ROC curves (Zheng and Heagerty, 2004), inverse probabilityofcensoring weighted estimators (Fine and Gray, 1999), modelfree contrast approaches (Rudser et al., 2012), and dynamic treatment regime methods (Zhao et al., 2011). Similar to restricted mean treatment effect estimation, all of these methods have traditionally used a Cox model or a nonparametric method to estimate the conditional survival function. The success of stacked survival models for restricted mean treatment effect estimation illustrates the potential improvement for a wide spectrum of survival analysis methods.
References
 Andersen et al. (2004) Andersen, P. K., Hansen, M. G., and Klein, J. P. (2004). Regression analysis of restricted mean survival time based on pseudoobservations. Lifetime Data Analysis 10, 335–350.
 Andersen et al. (2003) Andersen, P. K., Klein, J. P., and RosthÃ¸j, S. (2003). Generalised linear models for correlated pseudoobservations, with applications to multistate models. Biometrika 90, 15–27.
 Breiman (1996) Breiman, L. (1996). Stacked regressions. Machine Learning 24, 49–64.
 Cefalu et al. (2013) Cefalu, M., Dominici, F., and Parmigiani, G. (2013). Model averaged double robust estimation. Technical Report 149, Division of Biostatistics, Harvard University, Harvard University.
 Chen and Tsiatis (2001) Chen, P.Y. and Tsiatis, A. A. (2001). Causal inference on the difference of the restricted mean lifetime between two groups. Biometrics 57, 1030–1038.
 Cox (1972) Cox, D. R. (1972). Regression models and lifetables. Journal of the Royal Statistical Society, Series B 34, 187–220.
 Efron and Tibshirani (1993) Efron, B. and Tibshirani, R. (1993). An introduction to the bootstrap. Chapman and Hall.
 Ferguson (1996) Ferguson, T. S. (1996). A course in large sample theory. Chapman and Hall/CRC.
 Fine and Gray (1999) Fine, J. P. and Gray, R. J. (1999). A proportional hazards model for the subdistribution of a competing risk. Journal of the American Statistical Association 94, 496–509.
 Graf et al. (1999) Graf, E., Schmoor, C., Sauerbrei, W., and Schumacher, M. (1999). Assessment and comparison of prognostic classification schemes for survival data. Statistics in Medicine 18, 2529–2545.
 Ishwaran et al. (2008) Ishwaran, H., Kogalur, U. B., Blackstone, E. H., and Lauer, M. S. (2008). Random survival forests. Annals of Applied Statistics 2, 841–860.
 Karrison (1987) Karrison, T. (1987). Restricted mean life with adjustment for covariates. Journal of the American Statistical Association 82, 1169–1176.
 Lostritto et al. (2012) Lostritto, K., Strawderman, R. L., and Molinaro, A. M. (2012). A partitioning deletion/substitution/addition algorithm for creating survival risk groups. Biometrics 68, 1146–1156.
 R Development Core Team (2013) R Development Core Team (2013). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3900051070.
 Royston and Parmar (2013) Royston, P. and Parmar, M. K. (2013). Restricted mean survival time: an alternative to the hazard ratio for the design and analysis of randomized trials with a timetoevent outcome. BMC Medical Research Methodology 13, 1–15.
 Rudser et al. (2012) Rudser, K. D., LeBlanc, M. L., and Emerson, S. S. (2012). Distributionfree inference on contrasts of arbitrary summary measures of survival. Statistics in Medicine 31, 1722–1737.
 Schaubel and Wei (2011) Schaubel, D. E. and Wei, G. (2011). Double inverseweighted estimation of cumulative treatment effects under nonproportional hazards and dependent censoring. Biometrics 67, 29–38.
 Thabut et al. (2010) Thabut, G., Christie, J. D., Kremers, W. K., Fournier, M., and Halpern, S. D. (2010). Survival differences following lung transplantation among us transplant centers. Journal of the American Medical Association 304, 53–60.
 Thabut et al. (2009) Thabut, G., Christie, J. D., Ravaud, P., Castier, Y., Dauriat, G., Jebrak, G., Fournier, M., Leseche, G., Porcher, Raphael, and Mal, H. (2009). Survival after bilateral versus singlelung transplantation for idiopathic pulmonary fibrosis. Annals of Internal Medicine 151, 767–774.
 Therneau (2013) Therneau, T. (2013). Survival analysis, including penalized likelihood. R package version 2.374.
 Tsuang et al. (2013) Tsuang, W. M., Vock, D. M., Copeland, C. A. F., and Lederer, D. J. (2013). An acute change in lung allocation score and survival after lung transplantation: a cohort study. Annals of Internal Medicine 158, 650–657.
 Varadhan (2012) Varadhan, R. (2012). alabama: Constrained nonlinear optimization. R package version 2011.91.
 Wang et al. (2012) Wang, C., Parmigiani, G., and Dominici, F. (2012). Bayesian effect estimation accounting for adjustment uncertainty. Biometrics 68, 661–686.
 Weiss et al. (2009) Weiss, E. S., Allen, J. G., Meguid, R. A., Patel, N. D., Merlo, C. A., Orens, J. B., Baumgartner, W. A., Conte, J. V., and Shah, A. S. (2009). The impact of center volume on survival in lung transplantation: An analysis of more than 10,000 cases. The Annals of Thoracic Surgery 88, 1062–1070.
 Wey et al. (2013) Wey, A., Connett, J., and Rudser, K. (2013). Stacking survival models. arXiv:1309.7936v3.
 Wey et al. (2014) Wey, A., Wang, L., and Rudser, K. (2014). Censored quantile regression with recursive partitioning based weights. Biostatistics 15, 170–181.
 Zhang and Schaubel (2012) Zhang, M. and Schaubel, D. E. (2012). Doublerobust semiparametric estimator for differences in restricted mean lifetimes in observations studies. Biometrics 68, 999–1009.
 Zhao et al. (2011) Zhao, Y., Zeng, D., Socinski, M. A., and Kosorok, M. R. (2011). Reinforcement learning strategies for clinical trials in nonsmall cell lung cancer. Biometrics 67, 1422–1433.
 Zheng and Heagerty (2004) Zheng, Y. and Heagerty, P. (2004). Semiparametric estimation of timedependent roc curves for longitudinal marker data. Biostatistics 5, 615–632.
 Zucker (1998) Zucker, D. M. (1998). Restricted mean life with covariates: Modification and extension of a useful survival analysis method. Journal of the American Statistical Association 83, 702–709.
Percent  
Estimator  Relative Bias  MSE Ratio  ACL Ratio  Cov.  ISSE Ratio  
Linear 
Cox  1  1.00  1.00  0.95  1.00 
Splines  2  1.17  1.12  0.96  4.03  
Stacked  1  1.06  1.05  0.96  1.19  
Stacked (with RSF)  1  1.04  0.96  0.93  1.32  
NonLinear 
Cox  10  1.00  1.00  0.94  1.00 
Splines  3  0.86  1.04  0.95  1.06  
Stacked  5  0.77  0.98  0.95  0.76  
Stacked (with RSF)  6  0.78  0.86  0.88  0.76  


Linear 
Cox  1  1.00  1.00  0.95  1.00 
Splines  1  1.16  1.10  0.95  3.60  
Stacked  2  0.99  1.02  0.95  1.11  
Stacked (with RSF)  3  1.01  0.90  0.84  1.20  
NonLinear 
Cox  10  1.00  1.00  0.92  1.00 
Splines  3  0.88  1.05  0.95  1.07  
Stacked  6  0.82  0.98  0.94  0.77  
Stacked (with RSF)  6  0.81  0.82  0.92  0.78  

Percent  
Estimator  Relative Bias  MSE Ratio  ACL Ratio  Cov.  ISSE Ratio  
Linear 
Cox  13  1.00  1.00  0.93  1.00 
Splines  7  1.19  1.19  0.95  4.14  
Stacked  1  0.82  1.05  0.95  1.11  
Stacked (with RSF)  4  0.87  1.01  0.89  1.21  
NonLinear 
Cox  15  1.00  1.00  0.92  1.00 
Splines  6  1.09  1.15  0.94  1.58  
Stacked  3  0.79  1.04  0.95  0.96  
Stacked (with RSF)  7  0.84  0.99  0.88  0.94  


Linear 
Cox  6  1.00  1.00  0.93  1.00 
Splines  5  1.23  1.13  0.94  3.64  
Stacked  1  0.92  1.04  0.94  1.08  
Stacked (with RSF)  2  0.94  0.96  0.87  1.16  
NonLinear 
Cox  13  1.00  1.00  0.91  1.00 
Splines  6  0.96  1.07  0.94  1.32  
Stacked  5  0.78  1.00  0.94  0.84  
Stacked (with RSF)  7  0.80  0.91  0.86  0.83  
