Estimating the treatment effect in a subgroup defined by an early postbaseline biomarker measurement in randomized clinical trials with timetoevent endpoint.
Abstract
Biomarker measurements can be relatively easy and quick to obtain and they are useful to investigate whether a compound works as intended on a mechanistic, pharmacological level. In some situations, it is realistic to assume that patients, whose postbaseline biomarker levels indicate that they do not sufficiently respond to the drug, are also unlikely to respond on clinically relevant long term outcomes (such as timetoevent). However the determination of the treatment effect in the subgroup of patients that sufficiently respond to the drug according to their biomarker levels is not straightforward: It is unclear which patients on placebo would have responded had they been given the treatment, so that naive comparisons between treatment and placebo will not estimate the treatment effect of interest. The purpose of this paper is to investigate assumptions necessary to obtain causal conclusions in such a setting, utilizing the formalism of causal inference. Three approaches for estimation of subgroup effects will be developed and illustrated using simulations and a casestudy.
Keywords: Causal Inference, Estimand, Principal Stratification, Subgroup Analysis, Weighting
1 Introduction
Biomarker measurements can be relatively easy and quick to obtain, and are useful to investigate whether a compound works as intended on a mechanistic, pharmacological level. In some situations, it is plausible to assume that postbaseline biomarker responders are also more likely to respond better on clinically relevant long term outcomes, which are often of timetoevent type.
This research is motivated by the CANTOS outcome study in prevention of cardiovascular events (Ridker et al., 2017). Inflammation has been identified as playing a key role in atherosclerosis, for example in the formation and rupture of atherosclerotic plaques (Hansson, 2005). The CANTOS trial investigated canakinumab, an antiinflammatory agent, against placebo. The primary outcome was the time to a major adverse cardiac (MACE) event, a composite endpoint consisting of cardiovascular death, nonfatal myocardial infarction and stroke, and was statistically significant. In this specific case the biomarker of interest is a downstream inflammatory marker high sensitivity Creactive protein (hsCRP), where lower values indicate less inflammation. Interest focuses on determination of the treatment effect for hsCRP patients that, 3 months after treatment start, were able to lower hsCRP below a specific target level.
The determination of the treatment effect in a subgroup of patients that is defined based on postbaseline biomarker levels in the treatment group (e.g. indicated by reaching biomarker levels smaller than some threshold) is not straightforward: The biomarker might have a prognostic effect on the outcome, independent of treatment. For example hsCRP is a known prognostic risk factor for cardiovascular events. It is likely that patients who reach the biomarker target level on treatment also had a better (i.e. smaller) biomarker measurement at baseline compared to patients, who do not reach the target. A naive comparison of the biomarker responder subgroup on treatment, to the complete placebo group will thus likely overestimate the treatment effect. Similarly a naive comparison of the biomarker responder subgroup patients on treatment, against the biomarker responder subgroup patients on placebo will also likely be biased. Patients, who reach biomarker levels below the target on placebo are likely more “healthy” than the biomarker responders on treatment. So such a comparison would likely underestimate the treatment effect.
The purpose of this paper is to investigate the assumptions necessary to draw causal conclusions, using the formal language of causal inference developed over the past decades, see for example Pearl (2009), Imbens & Rubin (2015) or Hernán & Robins (2018) for reviews. “Valid conclusions” here means that the treatment effect should be attributable to the difference in treatment and not to differences in the two compared populations.
In the causal inference literature there are several papers dealing with related questions in more detail. The paper Frangakis & Rubin (2002) introduced the term principal stratification. A reference providing a review of many of the recent approaches and developing new approaches is Ding & Lu (2017). Related references among others are for example Joffe et al. (2007); Schwartz et al. (2011); Zigler & Belin (2012); Jo & Stuart (2009, 2011); Stuart & Jo (2015); Kern et al. (2016).
The purpose of this article is to provide a review of the underlying problem and propose methods that could be utilized in this setting. In Section 2 we utilize causal inference techniques to express the estimands of interest in terms of quantities that are identifiable in a randomized clinical trial. In Section 3 we discuss how these quantities can be estimated. In Section 4 we perform a simulation study to illustrate the proposed methods in a specific situation. Section 5 illustrates the method for a specific simulated data set. Section 6 concludes.
2 Methods
2.1 Estimand of interest
Assume a randomized trial with a treatment and a placebo arm. Let denote the treatment indicator, denoting whether a patient was randomized to treatment () or placebo (). Further let denote the potential continuous biomarker outcome under treatment at a specified time after start of the study. Further let be the binary indicator, defined as whether for a target threshold . Let be the potential event time under treatment . Here we focus on composite endpoints (a mixture of nonfatal events and death). The corresponding observed values will be denoted by , and , where is the treatment actually obtained. Let denote a vector of baseline variables influencing the biomarker and the event time. We assume in what follows that is small, so that no events are observed before . We include a discussion in Appendix B on how to handle events before time .
As discussed in the introduction, the population of interest is the subgroup of patients, that, if given the treatment would be biomarker responders. In terms of the introduced notation these are the patients with .
Different summary measures exist to define a treatment effect in terms of timetoevent outcomes. We focus here on the difference in survival probabilities at a given timepoint
(1) 
and the difference in restricted mean survival times (Royston & Parmar, 2011),
(2) 
integrated up to a timepoint . Here and are the survival functions in the subgroup of interest.
Because biomarker response is a postbaseline event in a parallel groups trial, we do not know which patients on placebo would be a biomarker responder had they received treatment. Identification of the estimand, and particular , hence requires assumptions. In terms of the draft ICH E9 addendum (ICH, 2017), this is an example of a principal stratification estimand, where the stratum of interest is the subgroup of the overall trial population with .
2.2 Identification of the Estimand
As treatment is randomized we know that and the variables are independent:
(3) 
In what follows we will concentrate on causal identification of and as and can be derived based on these.
Note that can be identified from the observed trial data because
The first equality holds due to randomization (3) and the second equality holds due to the assumption of consistency, which states that the distribution of is equal to the distribution observed in the trial. The quantity
(4) 
however cannot be estimated from the trial data without further causal assumptions, because in a parallel groups trial, one cannot obtain the survival time under placebo, for patients that were on treatment and had . In the following two sections, two different causal assumptions will be explored that allow identification of this quantity: One based on conditional independence assumptions using baseline covariates and the other is an analysis based on the monotonicity and equipercentile assumptions.
2.2.1 Utilization of Covariates
One approach to identify , is to utilize baseline covariates , so that conditional on knowing , provides no further information on (and vice versa). More formally the requirement is that and are independent conditional on covariates . In formulas
(5) 
Using this assumption we have,
The first equation follows by the law of total probability. The second equality follows by the assumption in (5). The third equation follows by randomization (3) and the last equation from the consistency assumption. This equation can be estimated from the data by estimating , and then averaging it over the observed distribution of for the population of patients with and (i.e. estimating by its empirical distribution). We will discuss specific methods for estimation in Section 3.1. This approach will be called “predicted placebo response” (PPR) in the following.
Using the same arguments it follows that the probability density of the event times in the subgroup is . Furthermore,
(6)  
The first two equations follow from the definition of conditional probability. The last equation follows from randomization (because ) and omitting all other terms that do not involve . The result in (6) is useful, because the observed data on with can be used to estimate (and ) by utilizing the weights
(7) 
The weights can be estimated based on the patients on treatment, using, for example, a logistic regression, or other classification approaches. We will discuss specific methods for estimation in Section 3.2. This approach will be called “weighted placebo patients” (WPP) in what follows.
2.2.2 Utilizing Monotonicity and EquiPercentile Assumptions
Table 1 illustrates how the overall trial population can be stratified into four subgroups according to their potential biomarker outcomes under treatment and placebo. Here , , and denote the probabilities to fall into the relevant principal strata, with . Each patient falls in exactly one subgroup so that . Note that this classification is not known, because we only observe one of the two potential biomarker measurements for every patient in the trial. Further let and with a be the corresponding marginal probabilities of the table, subject to and . Note that and can be estimated from the observed data on placebo and active treatment, while the other probabilities are not identifiable. To proceed further, a plausible assumption (depending on the mechanistic understanding of the drug) is to assume that
(8) 
i.e., there are no patients that would be biomarker responders under placebo but not under treatment. This socalled monotonicity assumption allows to identify that , and . This is useful, because (4) can be expressed as
(9)  
The term however remains unidentified in a parallel groups trial: Among the biomarker nonresponders on placebo it is not clear which would have responded under treatment.
Due to the monotonicity assumption, we can identify all proportions in Table 1 and hence also the proportion of patients among the placebo biomarker nonresponders that would have responded on treatment: .
The task is hence to identify the patients that would be biomarker responders on treatment among the group of placebo biomarker nonresponders in order to estimate . One approach is to assume that the biomarker outcome on placebo contains information on the biomarker outcome under treatment ( and more specifically the event ). Unfortunately we do not have data relating and , as every patient received either placebo or the treatment, hence this relationship cannot be estimated from the observed data and will be derived using assumptions.
One simple approach is to rank patients according to their observed placebo biomarker outcome. Then one could just select the fraction of patients with the lowest value observed for , and identify those as the ones that would be biomarker responders on treatment. This type of assumption has in other contexts been called equipercentile equating (see Rubin (1991)). This approach can be criticized, because the ordering of potential biomarker outcomes could be different under treatment and placebo. The factors leading to a low biomarker value on placebo might be different to those on treatment. Selecting the patients with low biomarker outcome among the placebo biomarker nonresponders (and identify those as part of the stratum ) will hence tend to overestimate the survival time under placebo and thus underestimate the treatment effect.
A different idea is to include all patients among the placebo biomarker nonresponders to estimate . This will also include patients with worse healthstate, so one would expect that the event time under placebo would get underestimated and thus the treatment effect overestimated.
We propose an analysis that interpolates between these two extremes, using a parameter, where patients with a lower rank and thus lower observed get a higher weight than patients with higher . We propose to use the weight function for patient , where is the empirical quantile for of patient , i.e., if there are observations among the placebo biomarker nonresponders, the patient with the th ordered observation of will have an empirical quantiles of . See Figure 1 for an illustration of this weight function for different values. If this approach is equivalent to using the equipercentile equating assumption (the patients with lowest will receive a weight of 1, while all others receive a weight of 0). Letting corresponds to weighting patients equally.
To summarize the assumptions underlying this approach, we use the monotonicity assumption (8) and the relaxed equipercentile assumption, based on a logistic weighting function. We introduced a parameter that interpolates between two extremes: For this analysis might tend to underestimate the treatment effect while for this analysis will often tend to overestimate the treatment effect. This specific approach will be called analysis based on monotonicity and equipercentile assumption (MEA) in what follows.
3 Estimation
Different statistical models can be utilized to estimate the quantities (1) and (2). Bayesian methods have the appeal of directly providing an uncertainty assessment, for nonBayesian methods bootstrap approaches can be utilized to perform inference. We focus here on semiparametric methods and utilize bootstrapping for inference. Nevertheless other approaches could equally be used for estimation.
Estimation of is straightforward by estimating from the observed data, for example, using the NelsonAalen estimator.
In Section 2 three methods were described to enable estimation of : The PPR and WPR approaches, both utilizing baseline covariates and the MEA utilizing the monotonicity and equipercentile assumptions, in the following sections, we summarize the statistical analyses that can be performed to estimate the quantities needed for each approach.
3.1 Predict placebo response (PPR)
The PPR approach requires estimation of (see (LABEL:eq:ppr_form)), i.e. estimation of the survival in the placebo arm based on covariates . This can be done by fitting a Cox regression in the placebo arm. An estimate for can be derived by using the empirical distribution of covariates for the biomarker responders on the treatment arm. To derive an estimate for (4) the observed covariates for biomarker responders on treatment will hence be used to predict a survival curve for every patient. This can be done using the Breslow estimator of the baseline survival function. The estimate of (4) is then the average of these predicted survival curves. The main modelling assumption in this approach is that covariates enter multiplicatively on the hazard rate according to the Cox proportional hazards model.
3.2 Weight placebo patients (WPP)
The WPP approach requires a model to determine the weights from (7). For that purpose a logistic regression on the treatment arm is fitted, where the outcome is the binary indicator of the event (being a biomarker responder) and the covariates are . Then using this model, for each patient on the placebo arm the probability for is predicted. These probabilities are used as weights in the weighted estimation of the survival function using a weighted NelsonAalen estimator in the placebo group, which is thus the estimate of . The main modelling assumption here originates from the model assumed for determination of , i.e. the logistic regression.
3.3 Monotonicity and EquiPercentile Assumption (MEA)
The MEA approach requires estimation of and (see Section 2.2.2). This will be done directly from the observed proportions based on the formulas in Section 2.2.2 that follow from the monotonicity assumption (8). For the quantity we have from the monotonicity assumption that where the last equation follows by consistency. For estimation of this quantity the NelsonAalen estimator will be used. The quantity will be estimated in the group of biomarker nonresponders on placebo, weighted according the approach outlined in Section 2.2.2, where the weights are derived based on the biomarker value on placebo and the logistic function with center and sensitivity parameter that is chosen independent of the observed data. A weighted NelsonAalen estimator will be used.
4 Simulations
The purpose of this section is to evaluate the theory developed in Section 2 for a particular data generating true model.
We will generate data for a parallel groups, eventdriven randomized trial to compare active treatment against placebo. The simulation is loosely motivated by the CANTOS trial mentioned in Section 1, even though a few details used here are different. It is assumed that around 20% of the patients will have an event by year 5. The analysis will be performed, once 850 events have been observed in total. Patients that did not have an event by this calendar time will be censored. The number of 850 events is chosen as this is approximately the number needed to detect a loghazard ratio of 0.8 based on the logrank test with significance level and power . Recruitment will be simulated according to a homogeneous Poisson process, that is, enrollment is assumed to be linear increasing over time (uniformly distributed entry times). The yearly recruitment rate is set to be 1500 patients. For every patient we utilize two baseline covariates, and (in a real situation would of course be higherdimensional). Here, is assumed to be the biomarker value at baseline and a general covariate that has a strong prognostic effect on the event time, and small effect on the postbaseline biomarker. For and we assume a bivariate normal distribution with means 0, standard deviations 1 and correlation of 0.25. The postbaseline biomarker level for patient is then simulated from the following linear model
(10) 
where and are the observed baseline covariates and the treatment indicator for patient . In the simulations the parameters will be chosen to represent a typical situation where there is a strong effect of treatment and baseline biomarker value on the postbaseline biomarker, but only a small effect of the covariate . The specific values assumed are , , and .
The event time for patient will be generated based on the postbaseline biomarker value for every patient as
(11) 
The parameters are chosen as follows: is chosen so that at 5 years an event rate of 20% is achieved (if all other covariates would be 0). The parameters and represent the prognostic effect of the baseline variables, for a value of will be chosen, indicating a small effect the baseline biomarker level and for indicating a strong effect of this prognostic covariate.
The parameters and describe the treatment effect: denotes the effect of treatment independent of the postbaseline biomarker level, is the effect of the postbaseline biomarker level , while determines, how much the treatment effect is modified by the postbaseline biomarker level. Different scenarios will be evaluated for and , corresponding to different assumptions on how the treatment effect is generated: (i) for , the postbaseline biomarker value has no effect on the outcome, and the treatment works only by other mechanisms, (ii) for , all the treatment effect is achieved through the modification of the biomarker, and (iii) and corresponds to situation (ii), but the treatment effect is modified by the postbaseline biomarker level. In each case the parameters are chosen such that difference in the average log hazard rates between treatment and placebo group is equal to , to ensure that the overall “treatment effect” is similar across the scenarios. The exact parameters values are given in Table 3 in Appendix A.1. A detailed algorithm of how the eventdriven trials are generated, is given in Appendix A.2.
For the simulation we will set the biomarker threshold at , i.e., we focus on the subgroup of patients that achieve a postbaseline biomarker value less than on treatment. From model (10) for the postbaseline biomarker and with parameters as in Appendix A.1, the probability to achieve on treatment is around and around on placebo.
We estimate the difference in restricted survival time with . For estimation of the difference in survival we utilize .
We compare six approaches: PPR, WPP and MEA (with and ) and two further analyses which are simple and seemingly intuitive to do, but do not target the estimand of interest. The first additional method estimates the placebo survival curve in the subgroup of interest by utilizing the complete placebo group (called NAIVE_FULLPBO in what follows). The second approach only utilizes placebo patients, which have on placebo (NAIVE_THRES). As discussed in the introduction, the NAIVE_FULLPBO approach is expected to overestimate the treatment effect (as the baseline population on the combined placebo group is “less healthy”), while for the NAIVE_THRES approach one would expect that it underestimates the treatment effect (as the baseline population of patients on the placebo arm that reach the threshold is “healthier”).
To evaluate how well the different approaches estimate the estimands of interest (difference in survival curve and restricted mean survival time), we need to derive the “true” survival curves on treatment and placebo in the subgroup of interest. In Appendix A.3 this is explained in detail.
In each scenario 5000 simulations were performed and Figure 2 displays boxplots for the estimation error of the difference in the restricted mean survival time up to year 5 (). As expected the naive approaches systematically over (NAIVE_FULLPBO) or underestimate (NAIVE_THRES) the difference in restricted mean survival times, with the NAIVE_THRES approach having a much larger variability (due to the small sample size on placebo).
For the MEA approach one can see that for our simulation settings the method with one obtains estimates that tend to overestimate the treatment effect, as expected. The performance for is adequate in scenario (i), where the outcome is independent of the postbaseline biomarker. For scenarios (ii) and (iii) the treatment effect is underestimated on average. This is because the equipercentile assumption is violated in these scenarios (see also Appendix A.3 on how the true survival differences were calculated). The WPP and PPR perform well across all scenarios. This is expected as these approaches utilize the information on the true covariates (and in this sense also the true simulation model).
Overall similar results can also be observed for estimation of the survival differences and , see Figures 5 and 6 in the Appendix.
Note that for the considered simulation setting is simple, but the purpose here was to investigate our semiparametric procedures in this case, where naive approaches already fail.
5 Data application
For illustration we analyse a data set simulated under scenario (iii) above. In the simulated data set 850 events were reached after 5.9 years. The mean followup time for patients was around 3.9 years at that time.
Figure 3 shows four observed cumulative incidence rates 1(estimated survival function), estimated using the NelsonAalen estimator. Two curves correspond to the overall estimates in the placebo and treatment groups. The other two curves correspond to patients on placebo and treatment with . About of the patients on placebo reached this value and about of the patients on treatment. Figure 3 shows that there were less events on the active treatment and also that a low biomarker postbaseline leads to a smaller event rate.
Figure 4 shows six approaches to estimate the difference in the cumulative incidence rate. The first two are the PPR and WPP approaches discussed in Section 2. In addition we show four principal stratification approaches with the different values shown in Figure 1. Note that all six approaches only differ in the way that the placebo survival curve is estimated. The estimation of the survival curve in the subgroup of interest under treatment is the same for all approaches. The results are quite consistent with the simulation results in the previous section in the sense that the PPR and WPP approaches lead to quite similar results and the MEA approaches estimate an increased treatment effect with increasing as expected.
Table 2 presents numerical results for the PPR, WPP and MEA approaches for estimation of the survival differences , and . Confidence intervals have been calculated using sampling with replacement stratified by treatment group (nonparametric bootstrap).
Method  

WPP  0.026 (0.01,0.04)  0.064 (0.039,0.09)  0.183 (0.114,0.248) 
PPR  0.027 (0.012,0.042)  0.066 (0.043,0.093)  0.191 (0.123,0.256) 
MEA,  0.026 (0.01,0.042)  0.064 (0.037,0.09)  0.178 (0.103,0.245) 
MEA,  0.027 (0.011,0.042)  0.065 (0.039,0.09)  0.182 (0.111,0.249) 
MEA,  0.028 (0.013,0.044)  0.069 (0.045,0.093)  0.196 (0.128,0.265) 
MEA,  0.031 (0.016,0.046)  0.073 (0.048,0.096)  0.211 (0.145,0.28) 
One can see that in all cases and for all estimands of interest the confidence interval excludes 0, so a treatment effect is concluded in this setting, which is the right decision based on the simulation scenario upon which the data were generated.
6 Conclusions
In this paper we considered estimation of the treatment effect in a subgroup defined by reaching a postbaseline biomarker measurement on treatment. This is challenging, because in a parallel groups trial we do not observe the event time that these patients would have had, had they been randomized to the placebo group. Three approaches are proposed based on different causal identifying assumptions. Two approaches (PPR and WPP) are based on a conditional independence assumption utilizing baseline covariates, stating that given a set of baseline covariates the potential biomarker outcome under treatment and the potential event time under placebo are independent. Utilizing this, one can use placebo information to estimate the placebo response for the patients in the subgroup of interest. The third approach (MEA) is based on principal stratification and utilizes the monotonicity and a relaxed equipercentile assumption with a logistic weighting function based on a sensitivity parameter . The approaches have been evaluated in a simulation study, where the proposed approaches show good performance compared to more naive approaches. Finally the methodology has been illustrated on a single simulated data set.
In this paper we estimated the survival difference at a specified time and on estimation of the mean restricted survival time, both quantities are rather easy to interpret. In many clinical applications the hazard ratio is the measure to compare timetoevent endpoints, with the underlying assumption that the hazards are proportional. In our setting we do not make the assumption of proportional hazards and estimate the survival curves separately for placebo and treatment. Hence it is not immediately obvious how to summarize the ratio of hazard functions into a single hazard ratio. A graphical approach would be to plot the ratio of the cumulative hazard functions, which is easily obtainable from the already determined survival functions, to see whether values fluctuate around a particular value. In cases like the simulated example above (see Section 5), where there is a rather low event rate, and the cumulative hazard functions are approximately linear (like in Figure 3), one approach is to fit a simple exponential () to the survival curves and then derive an approximate hazard ratio based on the ratio of the fitted rates.
Acknowledgments
The authors would like to thank Daniel Scharfstein for sharing ideas
and discussions on the topic, as well as feedback on early versions of
the methods described in this paper. We would also like to thank
Heinz Schmidli, Simon Wandel and Nathalie Ezzet, who provided comments
on the methods and earlier versions of the manuscript.
References
 (1)
 Ding & Lu (2017) Ding, P. & Lu, J. (2017), ‘Principal stratification using principal scores’, J.R. Statist. Soc. B 79(Part 3), 757–777.
 Frangakis & Rubin (2002) Frangakis, C. E. & Rubin, D. B. (2002), ‘Principal stratification in causal inference’, Biometrics 58(1), 21–29.
 Hansson (2005) Hansson, G. K. (2005), ‘Inflammation, atherosclerosis and coronary artery disease’, New England Journal of Medicine 352, 1685–1695.
 Hernán & Robins (2018) Hernán, M. A. & Robins, J. M. (2018), Causal Inference, Chapman and Hall/CRC, Boca Raton.
 ICH (2017) ICH (2017), ‘E9(R1) statistical principles for clinical trials: Estimands and sensitivity analysis in clinical trials (draft)’. http://www.fda.gov/downloads/Drugs/GuidanceComplianceRegulatoryInformation/Guidances/UCM582738.pdf.
 Imbens & Rubin (2015) Imbens, G. W. & Rubin, D. B. (2015), Causal inference in statistics, social, and biomedical sciences, Cambridge University Press.
 Jo & Stuart (2009) Jo, B. & Stuart, E. A. (2009), ‘On the use of propensity scores in principal causal effect estimation’, Statistics in Medicine 28, 2857–2875.
 Jo & Stuart (2011) Jo, B. & Stuart, E. A. (2011), ‘The use of propensity scores in mediation analysis’, Multivariate Behav. Res. 46(3), 425–452.
 Joffe et al. (2007) Joffe, M. M., Small, D. & Hsu, C.Y. (2007), ‘Defining and estimating treatment effects for groups that will develop an auxiliary outcome’, Statistical Inference 22(1), 74–97.
 Kern et al. (2016) Kern, H. L., Stuart, E. A., Hill, J. & Green, D. P. (2016), ‘Assessing methods of generalizing experimental impact estimates to target populations’, J. Res. Eduacational Effectiveness 9(1), 103–127.

Pearl (2009)
Pearl, J. (2009), ‘Causal inference in
statistics: An overview’, Statist. Surv. 3, 96–146.
http://dx.doi.org/10.1214/09SS057  Ridker et al. (2017) Ridker, P. M., Everett, B. M., Thuren, T., MacFadyen, J. G., Chang, W., Ballantyne, C., Fonseca, F., Nicolau, J., Koenig, W., Anker, S. D., Kastelein, J., Cornel, J. & et al. for the CANTOS Trial Group (2017), ‘Antiinflammatory therapy with canakinumab for atherosclerotic disease’, New England Journal of Medicine 377(12), 1119–1131.
 Royston & Parmar (2011) Royston, P. & Parmar, M. G. (2011), ‘The use of restricted mean survival time to estimate the treatment effect in randomized clinical trials when the proportional hazards assumption is in doubt’, Statistics in Medicine 30, 2409–2421.
 Rubin (1991) Rubin, D. (1991), ‘Comment on dose response estimands, by B. Efron and D. Feldman’, Journal of the American Statistical Association 86, 22–24.
 Schwartz et al. (2011) Schwartz, S. F., Li, F. & Mealli, F. (2011), ‘A Bayesian semiparametric approach to intermediate variables in causal inference’, Journal of the American Association 106(496), 1331–1344.
 Stuart & Jo (2015) Stuart, E. A. & Jo, B. (2015), ‘Assessing the sensitivity of methods for estimating principal causal effects’, Statist. Meth. Med.Res. 24(6), 657–674.
 Zigler & Belin (2012) Zigler, C. M. & Belin, T. R. (2012), ‘A Bayesian apprach to improved estimation of causal effect predictiveness for a principal surrogate endpoint’, Biometrics 68, 922–932.
Appendix A Simulations
a.1 True datagenerating parameters
For the parameter the scenarios outlined in Table 3 will be utilized.
Scenario  Values 

(i) ,  
(ii) ,  
(iii) , 
a.2 Data generation for eventdriven trial
The following steps explain how the survival data were generated in the simulation study.

Input
: enrollment rate per year
and : A fraction of of the subjects on placebo will have an event by year
number of events
: parameters for the models of the postbaseline biomarker and the eventtime 
Calculate , the approximate number needed to recruit to achieve events by year .

For each patient simulate the following random variables (i.e. in total)

Generate i.i.d. exponential random variates with hazard rate and sort them in increasing order, to obtain the recruitment times for every patient

Generate baseline variables with mean vector equal to 0, marginal variances 1 and correlation 0.25.

treatment indicator

postbaseline biomarker value according to the linear model (10) with parameters and covariates values as generated in the previous steps.

the event time from the exponential model (11) with parameters and covariates values as generated in the previous steps.


Calculate the calendar times for the event times (i.e. for every patient add recruitment and event time).

Determine by which calendar time 850 patients had the event. Remove all patients that were recruited after this calendar time. The event times of patients that had their eventtime after this calendar time are rightcensored with this as their censoring time. For analysis translate all times back to the study time.
a.3 Calculation of true survival curves in subgroup of interest
In the datagenerating model used in our simulations (Equations (10), (11)) the subgroup of patients of interest (i.e. those that achieve on treatment) can explicitly be characterized in terms of the baseline covariates and : The joint distribution of and conditional on and can be simulated by simulating the joint distribution given , and then removing the observations where .
Having obtained this joint distribution for , and in the subgroup of interest (patients with ), these values can be plugged in the true exponential survival curve (see (11)) and the resulting survival curves averaged, to obtain the population survival curve under treatment. To obtain the population survival curve under placebo a biomarker outcome under placebo is first simulated for each patient based on their covariates , (according to (10)) and then , and the simulated biomarker value under placebo are plugged in the true exponential survival curve. The resulting survival curves are then averaged, to obtain the population survival curve under treatment.
a.4 Additional Simulation Results
Appendix B Events before and missing postbaseline biomarker measurements
The timepoint of measuring the postbaseline biomarker, should be relatively short after initiation of treatment. A subgroup defined based on a biomarker measurement at a timepoint far away from baseline, will not be of clinical interest. Thus in general few events will be expected before , and unlikely to influence the overall analysis much. Nevertheless in many practical situations there will be some events and in this section we will discuss how to formally handle these situations. Another related issue that we will discuss, are missing postbaseline biomarker values (i.e. situations where a biomarker value could have been measured at time but was not).
An important consideration in handling events before time in the case of composite events is to distinguish between death and nonfatal events. Let the time to death be . In general the subgroup of interest should then be defined by , i.e., the subgroup of patients, who would, on treatment, be alive and biomarker responders. An assumption that is plausible in some situations is to assume that implies and viceversa. That is, patients not dying under treatment before time would also not have died under placebo (and vice versa). This assumption would justify an analysis, where patients dying before are excluded from the analysis and approaches suggested in Section 2 could be performed as they are described. If it is assumed that the population of patients dying before time is different between treatment and placebo, the analyses would need to be modified, to identify/weight patients on the placebo arm, according to whether they would have died on the treatment arm until time .
In principle nonfatal events could be handled in exactly the same way. One could define the subgroup by , i.e., the subgroup of patients, who would, on treatment, be eventfree and biomarker responders. Whether or not this is more relevant than the subgroup is a clinical and pharmacological question: Are nonfatal events likely due to the fact that the drug “does not work” (i.e. supporting the decision of stopping the treatment after the event, which would mean it is more relevant to focus on the subgroup ), or is it too early to say that, because the drug effect did not fully materialize by time . If it is considered relevant to continue treating patients even in case of an event up to time the more relevant subgroup would be given by . In this situation the approach would be to include the events before in the analysis.
A different issue is that there will always be patients, where the postbaseline biomarker measurement is missing, but in principle measurable. It is not appropriate to remove these patients from the analysis, as these might be systematically different from the population of patients, where the biomarker measurement is available. One way of approaching this, is to impute missing postbaseline biomarker values, using for example the baseline biomarker level as well as other covariates to impute the postbaseline biomarker value on treatment. Given each completed multiplyimputed data set, the rest of the analysis would be conducted as described in Sections 2 and 3 and at the end appropriately combined.