Combining Survival Trials Using Aggregate Data Based on Misspecified Models

Combining Survival Trials Using Aggregate Data Based on Misspecified Models

Tinghui Yu 111FDA, Center for Devices and Radiological Health., Yabing Mai 222Merck Research Laboratories., Sherry Liu , Xiaofei Hu
July 9, 2019


The treatment effects of the same therapy observed from multiple clinical trials can be very different. Yet the patient characteristics accounting for the differences may not be identifiable in real practice so that it is necessary to estimate and report the overall treatment effect for the general popoulation during the development and validation of a new therapy. The non-linear structure of the maximum partial likelihood estimates for the (log) hazard ratio defined with a Cox proportional hazard model leads to challenges in the statistical analyses for combining such clinical trials. In this paper, we formulated the expected overall treatment effects using various modeling assumptions. Then we proceeded to propose efficient estimates together with a version of Wald test for the combined hazard ratio using only aggregate data. Interpretation of the methods are provided in the framework of robust data analyses involving misspecified models.

Keyword: combining survival trials, misspecified models, harmonic average.

1 Introduction

Multiple clinical trials may be performed to validate a newly developed therapy to account for the variability of the targeted patient population. The data collected from each individual clinical trial may be used to generate the efficacy (or safety) estimates for each specific trial as well as the overall efficacy estimate for the overall population to support the effectiveness of the therapy. In this paper, we will focus on the analyses of time-to-event data () using Cox proportional hazard models. Let be the covariates available for each enrolled patient. Given , the proportional hazard model assumes that the hazard function for a patient from the -th () trial can be writtens as

where is the baseline hazard function of the -th trial with unknown formulation, is the vector of log hazard ratios defined specifically for the patients enrolled in the -th trial.

Efficient estimates of the trial specific based on maximal partial likehood (MPLE) were well developed since the days of Cox (1972, 1975). In this paper, we are more interested in the methods for statistical inference based on the information contained in the pooled data from all trials. Meta analyses were invented to address such needs. Yuan (2009) provided a comprehensive review of the statistical methods used to generate an overall (log) hazard ratio estimate based on multiple clinical trials. The convex combination of all the log hazard ratio estimates from each individual trial is a popular choice (Wei, Lin and Weissfeld 1989):


Here () denotes the MPLEs of derived from the patient data collected from the -th trial. The weights can be arbitrary constants sum up to unity, among which the inverse variance scheme may be the most popular option due to the obvious advantage of minimized variance among the family of all convex combinations of ’s. Though less common, some researchers also recommend using the linear combination of the hazard ratio estimates for any given covariate value :


To account for the differences in each due to randomness inherited from the observed data, DerSimonian and Laird (1986) provided an overall treatment effect estimate based on random effect models. Yuan continued to propose a meta-ANOVA model and a meta-polynomial model to address the differences in ’s due to mis-match of baseline information from different trials. These methods require a large number of trials () to support the regression algorithms. Here we are only interested in the cases where very limited number () of trials are deliverable.

As far as we know from the research literature, all the statistical models developed for meta-analyses assume that there exists a set of “true” baseline hazard function as well as a unique “true” log hazard ratio (vector) through out all the trials designed for the same therapy. The baseline may vary by , while has to be consistent for all the trials. The differences in the estimated log hazard ratios are either attributed to the randomness of the observed outcomes or the incompleteness of the covariate data. However, in some cases, such assumptions may be far from the truth. The treatment effect on different populations may be essentially different due to discrepancies in some latent patient characteristics. The problem is even more acute if the therapy under investigation targets specific gene expressions. The human genome is overwhelmingly complex such that even the most well devised targeted treatment can be subject to unexpected impacts from genes outside the targeted region. Hence the true values of ’s may vary with because the patient populations enrolled for different trials are in fact heterogeneous with regard to their responses to the therapy. Even though the inclusion/exclusion criteria and the design of these trials may appear to be perfectly aligned, it cannot help to suppress the differences in the patients recruited for different trials. It is impossible to adjust the hazard ratio estimates for these latent controlling factors because they are usually unknown to the researchers or not able to be detected by any currently available technology/assay. Instead, the researcher has to compromise with an overall treatment effect estimate derived from the pooled data, or from the summary statistics reported for each trial if patient line data are not available.

Mehrotra et. al. (2012) investigated the effect of deviation from the constant hazard ratio assumption underlying the standard stratified survival analyses. Our method proposed in this paper can be used to establish an efficient overall treatment effect estimate covering multiple stratums of high variability. Denne et. al. (2014) and Li (2014) provided another good example illuminating the necessity of considering different treatment effects () for different trials. One version of an assay may be used to enroll patients by a biomarker for a validation trial. Another version of the assay targeting the same biomarker may be needed when the treatment is ready for marketing due to advances of technology or solely to reduce the cost and time for patient screening. The two versions of the assay may not perfectly match with each other. The discrepancies in the test results may reflect some differences in the patient’s biological profile which are not intended to be captured by either assay. Such differences may affect the patient responses to the therapy. Hence the intent-to-treat patient population is divided into multiple subgroups with different assay result combinations. It is reasonable to assume that the patient responses to the same targeted treatment are in fact different (or even opposite as was observed in some real life examples) across these subgroups defined by both assays (Sargent et. al. 2005). However, only the market-ready version of the assays will be available to the patients/client laboratories so that it is the overall treatment effect covering all subgroups, rather than the treatment effect on patients with specific test result combinations, that concerns the developer.

The linear estimate defined by (1) or by (2) was used as the estimator for the overall treatment effect in these mentioned publications. The linear estimates has its own merit. It is easy to be implemented for calculation and the derivation of its variance is straightforward. It is guaranteed to be close to the correct answer if the true values across different trials only differ by a small amount of no practical significance. However, if the ’s are very different, despite the choice of the weights (), the usage of a linear estimate is not mathematically justifiable because each component of the linear combination converges to a completely different . It is not very likely that the limit of the estimator () is an exact measurement of the overall treatment effect because the (log) hazard ratio defined by the Cox proportional hazard model is obviously nonlinear with regard to the observed survival times. As a matter of fact, we will show that in most cases (and ) defined by the inverse variances coefficients or by the proportion of trials sizes leads to over estimate of the true overall hazard with the treatment.

The major concerns and challenges for a pharmaceutical researcher facing data collected from multiple trials can be summarized by two questions. First, what is the proper definition of the overall treatment effect? Hazard ratio is very different from the other commonly used endpoints such as the mean/median survival time and response rate. The later resorts to a natural measurement (in most cases, counting) of observed events or time, while hazard ratio is an artificial concept specifically invented for the Cox proportional hazard model. The definition of a “hazard ratio for the overall population” is ambiguous now that the Cox model is no more applicable to a mixed population. In the ideal case one may have all the exact knowledge about the true underlying performance of the therapy including but not limited to the shapes of the baseline hazards and the values of the log hazard ratios . That being said, the overall treatment effect is not a readily defined value as a unique functional of all such baseline functions and parameters. For the first time in the literature, we point out that the problem has to be addressed in the misspecified model framework. Typically the target to be estimated with a misspecified model is the limit of a chosen estimator. Its definition has to depend on the modelling assumptions chosen by the researchers. We will demonstrate two kinds of such definitions in this paper and provide explaination for their very delicate differences. The second question naturally comes after the first one, i.e., how to generate a statistically efficient estimate for the overall drug efficacy after a proper definition? It is even more challenging if patient line data are not available but only aggregate statistics can be accessed for some of the trials.

2 Combined hazard ratio as a limit of the MPLE from pooled data

It is sufficient to consider only two independent trials. Generalization of the results to cover more trials is straightforward. Assume that patients were enrolled for the first clinical trial. The survival times and baseline demographics of the patients are denoted . Here is the right-censored survival time of the -th patient, indicates that is censored or an observed event time, is a -dimensional covariate with probability density function . In many cases the distribution may vary by the trials. For simplicity, we assume the same distribution for all the trials in this paper. Extension of our methods to accommodate different distributions for the covariates is obvious. Assuming independent censoring, the patients’ survival times are i.i.d.’s with proportional hazard

By definition, the pdf of an uncensored (i.e., ) conditioned on is

where is an arbitrary baseline hazard function of the first trial and is the corresponding culmulative hazard. Similar notations can be defined for the second trial. Let , be the data collected from the second trial. Assuming proportional hazard and independent censoring, the patients from the second trial has hazards

The trial-specific treatment effect estimates ( and ) and their asymptotic variance-covariance matrices can be easily estimated using the MPLEs calculated from the and data respectively. We are particularly interested in the setup where the underlying true values of . Without loss of generality, let the first component of the covariates () be the arm indicator for the -th patient. The patient is in the treatment arm if or he is in the control arm if . The value of and , the first component of the regression parameter and respectively, are of most concern as a measurement of the treatment effect. It is convenient to assume for all the discussions presented in this paper.

It is natural to consider the pooled patient line data

Most researchers consider the MPLE calculated from all (denoted by ) pooled patient line (PL) records not only as an appropriate estimate for the overall treatment effects but also the best (especially when compared to the linear estimates or ) answer to our first question. The only obvious drawback of the MPLE is that it requires the knowledge of all patient line data:


where is the set of labels for those patients (originally from or ) who are at risk at time .

The overall log hazard ratio is hence defined as the limit of when both and . It is worth to point out that one needs to first determine what is an appropriate statistic for the overall treatment effect based on both statstical and clinical thinking. Then the parameter to be estimated follows as the limit of the statistic, not vice versa. A different version of the overall treatment effect can be as valid based on other assumptions about the estimating procedure. We will extend the discussion to provide such an example in the Section 4.

It is equivalent to imposing a misspecified Cox model (working model) on the pooled data such that the combined hazard can be written as


Assume that the two trials have the same baseline hazard for all because the control arms are usually subject to standard of care. Such treatments are well established for the general population. The true pdf of the pooled data is a mixture of two proportional hazard models

where as is a fixed ratio of the sample sizes controlled by the researcher. The true model for does not satisfy the proportional hazard assumption:

The formulation of the limit of can be studied using the techniques developed for misspecified models in Struthers and Kalbfleisch (1986) and Lin and Wei (1989). Let be the true hazard function of the -th patient from the pooled dataset and be the at-risk process at arbitrary time , . It is convenient to define the notations following the convention of Andersen and Gill (1989), Struthers and Kalbfleisch (1986) and Lin and Wei (1989):

Here the expected values are defined with respect to the true distribution of and .

Proposition 1. (based on Theorem 2.1 of Lin and Wei (1989)) Let be the MPLE of the log hazard ratios for the overall treatment effect as defined in (3). When , converges in probability to the unique solution of the following equation:


Without censoring it follows the definition of and the hazard that

The right hand side of the above formula is simply the pdf of by the definition of the hazard . It can be written as or respectively, in the form of

depending on whether the -th patient is from the first or the second trial. Hence and can be simplified by the following calculation

Similarly and can be simplified using in the form of

Expand the shorthand notations defined for Proposition 1, we have

All the above notations turn out to depend on no random variables other than the covariate ’s. The subscripts for are suppressed with the assumption that the ’s are independent and identically distributed. Under mild smoothness conditions, the first term of equation (5) can be simplified by switching the order of the integrals:

Plug in (5) with the definitions of the pdf’s and survival functions. For sufficiently large and , substitute by , i.e., the fixed ratio of the study sizes. Eliminate by letting and hence . We derived an equation for the definition of the overall treatment effect.

Corollary 1.1 Assume no censoring in the data and . The true treatment effect (log hazard ratios) from either trials are known and denoted by and respectively. The overall log hazard ratio defined as the limit of with is the unique solution to the following equation:


The distribution of the covariates can be well approximated using data from the general intent-to-treat population. Once the distribution of is known, equation (6) can be solved numerically.

Usually one needs all patient line data to define the MPLE as well as its sandwich-type robust variance estimate. Nonetheless, if only aggregate data such as the MPLEs and from individual trials are available for some reason, one may solve equation (6) with and substituted by their estimates and . The solution to such an equation, which is henceforth denoted by (the subscript stands for “misspecified model”), does not rely on the knowledge of the baseline hazard . It is a semiparametric, asymptotically efficient estimate to because the MPLEs and are based on maximum likelihoods. Such procedures are well known for being invariant with regard to functional transformations.

The condition of no censoring in Corollary 1.1 is natural for the definition of because we are only interested in the performance of the therapy. Censoring is considered as noise imposed on the observed survival times and should be excluded from the estimating procedure if at all possible. The MPLEs and reported for the individual trials are (asymtotically) unbiased for the underlying log hazard ratios and even if the data and are censored. Therefore always remains an unbiased estimate for the overall treatment effect no matter the data are censored or not.

Example 1. Usually the effects of the covariates are assumed to be sorted out by proper randomization. One of the most important analysis used in practice use the treatment arm indicator as the only covariate with . Denote the hazard ratio of the first trial by , the hazard ratio of the second trial by . By corollary 1.1, the MPLE calculated from the uncensored pooled line data of patient records converges to the solution of the following equation about :


Equation (7) can be simplified after a series of basic algebraic transformations:


It is easy to see that the right-hand-side of (8) is a strictly decreasing, convex function of . It implies that the overall hazard ratio defined by the solution to equation (8) must take a value on the open interval . In Appendix 5.2, we will also show that in most cases is superior to (i.e., smaller than) the linear alternative and .

Now take a closer look at . It is conventionally considered as the best estimate to the overall log hazard ratio of the combined trials. However, is actually defined to be the limit of the uncensored . The estimator is subject to the impact of censoring and can be biased for . By (6), the value of only relies on , and the distributions of .

Corollary 1.2 Assume independent right censoring for both and such that the survival function of the censoring times are denoted and respectively. The limit of with is the unique solution to the following equation with respect to with known values of and :


With a mixed population, censoring contains information about the source of the data, which is correlated with the length of the subject’s expected survival time beyond censoring. The limit of must contain all such information as the censoring mechanism , and the baseline hazard .

Example 2. Assume the same setup as in Example 1. The survival times observed from either trials follow Cox models defined with a treatment arm indicator . Both clinical trials will be terminated at given time . Hence the censoring time has a point mass of unity at . That is, if and 0 otherwise. Expand the definitions of the distribution functions and use the change-of-variable technique by letting in equation (9), it becomes


The limit of the MPLE calculated from the censored line data is the solution to the above equation. To study the bias in , we need to compare against , which is the solution to equation (8). For simplicity, denote the integrand in equation (10) by for fixed and . Equation (8) asserts that is a well defined pdf because its integral on equals unity. Consider the fact that (Appendix 5.2). It is easy to prove that the fraction term

if and it is less than 1 otherwise. Hence the pdf only intersects a standard Exponential pdf at one point, which in turn implies that the distribution defined by is stochastically smaller than a standard Exponential distribution (Fill and Machida 2001). Therefore, the cdf corresponding to is always bigger than that of the standard Exponential distribution for any given :

Note that is monotonically decreasing with respect to . To make an equality as in equation (10), its solution must be greater than . When the survival time data are censored at a maximum allowable trial length , is always associated with a positive bias. The bias decreases with .

Corollary 1.2 indicates that the most accepted MPLE is not robust against variability in the treatment effects observed from multiple trials. Hence we recommend reporting the overall log hazard ratio for multiple trials using rather than . The former is not only robust (unbiased despite of censoring) but also provides better chance for the researchers because it only requires aggregate statistics from each sub-population of concern. The following example illustrates the impact of censoring on using simulated data.

Example 3. We performed 1000 rounds of independent simulations to mimic the following scenario: the survival times of 200 patients treated in the first trial follow an distribution. With 1:1 randomization (i.e., ), another 200 patients in the control group has survival times sampled from a standard exponential distribution . Let , the second trial enrolls 85 patients for the treated group and 85 patients for the control group. To mimic a lower drug efficacy, the survival times of the treated patients from the second trial are sampled from an distribution, while the survival times of the control group patients are sampled from .

Without censoring, we can calculate for each of the 1000 simulated data set and summarize the distribution of using its empirical distribution. In this example, it was reported that (equivalent to ) with a 95% confidence interval of . According to Corollary 1.1 this is an (asymtotically) unbiased estimate for the true overall log hazard ratio as there is no censoring.

We continued to censor the 1000 simulated data sets using various censoring time and tried to calculate the estimates and respectively for each specific . In Figure 1(i) ’s are reported as the solution to equation (8) with and substibuted by and , which are calculated from the censored trial 1 and trial 2 data respectively. On the other hand, in Figure 1(ii) the MPLE ’s are calculated using (3) with all 570 censored line data from either trials pooled together.

Figure 1: (i) Mean and the 95% CI for . (ii) Mean and the 95% CI for .

The gray lines in Figure 1 outline the true overall hazard ratio (-0.926) and its 95% confidence interval . The bias of can be easily observed in Figure 1(ii). When the trials are censored at , about 51% of the collected data are censored. The MPLE reported for this scenario has a mean of -0.854 and a 95% confidence intervals of . It accounts for about 7.8% of positive bias in the log hazard ratio estimates. Such a result can be of serious concern for those clinical trials expected to be associated with low event rates, e.g., a trial for breast cancer treatments. For the same censored dataset, displayed in Figure 1(i) appears to be unbiased for . The confidence interval of tends to be wider with more data being censored because the variance of the log hazard ratio estimate is proportional to the inverse of the number of observed events (Kalbfleisch and Prentice (2002)).

3 Combined hazard ratio via harmonic means

We noted that and are semi-parametric estimators without requiring any knowlege about the baseline. However, the baseline hazard of the misspecified model (4) is different from the baseline hazard defined for each individual clinical trial. It is due to the fact that the MPLE procedure is the result of simultaneous maximization of the unknown parameter and the discretized baseline hazard in the form of a Breslow hazard estimate (Breslow (1972), Johansen (1983)). The shape of the baseline hazard is twisted to fit the mixed event times and in return leads to a slightly under-estimated therapy effect, or equivalently an over-estimated log hazard ratio in many cases. We will propose another method to avoid such undesirable effect.

Example 4. Let and be the i.i.d. event times observed from the treatment and control groups of the first trial, and be the event times recorded from the treatment and control groups of the second trial. Without loss of generality, assume . No censoring is allowed for simplicity. Provided the complete patient line data, it is easy to estimate the treatment effects of the therapy in either clinical trial using the MLEs and . Assuming a misspecified model for the pooled data, it is natural to estimate the overall treatment effect using

Again, the overall hazard ratio can be defined as the limit of the chosen estimate . It turns out to be the harmonic mean of the individual trial effects and weighted by the ratio of the study sizes . As a matter of fact, the harmonic mean effect is smaller than that defined by the semiparametric MPLE method in most cases (Appendix 5.2):

where is the solution of equation (8) as well as the limit of the MPLE calculated using without assuming an underlying exponential hazard. The difference between the two versions of the overall treatment effect estimate can be attributed to a twisted baseline hazard approximate due to the MPLE procedure. Consider the two trials respectively, the Breslow hazard estimates

calculated from either sets of patient line data are both asymptotically unbiased for the true underlying linear culmulative hazard function . Here the only covariate if the patient is in the control group and 1 if the patient is in the treatment group. The numerator of the fraction is always equal to one because there are no censoring. However, the Breslow estimate calculated from the pooled data ’s does not lead to a constant hazard estimate over time. At arbitrary time , the non-parametric baseline hazard as a limit of the Breslow estimate is


The derivation of the above formula is provided in Appendix 5.1. Note that

Unless , the shape of is tilted to the right by the fact that and (Appendix 5.2). For sufficiently large , is monotonically decreasing and happens at only one time point . The trend in is consistent with the monotone changes in the mixture proportion of the complete population. At the very beginning, the estimated baseline hazard is bigger than one. The proportion of treated patients from the first trial increases with time because the patients from the second trial have shorter expected life () and finally the baseline hazard estimate is dominated by , the effect of the first trial, and scaled by such that the average baseline hazard is close to one. Hence a smaller combined effect (i.e., bigger hazard ratio ) is given by the MPLE procedure compared to the harmonic mean effect based on the parametric Exponential model. In a sense, the MPLE procedure is an overfit to the data if the researcher is confident about the fact that the control groups from either trial are not essentially different.

It is curious to see that the harmonic mean type of definition for the combined trial effect can be extended to address statistical models assuming much more general conditions where neither the underlying exponential distribution nor the univariate covariate structure is needed. Let denote the observed survival times of patients recruited for a clinical trial. Their uncensored survival times follow a proportional hazard model with hazard . Similarly, the survival times () of another patients recruited for a second trial also follow a proportional hazard model with the same baseline hazard and a log hazard ratio . Here is a -variate vector of the same structure as but of different values. The formulation of the baseline hazard is unknown. The MPLE is not appropriate for the estimation of the log hazard ratio () for the combined population if one needs to avoid a twisted baseline hazard. Instead, we set out to define the MLE for . It leads to a different version of a semi-parametric estimate (, where the subscript stands for “harmonic mean”) for the overall log hazard ratio. It is based on aggregate statistics only. Patient line data are not required for the realization of .

Proposition 2. Let be the maximum likelihood estimate (MLE) for the overall log hazard ratio when a proportional hazard model with the following pdf is fitted to the pooled surival time records:


When , converges in probability to a constant , which is the unique solution to the following equation with respect to :


Proof. Again, we assume no censoring when trying to define the “true” values of the overall treatment effect because no information other than the measurements of the treatment effect should be of concern. Such conditions can be loosen when we get to the discussions about the estimating procedures for the treatment effects established here.

Assuming the misspecified proportional hazard model (12), the joint pdf of the observed survival times is

The derivative (w.r.t. ) of the logarithm of the misspecified joint pdf is

By the law of large number,

where stands for the expectation defined in terms of the true mixed model with pdf :

This is a typical setup of an estimating equation. Under mild regularity conditions, it can be proved that the MLE defined by the solution to the equation converges in probability to a , which is the solution to


Detail discussions about the asymptotics of the MLE derived from misspecified models are available in White (1982). Note that (14) is equivalent to


To simplify the right hand side of (15), we calculate

Note that . The third line of the above equation was due to the following definition of notations: , and . Equation (13) follows (15) with the definition of plugged in.

Following the discussions about misspecified models in White (1982) and Akaike (1973), it can be seen that equation (14) actually defines as the maximizer of the following expectation:

Hence, has an obvious geometric interpretation. It minimizes the Kullback-Leibler distance between the candidate working models and the true model :

Now we have another version of the definition for the overall log hazard ratio . It can be slightly smaller than in most cases (Appendix 5.2). Both of these two numbers are valid measurements of the treatment effect on the combined population, though they are based on different modelling assumptions. We consider a number closely related to the harmonic mean. Such an idea can be illustated by the following example.

Example 5. Assume the same setup as in Example 1. The treatment group indicator is the only covariate collected for the enrolled patients. The observed survival times (with or without censoring) follow proportional hazard models with hazard ratios and respectively for trial 1 and 2. By equation (13), the overall hazard ratio is the solution to the following equation:


Note that exponential survival times are not required in Example 5. The harmonic type of calculation (16) is applicable to any general proportional hazard modelling setup.