Equivalence of regression curves sharing common parameters

# Equivalence of regression curves sharing common parameters

Kathrin Möllenhoff, Frank Bretz , Holger Dette
Department of Mathematics, Ruhr-Universität Bochum, Germany,
Novartis Pharma AG, CH-4002 Basel, Switzerland
###### Abstract

In clinical trials the comparison of two different populations is a frequently addressed problem. Non-linear (parametric) regression models are commonly used to describe the relationship between covariates as the dose and a response variable in the two groups. In some situations it is reasonable to assume some model parameters to be the same, for instance the placebo effect or the maximum treatment effect. In this paper we develop a (parametric) bootstrap test to establish the similarity of two regression curves sharing some common parameters. We show by theoretical arguments and by means of a simulation study that the new test controls its level and achieves a reasonable power. Moreover, it is demonstrated that under the assumption of common parameters a considerable more powerful test can be constructed compared to the test which does not use this assumption. Finally, we illustrate potential applications of the new methodology by a clinical trial example.

Keywords and Phrases: Similarity of regression curves, equivalence testing, parametric bootstrap, nonlinear regression, dose finding studies

## 1 Introduction

Regression models are commonly used to describe the relationship between multiple covariates and a response variable. In certain applications, more than one regression model is available, such as when assessing the relationship between the covariates and the response variables in more than one population (e.g. in males and females). It is then often of interest to demonstrate the equivalence of the regression curves: If equivalence can be claimed, conclusions can be drawn from the pooled sample and a single regression model is sufficient to describe the data. This can be achieved by testing a suitable null hypothesis that the distance between the regression curves (measured in an appropriate sense) is smaller than a pre-specified equivalence margin at a controlled Type I error rate. Note that the problem of equivalence testing, as considered in this paper, is conceptually different from the more frequent problem of testing for equality of curves and is much less studied in the literature due to methodological difficulties.

The problem of testing for equality of regression models has been intensively discussed in the nonparametric context and we refer to the recent work of Feng et al. (2015), whichcontains a rather comprehensive list of references. In applied regression analysis, however, parametric models are usually preferred to a purely nonparametric approach as they admit a direct interpretation of the observed effects in terms of the model parameters. In addition, the available information of the observations is increased by applying more efficient estimation or test procedures, provided that the assumed model is valid. Despite its importance, the problem of establishing equivalence of two parametric regression models while controlling the Type I error rate has only recently found attention in the literature. Using the intersection-union test device from Berger (1982), Liu et al. (2009) investigated the assessment of non-superiority, non-inferiority and equivalence when comparing two regression models over a restricted covariate region. Building upon this work, Gsteiger et al. (2011) derived equivalence tests based on simultaneous confidence bands for nonlinear regression models, with application to population pharmacokinetic analyses. Likewise, Bretz et al. (2016) assessed the similarity of dose response curves in two non-overlapping subgroups of patients. Alternatively, Dette et al. (2018) suggested directly estimating the distance between the regression curves and using a non-standard bootstrap test to decide for equivalence of the two curves if the estimate is less than a certain threshold. Expanding this approach, Moellenhoff et al. (2018) assessed the comparability of drug dissolution profiles via maximum deviation, whereas Hoffelder (2018) demonstrated the equivalence of dissolution profiles using the Mahalanobis distance; see also Collignon et al. (2018).

In these papers, the authors assumed that the regression models have different parameters and can therefore be evaluated separately. In some applications, however, this assumption cannot be justified and it is more reasonable to assume that the regression models may have some common parameters. The total number of parameters to estimate is then reduced to the common and remaining parameters of each model, affecting the asymptotic behavior of the estimators. Consider, for example, the Phase II dose finding trial for a weight loss drug described in Bretz et al. (2016). This trial aimed at comparing the dose response relationship for two regimens administered to patients suffering from overweight or obesity: Three doses each for once daily (o.d.) and twice daily (b.i.d.) use of the medication, and placebo. It is reasonable to assume that the placebo response is the same under both the o.d. and the b.i.d. regimen. Since the regression models typically used for dose response modeling contain a parameter for the placebo response (Pinheiro et al. (2006)), they will thus share this common parameter for both the o.d. and the b.i.d. regimen. In some instances, it might even be reasonable to assume that the maximum efficacy for high doses is similar in both groups. Moreover, clinical trial sponsors may even decide to use the same placebo group for logistical reasons. The response of each patient on placebo is then used twice in the estimation of the o.d. and b.i.d. dose response models, further complicating the statistical problem.

In this paper, we investigate the equivalence of two parametric regression curves that share common parameters. In Section 2 we first introduce the regression models to be estimated under the assumption of common parameters. We then develop a non-standard bootstrap test which performs the resampling under the constraints of the interval hypotheses implied by the equivalence test problem. The new tests improves the procedure proposed in Dette et al. (2018) using the additional information of common parameter in both groups. We also discuss testing the equivalence of model parameters to assess whether the assumption of common parameters is plausible. In Section 3 we investigate the finite sample properties of the proposed bootstrap test proposed in terms of power and size. In Section 4 we illustrate the methods using a multi-regional clinical trial example where it is conceivable that the placebo and maximum treatment responses are the same across geographic regions but the onset of treatment differs due to intrinsic and extrinsic factors (Malinowski et al. (2008); ICH (2017)). Technical details and proofs are deferred to an appendix.

## 2 Methodology

### 2.1 Models with common parameters

Let

 Yℓ,i,j=mℓ(dℓ,i,βℓ)+ηℓ,i,j , j=1,…,nℓ,i, i=1,…,kℓ, (2.1)

denote the observed response of the th subject at the th dose level under the th dose response model , where denotes the index of the two groups under consideration. We assume that the (non-linear) regression model is parametrized through a -dimensional vector , . Note that the regression models and may be different. Likewise, the parameters and may be different even if . We further assume that the error terms are independent and identically distributed with expectation and variance . The dose levels may be different in both groups but they are attained on the same (restricted) covariate region . In this paper is assumed to be the dose range, although the results can be generalized to include other covariates. Further, denotes the sample size in group where we assume observations in the th dose level (. The sample sizes can be unequal and the total number of observations is denoted by .

In this paper we consider the situation, where the regression models have some common parameters. More precisely, we assume without loss of generality that these parameters are given by the first model parameters of the parameter in model (2.1) that is

 βℓ=(β0,~βℓ)∈Rpℓ,  ℓ=1,2 , (2.2)

where denotes the vector of common parameters in both regression models and and denote the remaining parameters in the models and , respectively, which do not necessarily coincide. The case where the models and do not share any common parameters is included and corresponds to for (that is ). As a consequence the -dimensional vector of all parameters of the regression functions in model (2.1) under the assumption (2.2) is given by . Throughout this paper we assume that where is a compact set.

These parameters are now estimated by least squares using the combined sample , that is

 ^β=(^β0,^~β1,^~β2)=argmin(b,,~b1,,~b2)∈B2∑ℓ=1kℓ∑i=1nℓ,i∑j=1(Yℓ,i,j−mℓ(dℓ,i,(b0,~bℓ)))2. (2.3)

### 2.2 Testing equivalence of regression curves

Following Liu et al. (2009) and Gsteiger et al. (2011) we consider the regression curves and to be equivalent if the maximum distance between the two curves is smaller than a given pre-specified constant, say , that is,

 d∞(β1,β2)=maxd∈D|m1(d,β1)−m2(d,β2)|<ε.

In clinical trial practice is often referred to as a relevance threshold in the sense that if the difference between the two curves is believed not to be clinically relevant. In order to establish equivalence of the two curves and at a controlled type I error, we will develop a test for the hypotheses

 H0:d∞(β1,β2)≥εversusH1:d∞(β1,β2)<ε. (2.4)

In the following we extend the bootstrap approach from Dette et al. (2018) to test the hypotheses (2.4) in the situation of common parameters. Note that the test procedure proposed below could also be applied to alternative measures of equivalence, such as the integrated deviation .

###### Algorithm 2.1.

(parametric bootstrap for testing equivalence under the assumption of common parameters)

• Calculate the ordinary least-square (OLS) parameter estimate (2.3) assuming a common parameter . The corresponding variance estimates are given by

 ^σ2ℓ=1nℓkℓ∑i=1nℓ,i∑j=1(Yℓ,i,j−mℓ(dℓ,i,^βℓ))2,ℓ=1,2, (2.5)

where . Calculate the estimate

 ^d∞=d∞(^β1,^β2)=maxd∈D|m1(d,^β1)−m2(d,^β2)|

for the maximal deviation between the two regression curves.

• Define the constrained estimates

 ^^βℓ={^βℓ%if^d∞≥ε¯βℓif^d∞<ε,ℓ=1,2, (2.6)

where minimize the objective function in (2.3) under the additional restriction

 d∞(β1,β2)=maxd∈D|m1(d,β1)−m2(d,β2)|=ε. (2.7)

Define and note that .

The next two steps describe the (parametric) bootstrap procedure.

• Generate data

 Y∗ℓ,i,j=mℓ(dℓ,i,(^^β0,^^~βℓ))+η∗ℓ,i,j ,i=1,…,nℓ,i, ℓ=1,2, (2.8)

with independent and normally distributed errors .

• Calculate the OLS estimate as in Step (1) and the test statistic

 ^d∗∞=maxd∈D|m1(d,^β∗1)−m2(d,^β∗2)|,

where . The quantile of the distribution of the distribution of the statistic is denoted by and the null hypotheses in (2.4) is rejected, whenever

 ^d∞<^q∗α. (2.9)

In practice the can be calculated repeating steps (3) and (4), say times, in order to obtain replicates of . An estimate of then is defined by , where denotes the corresponding order statistic, and this estimate is used in (2.9)

The following theorem states that this algorithm yields a valid test procedure. The proof is left to the Appendix A.

###### Theorem 2.1.

The test defined by (2.9) is a consistent, asymptotic -level test. That is

 limn1,n2→∞P(^d∞<^q∗α)=1, (2.10)

whenever , and

 limsupn1,n2→∞P(^d∞<^q∗α)≤α. (2.11)

if .

###### Remark 2.2.

The results presented in this section remain correct in trials with a common placebo group, where observations are taken at dose level (corresponding to placebo), which are modelled by the random variables . For the sake of a simple presentation we consider location-scale type models, such that the common effect at the placebo can easily be modelled, but we note that more general models can be considered as well introducing additional constraints for the parameter.

To be precise, we assume that the models in (2.1) are given by

 mℓ(d,βℓ)=β0,1+~βℓ,1⋅m0ℓ(d,~β0ℓ), ℓ=1,2, i=1,…,kℓ. (2.12)

where , such that the condition reflects the fact that there is only one placebo group (and as a consequence a common placebo parameter). Models of this type cover the most frequently used functional forms used in drug development and several examples can be found in Ting (2006). Beside the location parameter there may be also other shared parameters, which we do not reflect in our notations for a better readability. The -th model is completely characterized by its parameter , , and we obtain estimates of the model parameters by minimizing the sum of squares

 (2.13)

Theorem 2.1 remains valid in this situation and a proof can be found in the Appendix (see Section A.4).

### 2.3 Testing equivalence of model parameters

So far we assumed that the two regression models and share the common parameter . In practice it may be necessary to assess whether this assumption is plausible using an appropriate equivalence test for the shared model parameters. To be more precise, we recall the definition the parameters in model (2.1), i.e.

 βℓ=(βℓ,1,…,βℓ,p′,…,βℓ,pℓ) ,  ℓ=1,2.

and note that assumption (2.2) of common parameters in the models and can be represented as for . In order to investigate if this assumption holds at least approximately we construct a test for the hypotheses

 K0:maxi=1,…,p′|β1,i−β2,i|≥δ versus K1:maxi=1,…,p′|β1,i−β2,i|<δ, (2.14)

where denotes the equivalence margin. To be precise let denote the least squares estimates in model for the sample (), and assume that for large sample considerations the sample sizes and converge to infinity such that

 limnℓ→∞nℓ,inℓ = ζℓ,i>0 ,  i=1,…,kℓ , ℓ=1,2, (2.15) limn1,n2→∞nn1 = λ∈(1,∞). (2.16)

Under standard assumptions, which are listed in Section A it can be shown that the least squares estimate of the parameter in model is approximately normal distributed, that is

 √nℓ(^β(ℓ)−βℓ)\lx@stackrelD→N(0,Σ−1ℓ) , ℓ=1,2, (2.17)

where the symbol means convergence in distribution and the matrix is defined by

 Σℓ=1σ2ℓkℓ∑i=1ζℓ,i∂∂bℓmℓ(dℓ,i,,bℓ)∣∣bℓ=βℓ(∂∂bℓmℓ(dℓ,i,,bℓ)∣∣bℓ=βℓ)T ,  ℓ=1,2. (2.18)

Here and throughout this paper we assume that the matrices and are non-singular. Consequently the difference is also asymptotically normal distributed, and in particular it follows for the first components of the difference that

 √n((^β1,1,…,^β1,p′)−(^β2,1,…,^β2,p′)−((β1,1,…,β1,p′)−(β2,1,…,β2,p′)))\lx@stackrelD→N(0,Ω), (2.19)

where the matrix is defined by

 Ω:=λΛ−11+λλ−1Λ−12, (2.20)

denotes the upper-left -block of the matrix and is defined in (2.16). Therefore we obtain the approximation

 (^β1,1,…,^β1,p′)−(^β2,1,…,^β2,p′)\lx@stackrelD≈N((β1,1,…,β1,p′)−(β2,1,…,β2,p′),1nΩ) ,

where is defined in (2.20). We can now apply the test proposed in Wang et al. (1999) by rejecting the null hypothesis in (2.14), whenever

 (2.21)

where denotes the quantile of the -distribution with degrees of freedom and the th diagonal element of the matrix which is an estimate for the (unknown) covariance matrix (this is obtained by replacing the unknown parameters , and weights in (2.18) by their corresponding estimates and , respectively).

## 3 Finite sample properties

We now investigate the finite sample properties of the bootstrap test proposed in Section 2.2 in terms of power and size using numerical simulations. The data is generated as follows:

1. We choose the functional form of the models and specify their parameters (including a common parameter ), which determine the true underlying models. Further we choose variances and the actual dose levels , .

2. For each dose we calculate values for the response given by . By generating residual errors we obtain the final response data

 Yℓ,i,j=mℓ(dℓ,i,(β0,~βℓ))+ηℓ,i,j,j=1,…,nℓ,i, i=1,…kℓ, ℓ=1,2. (3.1)

The simulation results below were obtained using simulation runs, where bootstrap replications were used to calculate quantiles of the bootstrap test.

In the following, we report the simulations results for power and size under three different scenarios. We consider the four-parameter sigmoid Emax model

 m(d,β)=β1+β2dβ3ββ34+dβ3, (3.2)

which is frequently used in practice when modeling dose response relationships (see for example Gabrielsson and Weiner (2007) or Thomas et al. (2014)). In model (3.2) the parameter corresponds (in this order) to the placebo effect , the maximum effect , the Hill parameter determining the steepness of the dose-response curve and the dose producing half of the maximum effect (Macdougall (2006)). In what follows we add an index for a shared parameter or for the group under consideration.

Scenario 1: We assume the dose range with identical dose levels , for both regression models . For each configuration of we use (3.1) to simulate observations at each dose level , resulting in total sample sizes of , respectively. We first compare the two sigmoid Emax models

 m1(d,β1)=β0,1+β0,2dβ0,3~ββ0,31,4+dβ0,3andm2(d,β2)=β0,1+β0,2dβ0,3~ββ0,32,4+dβ0,3. (3.3)

assuming the shared parameters . The only difference between the two models is in the parameters and , which results in the need to estimate five parameters in total. We consider the reference sigmoid Emax model with the parameters and . This reference model is compared to various specifications of the second model determined by , , , , , and common shared parameters . The values for were chosen such that the maximum absolute distances are given by , , , , , respectively. For these are attained at the dose levels and ; see Figure 1. For , that is , the maximum distance is attained at every point in .

In Table 1 we summarize the simulated rejection probabilities of the bootstrap test (2.9) under the null hypothesis (2.4) with and . We conclude that the bootstrap test controls its level in all cases under consideration. At the margin of the null hypothesis (i.e. ) the approximation of the level is very precise, even for sample sizes as small as .

We also investigated the relative residual mean squared errors (RRMSE) of the parameters estimates. Table 2 summarizes the simulation results only for (i.e. at the margin of the null hypothesis), as the results are similar for other choices of . We conclude that the RRMSE for estimating the Hill parameter is (by far) the largest. This phenomenon has also been observed by Mielke (2016). We also observe that all estimation errors decrease with larger sample sizes and smaller variances. Table 2 also summarizes the RRMSE when fixing the Hill parameter at (see the numbers in brackets). In this case, four parameters need to be estimated in total and the estimation errors become slightly smaller. We also repeated the Type I error rate simulations when fixing the Hill parameter at . The the results are reported in Table 1 (numbers in brackets) and we conclude that the size is well controlled within the simulation error.

In Table 3 we summarize the power of the bootstrap when generating the data under the alternative and . As expected, the power increases with larger sample sizes and smaller variances and is reasonably high across all configurations. Fixing the Hill parameter significantly improves the power which can be explained by the difficulty of estimating this parameter precisely, as discussed above.

Scenario 2: We maintain the basic settings from Scenario . We consider again two sigmoid Emax models

 m1(d,β1)=β0,1+β0,2d~β1,3~β~β1,31,4+d~β1,3% andm2(d,β2)=β0,1+β0,2d~β2,3~β~β2,32,4+d~β2,3. (3.4)

but assume now that both, the placebo response and the maximum treatment effect , are the same. For the reference model we chose , and . We investigated the maximum distances , which resulted in the following parameter configurations for the second model:

 (~β2,3,~β2,4)=(0.81,0.86), (~β2,3,~β2,4)=(1.4,1.07), (~β2,3,~β2,4)=(2.15,1.18), (~β2,3,~β2,4)=(3.15,1.25), (~β2,3,~β2,4)=(3.75,1.28), (~β2,3,~β2,4)=(~β1,3,~β1,4)=(4.5,1.3). (3.5)

The maximum distances between the two curves are now attained at the dose levels and ; see Figure 1.

In Table 4 we summarize the simulated rejection probabilities under the null hypothesis (2.4) for and . We conclude again that the bootstrap test controls the designated significance level in all cases under consideration. Especially at the margin the simulated Type I error rates are close to the nominal level . These observations apply regardless of whether the Hill parameter is estimated or fixed at the true underlying values given in (3).

In Table 5 we summarize the simulated power of the bootstrap test under the alternative and . As expected, the power decreases for increasing values of and for higher variances or smaller sample sizes. One noticeable exception occurs at , where in some cases the power is smaller than for . This effect can be explained theoretically when considering the proofs for the bootstrap test. In case of the set containing all points where the maximum distance between the two curves is attained (see Appendix A.3) consists of the entire dose range . Therefore, the asymptotic distribution of the test statistic is not Gaussian but a maximum of Gaussian processes. This complex structure of the asymptotic distribution has an impact on the bootstrap procedure and explains the decrease in power for . This phenomenon can also be observed, although to a lesser degree, in Scenario . Finally, we observe higher power values when fixing the Hill parameter compared to the situation where it has to be estimated.

Scenario 3: We now investigate the operating characteristics of the bootstrap test assuming three, two, one and no shared parameters. We set , and compare again two sigmoid Emax models. The true placebo response is chosen as . The reference model is specified by . The second model is specified by , where is chosen such that the maximum distances with respect to lie between and ; see the resulting curves plotted in Figure 2 for . Consequently the parameters specifying the two models only differ in parameters and and the maximum distance is determined by the choice of . As all other parameters are the same for the two models, we can compare the bootstrap test assuming three, two, one and no shared parameter. Note that we do not consider the case of identical models (i.e. ) because of the discontinuity of power at described under Scenario 2. The dose range is given by with different dose levels , , , , . We create observations at each dose level for each group according to (3.1), which results in a total sample size of . Finally, we choose , .

In Figure 2 we plot the proportion of rejections in dependence of the true maximum absolute difference . Under the null hypothesis all four tests control their level, as the proportion of rejections is smaller than or equal to within simulation errors. Looking at the region , we observe that the test assuming three shared parameters has the highest power among all four tests, followed by the test assuming two shared parameters. The difference between the tests assuming one and no shared parameter is rather small. Concluding, the more parameters can be assumed to be common for the two regression curves the higher is the power of the test. Note, however, that strictly speaking the hypotheses (2.4) are different when assuming three, two, one and no shared parameters and that the perceived power gain when assuming more shared parameters comes at the cost of making additional assumptions that need to be verified in practice, as illustrated with the clinical trial example in Section 4.

## 4 Clinical trial example

We now illustrate the proposed method with a multi-regional clinical trial example. The objective of this trial is to evaluate the dose response relationships in Caucasian and Japanese patients and assess their similarity. Based on data from previous clinical trials investigating a drug with a similar mode of action, it is reasonable to assume a similar response to placebo and a common maximum treatment effect in both populations, with the main difference expected to be in a different onset of treatment effect. Using the sigmoid Emax model (3.2), these consideration thus lead to different and Hill parameters for the two dose response curves. Because the trial is still at its design stage, we simulate data based on the trial assumptions. To maintain confidentiality, we scale the actual doses to lie within the [0, 15] interval. These limitations do not change the utility of the calculations below.

We assume Japanese and Caucasian patients, resulting in patients overall. Patients from both populations are randomized to receive either placebo (dose level ) or one of three active dose levels, namely for the Japanese and and for the Caucasian patients. Assuming equal allocation of patients within each population, we thus have 75, 60, 15, 15, 60, and 75 patients randomized to the dose levels and , respectively. The response variable is assumed to be normally distributed and larger values indicate a better outcome. Pharmacological and clinical considerations suggest the use of the (three-parameter) Emax model with the Hill parameter fixed at 1. Later on we relax this assumption as part of a sensitivity analysis. The R code for this example and all other calculations in this paper is available from the authors upon request.

In Figure 3 we display the fitted dose response models and for the Japanese and Caucasian patients, respectively, together with the individual observations, where and the -axis is truncated to for better readability. The parameter estimates from the two separate model fits are given by and . The observed differences for the placebo response and the maximum treatment effect are given by and , respectively, and thus relatively small, as it also transpires from the plots in Figure 3. To corroborate this empirical observation, we formally test whether the assumption of shared parameters is plausible by applying the equivalence test described in Section 2.3 on the data set under consideration. We choose the threshold and therefore test the null hypothesis against the alternative Applying the test (2.21) for , we obtain and and therefore and , respectively. We can thus reject at the relatively stringent 5% level and conclude equivalence of the two parameters, which justifies using the bootstrap test (2.1) with shared parameters.

We now evaluate the similarity of the dose response curves for the Japanese and Caucasian patients, assuming the same placebo and maximum treatment effect. In order to compute the non-linear least squares estimates in model (2.1) with (2.2) we formulate the objective function of the minimization step as

 4∑i=160∑j=1(Y1,i,j−(β0,1+β0,2d1,i,j~β1,3+d1,i,j))2+4∑i=1240∑j=1(Y2,i,j−(β0,1+β0,2d2,i,j~β2,3+d2,i,j))2.

Here, denotes the (shared) placebo effect, the (shared) maximum treatment effect , and and the parameters of the two models. Using the auglag function from the alabama package Varadhan (2014) to solve the above optimization problem, we obtain the parameter estimates , , and . In brackets we report the associated standard errors, which have to be calculated manually based on (5.5). The estimates for the population variances are and . The observed maximum difference between both curves over the investigated dose range is , attained at dose . We apply the bootstrap test (2.9) using bootstrap replications. Setting for the equivalence margin in (2.4), we obtain the quantile for . Thus, we reject the null hypothesis (2.4) at the 5% significance level and conclude that the dose response curves for the Japanese and Caucasian populations are similar, under the shared parameter assumption. Alternatively, we can calculate the -value for the bootstrap test and obtain the same test decision at level . For illustration purposes we also apply the bootstrap test (2.9) but without shared parameters (yet under the assumption of a fixed Hill parameter). Accordingly, we obtain a considerably larger -value of , which supports our findings from Scenario in Section 3 about the loss in power when no shared parameters are assumed. In this case the observed maximum distance is , attained at dose , and the quantile of the bootstrap distribution is .

Finally, we perform a sensitivity analysis to investigate the assumption of the Hill parameter being equal to 1. As part of this analysis we repeat the model fit and the bootstrap test using the sigmoid Emax model (3.2) where the Hill parameter is now part of the estimation. The parameter estimates (standard errors in brackets) are