A Appendix: Technical details

Equivalence of regression curves

Abstract

This paper investigates the problem whether the difference between two parametric models describing the relation between a response variable and several covariates in two different groups is practically irrelevant, such that inference can be performed on the basis of the pooled sample. Statistical methodology is developed to test the hypotheses versus to demonstrate equivalence between the two regression curves for a pre-specified threshold , where denotes a distance measuring the distance between and . Our approach is based on the asymptotic properties of a suitable estimator of this distance. In order to improve the approximation of the nominal level for small sample sizes a bootstrap test is developed, which addresses the specific form of the interval hypotheses. In particular, data has to be generated under the null hypothesis, which implicitly defines a manifold for the parameter vector. The results are illustrated by means of a simulation study and a data example. It is demonstrated that the new methods substantially improve currently available approaches with respect to power and approximation of the nominal level.

Keywords and Phrases: dose response studies; nonlinear regression; equivalence of curves; constrained parameter estimation; parametric bootstrap

1 Introduction

Testing statistical hypotheses of equivalence has grown significantly in importance over the last decades, with applications covering such different areas as comparative bioequivalence trials, evaluating negligible trend in animal population growth, and model validation; see, for example, Cade (2011) and the references therein. Equivalence tests are based on a null hypothesis that a parameter of interest, such as the effect difference of two treatments, is outside an equivalence region defined through an appropriate choice of an equivalence threshold, denoted as in this paper. If the null hypothesis is rejected one can then claim at a pre-specified significance level that, in the previous example, the two treatments have an equivalent effect [see Wellek (2010)]. Equivalence testing is often used in regulatory settings because it reverses the burden of proof compared to a standard test of significance.

In this paper, we consider the problem of establishing equivalence of two regression models which are used for the description of the relation between a response variable and several covariates for two different groups, respectively. That is, the objective is to investigate whether the difference between these two models for the two groups is practically irrelevant, so that only one model can be used for both groups based on the pooled sample. Such problems appear for example in population pharmacokinetics (PK) where the goal is to establish bioequivalence of the concentration profiles over time, say , , of two compounds. Traditionally, bioequivalence is established by demonstrating equivalence between real valued quantities such as the area under the curve (AUC) or the maximum concentrations () [see Chow and Liu (1992); Hauschke et al. (2007)]. However, such an approach may be misleading because the two profiles could be very different although they may have similar AUC or values. Hence it might be more reasonable to work directly with the underlying PK profiles instead of the derived summary statistics.

Another application of comparing two dose response curves occurs when assessing the results from one patient population relative to another. For example, the international regulatory guidance document ICH E5 (1997) describes the concept of a bridging study based on, for example, the request of a new geographic region to determine whether data from another region are applicable to its population. If the bridging study shows that dose response, safety and efficacy in the new region are similar to another region, then the study is readily interpreted as capable of bridging the foreign data. As a result, the ability of extrapolating foreign data to a new region depends upon the similarity between the two regions. The ICH E5 guidance does not provide a precise definition of similarity and various concepts have been used in the literature. For example, Tsou et al. (2011) proposed a consistency approach for the assessment of similarity between a bridging study conducted in a new region and studies conducted in the original region. On the other hand, the ICH E5 guidance does require that the safety and efficacy profile in the new region is not substantially different from that in the original region, and similarity can therefore be interpreted as demonstrating “no substantial difference”, which results in an equivalence testing problem [see Liu et al. (2002)].

The problem of establishing equivalence of two regression models while controlling the Type I error rate has found considerable attention in the recent literature. For example, Liu et al. (2009) proposed tests for the hypothesis of equivalence of two regression functions, which are applicable in linear models. Gsteiger et al. (2011) considered non-linear models and suggested a bootstrap method which is based on a confidence band for the difference of the two regression models [see also Liu et al. (2007a)]. Both references use the intersection-union principle [see for example Berger (1982)] to construct an overall test for equivalence. We demonstrate in this paper that this approach leads to rather conservative test procedures with low power. Instead, we propose to directly estimate the distance, say , between the regression curves and and to decide for the equivalence of the two curves if the estimator is smaller than a given threshold. The critical values of this test can be obtained by asymptotic theory, which describes the limit distribution of an appropriately standardized estimated distance. In order to improve the approximation of the nominal level for small samples sizes a non-standard bootstrap approach is proposed.

In Section 2 we introduce the general problem of demonstrating the equivalence between two regression curves. While the concept of similarity of the two profiles is formulated for a general distance , we concentrate in the subsequent discussion on two specific cases. Section 3 is devoted to the comparison of curves with respect to -distances. We prove asymptotic normality of the corresponding test statistic and construct an asymptotic level- test. Moreover, a non-standard bootstrap procedure is introduced, which addresses the particular difficulties arising in the problem of testing (parametric) interval hypotheses. In particular, resampling has to be performed under the null hypothesis , which defines (implicitly) a manifold in the parameter space. We prove consistency of the bootstrap test and demonstrate by means of a simulation study that it yields an improvement of the approximation of the nominal level for small sample sizes. In Section 4 the maximal deviation between the two curves is considered as a measure of similarity, for which corresponding results are substantially harder to derive. For example, we prove weak convergence of a corresponding test statistic, but the limit distribution depends in a complicated way on the extremal points of the difference between the “true” curves. This problem is again solved by developing a bootstrap test. The finite sample properties of the new methodology are illustrated in Section 5, where we also provide a comparison with the method of Gsteiger et al. (2011). In particular, it is demonstrated that the methodology proposed in this paper is more powerful than the test proposed by these authors. The methods are illustrated with an example in Section 6. Technical details and proofs are deferred to an appendix in Section A.

2 Equivalence of regression curves

We consider two, possibly different, regression models , to describe the relationship between a response variable and several covariates for two different groups :

(2.1)

Here, the covariate region is denoted by , denotes the ith dose level (in group ), the number of patients treated at dose level and the number of different dose levels in group . Further, denotes the sample size in group , the total sample size, and the functions and in (2.1) define the (non-linear) regression models with - and -dimensional parameters and , respectively. The error terms are assumed to be independent and identically distributed with mean and variance for group . Let denote a metric space of real valued functions of the form with distance . We assume for all , that the regression functions satisfy , identify the models by their parameters and denote the distance between the two models by .
We consider the curves and as equivalent if the distance between the two curves is small, that is , where is a pre-specified positive constant. In clinical practice, is often denoted as relevance threshold in the sense that if the difference between the two curved is believed not to be clinically relevant. In order to establish equivalence of the two dose response curves, we formulate the hypotheses

(2.2)

which in the literature are called precise hypotheses, following Berger and Delampady (1987). The choice of depends on the particular problem under consideration. For example, when testing for bioequivalence we can conclude that two treatments are not different from one another if the 90% confidence interval of the ratio of a log-transformed exposure measure (AUC and/or ; see Section 1) falls completely within the range 80-125%, indicating that differences in systemic drug exposure within these limits are not clinically significant [see U.S. Food and Drug Administration (2003)]. For the comparison of dissolution profiles, which is a special case of the problem considered in this paper, we refer to Appendix I of EMA (2014) with some recommendations for the choice of the equivalence threshold on the basis of univariate measures [see for example Yuksel et al. (2000)].

In the following we are particularly interested in the metric space of all continuous functions with distances

(2.3)
(2.4)

The maximal deviation distance is of interest, for example, in drug stability studies, where one investigates whether the maximum difference in mean drug content between two batches is no larger than a pre-specified threshold; see, for example, Ruberg and Hsu (1992) and Liu et al. (2007b). The -distance might be attractive for demonstrating similarity of, for example, two PK models because it measures the squared integral of the difference between the two curves and is therefore related to the areas under the curves, which in turn is often of interest in bioequivalence studies, as mentioned above.
The maximal deviation distance (2.3) has also been considered in Liu et al. (2009) and Gsteiger et al. (2011), who constructed confidence bands for the difference of two regression curves and used the intersection-union principle to derive an overall test for the hypothesis that the two curves are equivalent. In linear models with normally distributed errors this test keeps the significance level not only asymptotically, but exactly at level for any fixed sample size [see also, Bhargava and Spurrier (2004) or Liu et al. (2008) for some exact confidence bounds when comparing two linear regression models]. However, the resulting test turns out to be conservative and has low power, as demonstrated in Section 5. This observation can be explained by the fact that the “classical” inversion of a confidence interval for a parameter, say , provides a level -test for the hypothesis , but it yields usually a conservative test for the hypothesis [see Wellek (2010)]. The same phenomenon also appears in the present context of comparing curves. These properties may limit the use of the procedures proposed by Liu et al. (2009) and Gsteiger et al. (2011) in practice, as we would like to maximize the probability of rejecting the null hypothesis if the two regression curves are in fact equivalent as measured by the relevance threshold .
In the following, we develop alternatives approaches that are more powerful. Roughly speaking, we consider for the estimator of the regression curve and reject the null hypothesis (2.2) for small values of the statistic . The critical values can be obtained by asymptotic theory deriving the limit distribution of if , as developed in the following sections. This approach leads to a satisfactory solution for the -distance (2.4) based on the quantiles of the normal distribution (see Section 3). However, for the maximal deviation distance (2.3), the limit distribution depends in a complicated way on the extremal points

of the true difference

(2.5)

Moreover, in small sample trials the approximation of the nominal level of a given test based on asymptotic theory may not be valid. In order to obtain a more accurate approximation of the nominal level, we propose a non-standard bootstrap procedure and prove its consistency. This procedure has to be constructed in a way such that it addresses the particular features of the equivalence hypotheses (2.2). In particular, data have to be generated under the null hypothesis , which implicitly defines a manifold for the vector of parameters of both models. The non-differentiability of the maximal deviation distance exhibits some technical difficulties of such an approach, and for this reason we begin the discussion with the -distance .

3 Comparing curves by -distances

In this section we construct a test for the equivalence of the two regression curves with respect to the squared distance, i.e. we consider hypotheses of the form

(3.1)

Note that under certain regularity assumptions (see the Appendix for details) the ordinary least squares (OLS) estimators, say and , of the parameters and can usually be linearized in the form

(3.2)

where the functions are given by

(3.3)

and the dimensional matrices are defined by

(3.4)

For these arguments we assume that the matrices are non-singular and that the sample sizes converge to infinity such that

(3.5)

and

(3.6)

It then follows by straightforward calculation that the OLS estimators are asymptotically normal distributed, i.e.

(3.7)

where the symbol means weak convergence (convergence in distribution for real valued random variables). The asymptotic variance in (3.7) can easily be estimated by replacing the parameters , and in (3.4) by their estimators , and . The resulting estimator will be denoted by throughout this paper. The null hypothesis in (3.1) is then rejected whenever

(3.8)

where denotes a pre-specified constant defined through the level of the test. In order to determine this constant we will derive the asymptotic distribution of the statistic . The following result is proved in the Appendix.

Theorem 3.1.

If Assumptions A.1 - A.5 from the Appendix, (3.5) and (3.6) are satisfied, we have

(3.9)

where the asymptotic variance is given by

(3.10)

is defined in (2.5) and the kernel is given by

(3.11)

Theorem 3.1 provides a simple asymptotic level- test for the hypothesis (3.1) of equivalence of two regression curves. More precisely, if denotes the (canonical) estimator of the asymptotic variance in (3.10), then the null hypothesis in (3.1) is rejected if

(3.12)

where denotes the -quantile of the standard normal distribution. Note that by the nature of the problem the quantile of this test depends on the threshold . The finite sample properties of this test will be investigated in Section 5.1.


Remark 3.2.

It follows from Theorem 3.1 that the test (3.12) has asymptotic level and is consistent if . More precisely, if denotes the cumulative distribution function of the standard normal distribution, we have for the probability of rejecting the null hypothesis in (3.1)

Under continuity assumptions it follows that and Theorem 3.1 yields . This gives

The test (3.12) can be recommended if the sample sizes are reasonable large. However, we will demonstrate in Section 5 that for very small sample sizes, the critical values provided by this asymptotic theory may not provide an accurate approximation of the nominal level, and for this reason we will also investigate a parametric bootstrap procedure to generate critical values for the statistic .

Algorithm 3.3.

(parametric bootstrap for testing precise hypotheses)

  • Calculate the OLS-estimators and , the corresponding variance estimators

    and the test statistic defined by (3.8).

  • Define estimators of the parameters and by

    (3.13)

    where denote the OLS-estimators of the parameters under the constraint

    (3.14)

    Finally, define and note that .

  • Bootstrap test

    • Generate bootstrap data under the null hypothesis, that is

      (3.15)

      where the errors are independent normally distributed such that .

    • Calculate the OLS estimators and and the test statistic

      from the bootstrap data. Denote by the quantile of the distribution of the statistic , which depends on the data through the estimators and .

    The steps (i) and (ii) are repeated times to generate replicates of . If denotes the corresponding order statistic, the estimator of the quantile of the distribution of is defined by , and the null hypothesis is rejected if

    (3.16)

    Note that the bootstrap quantile depends on the threshold which is used in the hypothesis (3.1), but we do not reflect this dependence in our notation.

The following result shows that the bootstrap test (3.16) has asymptotic level and is consistent if . Its proof can be found in the Appendix.

Theorem 3.4.

Assume that the conditions of Theorem 3.1 are satisfied.

  • If the null hypothesis in holds, then we have for any

    (3.17)
  • If the alternative in holds, then we have for any

    (3.18)

4 Comparing curves by their maximal deviation

In this section we construct a test for the equivalence of the two regression curves with respect to the maximal absolute deviation (2.3). The corresponding test statistic is given by the maximal deviation distance

(4.1)

between the two estimated regression functions, where are the OLS-estimators from the two samples. In order to describe the asymptotic distribution of the statistic we define the set of extremal points

(4.2)

and introduce the decomposition , where

(4.3)

The following result is proved in the Appendix.

Theorem 4.1.

If and the assumptions of Theorem 3.1 are satisfied, then

(4.4)

where denotes a Gaussian process defined by

(4.5)

and and are independent - and -dimensional standard normal distributed random variables, respectively, i.e. , .

In principle, Theorem 4.1 provides an asymptotic level -test for the hypotheses

(4.6)

by rejecting the null hypotheses whenever , where denotes the -quantile of the distribution of the random variable defined in (4.4). However, this distribution has a very complicated structure. For example, if the distribution of is a centered normal distribution but with variance

(4.7)

which depends on the location of the (unique) extremal point . In general (more precisely in the case ) the distribution of is the distribution of a maximum of dependent Gaussian random variables, where the variances and the dependence structure depend on the location of the extremal points of the function . Because the estimation of these points is very difficult, we propose a bootstrap approach to obtain suitable quantiles. The bootstrap test is defined in the same way as described in Algorithm 3.3, where the distance is replaced by the maximal deviation . The corresponding quantile obtained in Step 3(ii) of Algorithm 3.3 is now denoted by , while the theoretical quantile of the bootstrap distribution is denoted by . The following result is proved in the Appendix and shows that the test, which rejects the null hypothesis in (4.6) whenever

(4.8)

has asymptotic level and is consistent. Interestingly the quality of the approximation of the nominal level of the test depends on the cardinality of the set .

Theorem 4.2.

Suppose that the assumptions of Theorem 4.1 hold.

  • If the null hypothesis in is satisfied and the set defined in (4.2) consists of one point, then we have for any

    (4.9)
  • Let denote the distribution function of the random variable defined in (4.4) and its -quantile. Assume that is continuous at and . If the null hypothesis in is satisfied we have

    (4.10)
  • If the alternative in is satisfied, then we have for any

    (4.11)
Remark 4.3.

(a) The condition in part (2) of Theorem 4.2 is a non-trivial assumption. By results in Tsirel’son (1976), the distribution of has at most one jump at the left boundary of its support and is continuous to the right of that. The condition on to be continuous at is thus equivalent to requiring that the mass at the left endpoint of the support of is smaller than . In some cases it is possible to show that is continuous on , i.e. the mass at its left support point is zero. For example, this follows from Theorem 3 of Chernozhukov et al. (2015) provided that the condition

(4.12)

holds.
The assumption (4.12) is always fulfilled if one of the models contains an additive (placebo) effect because in this case the first entry of the gradient equals . Furthermore, if the two models are of the form for with (for example Michalis-Menten models), we have

Consequently, if (4.12) was not fulfilled, there would exist such that , and as it holds

This yields and does not correspond to the null hypothesis.
(c) Note that the asymptotic Type I error rate of the bootstrap test is precisely at the boundary of the hypothesis (i.e. ) if the cardinality of is one. On the other hand, if the set contains more than one point, part (2) of Theorem 4.2 indicates that the corresponding bootstrap test is usually conservative, even at the boundary of the hypothesis. These results are confirmed by a simulation study in Section 5.2.

5 Finite sample properties

In this section we investigate the finite sample properties of the asymptotic and bootstrap tests proposed in Sections 3 and 4 in terms of power and size. For the distance we also provide a comparison with the approach from Gsteiger et al. (2011). Their method follows from (3.5) - (3.7) and an application of the Delta method [see for example Van der Vaart (1998)] so that the prediction for the difference of the two regression models at the point is approximately normally distributed. That is,

where

(5.1)

and denotes the estimator of the variance in (3.4), which is obtained by replacing the parameters , and by their estimators , , and . Gsteiger et al. (2011) proposed a test based on the pointwise confidence bands derived by Liu et al. (2007a), that is

where denotes the -quantile of the standard normal distribution. A test for the hypotheses (4.6) is finally obtained by rejecting the null hypothesis and conclude for equivalence, if the maximum (minimum) of the upper (lower) confidence band is smaller (larger) than (). A particular advantage of this test is that it directly refers to the distance (2.3), which has a nice interpretation in many applications. Moreover, in linear models (with normally distributed errors) it is an exact level- test. However, the resulting test is conservative and has low power compared to the methods proposed in this paper as shown in Section 5.2.

All results in this and the following section are based on simulation runs and the quantiles of the bootstrap tests have been obtained by bootstrap replications. In all examples the dose range is given by the interval and an equal number of patients is allocated at the five dose levels , , , in both groups (that is ).

5.1 Tests based on the distance

For the sake of brevity we restrict ourselves to a comparison of two shifted EMAX-models

(5.2)

where and . In Tables 1 and 2 we display the simulated Type I error rates of the bootstrap test (3.16) and the asymptotic test (3.12) for in (3.1) and various configurations of , , , and . In the interior of the null hypothesis (i.e. ) the Type I error rates of the tests (3.12) and (3.16) are smaller than the nominal level as predicted by Remark 3.2. For both tests we observe a rather precise approximation of the nominal level (even for small sample sizes) at the boundary of the null hypothesis (i.e. ). In some cases the approximation of the nominal level by the bootstrap test (3.16) is slightly more accurate and for this reason we recommend to use the bootstrap test (3.16) to establish equivalence of two regression models with respect to the -distance.

1 4 0.000 0.000 0.000 0.000 0.000 0.000
0.75 2.25 0.004 0.002 0.001 0.000 0.002 0.000
0.5 1 0.051 0.064 0.052 0.101 0.120 0.118
1 4 0.000 0.000 0.000 0.000 0.000 0.000
0.75 2.25 0.000 0.000 0.000 0.000 0.000 0.000
0.5 1 0.055 0.060 0.051 0.104 0.111 0.101
1 4 0.000 0.000 0.000 0.000 0.000 0.000
0.75 2.25 0.001 0.002 0.000 0.004 0.005 0.001
0.5 1 0.057 0.058 0.050 0.125 0.107 0.097
1 4 0.000 0.000 0.000 0.000 0.000 0.000
0.75 2.25 0.001 0.000 0.000 0.002 0.000 0.000
0.5 1 0.057 0.048 0.054 0.097 0.114 0.093
Table 1: Simulated Type I error rates of the bootstrap test (3.16) for the equivalence of two shifted EMAX models defined in (5.2). The threshold in (3.1) is chosen as .
1 4 0.002 0.002 0.002 0.000 0.002 0.003
0.75 2.25 0.005 0.005 0.009 0.007 0.011 0.016
0.5 1 0.080 0.042 0.049 0.102 0.061 0.071
1 4 0.000 0.000 0.000 0.000 0.000 0.000
0.75 2.25 0.007 0.012 0.007 0.017 0.015 0.012
0.5 1 0.055 0.063 0.060 0.081 0.078 0.084
1 4 0.000 0.000 0.000 0.000 0.000 0.000
0.75 2.25 0.000 0.001 0.002 0.017 0.003 0.006
0.5 1 0.060 0.066 0.080 0.090 0.091 0.096
1 4 0.000 0.000 0.000 0.000 0.000 0.000
0.75 2.25 0.000 0.000 0.000 0.000 0.000 0.001
0.5 1 0.041 0.058 0.052 0.071 0.087 0.073
Table 2: Simulated Type I error rates of the asymptotic test (3.12) for the equivalence of two shifted EMAX models defined in (5.2). The threshold in (3.1) is chosen as

In Tables 3 and 4 we display the power of the two tests under various alternatives specified by the value in model (5.2). We observe a reasonable power of both tests in all cases under consideration. In those cases where the asymptotic test (3.12) keeps (or exceeds) its nominal level it is slightly more powerful than the bootstrap test (3.16). The opposite performance can be observed in those cases where the asymptotic test is conservative (e.g., if ). We also note that the power of both tests is a decreasing function of the distance , as predicted by the asymptotic theory.