Efficient determination of optimised multi-arm multi-stage experimental designs with control of generalised error-rates

Efficient determination of optimised multi-arm multi-stage experimental designs with control of generalised error-rates



Running Head: Generalised multi-arm multi-stage experimental designs.

Abstract: Primarily motivated by the drug development process, several publications have now presented methodology for the design of multi-arm multi-stage experiments with normally distributed outcome variables of known variance. Here, we extend these past considerations to allow the design of what we refer to as an multi-arm multi-stage experiment. We provide a proof of how strong control of the -generalised type-I familywise error-rate can be ensured. We then describe how to attain the power to reject at least out of false hypotheses, which is related to controlling the -generalised type-II familywise error-rate. Following this, we detail how a design can be optimised for a scenario in which rejection of any null hypotheses brings about termination of the experiment. We achieve this by proposing a highly computationally efficient approach for evaluating the performance of a candidate design. Finally, using a real clinical trial as a motivating example, we explore the effect of the design’s control parameters on the statistical operating characteristics.

Keywords: Clinical trial; Familywise error rate; Group sequential; Interim analysis; Multi-arm multi-stage; Treatment selection.

Address correspondence to M. J. Grayling, MRC Biostatistics Unit, Forvie Site, Robinson Way, Cambridge CB2 0SR, UK; Fax: +44-(0)1223-330365; E-mail: mjg211@cam.ac.uk.

1 Introduction

The statistical literature contains much research on the design and analysis of two-arm group sequential experiments, with the primary motivation often the desire to improve the efficiency of clinical research (see, for example, Jennison and Turnbull (2000)). Recently, several papers have sought to extend these methods to multi-arm multi-stage (MAMS) designs, which allow multiple arms to be compared to a single control arm. Assuming outcome data to be normally distributed with known variance, Magirr et al. (2012) established how the type-I familywise error-rate (FWER) could be strongly controlled in a parallel arm MAMS design. Allowing early stopping to both reject and accept null hypotheses, with any number of observations accrued in each arm during each stage, they also described how commonly utilised stopping boundaries for two-arm designs could be extended to the multi-arm setting.

Additional recommendations on how to design MAMS experiments have been presented (Wason et al., 2016), along with considerations on other outcome data types and the incorporation of covariates (Jaki and Magirr, 2013). An approach based on simulation was also advanced for the optimisation of a MAMS design (Wason and Jaki, 2012), whilst Ghosh et al. (2017) recently described an effective method for overcoming issues of computational intractability inherent in the approach of Magirr et al. (2012).

Thus, much methodology is now available for the design of MAMS experiments. However, each of these presentations desired to control the conventional type-I FWER, and power to reject a particular null hypothesis under the so-called least favourable configuration (LFC). These operating characteristics are a logical starting point, given their typical requirement in late phase clinical trials. However, allowing for more flexible error control would be highly advantageous. Specifically, there has been increased interest in recent years in experimental designs that control the generalised type-I (Lehmann and Romano, 2005; Romano and Shaikh, 2006) and type-II FWERs (Delorme et al., 2016). These constraints are useful in scenarios where classical error control leads to infeasible or undesirable required sample sizes.

Moreover, each of the above papers assumed a so-called simultaneous stopping rule in which an experiment would be terminated as soon as a single null hypothesis was rejected. Urach and Posch (2016) recently provided a detailed assessment of the relative merits of simultaneous stopping in comparison to a separate stopping rule. In the latter approach, an experiment is continued until a decision is made for every null hypothesis. These two rules are extreme ones, particularly when the number of arms is large. In many instances we may wish to continue our investigations until some intermediate number of null hypotheses have been rejected. To date though, no methodology has been presented to facilitate this.

Within the context of clinical research, generalised error-rate control and more flexible trial stopping rules may be of most relevance in phase II trials. Explicitly, randomised designs are increasing in use (Ivanova et al., 2016) and popularity in phase II (Sharma et al., 2011; Jung, 2013), but their employment has been hindered because of associated sample size requirements (Pond and Abassi, 2011). Controlling generalised error-rates could assist in overcoming this problem. Furthermore, if many treatments are compared in a MAMS phase II trial, we may not wish to identify the single seemingly most efficacious treatment, but some larger number to carry forward for further exploration.

Therefore, here, we provide methodology for the design of MAMS experiments with control of generalised error criteria under highly adaptable stopping rules. We describe how strong control of the -generalised type-I FWER can be achieved. Next, we describe how power to reject at least out of false hypotheses can be attained. Following this, we detail how a design can be optimised for a scenario in which rejection of any null hypotheses will bring about termination of the experiment. To facilitate obtaining such designs in practice, we also detail an efficient algorithm for evaluating the key operating characteristics of any candidate design. We conclude with an example based on a real trial, and with a short discussion.

2 Methods

2.1 Hypotheses, analysis, and stopping boundaries

Although our methodology is relevant to many types of experiment, from here we pose our problem within the context of designing a clinical trial. Thus, our goal will be to test whether several experimental treatments are superior to some common control regimen.

First, define the sets and , for any . Additionally, for any , , and , define to be the vector formed by repeating , times. For simplicity, . We will also assume the convention that , and take as the matrix formed by placing the elements of along the leading diagonal (i.e., , if , ). Finally, will signify the indicator function on event .

Now, we suppose the trial has experimental treatments present initially, and let , , be the mean response on the experimental treatments, and be the mean response on the control. Our hypotheses of interest are then , . As data are accrued over the course of the trial, this family of null hypotheses will be tested at a series of analyses indexed by , for some . We suppose that at any interim analysis , to test , responses will be available from patients on treatments . We enforce that , such that is the group size in the first stage for the control arm, and the are allocation ratios relative to . We denote the set of all , given , by . Note that by this definition of , for now we require only that . We return later to discuss how we can ensure .

The test statistics (, ) will be used, where . Here, we have initially explicitly stated the dependence upon , but will routinely omit it in what follows for brevity.

Similarly, we set and for . Moreover, take and , for and . Then, has a multivariate normal distribution (Jennison and Turnbull, 2000) with

As discussed, we suppose that we would like our design to strongly control the -generalised type-I FWER, the probability of incorrectly rejecting at least true null hypotheses, to some level . Furthermore, we extend the typically employed LFC considerations of Dunnett (1984) to the requirement of Dunnett and Tamhane (1992), and assume it is desired that the out of familywise power (FWP) be at least , for some and . That is, without loss of generality, the probability of rejecting at least of the null hypotheses , when and for and , must be at least . Here, is interpreted as an effect size at which it would be of interest to consider a treatment further, whilst if the treatment effect is below then this arm is not worth considering further. Note that this FWP requirement is directly related to the generalised type-II FWER defined in Delorme et al. (2016), who use the notation and , where we use and . In what follows, we set .

Design determination is also achieved supposing that the cumulative rejection of or more null hypotheses, by any analysis , will bring about the termination of the trial. With this, and correspond to the simultaneous and separate stopping rules respectively, whilst represents a new previously unconsidered stopping rule.

We refer to our designs from here as -MAMS designs. The design considered in Magirr et al. (2012), for example, is then a 1111-MAMS design. Explicitly, in a 1111-MAMS design, the conventional FWER is strongly controlled, the trial is terminated as soon as any null hypothesis is rejected, and power is provided to reject a particular treatment’s null hypothesis when only it is worthy of further consideration. Jung (2008) explored, in the context of exact binomial tests, a 111-MAMS design. With this, power is instead provided to reject at least one null hypothesis amongst the entire family, when all treatments have an effectiveness considered interesting.

We define our futility (non-rejection) and efficacy (rejection) stopping boundaries as and respectively. Allowing for infinite stopping boundaries to preclude the possibility to make early decisions to reject or accept null hypotheses if so desired, we denote the set of all possible pairs by . Here, is enforced so that the trial terminates after at most stages as desired.

Finally, we define our trials formal conduct using two vectors, and , as follows

  1. Set and .

  2. Conduct stage of the trial, and compute the .

  3. For such that for (with the convention for )

    • If reject , setting and , and designating that for .

    • If do not reject , setting , and designating that for .

  4. If and , set and return to 2. Else stop the trial, and for each with , set , and designate that for .

Here, if fewer than null hypotheses have been rejected, and a decision has not been made for all null hypotheses, step 4 leads to a return to step 2.

2.2 Strong control of the -generalised type-I familywise error-rate

To demonstrate how strong control of the -generalised type-I FWER can be achieved, we extend the arguments of Magirr et al. (2012). First, for , , and , define

Next, take . That is, is the set of all ordered subsets of with .

Then, if for , the event that at least of fail to be rejected is equivalent to

taking , where is the whole sample space.

Lemma 2.1.

For any


See the Appendix of Magirr et al. (2012). ∎

Theorem 2.2.

For any , .


Suppose without loss of generality that and for , and . Let . Using Lemma 2.1

Thus, by Theorem 2.2, the -generalised type-I FWER can be controlled in the strong sense to level by ensuring .

2.3 Design operating characteristics

In this section, we describe an efficient approach for the evaluation of the key performance characteristics of any design.

Ghosh et al. (2017) discussed how the approach utilised in Magirr et al. (2012) is prohibitively computationally expensive even for moderate values of the number of stages or treatments. By working with the score statistics, and employing recently developed techniques for the efficient evaluation of multivariate normal integrals, they provided a method which was considerably more efficient. However, their approach makes no allowance for the fast evaluation of expected sample sizes, as it is geared towards assessing the probability one of the null hypotheses is rejected, and does not monitor the outcome for each arm. It is also specialised to the case . Though it could in theory be extended to designs, to allow the optimisation of the stopping boundaries and group sizes in Section 2.4, we here propose a different approach based on the distribution of , the standardised Wald test statistics.

We denote the possible outcomes of the trial by , where

  • , with if is rejected, and otherwise,

  • , with if is the interim analysis at which is rejected or accepted, or the whole trial is stopped and no decision is made on .

With this definition, from the presented trial conduct is then the realised value of . We denote the sample space for , for any , given , and , by . We present the exact definition of in the Appendix.

The probability of a particular outcome is then

where is the probability density function of a multivariate normal distribution with mean and covariance matrix , evaluated at vector . Furthermore

for and . Note that the above integral is presented for notational simplicity as being dimensional. However, using the marginal distribution properties of the multivariate normal distribution, it can be reduced immediately by removing integrals, and the corresponding elements of , for which the range of integration is . These correspond exactly to those designated to be by the trial’s formal conduct above.

Denoting the trial’s (random) required sample size by , the expected sample size (ESS) can then be computed as


Whilst the probabilities of committing a -generalised type-I familywise error, and of rejecting at least of , respectively, are



Equations (2.1)-(2.3) together provide the operating characteristics of candidate designs that are typically required for the determination of stopping boundaries and group sizes. Therefore, we have technically described all that is required for the determination of -MAMS designs. However, particularly when the number of stages or treatment arms is large, it is important to be as efficient as possible in evaluating these characteristics. For this reason, we now describe how one may more improve upon the above in certain routinely faced design scenarios.

Specifically, suppose for two treatment arms and , , , that , for , and . Then these arms are interchangeable in the sense that

for if , , , , and , for . This allows us to reduce the number of integrals that must be considered. Here, we explain how this is achieved when and for . Methodology for other scenarios should then be clear.

We place an order upon the outcome for the experimental treatments arms, setting

In the Appendix, we expand on the intuition behind this definition. With it however, we have



where is the degeneracy of scenario given , , , , , and . That is, is the number of possible scenarios in , which could be produced from , that would have equal probability because of the equality of treatment effects in the vector .

To evaluate the efficiency improvements made by accounting for any interchangeability amongst the experimental treatments, one can compare the relative sizes of the sets and . Unfortunately, the values of and do not in general have a simple form. However, in the Appendix we present formulae for their values, along with the , when and when , for the scenario in which early stopping to accept or reject null hypotheses is possible at each analysis (i.e., for ). Therefore as an example, we compare the relative sizes of these sets in this setting when in Table 1. It is clear that the value of becomes substantially smaller than the corresponding as the number of treatments or stages grows. This highlights the importance of utilising this refined approach whenever possible.

1 2 12 24 40 8 15 24
2 2 16 36 64 10 21 36
1 3 34 90 188 13 29 54
2 3 58 186 428 18 48 100
3 3 64 216 512 20 56 120
1 4 96 336 880 19 49 104
2 4 200 888 2608 28 90 220
3 4 248 1224 3808 33 116 300
4 4 256 1296 4096 35 126 330
Table 1: A comparison of the sizes of the sets and , for several values of , , and .

2.4 Optimal generalised multi-arm multi-stage designs

Stopping boundaries and group sizes providing particular type-I and type-II FWERs can be determined for -MAMS trial designs in a variety of ways. An adaptation of the original error spending approach to group sequential trial design (Lan and DeMets, 1983) could be employed, or a functional form could be assumed for the boundaries. In the Appendix, we describe how the triangular test of Whitehead and Stratton (1983) can be modified for -MAMS designs. Here, our focus will be on determining optimised stopping boundaries, in order to attempt to maximise trial efficiency.

We suppose that and have been specified for and . Furthermore, we assume we desire to allow early stopping for either efficacy or futility at each analysis, and that values for , , , , , , , , , and , have all been designated. Adapting Wason et al. (2012), our goal will be, for , to find the solution to the following optimisation problem

subject to

for . Note that other optimality criteria could be treated similarly. Additionally, in general it is wise to enforce that , since the optimal design in the case would be the one converging toward a single-stage design.

The complexity of the above constraints prevents their direct incorporation in to the search procedure. Therefore, we translate our problem to search for the solution to

We will refer to the function in the above, which we attempt to minimise, as the objective function. Here, , and the final factors therefore penalise designs not conforming to the desired operating characteristics. Following Wason and Jaki (2012), we set , the sample-size required by a corresponding single-stage design.

Note that is treated in this search as a continuous parameter. This is for convenience given the majority of available software for optimisation is applicable only to continuous search spaces. Denoting the solution to the above problem by , then, as discussed in Wason (2015), in the likely event that , a choice must be made. Explicitly, define and . The optimal integer could then either be directly designated as , increasing the likelihood that the FWP requirement is met. Or, the optimisation routine could be re-run with constrained in turn to and (i.e., the optimal for should be identified), and the design with the smallest value of the objective function amongst these two solutions determined the true optimal design. This latter procedure is of most importance when is relatively large, when it is likely that the performance of the constrained optimal designs will be substantially different.

Finally, whilst the above specifies how a search can be performed, it does not enforce that a particular search algorithm be utilised. However, given the search space is likely to contain many local optima, a stochastic search routine is preferable (Wason and Jaki, 2012). Here, we utilise the GA package in R (Scrucca, 2013; Scrucca, 2016), which allows the parallel evaluation of several candidate designs. In combination with the use of the package mvtnorm (Genz et al., 2016) for evaluating the requisite multivariate normal integrals, this allow for the efficient determination of optimised designs. Of importance is that GA requires the designation of values for several control parameters. We discuss our choices for these in the Appendix. Code to replicate our results is available upon request.

3 Example: TAILoR

To explore the effect the parameters , , , and can have upon a designs operating characteristics, we re-consider the TAILoR (TelmisArtan and InsuLin Resistance in HIV) trial (Magirr et al., 2012). In this trial, three experimental treatments, corresponding to different doses of Telmisartan, were compared to a shared control. Therefore, we set , and as in Magirr et al. (2012), we suppose that , , , , and for . To allow utilisation of Equations (2.4)-(2.6), we further suppose that for and (implying and ).

In Table 2 and Table 3, we summarise the operating characteristics of determined optimal designs for and , considering all possible combinations of , , and