Efficient determination of optimised multiarm multistage experimental designs with control of generalised errorrates
\titlelabel\thetitle.
Running Head: Generalised multiarm multistage experimental designs.
Abstract: Primarily motivated by the drug development process, several publications have now presented methodology for the design of multiarm multistage experiments with normally distributed outcome variables of known variance. Here, we extend these past considerations to allow the design of what we refer to as an multiarm multistage experiment. We provide a proof of how strong control of the generalised typeI familywise errorrate can be ensured. We then describe how to attain the power to reject at least out of false hypotheses, which is related to controlling the generalised typeII familywise errorrate. Following this, we detail how a design can be optimised for a scenario in which rejection of any null hypotheses brings about termination of the experiment. We achieve this by proposing a highly computationally efficient approach for evaluating the performance of a candidate design. Finally, using a real clinical trial as a motivating example, we explore the effect of the design’s control parameters on the statistical operating characteristics.
Keywords: Clinical trial; Familywise error rate; Group sequential; Interim analysis; Multiarm multistage; Treatment selection.
Address correspondence to M. J. Grayling, MRC Biostatistics Unit, Forvie Site, Robinson Way, Cambridge CB2 0SR, UK; Fax: +44(0)1223330365; Email: mjg211@cam.ac.uk.
1 Introduction
The statistical literature contains much research on the design and analysis of twoarm group sequential experiments, with the primary motivation often the desire to improve the efficiency of clinical research (see, for example, Jennison and Turnbull (2000)). Recently, several papers have sought to extend these methods to multiarm multistage (MAMS) designs, which allow multiple arms to be compared to a single control arm. Assuming outcome data to be normally distributed with known variance, Magirr et al. (2012) established how the typeI familywise errorrate (FWER) could be strongly controlled in a parallel arm MAMS design. Allowing early stopping to both reject and accept null hypotheses, with any number of observations accrued in each arm during each stage, they also described how commonly utilised stopping boundaries for twoarm designs could be extended to the multiarm setting.
Additional recommendations on how to design MAMS experiments have been presented (Wason et al., 2016), along with considerations on other outcome data types and the incorporation of covariates (Jaki and Magirr, 2013). An approach based on simulation was also advanced for the optimisation of a MAMS design (Wason and Jaki, 2012), whilst Ghosh et al. (2017) recently described an effective method for overcoming issues of computational intractability inherent in the approach of Magirr et al. (2012).
Thus, much methodology is now available for the design of MAMS experiments. However, each of these presentations desired to control the conventional typeI FWER, and power to reject a particular null hypothesis under the socalled least favourable configuration (LFC). These operating characteristics are a logical starting point, given their typical requirement in late phase clinical trials. However, allowing for more flexible error control would be highly advantageous. Specifically, there has been increased interest in recent years in experimental designs that control the generalised typeI (Lehmann and Romano, 2005; Romano and Shaikh, 2006) and typeII FWERs (Delorme et al., 2016). These constraints are useful in scenarios where classical error control leads to infeasible or undesirable required sample sizes.
Moreover, each of the above papers assumed a socalled simultaneous stopping rule in which an experiment would be terminated as soon as a single null hypothesis was rejected. Urach and Posch (2016) recently provided a detailed assessment of the relative merits of simultaneous stopping in comparison to a separate stopping rule. In the latter approach, an experiment is continued until a decision is made for every null hypothesis. These two rules are extreme ones, particularly when the number of arms is large. In many instances we may wish to continue our investigations until some intermediate number of null hypotheses have been rejected. To date though, no methodology has been presented to facilitate this.
Within the context of clinical research, generalised errorrate control and more flexible trial stopping rules may be of most relevance in phase II trials. Explicitly, randomised designs are increasing in use (Ivanova et al., 2016) and popularity in phase II (Sharma et al., 2011; Jung, 2013), but their employment has been hindered because of associated sample size requirements (Pond and Abassi, 2011). Controlling generalised errorrates could assist in overcoming this problem. Furthermore, if many treatments are compared in a MAMS phase II trial, we may not wish to identify the single seemingly most efficacious treatment, but some larger number to carry forward for further exploration.
Therefore, here, we provide methodology for the design of MAMS experiments with control of generalised error criteria under highly adaptable stopping rules. We describe how strong control of the generalised typeI FWER can be achieved. Next, we describe how power to reject at least out of false hypotheses can be attained. Following this, we detail how a design can be optimised for a scenario in which rejection of any null hypotheses will bring about termination of the experiment. To facilitate obtaining such designs in practice, we also detail an efficient algorithm for evaluating the key operating characteristics of any candidate design. We conclude with an example based on a real trial, and with a short discussion.
2 Methods
2.1 Hypotheses, analysis, and stopping boundaries
Although our methodology is relevant to many types of experiment, from here we pose our problem within the context of designing a clinical trial. Thus, our goal will be to test whether several experimental treatments are superior to some common control regimen.
First, define the sets and , for any . Additionally, for any , , and , define to be the vector formed by repeating , times. For simplicity, . We will also assume the convention that , and take as the matrix formed by placing the elements of along the leading diagonal (i.e., , if , ). Finally, will signify the indicator function on event .
Now, we suppose the trial has experimental treatments present initially, and let , , be the mean response on the experimental treatments, and be the mean response on the control. Our hypotheses of interest are then , . As data are accrued over the course of the trial, this family of null hypotheses will be tested at a series of analyses indexed by , for some . We suppose that at any interim analysis , to test , responses will be available from patients on treatments . We enforce that , such that is the group size in the first stage for the control arm, and the are allocation ratios relative to . We denote the set of all , given , by . Note that by this definition of , for now we require only that . We return later to discuss how we can ensure .
The test statistics (, ) will be used, where . Here, we have initially explicitly stated the dependence upon , but will routinely omit it in what follows for brevity.
Similarly, we set and for . Moreover, take and , for and . Then, has a multivariate normal distribution (Jennison and Turnbull, 2000) with
As discussed, we suppose that we would like our design to strongly control the generalised typeI FWER, the probability of incorrectly rejecting at least true null hypotheses, to some level . Furthermore, we extend the typically employed LFC considerations of Dunnett (1984) to the requirement of Dunnett and Tamhane (1992), and assume it is desired that the out of familywise power (FWP) be at least , for some and . That is, without loss of generality, the probability of rejecting at least of the null hypotheses , when and for and , must be at least . Here, is interpreted as an effect size at which it would be of interest to consider a treatment further, whilst if the treatment effect is below then this arm is not worth considering further. Note that this FWP requirement is directly related to the generalised typeII FWER defined in Delorme et al. (2016), who use the notation and , where we use and . In what follows, we set .
Design determination is also achieved supposing that the cumulative rejection of or more null hypotheses, by any analysis , will bring about the termination of the trial. With this, and correspond to the simultaneous and separate stopping rules respectively, whilst represents a new previously unconsidered stopping rule.
We refer to our designs from here as MAMS designs. The design considered in Magirr et al. (2012), for example, is then a 1111MAMS design. Explicitly, in a 1111MAMS design, the conventional FWER is strongly controlled, the trial is terminated as soon as any null hypothesis is rejected, and power is provided to reject a particular treatment’s null hypothesis when only it is worthy of further consideration. Jung (2008) explored, in the context of exact binomial tests, a 111MAMS design. With this, power is instead provided to reject at least one null hypothesis amongst the entire family, when all treatments have an effectiveness considered interesting.
We define our futility (nonrejection) and efficacy (rejection) stopping boundaries as and respectively. Allowing for infinite stopping boundaries to preclude the possibility to make early decisions to reject or accept null hypotheses if so desired, we denote the set of all possible pairs by . Here, is enforced so that the trial terminates after at most stages as desired.
Finally, we define our trials formal conduct using two vectors, and , as follows

Set and .

Conduct stage of the trial, and compute the .

For such that for (with the convention for )

If reject , setting and , and designating that for .

If do not reject , setting , and designating that for .


If and , set and return to 2. Else stop the trial, and for each with , set , and designate that for .
Here, if fewer than null hypotheses have been rejected, and a decision has not been made for all null hypotheses, step 4 leads to a return to step 2.
2.2 Strong control of the generalised typeI familywise errorrate
To demonstrate how strong control of the generalised typeI FWER can be achieved, we extend the arguments of Magirr et al. (2012). First, for , , and , define
Next, take . That is, is the set of all ordered subsets of with .
Then, if for , the event that at least of fail to be rejected is equivalent to
taking , where is the whole sample space.
Lemma 2.1.
For any
Proof.
See the Appendix of Magirr et al. (2012). ∎
Theorem 2.2.
For any , .
Proof.
Thus, by Theorem 2.2, the generalised typeI FWER can be controlled in the strong sense to level by ensuring .
2.3 Design operating characteristics
In this section, we describe an efficient approach for the evaluation of the key performance characteristics of any design.
Ghosh et al. (2017) discussed how the approach utilised in Magirr et al. (2012) is prohibitively computationally expensive even for moderate values of the number of stages or treatments. By working with the score statistics, and employing recently developed techniques for the efficient evaluation of multivariate normal integrals, they provided a method which was considerably more efficient. However, their approach makes no allowance for the fast evaluation of expected sample sizes, as it is geared towards assessing the probability one of the null hypotheses is rejected, and does not monitor the outcome for each arm. It is also specialised to the case . Though it could in theory be extended to designs, to allow the optimisation of the stopping boundaries and group sizes in Section 2.4, we here propose a different approach based on the distribution of , the standardised Wald test statistics.
We denote the possible outcomes of the trial by , where

, with if is rejected, and otherwise,

, with if is the interim analysis at which is rejected or accepted, or the whole trial is stopped and no decision is made on .
With this definition, from the presented trial conduct is then the realised value of . We denote the sample space for , for any , given , and , by . We present the exact definition of in the Appendix.
The probability of a particular outcome is then
where is the probability density function of a multivariate normal distribution with mean and covariance matrix , evaluated at vector . Furthermore
for and . Note that the above integral is presented for notational simplicity as being dimensional. However, using the marginal distribution properties of the multivariate normal distribution, it can be reduced immediately by removing integrals, and the corresponding elements of , for which the range of integration is . These correspond exactly to those designated to be by the trial’s formal conduct above.
Denoting the trial’s (random) required sample size by , the expected sample size (ESS) can then be computed as
(2.1) 
Whilst the probabilities of committing a generalised typeI familywise error, and of rejecting at least of , respectively, are
(2.2)  
(2.3) 
for
Equations (2.1)(2.3) together provide the operating characteristics of candidate designs that are typically required for the determination of stopping boundaries and group sizes. Therefore, we have technically described all that is required for the determination of MAMS designs. However, particularly when the number of stages or treatment arms is large, it is important to be as efficient as possible in evaluating these characteristics. For this reason, we now describe how one may more improve upon the above in certain routinely faced design scenarios.
Specifically, suppose for two treatment arms and , , , that , for , and . Then these arms are interchangeable in the sense that
for if , , , , and , for . This allows us to reduce the number of integrals that must be considered. Here, we explain how this is achieved when and for . Methodology for other scenarios should then be clear.
We place an order upon the outcome for the experimental treatments arms, setting
In the Appendix, we expand on the intuition behind this definition. With it however, we have
(2.4)  
(2.5)  
(2.6) 
for
where is the degeneracy of scenario given , , , , , and . That is, is the number of possible scenarios in , which could be produced from , that would have equal probability because of the equality of treatment effects in the vector .
To evaluate the efficiency improvements made by accounting for any interchangeability amongst the experimental treatments, one can compare the relative sizes of the sets and . Unfortunately, the values of and do not in general have a simple form. However, in the Appendix we present formulae for their values, along with the , when and when , for the scenario in which early stopping to accept or reject null hypotheses is possible at each analysis (i.e., for ). Therefore as an example, we compare the relative sizes of these sets in this setting when in Table 1. It is clear that the value of becomes substantially smaller than the corresponding as the number of treatments or stages grows. This highlights the importance of utilising this refined approach whenever possible.
1  2  12  24  40  8  15  24 
2  2  16  36  64  10  21  36 
1  3  34  90  188  13  29  54 
2  3  58  186  428  18  48  100 
3  3  64  216  512  20  56  120 
1  4  96  336  880  19  49  104 
2  4  200  888  2608  28  90  220 
3  4  248  1224  3808  33  116  300 
4  4  256  1296  4096  35  126  330 
2.4 Optimal generalised multiarm multistage designs
Stopping boundaries and group sizes providing particular typeI and typeII FWERs can be determined for MAMS trial designs in a variety of ways. An adaptation of the original error spending approach to group sequential trial design (Lan and DeMets, 1983) could be employed, or a functional form could be assumed for the boundaries. In the Appendix, we describe how the triangular test of Whitehead and Stratton (1983) can be modified for MAMS designs. Here, our focus will be on determining optimised stopping boundaries, in order to attempt to maximise trial efficiency.
We suppose that and have been specified for and . Furthermore, we assume we desire to allow early stopping for either efficacy or futility at each analysis, and that values for , , , , , , , , , and , have all been designated. Adapting Wason et al. (2012), our goal will be, for , to find the solution to the following optimisation problem
subject to
for . Note that other optimality criteria could be treated similarly. Additionally, in general it is wise to enforce that , since the optimal design in the case would be the one converging toward a singlestage design.
The complexity of the above constraints prevents their direct incorporation in to the search procedure. Therefore, we translate our problem to search for the solution to
We will refer to the function in the above, which we attempt to minimise, as the objective function. Here, , and the final factors therefore penalise designs not conforming to the desired operating characteristics. Following Wason and Jaki (2012), we set , the samplesize required by a corresponding singlestage design.
Note that is treated in this search as a continuous parameter. This is for convenience given the majority of available software for optimisation is applicable only to continuous search spaces. Denoting the solution to the above problem by , then, as discussed in Wason (2015), in the likely event that , a choice must be made. Explicitly, define and . The optimal integer could then either be directly designated as , increasing the likelihood that the FWP requirement is met. Or, the optimisation routine could be rerun with constrained in turn to and (i.e., the optimal for should be identified), and the design with the smallest value of the objective function amongst these two solutions determined the true optimal design. This latter procedure is of most importance when is relatively large, when it is likely that the performance of the constrained optimal designs will be substantially different.
Finally, whilst the above specifies how a search can be performed, it does not enforce that a particular search algorithm be utilised. However, given the search space is likely to contain many local optima, a stochastic search routine is preferable (Wason and Jaki, 2012). Here, we utilise the GA package in R (Scrucca, 2013; Scrucca, 2016), which allows the parallel evaluation of several candidate designs. In combination with the use of the package mvtnorm (Genz et al., 2016) for evaluating the requisite multivariate normal integrals, this allow for the efficient determination of optimised designs. Of importance is that GA requires the designation of values for several control parameters. We discuss our choices for these in the Appendix. Code to replicate our results is available upon request.
3 Example: TAILoR
To explore the effect the parameters , , , and can have upon a designs operating characteristics, we reconsider the TAILoR (TelmisArtan and InsuLin Resistance in HIV) trial (Magirr et al., 2012). In this trial, three experimental treatments, corresponding to different doses of Telmisartan, were compared to a shared control. Therefore, we set , and as in Magirr et al. (2012), we suppose that , , , , and for . To allow utilisation of Equations (2.4)(2.6), we further suppose that for and (implying and ).