# A Benchmark for Dose Finding Studies with Continuous Outcomes

###### Abstract

An important tool to evaluate the performance of any design is an optimal benchmark proposed by O’Quigley and others (2002, Biostatistics 3(1), 51-56) that provides an upper bound on the performance of a design under a given scenario. The original benchmark can be applied to dose finding studies with a binary endpoint only. However, there is a growing interest in dose finding studies involving continuous outcomes, but no benchmark for such studies has been developed. We show that the original benchmark and its extension by Cheung (2014, Biometrics 70(2), 389-397), when looked at from a different perspective, can be generalized to various settings with several discrete and continuous outcomes. We illustrate and compare the benchmark performance in the setting of a Phase I clinical trial with continuous toxicity endpoint and in the setting of a Phase I/II clinical trial with continuous efficacy outcome. We show that the proposed benchmark provides an accurate upper bound for model-based dose finding methods and serves as a powerful tool for evaluating designs.

Keywords: Continuous endpoint; Dose finding; Non-parametric optimal design; Phase I; Phase I/II

## 1 Introduction

A variety of dose finding methods for Phase I clinical trials aiming to find the maximum tolerated dose (MTD) were proposed in the literature in past three decades. The conventional way to assess the performance of a design is to conduct an extensive simulation study. One of the key characteristics of any dose-finding method is its accuracy which is usually computed as the proportion of times the correct dose is selected. The majority of novel proposals are studied in scenarios chosen by investigators themselves. This, clearly, adds subjectivity to the assessment of the method’s operating characteristics as one can always find scenarios in which the MTD identification is easier than in others. To solve this problem, O’Quigley and others (2002) proposed the non-parametric optimal benchmark that provides an upper limit of accuracy (in terms of proportion of correct selections) for dose finding methods based on a binary toxicity endpoint. The benchmark uses the concept of the complete information which assumes that outcomes of each patient can be observed at all dose levels (in contrast to an actual trial in which patients can be assigned to one dose only). The benchmark shows how ‘difficult‘ the MTD identification is in the chosen scenario and provides the objective context for the performance evaluation of the design under investigation. Since its proposal, the benchmark has proven its great usefulness by the ability to assess the newly proposed designs comprehensively (see e.g. Paoletti and Kramar, 2009; Yin and Yuan, 2009). Additionally, based on the benchmark, Cheung (2013) derived sample size formulae for the continual reassessment method (CRM) by O’Quigley and others (1990).

The benchmark was originally proposed for studies with a binary endpoint. Motivated by more complex studies, for instance, Phase I/II clinical trials evaluating binary toxicity and efficacy endpoints simultaneously (Thall and Russell, 1998) or Phase I trials with multiple grades of toxicities (Lee and others, 2011), Cheung (2014) generalized the benchmark to both of these cases. This has broadened the application of the benchmark significantly. However, there is a growing number of Phase I and Phase I/II clinical trials involving continuous endpoints, but no corresponding benchmark exists yet. For example, Bekele and Thall (2004); Yuan and others (2007); Ivanova and Kim (2009); Bekele and others (2010); Ezzalfani and others (2013); Wang and Ivanova (2015), considered a continuous toxicity endpoint while, for example, Bekele and Shen (2005); Hirakawa (2012); Yeung and others (2015, 2017) studied Phase I/II trials with binary toxicity and continuous efficacy endpoints.

In this work, we propose a simple benchmark which can be applied to dose finding studies with continuous outcomes. The novel benchmark employs the same concept of the complete information as the original method and is based on the well-known probability integral transform. This general method also allows to find a benchmark for designs with multiple correlated outcomes and several treatment cycles. It is shown that the evaluation of the novel benchmark does not require any additional information other than already provided in the simulation study of a design. We apply the novel benchmark to evaluate the performance of two recently proposed dose finding methods: a design for a Phase I trial with continuous toxicity endpoint and a design for a Phase I/II trial with binary toxicity and continuous efficacy endpoints.

In Section 2, we review the original benchmark and propose its generalization. We compare design proposals for Phase I and Phase I/II to the benchmark in Section 3 and conclude with a discussion.

## 2 Methods

### 2.1 Benchmark for Binary Endpoint

Consider a Phase I clinical trial with a binary toxicity outcome, dose-limiting toxicity (DLT) or no DLT, patients and a discrete set of dose levels . Let be a Bernoulli random variable taking value if patient has experienced no DLT at dose and otherwise. This random variable is characterised by probability such that , . The goal of the trial is to find the maximum tolerated dose (MTD), the dose corresponding to a prespecified risk of toxicity, .

The non-parametric optimal benchmark uses the concept of the complete information. For a given patient the complete information consists of the vector of outcomes (DLT or no DLT) at all dose levels assuming that are known. In other words, for a given patient one knows the maximum toxicity probability that this patient can tolerate. Formally, the information about the DLT of patient at each dose level is summarised in a single value , which is drawn from a uniform distribution, . For instance, means that patient can tolerate doses with , but would observe a DLT if given dose with . It follows that is transformed to for doses with and to otherwise. The procedure is repeated for all patients which results in the vector of responses for each dose level , . Let be a summary statistic for the dose level upon which the decision about the MTD selection is based. Conventionally, is chosen such that its minimum (or maximum) value corresponds to the estimated MTD. Therefore, for which is minimised (maximised) for all is declared as the MTD in a single trial. The procedure is repeated for simulated trials and then proportions of each dose selected as the MTD is computed.

### 2.2 Benchmark for Continuous Endpoint

Consider now a Phase I clinical trial with continuous outcome at dose for patient having cumulative distribution function (CDF) . The goal of the trial is find the target dose (TD) which minimises (or maximises as defined by an investigator) some decision criterion . In simulations the CDF, , is chosen by an investigator and specifies the distribution of outcomes for a given dose , and the set of CDFs corresponding to doses defines a simulation scenario. This simple fact is going to be a central part of our proposal. To illustrate the construction of the novel benchmark step-by-step, we use a setting studied by Wang and Ivanova (2015) throughout this section.

###### Example 1.

Wang and Ivanova (2015) considered a setting with doses and a biomarker for toxicity measured on a continuous scale. In one of the simulation scenarios presented, it is assumed that a toxicity outcome given dose level has normal distribution , . Then, the CDF is the CDF of a normal random variable with corresponding parameters . These CDFs will be used to obtain the benchmark in this scenario.

Let us denote the quantile transformation as

(2.2) |

Then,

###### Probability integral transform.

If is a uniform random variable on the unit interval, then is the cumulative distribution function of a random variable .

This result is commonly used for inverse transform sampling (e.g. see Bekele and Shen, 2005, for an example in dose finding) which allows to generate a random variable with any distribution .

Assume that the whole information about a patient’s profile is summarised in a single value drawn from . For patient with profile , the quantile transformation is applied to obtain a continuous outcome that this patient would have at dose , . Different dose levels are modelled by applying the quantile transformation using corresponding CDFs. This results in a vector of responses , also called the complete information about patient . The same procedure is repeated for all patients which, again, results in the vector of responses for each dose level , .

###### Example 1 (Continued).

Following the setting by Wang and Ivanova (2015), assume that the first patient has a toxicity profile . The benchmark answers the question ”how would patient respond to dose level with response having distribution ”. Applying the corresponding quantile transformation, the response of patient given the dose level is equal to . Subsequently, the complete information about patient consists in the vector of responses at all dose levels

The complete information for patients with randomly generated profiles is given in Table 1.

Patient’s profile | Patient’s response | |||||

0.075 | 0.149 | 0.224 | 0.299 | 0.373 | 0.448 | |

0.033 | 0.065 | 0.098 | 0.130 | 0.163 | 0.195 | |

0.241 | 0.481 | 0.722 | 0.962 | 1.203 | 1.443 | |

0.144 | 0.288 | 0.432 | 0.576 | 0.720 | 0.864 | |

0.050 | 0.101 | 0.151 | 0.202 | 0.252 | 0.302 | |

Mean | 0.109 | 0.217 | 0.325 | 0.434 | 0.542 | 0.650 |

Variance | 0.007 | 0.029 | 0.065 | 0.116 | 0.181 | 0.261 |

Recalling the decision criterion on which the TD selection is based, the dose level for which is minimised (or maximised) is declared as the TD in a single trial. For instance, if the goal of the trial is to find the dose having the average level of toxicity , the decision criterion (2.1) can be used. The benchmark can be constructed for various decision criteria and then be adapted to evaluate any design under investigation.

###### Example 1 (Continued).

The goal of the trial considered by Wang and Ivanova (2015) is to find the dose with the mean response closest to the target response . The criterion of choosing the dose which maximises the probability of the average level of toxicity to be in the neighbourhood of was considered. Let be a probability density function of given the data . Then, the decision criterion takes the form

(2.3) |

The TD is the dose for which the criterion is maximised. Following the original framework, and are chosen. Using the complete information generated in Table 1 and the density function of Normal distribution with corresponding mean and variance parameters yields: and . The value of the criterion is maximised for dose level which is selected as the TD in this single trial. The procedure is repeated for simulated trial to obtain the proportion of correct selections. The evaluation of the method by Wang and Ivanova (2015) using the proposed benchmark is provided in Section 3.1.

Algorithm 1 provides the step-by-step guidance how the benchmark can be constructed based on simulated trials.

The proposed benchmark can be applied to a wide range of distributions as it requires the quantile information only, which is available for many distributions in various statistical software (for example, qbinom, qnorm, qexp , etc in R (R Core Team, 2015)). Note that the probability integral transform can be also applied to discrete random variables in which case the quantile transformation is given explicitly. It is easy to see that using the corresponding to a Bernoulli random variable in Algorithm 1 results in the original benchmark construction proposed by O’Quigley and others (2002).

The novel the benchmark can be also applied to clinical trials with multiple endpoints. This construction is provided below.

### 2.3 Benchmark for Multiple Endpoints

In the setting with several endpoint, the correlation between them is important. Below, we describe the algorithm generating correlated outcomes in the benchmark framework. In fact, the approach described below has been known for a long time (Tate, 1955; Molenberghs and others, 2001). We apply it to an arbitrary distribution of outcomes to generate the complete information. We start from the case of binary toxicity and continuous efficacy that has attracted a lot of attention in the literature recently.

Consider a Phase I/II clinical trial with toxicity outcome and efficacy outcome with CDFs and , respectively, at dose level for patient . We will use the setting studied by Bekele and Shen (2005) to illustrate the construction of the benchmark for multiple endpoint through this section.

###### Example 2.

Bekele and Shen (2005) considered a setting with dose levels, an efficacy outcome at dose with Gamma distribution where is the shape parameter, is the rate parameter (i.e., the mean equals to ), and a DLT outcome having probability . In one of the simulation scenarios the following parameters are assumed and . Then, is the CDF of a Bernoulli random variable with parameter and is the CDF of a Gmma random variable with parameter , .

The toxicity/efficacy profile of patient is given by two characteristics: corresponding to toxicity and corresponding to efficacy. Firstly, we generate a bivariate standard normal vector with mean and covariance matrix

(2.4) |

where is the correlation coefficient. In a simulation study, the correlation coefficient, , is specified by the investigator as part of the simulation scenario. By applying the CDF of the standard normal random variable , one can obtain two correlated random variables with uniform distributions. Then, the corresponding quantile transformations are applied to and marginally as described in Section 2.2 and values of response for patient at dose levels are obtained , . This results in the complete vector of toxicity and efficacy outcomes at all dose level for the patient . The procedure is repeated for patients and pairs of vectors and are obtained for each dose level .

###### Example 2 (Continued).

The correlation coefficient considered by Bekele and Shen (2005) is . The bivariate normal vector with mean and covariance matrix is initially generated: () = (). Then, the first patient has a toxicity profile and an efficacy profile which corresponds to toxicity response (applying the quantile transformation of Bernoulli distribution) and efficacy response (apply the quantile transformation of Gamma distribution). Subsequently, the vector of the complete toxicity information is and the vector of the complete efficacy information is The complete information for 5 patients with random generated profiles is given in Table 2.

Patient’s profile | Patient’s response | |||||

0 | 0 | 1 | 1 | |||

26.3 | 74.6 | 121.8 | 134.3 | |||

0 | 0 | 0 | 1 | |||

12.2 | 48.4 | 87.3 | 97.3 | |||

0 | 0 | 0 | 0 | |||

45.7 | 104.7 | 159.3 | 173.5 | |||

0 | 0 | 0 | 1 | |||

23.6 | 70.0 | 112.9 | 128.1 | |||

0 | 0 | 0 | 0 | |||

42.5 | 99.9 | 153.5 | 167.4 | |||

Number of toxicities | 0 | 0 | 1 | 3 | ||

Mean (efficacy) | 30.1 | 79.5 | 127.5 | 140.2 | ||

Standard Deviation (efficacy) | 13.8 | 23.0 | 29.5 | 30.8 |

Similar to a single endpoint case, the TD selection is based on a pre-specified decision criterion, , which takes the minimum (maximum) value for the most desirable dose level. This would, however, involve the information for all endpoints of interest and can have more complicated structure. In the context of the Phase I/II clinical trial the decision criterion is also known as a trade-off function (see e.g. Thall and Cook, 2004).

###### Example 2 (Continued).

Bekele and Shen (2005) defined the target dose as the dose with the highest expected efficacy while being safe () and efficacious (). This translates in the criterion

(2.5) |

where and are probability density functions of an efficacy response and of a toxicity probability given the data , respectively, and , are controlling probabilities. This decision criterion is used to construct the benchmark in this setting. Applied to the benchmark, the integrals in (2.5) are computed using density functions of Beta distribution and Normal distribution for toxicity and efficacy outcomes, respectively. Using summary statistics given in Table 2 and controlling probabilities , values of the criterion are . The criterion is maximised for dose level which is selected as the TD in this single trial. The procedure is repeated for simulated trials to obtain the proportion of correct selections. The evaluation of the method by Bekele and Shen (2005) using the proposed benchmark is provided in Section 3.2.

Similarly, the benchmark can be applied to an arbitrary number of endpoints. For instance, consider a Phase I/II trial in which toxicity and efficacy are evaluated in four cycles. Then, the profile of patient is given by each drawn from and the rest of the construction remains unchanged. The procedure to generate the benchmark for endpoints is given in Algorithm 2.

## 3 Application

### 3.1 Continuous Toxicity in Phase I Trials

The dichotomization of the toxicity endpoint (DLT/no DLT) in Phase I clinical trials restricts the available information about the drug’s toxicity. In fact, a continuous toxicity endpoint can provide a better insight on the drug’s profile (Wang and others, 2000; Bekele and Thall, 2004; Wang and Ivanova, 2015).

Recently, Wang and Ivanova (2015) proposed the Bayesian Design for Continuous Outcomes (BDCO) which can be applied to clinical trials with continuous toxicity endpoint. In short, BDCO assumes that outcome at dose for patient has normal distribution where is considered as a random variable itself. Based on the posterior distributions of , BCDO is driven by the probability that is within of the target, :

(3.1) |

The design targets the dose which maximizes the probability in (3.1). This is equivalently to maximising the decision criterion given in Equation (2.3). Below, we apply the proposed benchmark to the setting considered in the original paper using this decision criterion and compare its performances to BDCO.

Recalling the setting by Wang and Ivanova (2015), we consider six scenarios with six dose levels , a sample size of , parameter and two cases: (i) the case of equal variances in which outcome has normal distribution and (ii) the case of unequal variances corresponding to normal distributions . In each of six scenarios the target values are used, respectively. As a consequence, the target dose is dose in scenario , in scenario 2, and so on.

Table 3 shows the operating characteristics of the BDCO against the benchmark. The results of the BDCO are extracted from Table 2 in the original article, and the benchmark is evaluated using trial replications.

Design | Variance | Percent of selecting dose | |||||

Scenario 1 in Wang and Ivanova (2015) | |||||||

BCDO | Equal | 0.91 | 0.10 | 0.00 | 0.00 | 0.00 | 0.00 |

Benchmark | 0.94 | 0.06 | 0.00 | 0.00 | 0.00 | 0.00 | |

BCDO | Unequal | 0.97 | 0.03 | 0.00 | 0.00 | 0.00 | 0.00 |

Benchmark | 0.98 | 0.02 | 0.00 | 0.00 | 0.00 | 0.00 | |

Scenario 2 in Wang and Ivanova (2015) | |||||||

BCDO | Equal | 0.07 | 0.86 | 0.08 | 0.00 | 0.00 | 0.00 |

Benchmark | 0.07 | 0.87 | 0.07 | 0.00 | 0.00 | 0.00 | |

BCDO | Unequal | 0.04 | 0.84 | 0.11 | 0.01 | 0.00 | 0.00 |

Benchmark | 0.02 | 0.86 | 0.11 | 0.00 | 0.00 | 0.00 | |

Scenario 3 in Wang and Ivanova (2015) | |||||||

BCDO | Equal | 0.00 | 0.07 | 0.83 | 0.09 | 0.00 | 0.00 |

Benchmark | 0.00 | 0.07 | 0.87 | 0.07 | 0.00 | 0.00 | |

BCDO | Unequal | 0.00 | 0.16 | 0.65 | 0.16 | 0.02 | 0.00 |

Benchmark | 0.00 | 0.11 | 0.69 | 0.17 | 0.02 | 0.00 | |

Scenario 4 in Wang and Ivanova (2015) | |||||||

BCDO | Equal | 0.00 | 0.00 | 0.08 | 0.81 | 0.11 | 0.00 |

Benchmark | 0.00 | 0.00 | 0.07 | 0.87 | 0.07 | 0.00 | |

BCDO | Unequal | 0.00 | 0.00 | 0.27 | 0.50 | 0.18 | 0.04 |

Benchmark | 0.00 | 0.00 | 0.19 | 0.55 | 0.20 | 0.05 | |

Scenario 5 in Wang and Ivanova (2015) | |||||||

BCDO | Equal | 0.00 | 0.00 | 0.00 | 0.09 | 0.80 | 0.11 |

Benchmark | 0.00 | 0.00 | 0.00 | 0.07 | 0.87 | 0.07 | |

BCDO | Unequal | 0.00 | 0.00 | 0.02 | 0.34 | 0.45 | 0.20 |

Benchmark | 0.00 | 0.00 | 0.00 | 0.25 | 0.45 | 0.29 | |

Scenario 6 in Wang and Ivanova (2015) | |||||||

BCDO | Equal | 0.00 | 0.00 | 0.00 | 0.00 | 0.10 | 0.90 |

Benchmark | 0.00 | 0.00 | 0.00 | 0.00 | 0.07 | 0.93 | |

BCDO | Unequal | 0.00 | 0.00 | 0.00 | 0.07 | 0.40 | 0.54 |

Benchmark | 0.00 | 0.00 | 0.00 | 0.02 | 0.27 | 0.71 |

Under Scenarios 2-5, the proportion of correct selection using the benchmark is 87%, which illustrates that they have the same level of ”complexity”. Conversely, the benchmark shows that it is easier to find the MTD if it is either the first or the last dose. Under all scenarios with equal variances, the BCDO has the accuracy close to the benchmark. The ratio of the probability of correct selection of the BCDO relative to the benchmark ranges between 92% and 98% in these cases.

Under Scenarios with unequal variances, the benchmark demonstrates that it is harder to find the MTD if the corresponding variance is high. For example, the benchmark leads to 86% of correct selections under Scenario 2 and 45% under Scenario 5. Again, it appears that it is easier to find the MTD when it is the first or the last dose for any methods. BCDO shows very high accuracy in Scenario 1-5 with unequal variance. The correct probability ratios never go below 91% and even reach nearly 100% under Scenario 5. In the former case, BCDO recommends the MTD in 45% of replications (as well as the benchmark), but it recommends the highest dose systematically less often - 20% against 29% by the benchmark. This implies that BCDO tends to more conservative decisions. Scenario 6 confirms this finding in which the correct probability ratio equals which, however, is still high.

Overall, BCDO selects the correct dose uniformly less often than the benchmark in all scenarios (as expected), but the efficiency of the design is high. The minimum ratio of the probability of correctly selecting is which corresponds to highly variable outcomes. This indicates that parameters of the BCDO are adequately calibrated and the BCDO in the proposed form is able to find the MTD in various scenarios.

### 3.2 Continuous Efficacy and Binary Toxicity in Phase I/II Trials

Similarly to the continuous toxicity outcome, the continuous efficacy endpoint can provide better guidance on the target dose selection than the dichotomized one. One of the first designs proposed for Phase I/II clinical trial considering continuous efficacy outcome is by Bekele and Shen (2005) who developed a Bayesian approach to model toxicity and (continuous) biomarker of efficacy jointly. We denoted this design by BS.

Bekele and Shen (2005) introduced a latent normal random variable which is related to the observed binary toxicity. A bivariate normal distribution allows for different strengths of the dependence between toxicity and efficacy. Dose escalation/de-escalation decision rules are based on the posterior distribution of both toxicity and efficacy. The design was shown to have good operating characteristics in many scenarios. Therefore, the majority of subsequently proposed designs (e.g. see Hirakawa (2012) and Yeung and others (2015)) were compared to it. Below, we provide the comparison of the design by Bekele and Shen (2005) against the respective benchmark.

Recalling the framework by Bekele and Shen (2005) we consider an efficacy outcome at dose having a Gamma distribution with rate parameter and a DLT outcome having probability . A total of six scenarios and four dose levels per scenario are explored using the total sample size . The parameters of and toxicity probability are given in Table 4. In each scenario a weak association, , between the toxicity and efficacy biomarker is used. The target dose is defined as given in the criterion (2.5) - the dose with the highest expected efficacy while being safe () and efficacious ().

Table 4 shows the operating characteristics of the BS design against the respective benchmark. The results for BS are extracted from Table 1 of the original work which uses replications, and the benchmark is evaluated using trial replications.

Design | Percent of selecting dose | ||||

None | |||||

Scenario 1 in Bekele and Shen (2005) | |||||

() | (25,0.01) | (70,0.10) | (115,0.25) | (127,0.60) | |

BS | 0.00 | 0.03 | 0.95 | 0.02 | |

Benchmark | 0.00 | 0.08 | 0.92 | 0.00 | 0.00 |

Scenario 2 in Bekele and Shen (2005) | |||||

() | (5,0.50) | (70,0.70) | (90,0.80) | (135,0.85) | |

BS | 0.06 | 0.00 | 0.00 | 0.00 | 0.94 |

Benchmark | 0.02 | 0.00 | 0.00 | 0.00 | 0.98 |

Scenario 3 in Bekele and Shen (2005) | |||||

() | (25,0.03) | (46,0.05) | (90,0.10) | (135,0.15) | |

BS | 0.00 | 0.00 | 0.02 | 0.98 | 0.00 |

Benchmark | 0.00 | 0.00 | 0.01 | 0.99 | 0.00 |

Scenario 4 in Bekele and Shen (2005) | |||||

() | (20,0.05) | (75,0.05) | (75,0.35) | (75,0.65) | |

BS | 0.00 | 0.83 | 0.17 | 0.00 | 0.00 |

Benchmark | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 |

Scenario 5 in Bekele and Shen (2005) | |||||

() | (60,0.05) | (65,0.50) | (80,0.70) | (95,0.85) | |

BS | 0.94 | 0.06 | 0.00 | 0.00 | 0.00 |

Benchmark | 0.97 | 0.03 | 0.00 | 0.00 | 0.00 |

Scenario 6 in Bekele and Shen (2005) | |||||

() | (2,0.03) | (2,0.03) | (2,0.03) | (2,0.03) | |

BS | 0.01 | 0.00 | 0.00 | 0.00 | 0.99 |

Benchmark | 0.01 | 0.00 | 0.00 | 0.00 | 0.99 |

Under Scenarios 1, 3 and 5 with an increasing dose-efficacy relationship, the BS design performs with high accuracy and the proportion of correct selections is close to the benchmark. Interestingly, the BS design recommends the target dose 3% more often than the benchmark under Scenario 1. Given the number of replications for the BS and the benchmark, 3% difference is significant. This can be an indication that the prior distribution used by BS is in favour of . It would also explain the relatively lower performance under Scenario 4 in which the BS recommends the target dose in 83% of trials against 100% by the benchmark. The BS recommends the dose with the same efficacy, but noticeably greater toxicity in 17% of trials. An alternative explanation of the difference in proportion of selections under Scenario 4 can be a plateau in dose-efficacy relation that is not modelled by the BS. Nevertheless, the ratio of correct probabilities is demonstrating good operating characteristics of the BS design.

Under unsafe Scenario 2 and inefficacious Scenario 6, the BS design comes to the correct conclusion nearly the same proportion of trials as the benchmark. This shows the ability of the BS design to avoid the unethical selections due to either high toxicity or low activity.

Overall, the benchmark confirmed that the BS design is flexible and can recommend the target dose under many different scenarios. It also gives some possible clue to the super-efficient performance under Scenario 1 and to a potential challenges that the BS design can face in the plateau dose-efficacy scenarios.

## 4 Discussion

In this work, the novel benchmark for dose finding studies is formulated. In essence, the novel benchmark is similar to the original proposal by O’Quigley and others (2002) as the whole information about a patient is summarised in a single value , but can be also applied to studies with continuous outcomes. In the era of increasing complexity of clinical trial the procedure evaluating an adequacy of the novel dose finding methods is crucial. As it is shown above, the proposed benchmark provide an accurate upper limit on the performance of model-based dose finding design. It is also able to reveal some inadequacy in the model/parameter/prior specifications or, alternatively, confirm the robustness of the design. The benchmark assesses the complexity of scenarios and can serve as a standardization of scenarios of various difficulty. Therefore, it should be definitely recommended for the complete analysis of the dose finding design as it helps to evaluate the dose finding designs in more comprehensive way.

The possibility of the benchmark application to several endpoints allows to investigate the influence of the correlated outcomes on design’s characteristics which is an important aspect of a Phase I/II dose finding studies. Moreover, it worth investigation what correlation structure on the endpoints of interest the used method of correlated outcomes generating implies. Clearly, the outcomes of the interest may no longer have the same correlation .

Finally, it is important to mention that while the benchmark is a useful tool for assessing performances of any given dose finding methods, it does not capture all aspects of the evaluation. For instance, it does not provide information on the distribution of dose allocation, average number of DLTs or stopping rules. Developments in this direction are of the great value for the complete design assessment.

## Acknowledgments

This project has received funding from the European Union‘s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement No 633567. Xavier Paoletti is partially funded by the Institut National du Cancer (French NCI) grant SHS-2015 Optidose immuno project. This report is independent research arising in part from Prof Jaki’s Senior Research Fellowship (NIHR-SRF-2015-08-001) supported by the National Institute for Health Research. The views expressed in this publication are those of the authors and not necessarily those of the NHS, the National Institute for Health Research or the Department of Health.

## References

- Bekele and Shen (2005) Bekele, B and Shen, Yu. (2005). A bayesian approach to jointly modeling toxicity and biomarker expression in a phase i/ii dose-finding trial. Biometrics 61(2), 343–354.
- Bekele and others (2010) Bekele, B Nebiyou, Li, Yisheng and Ji, Yuan. (2010). Risk-group-specific dose finding based on an average toxicity score. Biometrics 66(2), 541–548.
- Bekele and Thall (2004) Bekele, B Nebiyou and Thall, Peter F. (2004). Dose-finding based on multiple toxicities in a soft tissue sarcoma trial. Journal of the American Statistical Association 99(465), 26–35.
- Cheung (2013) Cheung, Kuen. (2013). Sample size formulae for the bayesian continual reassessment method. Clinical Trials 10(6), 852–861.
- Cheung (2014) Cheung, Kuen. (2014). Simple benchmark for complex dose finding studies. Biometrics 70(2), 389–397.
- Ezzalfani and others (2013) Ezzalfani, Monia, Zohar, Sarah, Qin, Rui, Mandrekar, Sumithra J and Deley, Marie-Cécile Le. (2013). Dose-finding designs using a novel quasi-continuous endpoint for multiple toxicities. Statistics in medicine 32(16), 2728–2746.
- Hirakawa (2012) Hirakawa, Akihiro. (2012). An adaptive dose-finding approach for correlated bivariate binary and continuous outcomes in phase i oncology trials. Statistics in Medicine 31(6), 516–532.
- Ivanova and Kim (2009) Ivanova, Anastasia and Kim, Se Hee. (2009). Dose finding for continuous and ordinal outcomes with a monotone objective function: a unified approach. Biometrics 65(1), 307–315.
- Lee and others (2011) Lee, SM, Hershman, DL, Martin, P, Leonard, JP and Cheung, YK. (2011). Toxicity burden score: a novel approach to summarize multiple toxic effects. Annals of oncology 23(2), 537–541.
- Molenberghs and others (2001) Molenberghs, Geert, Geys, Helena and Buyse, Marc. (2001). Evaluation of surrogate endpoints in randomized experiments with mixed discrete and continuous outcomes. Statistics in medicine 20(20), 3023–3038.
- O’Quigley and others (2002) O’Quigley, John, Paoletti, Xavier and Maccario, Jean. (2002). Non-parametric optimal design in dose finding studies. Biostatistics 3(1), 51–56.
- O’Quigley and others (1990) O’Quigley, John, Pepe, Margaret and Fisher, Lloyd. (1990). Continual reassessment method: a practical design for phase 1 clinical trials in cancer. Biometrics, 33–48.
- Paoletti and Kramar (2009) Paoletti, X and Kramar, A. (2009). A comparison of model choices for the continual reassessment method in phase i cancer trials. Statistics in medicine 28(24), 3012–3028.
- R Core Team (2015) R Core Team. (2015). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
- Tate (1955) Tate, Robert F. (1955). The theory of correlation between two continuous variables when one is dichotomized. Biometrika 42(1/2), 205–216.
- Thall and Cook (2004) Thall, Peter F and Cook, John D. (2004). Dose-finding based on efficacy–toxicity trade-offs. Biometrics 60(3), 684–693.
- Thall and Russell (1998) Thall, Peter F and Russell, Kathy E. (1998). A strategy for dose-finding and safety monitoring based on efficacy and adverse outcomes in phase i/ii clinical trials. Biometrics, 251–264.
- Wages and Varhegyi (2017) Wages, Nolan A and Varhegyi, Nikole. (2017). A web application for evaluating phase i methods using a non-parametric optimal benchmark. Clinical Trials 14(5), 553–557.
- Wang and others (2000) Wang, Chinying, Chen, T Timothy and Tyan, Irvin. (2000). Designs for phase i cancer clinical trials with differentiation of graded toxicity. Communications in Statistics-Theory and Methods 29(5-6), 975–987.
- Wang and Ivanova (2015) Wang, Yunfei and Ivanova, Anastasia. (2015). Dose finding with continuous outcome in phase i oncology trials. Pharmaceutical statistics 14(2), 102–107.
- Yeung and others (2017) Yeung, Wai Yin, Reigner, Bruno, Beyer, Ulrich, Diack, Cheikh, Palermo, Giuseppe, Jaki, Thomas and others. (2017). Bayesian adaptive dose-escalation designs for simultaneously estimating the optimal and maximum safe dose based on safety and efficacy. Pharmaceutical Statistics.
- Yeung and others (2015) Yeung, Wai Yin, Whitehead, John, Reigner, Bruno, Beyer, Ulrich, Diack, Cheikh and Jaki, Thomas. (2015). Bayesian adaptive dose-escalation procedures for binary and continuous responses utilizing a gain function. Pharmaceutical statistics 14(6), 479–487.
- Yin and Yuan (2009) Yin, Guosheng and Yuan, Ying. (2009). Bayesian dose finding in oncology for drug combinations by copula regression. Journal of the Royal Statistical Society: Series C (Applied Statistics) 58(2), 211–224.
- Yuan and others (2007) Yuan, Z, Chappell, R and Bailey, H. (2007). The continual reassessment method for multiple toxicity grades: A bayesian quasi-likelihood approach. Biometrics 63(1), 173–179.