Statistical Inference of Covariate-Adjusted Randomized Experiments

Statistical Inference of Covariate-Adjusted Randomized Experiments

Wei Ma, Yichen Qin, Yang Li, and Feifang Hu Wei Ma is Assistant Professor at Renmin University of China. Yichen Qin is Assistant Professor at University of Cincinnati. Yang Li is Associate Professor at Renmin University of China. Feifang Hu ( is Professor at George Washington University. This work is partially supported DMS-1612970 from the National Science Foundation and by grant 11371366, 11731011 and 71771211 from the National Natural Science Foundation of China.
July 16, 2019

Covariate-adjusted randomization procedure is frequently used in comparative studies to increase the covariate balance across treatment groups. However, as the randomization inevitably uses the covariate information when forming balanced treatment groups, the validity of classical statistical methods following such randomization is often unclear. In this article, we derive the theoretical properties of statistical methods based on general covariate-adjusted randomization under the linear model framework. More importantly, we explicitly unveil the relationship between covariate-adjusted and inference properties by deriving the asymptotic representations of the corresponding estimators. We apply the proposed general theory to various randomization procedures, such as complete randomization (CR), rerandomization (RR), pairwise sequential randomization (PSR), and Atkinson’s -biased coin design (-BCD), and compare their performance analytically. Based on the theoretical results, we then propose a new approach to obtain valid and more powerful tests. These results open a door to understand and analyze experiments based on covariate-adjusted randomization. Simulation studies provide further evidence of the advantages of the proposed framework and theoretical results.

Keywords: balancing covariates, conservative tests, covariate-adjusted randomization, Mahalanobis distance, power, rerandomization.

1 Introduction

Randomization is considered the “gold standard” to evaluate treatment effect as it mitigates selection bias and provides a foundation for statistical inference. Among all the randomization methods, covariate-adjusted randomization (CAR) procedure is frequently used because it utilizes the covariate information to form more balanced treatment groups. However, because of such a feature, the validity of classical statistical inference following such randomization is usually unclear. In this article, we establish a general theory by which properties of statistical inference can be obtained for covariate-adjusted randomization under mild conditions.

There has been extensive studies on CAR procedures. When facing categorical covariates, Pocock and Simon’s minimization method and its extensions can be used to reduce covariate imbalance of different levels (Taves, 1974; Pocock and Simon, 1975; Hu and Hu, 2012), which can also handle continuous covariates through discretization. To avoid information loss due to discretization, many randomization methods that directly utilize continuous covariates are also proposed in the literature (Frane, 1998; Lin and Su, 2012; Ma and Hu, 2013). Atkinson’s -biased coin design (-BCD) represents a large class of methods that take covariates into account in allocation rules based on certain optimality criteria (Atkinson, 1982; Smith, 1984a, b; Antognini and Zagoraiou, 2011). When all units’ covariates are available before the experiment starts, we can adopt rerandomization (RR) which repeats the traditional randomization process until a satisfactory configuration is achieved (Morgan and Rubin, 2012, 2015). In addition, pairwise sequential randomization (PSR) recently proposed by Qin et al. (2017) is another alternative, which achieves the optimal covariate balance and is computationally more efficient. Details of those methods will be given in Section 4. For an overview, please see Hu et al. (2014) and Rosenberger and Lachin (2015).

Since the aforementioned randomizations inevitably use the covariate information in forming more balanced treatment groups, the subsequent statistical inference is usually affected and demonstrates undesirable properties, such as reduced type I errors (Shao et al., 2010; Morgan and Rubin, 2012; Ma et al., 2015). This phenomenon of conservativeness is particularly common for a working model including only a subset of covariates used in randomization, such as two sample t test. As all the covariates are used in the randomization to generate more balanced assignments, a valid statistical procedure should incorporate all the covariates. Therefore, excluding some covariates from the working model leads to a distortion of the sampling distribution of test statistics, which consequently causes invalid statistical inference.

It is ideal that the covariates used in randomization should be included in the subsequent analysis in the context of clinical trials according to regulatory guidelines (ICH E9, 1998; EMA, 2015). However, unadjusted tests still dominate in practice (Sverdlov, 2015). For example, to avoid too many parameters, investigation sites are usually omitted in the analysis model for a multi-center clinical trial. Other reasons (not to incorporate all the covariates in practice) include simplicity of the test procedure, robustness to model misspecification and so on (Shao et al., 2010; Shao and Yu, 2013; Ma et al., 2015). Therefore, many working models may suffer from the issue of invalid statistical inference. As covariates are commonly used in comparative studies such as biomarker analysis, personalized medicine (Hu and Hu, 2012), and crowdsourced-internet experimentation (Horton et al., 2011; Chandler and Kapelner, 2013), understanding the impact of covariate-adjusted randomization on statistical inference is an increasingly pressing problem.

The issue over the validity of statistical inference after balancing covariates is investigated mainly based on simulations in the early literature, such as Birkett (1985); Forsythe (1987). More recently, theoretical progress has been made on the inference properties for some specific covariate-adjusted randomization methods. Shao et al. (2010) prove that the two sample t test is conservative under a special stratified randomization. Ma et al. (2015) study the hypotheses testing under a linear model for discrete covariate-adaptive randomization, which assumes that the overall and marginal imbalances across covariates are bounded in probability. However, their results are limited as many covariate-adjusted procedures deal with continuous covariates directly and do not necessarily satisfy the strong balancing assumptions. In fact, inference properties of many methods, as we will show for RR, PSR, and -BCD, are different than those studied by Shao et al. (2010) and Ma et al. (2015). In this article, we study inference properties under a general framework and demonstrate the impact of covariate-adjusted randomization on inference.

The main contributions of this article are as follows. First, we derive the statistical properties of inference following general covariate-adjusted randomization methods under the linear model framework. Most important, we explicitly display the relationship between covariate balance and inference by deriving their asymptotic representations. This result explains why inference behaves differently for various randomization methods. Second, we show that the results have broad applications, which is illustrated by applying to several randomization procedures, including CR, RR, PSR, and -BCD. In addition, it provides a theoretical approach to formally evaluate inference properties and compare pros and cons of different randomization methods. Third, we propose a method to obtain valid and powerful tests based on our theoretical results. The study lays a foundation to understand the impact of covariate balance on post-randomization statistical inference and sheds lights on future study in this area.

This article is organized as follows. After introducing the framework and notations in Section 2, we present our main theoretical results for statistical inference under covariate-adjusted randomization in Section 3. Using the proposed theory, we study four specific randomization methods in terms of their conservativeness in hypothesis testing in Section 4, and further propose a method to correct the conservative type I errors in Section 5. In Section 6, numerical studies are presented to illustrate the effectiveness of the proposed theory. Section 7 concludes with some remarks and future research topics. The main theoretical proofs are in Appendix.

2 General Framework

Suppose that units are to be assigned to two treatment groups using a covariate-adjusted randomization. Let be the assignment of the -th unit, i.e., for treatment 1 and for treatment 2. Let represent covariates observed for the -th unit, where are independent and identically distributed as for each unit . A linear regression model is assumed for the outcome of the -th unit,


where and are the main effects of treatment 1 and 2, respectively, and is the treatment effect. Furthermore, represents the covariate effects, and is independent and identically distributed random errors with mean zero and constant variance , and is independent of covariates. For simplicity, all the covariates are assumed to be independent of each other and have expectations of zero, i.e., for .

After allocating the units to treatment groups via covariate-adjusted randomization, a working model is used to estimate and test the treatment effect. In such a working model, it is common in practice to include a subset of covariates used in randomization, or sometimes even no covariates at all (Shao et al., 2010; Ma et al., 2015; Sverdlov, 2015). Therefore, without loss of generality, suppose that the first covariates are included in the working model,


Note that when all the covariates are included in the working model, and when no covariates are included.

Let , , , where

Further let , , so that . Then the working model can also be written as,

where is the design matrix, is the vector of parameters of interest, and is the -dimensional vector of ones. Therefore, the ordinary least squares (OLS) estimate of , is,

Under the covariate-adjusted randomization, the treatment assignments depend on both and . The distribution of is often difficult to obtain. However, testing the treatment effect is often the primary goal when performing a comparative study (e.g., randomized clinical trial). To detect if a treatment effect exists, we have the following hypothesis testing problem,


with the test statistic

where is a vector of length , and is the model-based estimate of the error variance . The traditional testing procedure is to reject the null hypothesis at the significance level if , where is -th quantile of a standard normal distribution.

In addition to testing the treatment effect, it is often of interest to test whether there exist covariate effects. A general form of hypothesis testing can be used for any linear combinations of the covariate effects. Let be an matrix of rank with entries in the first two columns all equal to zero (no treatment effect to test). Consider the following hypotheses,


and the test statistic is,

The traditional test rejects the null hypothesis if , where is -th percentile of a distribution with degrees of freedom. Note that we can let to test the significance of a single covariate effect of , and similarly for other covariate effects.

3 General Properties

Based on the framework introduced above, we study the statistical properties of estimation and hypothesis testing, i.e., (3) and (4), under covariate-adjusted randomization. Before presenting our main results, we first introduce two widely satisfied assumptions.

Assumption 1.

Global balance:

Assumption 2.

Covariate balance: , where is a (p+q)-dimensional random vector with .

Assumption 1 requires that the proportions of units in each treatment group converge to , which is usually the desired target proportion as balanced treatment assignments are more likely to provide efficient estimation and powerful tests. On the other hand, Assumption 2 specifies the asymptotic properties of the imbalance vector of covariates, i.e., . That is, the sums of covariates in each treatment group tend to be equal as sample size increases. Together with Assumption 1, this implies the similarity of the averages for each covariate between two treatment groups. The two assumptions ensure that a covariate-adjusted randomization procedure achieves good balancing properties, both globally and across covariates. It is worth to point out that the Assumption 2 is satisfied with under the assumptions of Ma et al. (2015) for discrete covariates.

The properties of classical statistical methods are usually well known and well studied under the full model (1) in literature. However, in practice, the final statistical inference is often based on the working model (2). Now we present our main theoretical results based on the working model (2).

Theorem 3.1.

Under Assumptions 1 and 2, the estimates based on the working model (2) are consistent. That is, .


where .

The representation provides a convenient way to derive the asymptotic distribution of and its linear combinations. In particular, for the estimated treatment effect it holds

based on which the asymptotic distribution can be obtained. We partition so that represents the first dimensions of , and the last dimensions. Let be a standard normal random variable that is independent of . We have the following corollary.

Corollary 3.2.

Under the working model (2), assume that Assumptions 1 and 2 are satisfied, we have

The corollary describes the asymptotic behavior of under the working model (2). If the model parameters and are known, statistical inference, such as Wald-type hypothesis test, can be constructed based on the asymptotic distribution. In practice, these parameters are unknown and the model-based test procedure defined in (3) is used instead. It assumes the normal approximation for the asymptotic distribution, and the asymptotic variance is estimated by , which is shown in Appendix to equal . Further let and . The asymptotic properties of the test (3) under both the null and alternative hypotheses are presented in the following theorem.

Theorem 3.3.

Under the working model (2), assume that Assumptions 1 and 2 are satisfied, we have

  1. When is true, then

  2. When is true, we consider a sequence of local alternatives with for a fixed , then

The asymptotic distribution of test statistic under consists of two independent components, and . The first component is due to the random error in the underlying model (1), and remains invariant under different covariate-adjusted randomization as the randomization procedure utilizes only covariate information and does not depend on the observed responses (Hu and Rosenberger, 2006). In addition, note that in the second component is the last dimensions of . By Assumption 2, is the asymptotic distribution of the imbalance vector of covariates and illustrates how well covariates are balanced under a specific covariate-adjusted randomization method. The better it performs in terms of covariate balance, the more concentrated is distributed around . Therefore, the second component of represents the impact of a covariate-adjusted randomization on the test statistic through the level of covariate balance. Depending on to what extent covariates are balanced, the test may behave differently in terms of size and power.

When the asymptotic distribution of is no longer a standard normal distribution, the traditional test may fail to maintain the pre-specified type I error. Let be -th quantile of the asymptotic distribution of under the null hypothesis. If , the test is conservative in the sense that the actual type I error is smaller than the pre-specified level . In fact, such conservativeness is often the case for covariate-adjusted randomization and can be demonstrated by comparison of between complete randomization and covariate-adjusted randomization. Under complete randomization, follows a normal distribution that makes follow a standard normal distribution asymptotically (Section 4.1), in which case the test has valid type I error. However, covariate-adjusted randomization is used with a purpose to reduce the imbalance of covariates between treatment groups, and hence is more concentrated around as opposed to complete randomization, leading to conservative tests. Three special cases of covariate-adjusted randomization (RR, PSR, and -BCD) will be disucssed in details in Sections 4.2, 4.3, and 4.4, respectively. The correction of conservative tests is discussed in Section 5.

Besides type I error, the explicit form of power can also be derived based on Theorem 3.3. Under the local alternatives, for a fixed , the power is

where is the cumulative distribution function of the asymptotic distribution of . In Section 6 power is evaluated numerically for several covariate-adjusted randomization methods.

Similarly as for the treatment effect, the inference for the covariates can also be studied following the representation given in Theorem 3.1. The next theorem illustrates the asymptotic normality of under covariate-adjusted randomization.

Corollary 3.4.

Under the working model (2), assume that Assumptions 1 and 2 are satisfied, we have

where .

Based on the asymptotic normality of , tests for these parameters can be constructed with the asymptotic variances replaced by their consistent estimates. The next theorem shows that the standard test under the working model defined in (4) is valid for linear combinations of .

Theorem 3.5.

Under the working model (2), assume that Assumptions 1 and 2 are satisfied, we have

  1. When is true, then

  2. When is true, we consider a sequence of local alternatives with for a fixed , then

    where is the non-central parameter.

Theorem 3.5 states that the type I error is maintained when testing the covariate effects under covariate-adjusted randomization. The power, however, is reduced if not all covariate information is incorporated in the working model. Since the inference on covariate effects is valid under covariate-adjusted randomization, we will mainly focus on testing treatment effect in the next section.

4 Properties of Several CAR procedures

In last section, we derived the theoretical properties of general CAR procedures. Now we apply our results to several important CAR procedures proposed in literature. These applications help us to understand the relationship between balancing and inference of a given CAR procedure.

4.1 Complete Randomization

Complete randomization (CR) assigns units to each treatment group with the equal probability . Since the treatment assignment is independent and does depend on covariates, it follows from the central limit theorem that

where .

Therefore, under CR, defined in Assumption 2 is a normal distribution, and furthermore, . By Theorem 3.3, it is easy to show that the asymptotic distribution of the test statistic under the null hypothesis follows a standard normal distribution, i.e., . The traditional hypothesis testing under CR is valid with correct type I error and no adjustment is needed.

4.2 Rerandomization

To balance the covariates across treatment groups, Morgan and Rubin (2012) have proposed rerandomization (RR), for which the procedure can be summarized as follows:

  1. Collect covariate data.

  2. Specify a balance criterion to determine when a randomization is acceptable. For example, the criterion could be defined as a threshold of on some user-defined imbalance measure, denoted as .

  3. Randomize the units into treatment groups using traditional randomization methods, such as CR.

  4. Check the balance criterion . If the criterion is satisfied, go to Step (5); otherwise, return to Step (3).

  5. Perform the experiment using the final randomization obtained in Step (4).

Morgan and Rubin (2012) have chosen the imbalance measure in Step (2) to be the Mahalanobis distance between the sample means across two treatment groups, which is defined by

where and are the sample means of covariates in the two treatment groups. There are several advantages for adopting such an imbalance measure. Mahalanobis distance is an affinely invariant imbalance measure, which is appealing especially for multivariate data. It is an overall imbalance measure which standardizes and aggregates each covariate imbalance information. A smaller value of Mahalanobis distance indicates a better covariate balance. A low Mahalanobis distance guarantees low imbalance levels in all covariates. Other desirable properties such as the reduction in variance of the estimated treatment effect can be found in Morgan and Rubin (2012).

Under the assumption of independent covariates, the Mahalanobis distance can be expressed as

By the balance criterion under RR, we have

where is the square root of , and is the -dimensional identity matrix.

Theorem 4.1.

Under rerandomization, we have

  1. Under , then

    where is the last dimensions of .

  2. Under , where for a fixed ,

Furthermore, the asymptotic variance of is

where is defined in Morgan and Rubin (2012) as

and is the incomplete gamma function .

The asymptotic distribution of under RR is no longer a normal distribution, and it is more concentrated around 0 compared with the standard normal distribution, which indicates that the traditional testing procedure is more conservative. The extent of conservativeness is impacted by the value of , which is an increasingly monotonic function of . By selecting a relatively smaller value of , the covariates are more balanced due to stricter balance criterion, resulting in a lower asymptotic variance of . However, a smaller means that on average it takes more attempts to meet the balance criterion. More discussions on the choice of can be found in Morgan and Rubin (2012).

4.3 Pairwise Sequential Randomization

Although RR can significantly reduce the covariate balance, it is incapable to scale up for the case of large number of covariates or large number of units, which is almost ubiquitous in the era of big data. Pairwise sequential randomization (PSR) recently proposed by Qin et al. (2017) solves such a problem by sequentially and adaptively assigning units to different treatment groups and is proven to have superior performance in terms of covariate balance and variance of the estimated treatment effect. PSR involves the following steps:

  1. Collect covariate data.

  2. Choose the covariate imbalance measure for units, denoted as .

  3. Randomly arrange all units in a sequence , … , .

  4. Separately assign the first two units to treatment 1 and treatment 2.

  5. Suppose that units have been assigned to treatment groups (), for the -th and -th units:

    1. If the -th unit is assigned to treatment 1 and the -th unit is assigned to treatment 2 (i.e., and ), then we can calculate the “potential” imbalance measure, , between the updated treatment groups with units.

    2. Similarly, if the -th unit is assigned to treatment 2 and the -th unit is assigned to treatment 1 (i.e., and ), then we can calculate the “potential” imbalance measure, , between the updated treatment groups with units.

  6. Assign the -th and -th units to treatment groups according to the following probabilities:

    where , and assign to maintain the equal proportions.

  7. Repeat Steps (5) through (7) until all units are assigned.

Similar to RR, PSR again chooses the Mahalanobis distance as the covariate imbalance measure in Step (2) because of its affinely invariant property and other desirable properties explained in the previous section. Once the Mahalanobis distance is calculated, the value of is set to 0.75. For a further discussion of , please see Hu and Hu (2012). In this algorithm, is assumed to be even. If is odd, then the last (-th) unit is randomly assigned to either treatment 1 or 2 with a probability of .

Note that the units are not necessarily observed sequentially; however, Qin et al. (2017) propose to allocate them sequentially (in pairs) to minimize the occurrence of covariate imbalance. The sequence in which the units are allocated is not unique. Rather, there are different possible sequences, but their performances are similar, especially when is large.

Comparing PSR with RR, we see that both methods use covariate information. PSR uses the covariate information to decide the unit allocation in each iteration, while RR uses the covariate information to decide if a randomly generated allocation is satisfactory or not. Note that neither PSR nor RR is restricted to Mahalanobis distance. Both methods can be easily adapted to different measures of imbalance. However, Mahalanobis distance does lead to desirable properties of the subsequent analysis. For example, PSR results in the minimum asymptotic variance of the estimated treatment effect.

Under PSR, it is shown in Qin et al. (2017) that


Then we have the following theorem.

Theorem 4.2.

Under PSR, we have

  1. Under , then

  2. Under , where for a fixed ,

The variance from the covariates is completely eliminated out in the numerator of the asymptotic distribution of , resulting in a distribution more concentrated around 0 than the standard normal distribution. In fact, the conclusion under PSR can be extended to a large class of covariate-adjusted randomization if Assumption (2) is replaced by (5). This can be considered as a natural extension of the conditions proposed in Ma et al. (2015) that lead to conservative tests for covariate-adaptive designs balancing discrete covariates. Note that the condition (5) is quite strong and is not a necessary condition to have a conservative test. For example, the condition is not satisfied under RR while the test is also conservative as shown in Section 4.2.

4.4 Atkinson’s -Biased Coin Design

Atkinson’s -biased coin design (-BCD) is proposed to balance allocations across covariates in order to minimize the variance of estimated treatment effects when a classical linear model between response and covariates is assumed (Atkinson, 1982; Smith, 1984a, b). Unlike RR and PSR, it is used in the setting where covariate information are collected sequentially, such as in clinical trials. More discussions on the method and its properties can be found in Atkinson (2002), Antognini and Zagoraiou (2011) and Atkinson (2014).

-BCD sequentially assigns units to treatment groups with an adaptive allocation probability: suppose units have been assigned to treatment groups, -BCD assigns the -th unit to treatment 1 with probability

where and .

Under -BCD, by applying result (10.5) of Smith (1984b), it holds that

It is clear to see that under -BCD the variance of the imbalance vector of covariates is reduced to of that under complete randomization, indicating that covariates are more balanced compared to complete randomization. The next theorem states the asymptotic distributions of under -BCD.

Theorem 4.3.

Under -BCD, we have

  1. Under , then

  2. Under , where for a fixed ,

This theorem shows that the test statistic has the asymptotic variance smaller than 1, so the test is conservative with a reduced type I error for testing the treatment effect.

Based on above four CAR procedures, we find that: (i) Under the CR (complete randomization), the distribution of is still asymptotically standard normal, so it provides the correct type I error. This is because the CR do not use covariate information at the assignment stage. (ii) Under other three procedures (RR, -BCD and PSR), the asymptotical distributions of are not standard normal. Therefore, their type I errors (based on ) are not correct anymore. Based on their distributions of , we can compare their type I errors as well as their powers. In next section, we will discuss about the correction of type I error of CAR procedures and then we may compare their adjusted powers. We may also apply the general theorems (in Section 3) to other CAR procedures. Numerical comparisons of these randomization methods are given in Section 6.

5 Correction for Conservativeness

As we can see, most of the covariate-adjusted randomizations lead to conservative type I error for testing the treatment effect, because traditional tests use standard normal distribution as the null distribution. Therefore, we propose the following approach to correct the conservative type I errors and to obtain higher powers.

Since we have derived the asymptotic distribution of in Theorem 3.3, we can obtain the correct asymptotic critical values. However, since the asymptotic distribution depends on unknown parameters, we need to estimate them using the observed sample to obtain the approximated null distribution and adjust the corresponding critical values and p-values.

After adjusting the critical values and p-values, we are able to obtain more powerful hypothesis testing results. The more conservative the traditional tests are, the more powerful their corrected versions become. Finally, we compare the covariate-adjusted randomization procedures mentioned above in terms of covariate balance, conservativeness of the traditional tests, and powers of the corrected test, and summarize their advantages and disadvantages in Table 1. The conclusions in the table are further verified through simulation in the next section.

Randomization Covariate balance Type I error of traditional test Power of corrected test
CR least balanced valid least powerful
RR moderately balanced moderately conservative moderately powerful
-BCD moderately balanced moderately conservative moderately powerful
PSR most balanced most conservative most powerful
Table 1: Comparison of different covariate-adjusted randomization procedures in terms of covariate balance, traditional tests’ conservativeness, and corrected tests’ powers.

6 Numerical Studies

In this section, we perform simulation studies to verify the theoretical results and demonstrate their effectiveness in obtaining high powers for hypothesis testing. We have tested various randomization procedures, including CR, RR, PSR, and -BCD.

6.1 Verification of Theoretical Results

We first verify the theoretical asymptotic distribution of under CR, RR, PSR, and -BCD. Assume the underlying model is where , for . for and is independent of each other. The random error is independent of all . We simulate the data according to the underlying model with sample size and use the working model which includes only two covariates out of four, , to obtain the test statistic . In Figure 1, we plot the simulated distributions of along with the theoretical distributions of given by Theorem 3.3. As the figure shows, the theoretical distributions are very close to the simulated distributions for all randomization procedures, which verifies our theoretical results. For comparison, we plot the standard normal distribution in bold gray. As we move from the left panel to the right panel (i.e., from CR to PSR), we can see that the distribution of becomes narrower. Therefore, using the critical values or the p-values obtained from the standard normal distribution will result in conservative tests with reduced type I errors for RR, PSR, and -BCD. The correction for such conservativeness will be further illustrated in the following sections.

Figure 1: Comparison of theoretical distributions and simulated distributions of . From left panel to right panel: (1) CR, (2) RR, (3) -BCD, and (4) PSR. In each panel, red solid curve represents the simulated distribution, blue dash curve represents the theoretical distribution, and the gray bold curve is the standard normal density.

6.2 Conservative Hypothesis Testing for Treatment Effect

From previous sections, we understand that, the traditional test for treatment effect under most covariate-adjusted randomization procedures generates conservative results. In this section, we verify such a phenomenon. Suppose the underlying model is


where for . and is independent of each other. The random error is independent of all . We use the following four working models to test the treatment effect, i.e., and .

W1: .

W2: .

W3: .

W4: .

Note that the first model is equivalent to the two sample t test and the last working model is the same as the underlying model. We simulate data according to (6) with and sample size and obtain the type I errors of the traditional tests. The results are shown in Table 2. As we can see, under CR, all working models provide correct type I errors. However, under RR, -BCD, and PSR, W1 W2, and W3 generate conservative type I errors below 5%, with PSR being the most conservative. This shows that the covariate-adjusted randomization leads to conservative results for traditional tests for the treatment effect. The more balanced covariates the randomization procedures provide, the more conservative the tests become. We further simulate data using and and obtain the power of the traditional tests and presented them in Table 3. As we can see, since the type I errors are conservative, the powers of RR, -BCD and PSR are also affected.

Since we know the true data generating process, we could obtain the true critical values for each scenario to make sure the type I errors are at 5%. Using the true critical values, we can obtain the true power of the tests using the same setting. We present these powers in Table 4. The powers are much higher than the ones reported in Table 3. The powers from PSR are the highest among all randomization procedures because it can balance the covariate the best. These results are consistent with Table 1.

From the simulation above, we understand that PSR, -BCD, and RR generate more balanced covariates than CR, therefore, are able to provide more powerful tests, subject to the availability of the correct critical values or p-values. Note that, in practice, we may not know the true data generating process and need to estimate the critical values and p-values.

Randomization W1 W2 W3 W4

0.0529 0.0512 0.0538 0.0513
RR 0.0114 0.0166 0.0259 0.0502
-BCD 0.0071 0.0118 0.0249 0.0532
PSR 0.0018 0.0058 0.0178 0.0519
Table 2: Type I error of traditional tests for treatment effect using different working models and different randomization procedures.
Randomization W1 W2 W3 W4
CR 0.1791 0.2149 0.2733 0.3872
RR 0.1260 0.1711 0.2477 0.3841
-BCD 0.1116 0.1550 0.2443 0.3861
PSR 0.0867 0.1400 0.2352 0.3864
Table 3: Power of the traditional tests for treatment effect using different working models and different randomization procedures when and , i.e., under .
Randomization W1 W2 W3 W4
CR 0.1801 0.2106 0.2644 0.3825
RR 0.2691 0.2971 0.3344 0.3908
-BCD 0.3130 0.3360 0.3695 0.3931
PSR 0.3684 0.3770 0.3812 0.3900
Table 4: Power of the hypothesis testing for treatment effect using true critical values under different working models and different randomization procedures when and .

6.3 Corrected Hypothesis Testing for Treatment Effect

In practice, to correct the type I error and obtain higher powers, we can estimate the critical values and p-values based on the estimated asymptotic distribution (Section 5). Using this approach, we repeat the same simulation as in the previous section, and present the type I errors in Table 5 and the powers in Figure 2.

As we can see in Table 5, all type I errors are successfully controlled at 5% which means the proposed approach works well. In Figure 2, we can see that as increases away from 0, the powers generally increase. However, under CR, different working models provide different powers. The more covariates included in the working model, the higher the power. This is because CR cannot balance the covariate well, and the covariates not included in the working model will affect the test for the treatment effect. On the contrary, under PSR, all working models provide similar powers because PSR can balance all covariate well. Since RR, -BCD can also balance the covariates, but not as well as PSR does, their powers are slightly better than CR but much worse than PSR.

Randomization W1 W2 W3 W4
CR 0.0477 0.0495 0.0459 0.0451
RR 0.0514 0.0498 0.0515 0.0510
-BCD 0.0508 0.0518 0.0525 0.0511
PSR 0.0597 0.0584 0.0504 0.0477
Table 5: Type I error of hypothesis testing for treatment effect using estimated asymptotic distribution’s critical values under different working models and different randomization procedures.
Figure 2: Power against using estimated asymptotic distribution’s critical values and p-values. Sample size . From left panel to right panel: 1) CR, 2) RR with , 3) -BCD, 4) PSR. Note that we plot the power of W4 under CR in bold gray curves in all the panels for a better comparison among different randomizations.

6.4 Hypothesis Testing for Covariate Effect

Lastly, we compare the performance of the traditional test for the third covariate effect, i.e., and . We adopt the same setting from the previous section and choose a range of values from 0 to 1 for to calculate the power under different working models. Note that in this case, only W3 and W4 contain the third covariate. The type I errors are shown in Table 6 and powers in Figures 3. As we can see, the type I errors are all controlled at 5%, which is consistent with our theoretical results shown in Theorem 3.5. In other words, no correction is needed for testing the covariate effect. On the other hand, Figure 3 shows that, if the working model does not include all covariates, the powers are reduced. This is again consistent with the results in Theorem 3.5. It is worthwhile to note that the performance of the hypothesis testing for the covariate effect does not depend on the choice of randomization procedure as all panels of Figures 3 are almost identical.

Randomization W3 W4

0.0527 0.0532
RR 0.0520 0.0509
-BCD 0.0485 0.0428
PSR 0.0512 0.0496
Table 6: Type I error of hypothesis testing for covariate effect using unadjusted critical values under different working models and different randomization procedures.
Figure 3: Power against . Sample size . From left panel to right panel: 1) CR, 2) RR with , 3) -BCD, 4) PSR. Note that we plot the power of W4 under CR in bold gray curves in all the panels for a better comparison among different randomizations.

7 Conclusion

In this article, the impact of covariate-adjusted randomization on inference properties is studied. The theoretical properties of post-randomization inference are established in Section 3 for general covariate-adjusted randomization. These results provide a theoretical foundation to analysis experiments based on covariate-adjusted randomization. Based on these results, one can then compare different covariate-adjusted randomization procedures theoretically. We then applied the theoretical properties to several popular covariate-adjusted randomization methods in the literature. Finite sample properties are also studied via extensive simulations.

As shown in Theorem 3.3, inference properties under covariate-adjusted randomization is closely related to how well covariates are balanced, which is measured by asymptotically. If is known, the inference properties can be derived according to Theorem 3.3. However, the exact form of may be unknown for some randomization procedures. we can evaluate the imbalance vector of covariates numerically to estimate its asymptotic distribution, for example, using bootstrap (Shao et al., 2010). Therefore, different covariate-adjusted randomizations can be compared in terms of both balancing and inference properties.

In Section 4.3, the estimated treatment effect under PSR achieves the minimum asymptotic variance, and hence the subsequent hypothesis tests are most powerful. Since PSR is proposed in the scenario that all covariate data are available before treatment allocation, this property is also desirable in the setting of sequential allocation, such as in clinical trials (Qin et al., 2017). For categorical covariates, stratification, minimization and other procedures that satisfy certain conditions have this property (Ma et al., 2015), but randomization with the same property for continuous covariates are still needed.

A linear model framework is assumed for the underlying model and the working model in our proposed framework. However, it is common to see other types of outcomes in practice. In Feinstein and Landis (1976) and Green and Byar (1978), the properties of unadjusted tests are studied for binary responses under which case type I error is decreased for stratified randomization compared to unstratified randomization. More recently, inference properties are studied based on generalized linear models or survival analyses for certain randomization procedures, such as stratification and minimization (Shao and Yu, 2013; Luo et al., 2016; Xu et al., 2016), but the properties for non-continuous outcomes under general covariate-adjusted methods remain unknown. It is desirable to extend the framework and results obtained in this article to the non-linear model framework.

The proposed framework and results can be generalized into several directions. First of all, the current work assumes equal allocation to each treatment, however, other target proportions may be preferred in some cases. Second, the framework in this article is based on experiments with two treatment groups, which can be extended to multiple treatments (Tymofyeyev et al., 2007). Last, other outcome types, such as time to event or categorical responses, can be studied with modifications to the proposed framework. These topics are left for future research.

8 Appendix: Proof of Main Theorems

To prove the main theorems, we first prove the following lemma.

Lemma 8.1.

Under Assumption 2, we have


for any .

Proof of Lemma 8.1.

First, it is easy to see that for any ,

Assumption 2 implies that converges to the -th dimension of in distribution, and thus it is bounded in probability. So we have

Also, it follows from the weak law of large numbers that

Therefore, for any ,

Similarly, it can also be shown that,

Proof of Theorem 3.1.

The OLS estimate can be written as

Note it is assumed that the covariates are independent of each other and for any , then by Assumption 1 and the weak law of large numbers,

and, together with Lemma 8.1,

In addition, the independence of and implies that

Therefore, we have

that is,

To prove the second part of the theorem, since we have shown , it suffices to show that

is bounded in probability. Note that Assumption 2 implies

for any , Also, using the independece of and , we have

Similar auguents give

which, together with the central limit theorem, yields

Proof of Corollary 3.2.

By the representation given in Theorem 3.1, we have

then Assumption 2 can be applied so that