Statistical Inference of CovariateAdjusted Randomized Experiments
Abstract
Covariateadjusted randomization procedure is frequently used in comparative studies to increase the covariate balance across treatment groups. However, as the randomization inevitably uses the covariate information when forming balanced treatment groups, the validity of classical statistical methods following such randomization is often unclear. In this article, we derive the theoretical properties of statistical methods based on general covariateadjusted randomization under the linear model framework. More importantly, we explicitly unveil the relationship between covariateadjusted and inference properties by deriving the asymptotic representations of the corresponding estimators. We apply the proposed general theory to various randomization procedures, such as complete randomization (CR), rerandomization (RR), pairwise sequential randomization (PSR), and Atkinson’s biased coin design (BCD), and compare their performance analytically. Based on the theoretical results, we then propose a new approach to obtain valid and more powerful tests. These results open a door to understand and analyze experiments based on covariateadjusted randomization. Simulation studies provide further evidence of the advantages of the proposed framework and theoretical results.
Keywords: balancing covariates, conservative tests, covariateadjusted randomization, Mahalanobis distance, power, rerandomization.
1 Introduction
Randomization is considered the “gold standard” to evaluate treatment effect as it mitigates selection bias and provides a foundation for statistical inference. Among all the randomization methods, covariateadjusted randomization (CAR) procedure is frequently used because it utilizes the covariate information to form more balanced treatment groups. However, because of such a feature, the validity of classical statistical inference following such randomization is usually unclear. In this article, we establish a general theory by which properties of statistical inference can be obtained for covariateadjusted randomization under mild conditions.
There has been extensive studies on CAR procedures. When facing categorical covariates, Pocock and Simon’s minimization method and its extensions can be used to reduce covariate imbalance of different levels (Taves, 1974; Pocock and Simon, 1975; Hu and Hu, 2012), which can also handle continuous covariates through discretization. To avoid information loss due to discretization, many randomization methods that directly utilize continuous covariates are also proposed in the literature (Frane, 1998; Lin and Su, 2012; Ma and Hu, 2013). Atkinson’s biased coin design (BCD) represents a large class of methods that take covariates into account in allocation rules based on certain optimality criteria (Atkinson, 1982; Smith, 1984a, b; Antognini and Zagoraiou, 2011). When all units’ covariates are available before the experiment starts, we can adopt rerandomization (RR) which repeats the traditional randomization process until a satisfactory configuration is achieved (Morgan and Rubin, 2012, 2015). In addition, pairwise sequential randomization (PSR) recently proposed by Qin et al. (2017) is another alternative, which achieves the optimal covariate balance and is computationally more efficient. Details of those methods will be given in Section 4. For an overview, please see Hu et al. (2014) and Rosenberger and Lachin (2015).
Since the aforementioned randomizations inevitably use the covariate information in forming more balanced treatment groups, the subsequent statistical inference is usually affected and demonstrates undesirable properties, such as reduced type I errors (Shao et al., 2010; Morgan and Rubin, 2012; Ma et al., 2015). This phenomenon of conservativeness is particularly common for a working model including only a subset of covariates used in randomization, such as two sample t test. As all the covariates are used in the randomization to generate more balanced assignments, a valid statistical procedure should incorporate all the covariates. Therefore, excluding some covariates from the working model leads to a distortion of the sampling distribution of test statistics, which consequently causes invalid statistical inference.
It is ideal that the covariates used in randomization should be included in the subsequent analysis in the context of clinical trials according to regulatory guidelines (ICH E9, 1998; EMA, 2015). However, unadjusted tests still dominate in practice (Sverdlov, 2015). For example, to avoid too many parameters, investigation sites are usually omitted in the analysis model for a multicenter clinical trial. Other reasons (not to incorporate all the covariates in practice) include simplicity of the test procedure, robustness to model misspecification and so on (Shao et al., 2010; Shao and Yu, 2013; Ma et al., 2015). Therefore, many working models may suffer from the issue of invalid statistical inference. As covariates are commonly used in comparative studies such as biomarker analysis, personalized medicine (Hu and Hu, 2012), and crowdsourcedinternet experimentation (Horton et al., 2011; Chandler and Kapelner, 2013), understanding the impact of covariateadjusted randomization on statistical inference is an increasingly pressing problem.
The issue over the validity of statistical inference after balancing covariates is investigated mainly based on simulations in the early literature, such as Birkett (1985); Forsythe (1987). More recently, theoretical progress has been made on the inference properties for some specific covariateadjusted randomization methods. Shao et al. (2010) prove that the two sample t test is conservative under a special stratified randomization. Ma et al. (2015) study the hypotheses testing under a linear model for discrete covariateadaptive randomization, which assumes that the overall and marginal imbalances across covariates are bounded in probability. However, their results are limited as many covariateadjusted procedures deal with continuous covariates directly and do not necessarily satisfy the strong balancing assumptions. In fact, inference properties of many methods, as we will show for RR, PSR, and BCD, are different than those studied by Shao et al. (2010) and Ma et al. (2015). In this article, we study inference properties under a general framework and demonstrate the impact of covariateadjusted randomization on inference.
The main contributions of this article are as follows. First, we derive the statistical properties of inference following general covariateadjusted randomization methods under the linear model framework. Most important, we explicitly display the relationship between covariate balance and inference by deriving their asymptotic representations. This result explains why inference behaves differently for various randomization methods. Second, we show that the results have broad applications, which is illustrated by applying to several randomization procedures, including CR, RR, PSR, and BCD. In addition, it provides a theoretical approach to formally evaluate inference properties and compare pros and cons of different randomization methods. Third, we propose a method to obtain valid and powerful tests based on our theoretical results. The study lays a foundation to understand the impact of covariate balance on postrandomization statistical inference and sheds lights on future study in this area.
This article is organized as follows. After introducing the framework and notations in Section 2, we present our main theoretical results for statistical inference under covariateadjusted randomization in Section 3. Using the proposed theory, we study four specific randomization methods in terms of their conservativeness in hypothesis testing in Section 4, and further propose a method to correct the conservative type I errors in Section 5. In Section 6, numerical studies are presented to illustrate the effectiveness of the proposed theory. Section 7 concludes with some remarks and future research topics. The main theoretical proofs are in Appendix.
2 General Framework
Suppose that units are to be assigned to two treatment groups using a covariateadjusted randomization. Let be the assignment of the th unit, i.e., for treatment 1 and for treatment 2. Let represent covariates observed for the th unit, where are independent and identically distributed as for each unit . A linear regression model is assumed for the outcome of the th unit,
(1) 
where and are the main effects of treatment 1 and 2, respectively, and is the treatment effect. Furthermore, represents the covariate effects, and is independent and identically distributed random errors with mean zero and constant variance , and is independent of covariates. For simplicity, all the covariates are assumed to be independent of each other and have expectations of zero, i.e., for .
After allocating the units to treatment groups via covariateadjusted randomization, a working model is used to estimate and test the treatment effect. In such a working model, it is common in practice to include a subset of covariates used in randomization, or sometimes even no covariates at all (Shao et al., 2010; Ma et al., 2015; Sverdlov, 2015). Therefore, without loss of generality, suppose that the first covariates are included in the working model,
(2) 
Note that when all the covariates are included in the working model, and when no covariates are included.
Let , , , where
Further let , , so that . Then the working model can also be written as,
where is the design matrix, is the vector of parameters of interest, and is the dimensional vector of ones. Therefore, the ordinary least squares (OLS) estimate of , is,
Under the covariateadjusted randomization, the treatment assignments depend on both and . The distribution of is often difficult to obtain. However, testing the treatment effect is often the primary goal when performing a comparative study (e.g., randomized clinical trial). To detect if a treatment effect exists, we have the following hypothesis testing problem,
(3) 
with the test statistic
where is a vector of length , and is the modelbased estimate of the error variance . The traditional testing procedure is to reject the null hypothesis at the significance level if , where is th quantile of a standard normal distribution.
In addition to testing the treatment effect, it is often of interest to test whether there exist covariate effects. A general form of hypothesis testing can be used for any linear combinations of the covariate effects. Let be an matrix of rank with entries in the first two columns all equal to zero (no treatment effect to test). Consider the following hypotheses,
(4) 
and the test statistic is,
The traditional test rejects the null hypothesis if , where is th percentile of a distribution with degrees of freedom. Note that we can let to test the significance of a single covariate effect of , and similarly for other covariate effects.
3 General Properties
Based on the framework introduced above, we study the statistical properties of estimation and hypothesis testing, i.e., (3) and (4), under covariateadjusted randomization. Before presenting our main results, we first introduce two widely satisfied assumptions.
Assumption 1.
Global balance:
Assumption 2.
Covariate balance: , where is a (p+q)dimensional random vector with .
Assumption 1 requires that the proportions of units in each treatment group converge to , which is usually the desired target proportion as balanced treatment assignments are more likely to provide efficient estimation and powerful tests. On the other hand, Assumption 2 specifies the asymptotic properties of the imbalance vector of covariates, i.e., . That is, the sums of covariates in each treatment group tend to be equal as sample size increases. Together with Assumption 1, this implies the similarity of the averages for each covariate between two treatment groups. The two assumptions ensure that a covariateadjusted randomization procedure achieves good balancing properties, both globally and across covariates. It is worth to point out that the Assumption 2 is satisfied with under the assumptions of Ma et al. (2015) for discrete covariates.
The properties of classical statistical methods are usually well known and well studied under the full model (1) in literature. However, in practice, the final statistical inference is often based on the working model (2). Now we present our main theoretical results based on the working model (2).
Theorem 3.1.
Furthermore,
where .
The representation provides a convenient way to derive the asymptotic distribution of and its linear combinations. In particular, for the estimated treatment effect it holds
based on which the asymptotic distribution can be obtained. We partition so that represents the first dimensions of , and the last dimensions. Let be a standard normal random variable that is independent of . We have the following corollary.
The corollary describes the asymptotic behavior of under the working model (2). If the model parameters and are known, statistical inference, such as Waldtype hypothesis test, can be constructed based on the asymptotic distribution. In practice, these parameters are unknown and the modelbased test procedure defined in (3) is used instead. It assumes the normal approximation for the asymptotic distribution, and the asymptotic variance is estimated by , which is shown in Appendix to equal . Further let and . The asymptotic properties of the test (3) under both the null and alternative hypotheses are presented in the following theorem.
Theorem 3.3.
The asymptotic distribution of test statistic under consists of two independent components, and . The first component is due to the random error in the underlying model (1), and remains invariant under different covariateadjusted randomization as the randomization procedure utilizes only covariate information and does not depend on the observed responses (Hu and Rosenberger, 2006). In addition, note that in the second component is the last dimensions of . By Assumption 2, is the asymptotic distribution of the imbalance vector of covariates and illustrates how well covariates are balanced under a specific covariateadjusted randomization method. The better it performs in terms of covariate balance, the more concentrated is distributed around . Therefore, the second component of represents the impact of a covariateadjusted randomization on the test statistic through the level of covariate balance. Depending on to what extent covariates are balanced, the test may behave differently in terms of size and power.
When the asymptotic distribution of is no longer a standard normal distribution, the traditional test may fail to maintain the prespecified type I error. Let be th quantile of the asymptotic distribution of under the null hypothesis. If , the test is conservative in the sense that the actual type I error is smaller than the prespecified level . In fact, such conservativeness is often the case for covariateadjusted randomization and can be demonstrated by comparison of between complete randomization and covariateadjusted randomization. Under complete randomization, follows a normal distribution that makes follow a standard normal distribution asymptotically (Section 4.1), in which case the test has valid type I error. However, covariateadjusted randomization is used with a purpose to reduce the imbalance of covariates between treatment groups, and hence is more concentrated around as opposed to complete randomization, leading to conservative tests. Three special cases of covariateadjusted randomization (RR, PSR, and BCD) will be disucssed in details in Sections 4.2, 4.3, and 4.4, respectively. The correction of conservative tests is discussed in Section 5.
Besides type I error, the explicit form of power can also be derived based on Theorem 3.3. Under the local alternatives, for a fixed , the power is
where is the cumulative distribution function of the asymptotic distribution of . In Section 6 power is evaluated numerically for several covariateadjusted randomization methods.
Similarly as for the treatment effect, the inference for the covariates can also be studied following the representation given in Theorem 3.1. The next theorem illustrates the asymptotic normality of under covariateadjusted randomization.
Corollary 3.4.
Based on the asymptotic normality of , tests for these parameters can be constructed with the asymptotic variances replaced by their consistent estimates. The next theorem shows that the standard test under the working model defined in (4) is valid for linear combinations of .
Theorem 3.5.
Theorem 3.5 states that the type I error is maintained when testing the covariate effects under covariateadjusted randomization. The power, however, is reduced if not all covariate information is incorporated in the working model. Since the inference on covariate effects is valid under covariateadjusted randomization, we will mainly focus on testing treatment effect in the next section.
4 Properties of Several CAR procedures
In last section, we derived the theoretical properties of general CAR procedures. Now we apply our results to several important CAR procedures proposed in literature. These applications help us to understand the relationship between balancing and inference of a given CAR procedure.
4.1 Complete Randomization
Complete randomization (CR) assigns units to each treatment group with the equal probability . Since the treatment assignment is independent and does depend on covariates, it follows from the central limit theorem that
where .
Therefore, under CR, defined in Assumption 2 is a normal distribution, and furthermore, . By Theorem 3.3, it is easy to show that the asymptotic distribution of the test statistic under the null hypothesis follows a standard normal distribution, i.e., . The traditional hypothesis testing under CR is valid with correct type I error and no adjustment is needed.
4.2 Rerandomization
To balance the covariates across treatment groups, Morgan and Rubin (2012) have proposed rerandomization (RR), for which the procedure can be summarized as follows:

Collect covariate data.

Specify a balance criterion to determine when a randomization is acceptable. For example, the criterion could be defined as a threshold of on some userdefined imbalance measure, denoted as .

Randomize the units into treatment groups using traditional randomization methods, such as CR.

Check the balance criterion . If the criterion is satisfied, go to Step (5); otherwise, return to Step (3).

Perform the experiment using the final randomization obtained in Step (4).
Morgan and Rubin (2012) have chosen the imbalance measure in Step (2) to be the Mahalanobis distance between the sample means across two treatment groups, which is defined by
where and are the sample means of covariates in the two treatment groups. There are several advantages for adopting such an imbalance measure. Mahalanobis distance is an affinely invariant imbalance measure, which is appealing especially for multivariate data. It is an overall imbalance measure which standardizes and aggregates each covariate imbalance information. A smaller value of Mahalanobis distance indicates a better covariate balance. A low Mahalanobis distance guarantees low imbalance levels in all covariates. Other desirable properties such as the reduction in variance of the estimated treatment effect can be found in Morgan and Rubin (2012).
Under the assumption of independent covariates, the Mahalanobis distance can be expressed as
By the balance criterion under RR, we have
where is the square root of , and is the dimensional identity matrix.
Theorem 4.1.
Under rerandomization, we have

Under , then
where is the last dimensions of .

Under , where for a fixed ,
Furthermore, the asymptotic variance of is
where is defined in Morgan and Rubin (2012) as
and is the incomplete gamma function .
The asymptotic distribution of under RR is no longer a normal distribution, and it is more concentrated around 0 compared with the standard normal distribution, which indicates that the traditional testing procedure is more conservative. The extent of conservativeness is impacted by the value of , which is an increasingly monotonic function of . By selecting a relatively smaller value of , the covariates are more balanced due to stricter balance criterion, resulting in a lower asymptotic variance of . However, a smaller means that on average it takes more attempts to meet the balance criterion. More discussions on the choice of can be found in Morgan and Rubin (2012).
4.3 Pairwise Sequential Randomization
Although RR can significantly reduce the covariate balance, it is incapable to scale up for the case of large number of covariates or large number of units, which is almost ubiquitous in the era of big data. Pairwise sequential randomization (PSR) recently proposed by Qin et al. (2017) solves such a problem by sequentially and adaptively assigning units to different treatment groups and is proven to have superior performance in terms of covariate balance and variance of the estimated treatment effect. PSR involves the following steps:

Collect covariate data.

Choose the covariate imbalance measure for units, denoted as .

Randomly arrange all units in a sequence , … , .

Separately assign the first two units to treatment 1 and treatment 2.

Suppose that units have been assigned to treatment groups (), for the th and th units:

If the th unit is assigned to treatment 1 and the th unit is assigned to treatment 2 (i.e., and ), then we can calculate the “potential” imbalance measure, , between the updated treatment groups with units.

Similarly, if the th unit is assigned to treatment 2 and the th unit is assigned to treatment 1 (i.e., and ), then we can calculate the “potential” imbalance measure, , between the updated treatment groups with units.


Assign the th and th units to treatment groups according to the following probabilities:
where , and assign to maintain the equal proportions.

Repeat Steps (5) through (7) until all units are assigned.
Similar to RR, PSR again chooses the Mahalanobis distance as the covariate imbalance measure in Step (2) because of its affinely invariant property and other desirable properties explained in the previous section. Once the Mahalanobis distance is calculated, the value of is set to 0.75. For a further discussion of , please see Hu and Hu (2012). In this algorithm, is assumed to be even. If is odd, then the last (th) unit is randomly assigned to either treatment 1 or 2 with a probability of .
Note that the units are not necessarily observed sequentially; however, Qin et al. (2017) propose to allocate them sequentially (in pairs) to minimize the occurrence of covariate imbalance. The sequence in which the units are allocated is not unique. Rather, there are different possible sequences, but their performances are similar, especially when is large.
Comparing PSR with RR, we see that both methods use covariate information. PSR uses the covariate information to decide the unit allocation in each iteration, while RR uses the covariate information to decide if a randomly generated allocation is satisfactory or not. Note that neither PSR nor RR is restricted to Mahalanobis distance. Both methods can be easily adapted to different measures of imbalance. However, Mahalanobis distance does lead to desirable properties of the subsequent analysis. For example, PSR results in the minimum asymptotic variance of the estimated treatment effect.
Under PSR, it is shown in Qin et al. (2017) that
(5) 
Then we have the following theorem.
Theorem 4.2.
Under PSR, we have

Under , then

Under , where for a fixed ,
The variance from the covariates is completely eliminated out in the numerator of the asymptotic distribution of , resulting in a distribution more concentrated around 0 than the standard normal distribution. In fact, the conclusion under PSR can be extended to a large class of covariateadjusted randomization if Assumption (2) is replaced by (5). This can be considered as a natural extension of the conditions proposed in Ma et al. (2015) that lead to conservative tests for covariateadaptive designs balancing discrete covariates. Note that the condition (5) is quite strong and is not a necessary condition to have a conservative test. For example, the condition is not satisfied under RR while the test is also conservative as shown in Section 4.2.
4.4 Atkinson’s Biased Coin Design
Atkinson’s biased coin design (BCD) is proposed to balance allocations across covariates in order to minimize the variance of estimated treatment effects when a classical linear model between response and covariates is assumed (Atkinson, 1982; Smith, 1984a, b). Unlike RR and PSR, it is used in the setting where covariate information are collected sequentially, such as in clinical trials. More discussions on the method and its properties can be found in Atkinson (2002), Antognini and Zagoraiou (2011) and Atkinson (2014).
BCD sequentially assigns units to treatment groups with an adaptive allocation probability: suppose units have been assigned to treatment groups, BCD assigns the th unit to treatment 1 with probability
where and .
Under BCD, by applying result (10.5) of Smith (1984b), it holds that
It is clear to see that under BCD the variance of the imbalance vector of covariates is reduced to of that under complete randomization, indicating that covariates are more balanced compared to complete randomization. The next theorem states the asymptotic distributions of under BCD.
Theorem 4.3.
Under BCD, we have

Under , then

Under , where for a fixed ,
This theorem shows that the test statistic has the asymptotic variance smaller than 1, so the test is conservative with a reduced type I error for testing the treatment effect.
Based on above four CAR procedures, we find that: (i) Under the CR (complete randomization), the distribution of is still asymptotically standard normal, so it provides the correct type I error. This is because the CR do not use covariate information at the assignment stage. (ii) Under other three procedures (RR, BCD and PSR), the asymptotical distributions of are not standard normal. Therefore, their type I errors (based on ) are not correct anymore. Based on their distributions of , we can compare their type I errors as well as their powers. In next section, we will discuss about the correction of type I error of CAR procedures and then we may compare their adjusted powers. We may also apply the general theorems (in Section 3) to other CAR procedures. Numerical comparisons of these randomization methods are given in Section 6.
5 Correction for Conservativeness
As we can see, most of the covariateadjusted randomizations lead to conservative type I error for testing the treatment effect, because traditional tests use standard normal distribution as the null distribution. Therefore, we propose the following approach to correct the conservative type I errors and to obtain higher powers.
Since we have derived the asymptotic distribution of in Theorem 3.3, we can obtain the correct asymptotic critical values. However, since the asymptotic distribution depends on unknown parameters, we need to estimate them using the observed sample to obtain the approximated null distribution and adjust the corresponding critical values and pvalues.
After adjusting the critical values and pvalues, we are able to obtain more powerful hypothesis testing results. The more conservative the traditional tests are, the more powerful their corrected versions become. Finally, we compare the covariateadjusted randomization procedures mentioned above in terms of covariate balance, conservativeness of the traditional tests, and powers of the corrected test, and summarize their advantages and disadvantages in Table 1. The conclusions in the table are further verified through simulation in the next section.
Randomization  Covariate balance  Type I error of traditional test  Power of corrected test 

CR  least balanced  valid  least powerful 
RR  moderately balanced  moderately conservative  moderately powerful 
BCD  moderately balanced  moderately conservative  moderately powerful 
PSR  most balanced  most conservative  most powerful 
6 Numerical Studies
In this section, we perform simulation studies to verify the theoretical results and demonstrate their effectiveness in obtaining high powers for hypothesis testing. We have tested various randomization procedures, including CR, RR, PSR, and BCD.
6.1 Verification of Theoretical Results
We first verify the theoretical asymptotic distribution of under CR, RR, PSR, and BCD. Assume the underlying model is where , for . for and is independent of each other. The random error is independent of all . We simulate the data according to the underlying model with sample size and use the working model which includes only two covariates out of four, , to obtain the test statistic . In Figure 1, we plot the simulated distributions of along with the theoretical distributions of given by Theorem 3.3. As the figure shows, the theoretical distributions are very close to the simulated distributions for all randomization procedures, which verifies our theoretical results. For comparison, we plot the standard normal distribution in bold gray. As we move from the left panel to the right panel (i.e., from CR to PSR), we can see that the distribution of becomes narrower. Therefore, using the critical values or the pvalues obtained from the standard normal distribution will result in conservative tests with reduced type I errors for RR, PSR, and BCD. The correction for such conservativeness will be further illustrated in the following sections.
6.2 Conservative Hypothesis Testing for Treatment Effect
From previous sections, we understand that, the traditional test for treatment effect under most covariateadjusted randomization procedures generates conservative results. In this section, we verify such a phenomenon. Suppose the underlying model is
(6) 
where for . and is independent of each other. The random error is independent of all . We use the following four working models to test the treatment effect, i.e., and .
W1: .
W2: .
W3: .
W4: .
Note that the first model is equivalent to the two sample t test and the last working model is the same as the underlying model. We simulate data according to (6) with and sample size and obtain the type I errors of the traditional tests. The results are shown in Table 2. As we can see, under CR, all working models provide correct type I errors. However, under RR, BCD, and PSR, W1 W2, and W3 generate conservative type I errors below 5%, with PSR being the most conservative. This shows that the covariateadjusted randomization leads to conservative results for traditional tests for the treatment effect. The more balanced covariates the randomization procedures provide, the more conservative the tests become. We further simulate data using and and obtain the power of the traditional tests and presented them in Table 3. As we can see, since the type I errors are conservative, the powers of RR, BCD and PSR are also affected.
Since we know the true data generating process, we could obtain the true critical values for each scenario to make sure the type I errors are at 5%. Using the true critical values, we can obtain the true power of the tests using the same setting. We present these powers in Table 4. The powers are much higher than the ones reported in Table 3. The powers from PSR are the highest among all randomization procedures because it can balance the covariate the best. These results are consistent with Table 1.
From the simulation above, we understand that PSR, BCD, and RR generate more balanced covariates than CR, therefore, are able to provide more powerful tests, subject to the availability of the correct critical values or pvalues. Note that, in practice, we may not know the true data generating process and need to estimate the critical values and pvalues.
Randomization  W1  W2  W3  W4 

CR 
0.0529  0.0512  0.0538  0.0513 
RR  0.0114  0.0166  0.0259  0.0502 
BCD  0.0071  0.0118  0.0249  0.0532 
PSR  0.0018  0.0058  0.0178  0.0519 
Randomization  W1  W2  W3  W4 

CR  0.1791  0.2149  0.2733  0.3872 
RR  0.1260  0.1711  0.2477  0.3841 
BCD  0.1116  0.1550  0.2443  0.3861 
PSR  0.0867  0.1400  0.2352  0.3864 
Randomization  W1  W2  W3  W4 

CR  0.1801  0.2106  0.2644  0.3825 
RR  0.2691  0.2971  0.3344  0.3908 
BCD  0.3130  0.3360  0.3695  0.3931 
PSR  0.3684  0.3770  0.3812  0.3900 
6.3 Corrected Hypothesis Testing for Treatment Effect
In practice, to correct the type I error and obtain higher powers, we can estimate the critical values and pvalues based on the estimated asymptotic distribution (Section 5). Using this approach, we repeat the same simulation as in the previous section, and present the type I errors in Table 5 and the powers in Figure 2.
As we can see in Table 5, all type I errors are successfully controlled at 5% which means the proposed approach works well. In Figure 2, we can see that as increases away from 0, the powers generally increase. However, under CR, different working models provide different powers. The more covariates included in the working model, the higher the power. This is because CR cannot balance the covariate well, and the covariates not included in the working model will affect the test for the treatment effect. On the contrary, under PSR, all working models provide similar powers because PSR can balance all covariate well. Since RR, BCD can also balance the covariates, but not as well as PSR does, their powers are slightly better than CR but much worse than PSR.
Randomization  W1  W2  W3  W4 

CR  0.0477  0.0495  0.0459  0.0451 
RR  0.0514  0.0498  0.0515  0.0510 
BCD  0.0508  0.0518  0.0525  0.0511 
PSR  0.0597  0.0584  0.0504  0.0477 
6.4 Hypothesis Testing for Covariate Effect
Lastly, we compare the performance of the traditional test for the third covariate effect, i.e., and . We adopt the same setting from the previous section and choose a range of values from 0 to 1 for to calculate the power under different working models. Note that in this case, only W3 and W4 contain the third covariate. The type I errors are shown in Table 6 and powers in Figures 3. As we can see, the type I errors are all controlled at 5%, which is consistent with our theoretical results shown in Theorem 3.5. In other words, no correction is needed for testing the covariate effect. On the other hand, Figure 3 shows that, if the working model does not include all covariates, the powers are reduced. This is again consistent with the results in Theorem 3.5. It is worthwhile to note that the performance of the hypothesis testing for the covariate effect does not depend on the choice of randomization procedure as all panels of Figures 3 are almost identical.
Randomization  W3  W4 

CR 
0.0527  0.0532 
RR  0.0520  0.0509 
BCD  0.0485  0.0428 
PSR  0.0512  0.0496 
7 Conclusion
In this article, the impact of covariateadjusted randomization on inference properties is studied. The theoretical properties of postrandomization inference are established in Section 3 for general covariateadjusted randomization. These results provide a theoretical foundation to analysis experiments based on covariateadjusted randomization. Based on these results, one can then compare different covariateadjusted randomization procedures theoretically. We then applied the theoretical properties to several popular covariateadjusted randomization methods in the literature. Finite sample properties are also studied via extensive simulations.
As shown in Theorem 3.3, inference properties under covariateadjusted randomization is closely related to how well covariates are balanced, which is measured by asymptotically. If is known, the inference properties can be derived according to Theorem 3.3. However, the exact form of may be unknown for some randomization procedures. we can evaluate the imbalance vector of covariates numerically to estimate its asymptotic distribution, for example, using bootstrap (Shao et al., 2010). Therefore, different covariateadjusted randomizations can be compared in terms of both balancing and inference properties.
In Section 4.3, the estimated treatment effect under PSR achieves the minimum asymptotic variance, and hence the subsequent hypothesis tests are most powerful. Since PSR is proposed in the scenario that all covariate data are available before treatment allocation, this property is also desirable in the setting of sequential allocation, such as in clinical trials (Qin et al., 2017). For categorical covariates, stratification, minimization and other procedures that satisfy certain conditions have this property (Ma et al., 2015), but randomization with the same property for continuous covariates are still needed.
A linear model framework is assumed for the underlying model and the working model in our proposed framework. However, it is common to see other types of outcomes in practice. In Feinstein and Landis (1976) and Green and Byar (1978), the properties of unadjusted tests are studied for binary responses under which case type I error is decreased for stratified randomization compared to unstratified randomization. More recently, inference properties are studied based on generalized linear models or survival analyses for certain randomization procedures, such as stratification and minimization (Shao and Yu, 2013; Luo et al., 2016; Xu et al., 2016), but the properties for noncontinuous outcomes under general covariateadjusted methods remain unknown. It is desirable to extend the framework and results obtained in this article to the nonlinear model framework.
The proposed framework and results can be generalized into several directions. First of all, the current work assumes equal allocation to each treatment, however, other target proportions may be preferred in some cases. Second, the framework in this article is based on experiments with two treatment groups, which can be extended to multiple treatments (Tymofyeyev et al., 2007). Last, other outcome types, such as time to event or categorical responses, can be studied with modifications to the proposed framework. These topics are left for future research.
8 Appendix: Proof of Main Theorems
To prove the main theorems, we first prove the following lemma.
Lemma 8.1.
Proof of Lemma 8.1.
First, it is easy to see that for any ,
Assumption 2 implies that converges to the th dimension of in distribution, and thus it is bounded in probability. So we have
Also, it follows from the weak law of large numbers that
Therefore, for any ,
Similarly, it can also be shown that,
∎
Proof of Theorem 3.1.
The OLS estimate can be written as
Note it is assumed that the covariates are independent of each other and for any , then by Assumption 1 and the weak law of large numbers,
and, together with Lemma 8.1,
In addition, the independence of and implies that
Therefore, we have
that is,
To prove the second part of the theorem, since we have shown , it suffices to show that
is bounded in probability. Note that Assumption 2 implies
for any , Also, using the independece of and , we have
Similar auguents give
which, together with the central limit theorem, yields
∎