An Optimal Method for Covariate Balancing and Its Properties
By Yichen Qin, Yang Li, and Feifang Hu***Yichen Qin is Assistant Professor, Department of Operations, Business Analytics and Information Systems, University of Cincinnati, Cincinnati, OH 45221 (E-mail: firstname.lastname@example.org). Yang Li is Associate Professor, School of Statistics and Center for Applied Statistics, Renmin University of China, Beijing, P.R. China, 100872 (E-mail: email@example.com). Feifang Hu is Professor, Department of Statistics, George Washington University, Washington, DC 20052 (E-mail: firstname.lastname@example.org). The research was partially supported by NSF Awards (DMS-1442192 and DMS-1612970) and the National Natural Science Foundation of China (No. 11371366 and No. 71301162).
An Optimal Method for Covariate Balancing and Its Properties
SUMMARY: This article introduces a new randomization procedure to improve the covariate balance across treatment groups. Covariate balance is one of the most important concerns for successful comparative studies, such as causal inference and clinical trials, because it reduces bias and improves the accuracy of inference. However, chance imbalance may still exist in traditional randomized experiments, in which units are randomly allocated without using their covariate information. To address this issue, the proposed method allocates the units sequentially and adaptively, using information on the current level of imbalance and the incoming unit’s covariate. With a large number of covariates or a large number of units, the proposed method shows substantial advantages over the traditional methods in terms of the covariate balance and computational time, making it an ideal technique in the era of big data. Furthermore, the proposed method attains the optimal covariate balance, in the sense that the estimated average treatment effect under the proposed method attains its minimum variance asymptotically. Numerical studies and real data analysis provide further evidence of the advantages of the proposed method.
KEYWORDS: Causal inference, covariate-adaptive randomization, clinical trial, Mahalanobis distance, minimum asymptotical variance, treatment effect.
Randomization is an essential tool for the accurate evaluation of the treatment effect, because it mitigates selection bias and provides a basis for statistical inference. However, traditional randomization methods, such as complete randomization (CR), often generate unsatisfactory randomization configurations with unbalanced prognostic (or baseline) covariates; this problem has been recognized and extensively discussed ever since Fisher (1926) noted:
“Most of experimenters on carrying out a random assignment of plots will be shocked to find out how far from equally the plots distribute themselves.”
Covariate balance is critical for successful statistical inference in comparative studies. Therefore, these chance imbalances in the covariates obtained from traditional randomization methods can significantly undermine the validity of subsequent analysis. According to Hu et al. (2014) and McEntegart (2003), the advantages of balanced covariates are at least threefold. First, covariate balance improves the accuracy and efficiency of statistical inference across treatment groups, removes the bias in estimation, and increases the true power of hypothesis testing. Second, it increases the interpretability of the estimated treatment effect by making the units in the treatment groups more comparable, thereby enhancing the credibility of the analysis. Third, it makes the analysis more robust against model misspecification, because less modeling is needed for balanced treatment groups. In addition, outliers are more likely to cancel each other out, because they tend to be more evenly distributed across treatment groups.
In the absence of covariate balance, various problems must be addressed before a valid conclusion can be drawn. For instance, in causal inference, if a significant imbalance exists, it is very difficult to make a direct comparison across treatment groups. In this case, any inferences regarding the treatment effect will be biased, and any claims about the treatment effect will need to rely on unverifiable assumptions (Morgan, 2011). In practice, researchers must assess the balance in the covariate distribution even before applying any methods for estimating the causal effect. Although some ex-post adjustments, such as regression (Freedman, 2008) and subsample selection using matching or trimming based on propensity scores (Imbens and Rubin, 2015), can be performed to cope with such an imbalance, they are much less efficient than achieving an ex-ante balance from the start (Bruhn and McKenzie, 2008). In addition, when applying these adjustments, researchers often need to have at least a nearly correct model, which can be difficult to test (Cochran, 1965; Cochran and Rubin, 1973). Rubin (2008) explained why the greatest possible efforts should be made during the design phase of an experiment rather than during the analysis stage, at which point the researcher has the potential to bias the results (consciously or unconsciously) (Morgan, 2011). This issue of covariate imbalance has also led to discussions regarding whether randomization is preferable to a purposefully designed balanced allocation (Gosset, 1938; Greenburg, 1951; Arnold, 1986; Senn, 2004).
Similarly, in clinical trials, if unbalanced covariates are strongly correlated with the outcomes, their presence may make it difficult to interpret the results of statistical tests with regard to the treatment effect. The credibility of the study also comes into question. Although regression adjustments are available for researchers to address the imbalance issue, regression relies on assumptions that are often difficult to verify, such as, linear versus nonlinear models or homoscedasticity versus heteroscedasticity. In some cases, regression estimates are not even unbiased for the treatment effect (Imbens and Rubin, 2015).
More recently, covariate balance has attracted growing interest in the field of crowdsourced-internet experimentation (Horton et al., 2011; Chandler and Kapelner, 2013; Kapelner and Krieger, 2014). Researchers increasingly recruit workers from online labor markets into their experiments, such as by asking them to label tumor cells in images or classify articles. Because of the nature of the recruiting process, a large number of workers with many covariates (e.g., 2500 workers in Chandler and Kapelner (2013)), typically are enrolled in such studies, which consequently pose challenges for traditional randomization methods. Therefore, ensuring balanced randomization is a pressing task for any statistical inference.
Furthermore, the phenomenon of covariate imbalance is exacerbated as the number of covariates and the sample size increase, which is nearly ubiquitous in the era of big data. For example, suppose that the probability of one particular covariate being unbalanced is . For a study with 10 covariates, the chance of at least one covariate exhibiting imbalance is . To address the covariate imbalance issue in the framework of causal inference, Morgan and Rubin (2012) have proposed rerandomization (RR), for which the procedure can be summarized as follows:
Collect covariate data.
Specify a balance criterion to determine when a randomization is acceptable. For example, the criterion could be defined as a threshold of on the Mahalanobis distance between the sample means across different treatment groups, , where and are the sample means for two treatment groups.
Randomize the units into treatment groups using traditional randomization methods, such as complete randomization (CR).
Check the balance criterion . If the criterion is satisfied, go to Step (5); otherwise, return to Step (3).
Perform the experiment using the final randomization obtained in Step (4).
They have further demonstrated various desirable properties of the causal inference performed under rerandomization, such as the reduction in variance of the estimated treatment effect. Although rerandomization works well in the case of a few covariates, it is incapable of scaling up to address massive amounts of data. For example, as the number of covariates increases, the probability of acceptance, , of each randomization in Step (4) decreases drastically, causing the rerandomization procedure to remain in Steps (3) and (4) for a long time. In addition, as the sample size increases, the computational time required for each iteration of Step (4) grows linearly. Therefore, rerandomization becomes infeasible in these cases.
In this article, we propose a covariate-adaptive randomization (CAM) approach to generate a more balanced treatment allocation and thus to improve the quality of the subsequent causal inference. Unlike rerandomization or complete randomization, in which all units are allocated independently, we allocate units adaptively and sequentially. We first initialize our algorithm by randomly allocating a few units. For the remaining units, we allocate one randomly chosen pair of units at a time. For each pair of units, using their covariate information and the existing level of imbalance of the previously allocated units, we adjust the probability with which the pair is allocated to treatment groups to avoid incidental covariate imbalance. In this way, we are able to produce a much more balanced allocation of units. As demonstrated through the numerical studies, for cases with a large number of covariates or a large number of units, the proposed method exhibits superior performance, with a more balanced randomization and much less computational time. In addition, the proposed method is proven to be optimal, in the sense that the estimated treatment effect under the proposed method achieves the minimum asymptotic variance.
The proposed method is following the similar spirit of the minimization methods used in clinical trials (Taves, 1974; Pocock and Simon, 1975; Hu and Hu, 2012). However, the context of these minimization methods is different from ours. In their settings, patients are observed and allocated as they enter into a clinical trial. In our case, we initially possess the complete information about all units, but we choose to allocate them sequentially and adaptively. In addition, in minimization methods, patients are typically allocated individually (one at a time), whereas in the proposed framework, we randomly choose pairs of units to allocate, which is infeasible in the minimization methods’ setting. Another significant difference is that the minimization methods are suitable only for discrete covariates, minimizing the margin and stratum imbalance. The proposed method, in contrast, is suitable for allocating units with both discrete and continuous covariates.
This article is organized as follows. We introduce the proposed covariate-adaptive randomization method using the Mahalanobis distance (CAM) in Section 2 and investigate its theoretical properties in Section 3. We demonstrate its advantages in the treatment effect estimation and present its corresponding theoretical properties in Section 4. We further present an example using real data to demonstrate the superior performance of our method in Section 5. Finally, we conclude with a discussion in Section 6 and relegate the outlining of proofs to Section 7.
2 Covariate-Adaptive Randomization via Mahalanobis Distance
Suppose that units (samples) are to be assigned to two treatment groups and that is even. Let be the assignment of the -th unit, i.e., for treatment 1 and for treatment 2. Consider continuous covariates for each unit. Let represent the covariates of the -th unit, where , and for . The proposed procedure, covariate-adaptive randomization via Mahalanobis distance (CAM), involves the following steps:
Choose the covariate balance criterion. We define the Mahalanobis distance, , between the covariate sample means for the two treatment groups as
where , and we fix to ensure that we have equal sample sizes across the treatment groups. and are the covariate sample means corresponding to these treatment groups under the proposed method, and . is the covariance matrix of the covariate . In practice, the covariance matrix is usually replaced with the sample covariance matrix, which can be computed using all units (before the randomization starts). This Mahalanobis distance functions as a measure of the covariate balance throughout this article. A smaller value of indicates a better covariate balance.
Randomly arrange all units in a sequence , … , .
Separately assign the first two units to treatment 1 and treatment 2.
Suppose that units have been assigned to treatment groups ().
For the -th and -th units:
If the -th unit is assigned to treatment 1 and the -th unit is assigned to treatment 2 (i.e., and ), then we can calculate the “potential” Mahalanobis distance, , between the updated treatment groups with units.
Similarly, if the -th unit is assigned to treatment 2 and the -th unit is assigned to treatment 1 (i.e., and ), then we can calculate the “potential” Mahalanobis distance, , between the updated treatment groups with units.
Assign the -th and -th units to treatment groups according to the following probabilities:
Repeat Steps (4) through (6) until all units are assigned.
The value of is set to 0.75 throughout this article. Different values of will not affect the theoretical results presented in this article. For a further discussion of , please see Hu and Hu (2012). In this algorithm, is assumed to be even. If is odd, then the last (-th) unit is randomly assigned to either treatment 1 or 2 with a probability of .
Note that the units are not necessarily observed sequentially; however, we allocate them sequentially (in pairs) to minimize the occurrence of covariate imbalance. The sequence in which the units are allocated is not unique. Rather, there are different possible sequences, but their performances are similar, especially when is large (see Figure 1 for details).
3 Theoretical Properties of Covariate-Adaptive Randomization via Mahalanobis Distance
We study the asymptotic properties of the Mahalanobis distance, , obtained using the proposed method. represents the level of imbalance of the covariates, such that at smaller values of , the covariates are more balanced.
Under the proposed method, suppose that the covariate , , is independent and identically distributed as a multivariate normal distribution with zero mean; then we have .
Note that the Mahalanobis distance that is obtained through the complete randomization (CR), , has a stationary distribution of a Chi-square distribution with degrees of freedom (regardless of ), i.e., . Therefore, the Mahalanobis distance that is obtained through the rerandomization (RR), , has a Chi-square distribution with degrees of freedom conditional on , i.e., . Hence, as the sample size increases, the proposed method reveals a greater advantage over both rerandomization and complete randomization, because converges to 0 at the rate of . That is, under the proposed method, as more units are included in the study, the quality (balance level) of the randomization becomes significantly better.
Moreover, under complete randomization, as the number of covariates increases, the stationary distribution of becomes flatter, which implies poorer allocation in terms of covariate balance (i.e., larger ). As a consequence, rerandomization has a lower probability of acceptance, . Therefore, the advantage of the proposed method also becomes more significant as increases, because the obtained using the proposed method converges to 0 regardless of the magnitude of . In addition, even for a fixed , the effect of on under the proposed method is less severe than that under rerandomization or complete randomization (as illustrated later in Figure 1).
To demonstrate the advantage of the proposed method, we have conducted a simulation to compare the proposed method with rerandomization (with a fixed acceptance probability of ) using the multivariate normal covariates with the zero mean and identity covariance matrix; the results are presented in Figure 1. For different sample sizes and different numbers of covariates , we plot the histograms of and obtained using the simulated data. As the figure shows, as increases with fixed , the distribution of the Mahalanobis distance obtained through rerandomization remains unchanged, whereas the distribution of the obtained using the proposed method rapidly converges to 0. Moreover, as increases with fixed , the distributions obtained through rerandomization and the proposed method become wider, but the inflation of distribution is much less severe for the proposed method (i.e., the overlap between the two distributions becomes smaller as increases).
In addition, we also compared the proposed method with rerandomization in terms of computational times and feasibility. Note that the proposed method only requires one iteration (i.e., it processes all units once), whereas rerandomization requires multiple iterations of complete randomization to achieve an acceptable balance level. Therefore, we compared the number of iterations required for rerandomization to achieve the same performance (same Mahalanobis distance) as the proposed method. In addition, we also compared the corresponding computational times. The results are shown in Figure 2. As seen in Figures 2a and 2b, when the sample size and number of covariates are small, the computational advantage of the proposed method is not obvious. As and increase, however, the proposed method gradually shows a significant advantage over rerandomization, because more iterations and more time are required for rerandomization in order to achieve the same level of performance as the proposed method. As the number of covariates continues to increase, rerandomization will eventually become infeasible. In other words, it is nearly impossible for rerandomization to achieve the same performance as the proposed method. Note that the computational time of the proposed method grows only linearly with and remains the same for different s, whereas the computational time of rerandomization grows exponentially as either or increases.
Finally, we verified the rate of convergence stated in Theorem 3.1 by plotting the expected Mahalanobis distance against the reciprocal of the sample size () using simulated data, as shown in Figure 3. It is clear that for different numbers of covariates , the expected Mahalanobis distance converges to 0 at the same rate of , as evidenced by the straight lines in the figure.
4 Causal Inference Under Covariate-Adaptive Randomization and Properties
We use a framework similar to that of Morgan and Rubin (2012). After allocating the units to treatment groups, we are interested in estimating the treatment effect from the outcomes , , obtained in the treatment groups. Suppose that represents the potential outcome of the -th unit under the treatment (Rubin’s causal model, Rubin (1974)). The actual observed outcome can be expressed as
The average treatment effect is
However, the fundamental problem in causal inference is that we only observe for one particular . Therefore, cannot be calculated directly. If we want to estimate using , a natural choice is
which is simply the difference between the sample means of the outcome variable for the different treatment groups.
One problem with is that if there is an imbalance in the covariates, it will affect the accuracy of . For example, if we estimate the effect of a drug when the treatment 1 group contains mostly males and the treatment 2 group contains mostly females, then the estimated treatment effect will not be able to exclude the effect of gender. In other words, the difference in the covariates will make less accurate.
To adjust for such an imbalance, we can use linear regression to estimate the treatment effect. That is, conditional on the treatment assignment , each outcome variable is assumed to follow the model below:
where and are the main effects of treatments 1 and 2, respectively, and is the treatment effect. Furthermore, represents the covariate effect, and is an independent and identically distributed random error with zero mean and constant variance , and is independent of . All covariates , , are independent and identically distributed.
Let us define
, , and . Then, we can obtain the ordinary least squares estimate of :
Let us consider , a -dimensional vector. We define
which is another estimate of the treatment effect that is adjusted for the imbalance in the covariates. Note that if does not include any covariates, i.e., , then the regression model is , and in Equation (1), becomes , which is the estimated treatment effect without adjusting for the imbalance in the covariates.
In the next section, we study the properties of and under our proposed method (i.e., and ), under complete randomization (i.e., and ), and under rerandomization (i.e., and ).
4.2 Theoretical Properties of the Estimated Treatment Effect
Morgan and Rubin (2012) proved that under complete randomization and rerandomization, and are unbiased. For the proposed method, we can similarly show unbiasedness:
Under the proposed method, we have
In addition to unbiasedness, we can also show the following:
Under the proposed method, suppose that the covariate , , is independent and identically distributed as a multivariate normal distribution with zero mean; then we have
where and .
In randomized experiments, the emphasis typically is placed on the percent reduction in variance (PRIV), which was originally defined by Morgan and Rubin (2012). This quantity represents the percentage by which the randomization method reduces the variance of the differences in the means calculated for the different treatment groups. Therefore, a higher value of the PRIV indicates that the means are closer to each other. Consider the PRIV for the -th covariate,
where and are the -th elements of and . According to Theorem 4.2, the PRIV of each covariate is under the proposed method. We recall that for rerandomization, the PRIV of each covariate is , which is a constant and independent of the sample size. This is because is a function of only . In contrast, for the proposed method, as .
Theorem 4.2 implies that the quality of the “matching” is much improved using the proposed method. As the sample size increases, the imbalance in the covariates reaches the minimum level. This is particularly useful when the covariates and outcome are correlated, because in this case, the proposed method will in turn improve the precision of the estimation of the treatment effect, as detailed in the following theorem.
Under the proposed randomization method, suppose that the outcome variable and the covariate are normally distributed and that the treatment effect is additive; then, the percent reduction in variance (PRIV) of is , where is the squared multiple correlation between and within the treatment groups, and .
Notably, Theorem 4.3 does not assume a linear regression for the outcome model.
Let us compare the PRIV of under the proposed method with the PRIV of under rerandomization. Recall that the PRIV of is (Morgan and Rubin, 2012), which is again a constant and does not depend on the sample size. In contrast, the PRIV of is and converges to as the sample size . In fact, the PRIV of is simply the PRIV of the covariates scaled by . To illustrate the advantage of the proposed method, we plot the PRIVs of and of (with a fixed acceptance probability of ) in Figure 4. Note that we let in both figures only for illustrative purposes (as in Morgan and Rubin (2012)). It is evident that as increases, at each value of , the PRIV of increases to . However, the PRIV of at a given does not vary with different . The advantage of the proposed method over rerandomization is clear, especially for large and large .
Meanwhile, the percent reduction in variance due to the adjustment via linear regression in complete randomization is (Cox, 1982), which converges to as . Therefore, we conclude that the proposed method can reduce the asymptotic variance to the minimum level.
In addition, if we further assume that the outcome variable truly follows a linear regression model, we can show that achieves the optimal precision even without adjusting for the imbalance in the covariates using linear regression. That is,
Theorem 4.4 (Optimal precision).
Suppose that the outcome variable truly follows the linear regression model in Equation (2) and that we estimate the treatment effect under the proposed method and under complete randomization; then, we have
This theorem first implies that under the proposed method, the precision of the estimate of the treatment effect obtained using a simple sample mean difference, , is the same as the precision of the estimate obtained through a linear regression which adjusts for the covariate imbalance, . This suggests that the regression adjustment would not be necessary under the proposed method, because the covariates already would have been balanced sufficiently well, and the linear regression does not provide any additional benefit. Furthermore, the theorem also implies that the precision of is the same as the precision of the estimated treatment effect obtained from a linear regression under complete randomization, , which is considered optimal. In other words, the proposed method can balance the covariates so well that, asymptotically, the simple sample mean difference is just as good as the linear-regression-adjusted estimate . Therefore, the adjustment provided by linear regression is generally not needed.
Moreover, under complete randomization, if the outcome variable follows the linear regression model, then the asymptotic variance of the estimated treatment effect, , reaches its minimum level. Therefore, we conclude that the estimated treatment effect under the proposed method, , also attains optimal precision, i.e., . We also understand that the Mahalanobis distance that we have selected as the covariate imbalance criterion is the best criterion to use in a randomization procedure for linear regression models. Although and have the same precision, it is worth noting that to calculate , it is necessary to estimate all regression coefficients , whereas is simply the sample mean difference and does not require the estimation of any additional coefficients. This is especially advantageous when the number of covariates is large, as in the case of high-dimensional data.
In Table 1, we summarize the relationships between the asymptotic variances of the different estimates presented in Theorem 4.4. Note that “Asym. Var.” in the table stands for the asymptotic variances of the different estimates blown up by a factor of .
|Randomized covariates||Randomization method||Working model for estimating|
|Complete randomization||Asym. Var.||Asym. Var.|
|CAM||Asym. Var.||Asym. Var.|
In addition, we performed a simple simulation to verify Table 1, and presented the results in Table 2. In the simulation, we used four continuous covariates in and simulate the outcome according to , where , , , and , and the sample size was .
|Randomized covariates||Randomization method||Working model for estimating|
4.3 Computational Time
The previous section clearly demonstrates the advantages of the proposed method with regard to causal inference. A natural question is whether we can also let in the rerandomization method to improve its performance to match that of the proposed method (because rerandomization allows researchers to increase the power of the analysis at the expense of computational time (Morgan and Rubin, 2012)). However, this option is infeasible in many cases, as illustrated below.
Suppose that the time required to allocate one additional unit using the proposed method is , where is the number of covariates and . Suppose that the time required to allocate one additional unit through complete randomization is . Then, we have the following theorem.
To achieve the same level of performance, the ratio of the average computational time of the proposed method to the average computational time of the rerandomization method is proportional to , where is the cumulative distribution function of a Chi-square distribution with degrees of freedom, and is the root of where is a constant and is the incomplete gamma function.
To illustrate the advantage of the proposed method, we plot the ratio of the computational times between the proposed method and rerandomization as a function of and in Figure 5. We let , , and for illustrative purposes. In Figure 5, we can see that as either or increases, the ratio of the computational times rapidly approaches 0. When and , the ratio is , which means that it is (almost) impossible for rerandomization to achieve the same level of performance as the proposed method. This demonstration illustrates the advantages of the proposed method in the case of a large sample size or high-dimensional data.
Furthermore, we report the ratios of the computational times for several scenarios as quantitative values in Table 3. We assume , , and . As we can see, for a small sample size and low-dimensional covariates, the computational times of the proposed method and rerandomization are comparable. However, as either or increases, the ratio approaches 0 very fast.
5 Real Data Example
In this section, we illustrate the advantages of our proposed method using a real data set obtained in a clinical study of a Ceragem massage (CGM) thermal therapy bed, a medical device for the treatment of lumbar disc disease. The study was conducted by the medical device company that produces the CGM thermal therapy bed. In total, 186 patients have been chosen for the study. For each patient, there are 50 covariates, i.e., . Among the covariates, there are 30 numerical covariates, such as age and several baseline measurements of the patient’s current conditions, including lower back pain, leg pain, leg numbness, body examination scores, and magnitudes of pain in other body parts (shoulders, neck, chest, hip and so on), all measured on 0-10 scales. In the original study, these patients were randomly assigned to the treatment group (to which the thermal therapy was administrated using the medical device) or the control group. At the end of the study, their outcome variable , representing the numerical measurements of their lower back pain after the treatment, was recorded to study the treatment effect of the CGM thermal therapy bed.
In the original study, because the patients are randomly allocated, the Mahalanobis distance of the patients allocation was only 57.67, which is relatively high and indicates a moderate covariate imbalance across the treatment groups.
Using the same patients’ covariates, we reassigned these patients to treatment and control groups using the proposed method, complete randomization, and rerandomization. We repeated the group allocation for these 186 patients many times, and the Mahalanobis distances obtained using all methods are plotted in Figure 6. We also replicated the data four times to obtain a sample size of , and then also allocated these 744 patients many times using all the methods. We report the corresponding Mahalanobis distances in the same figure.
As seen from the figure, the Mahalanobis distances obtained by applying the proposed method to the original data () are consistently lower than those obtained through complete randomization. Hypothetically, if we had patients, the Mahalanobis distance could be further decreased toward 0 (i.e., the green histogram). In that case, few of the allocations obtained through complete randomization could achieve the same level of balance as the proposed method. Note that regardless of the sample size ( or ), the Mahalanobis distance obtained through complete randomization always follows the Chi-square distribution. If we set the criterion for rerandomization to , then rerandomization will produce the Mahalanobis distances to the left of the vertical line (), which are still not comparable with the results of the proposed method. We could further reduce the threshold to ; however, a lower threshold results in a lower acceptance probability. Moreover, please note that the proposed method requires only a single run.
Under each patient allocation scheme, we further simulated the outcome variable using a linear regression model which was fitted to the original data. To closely mimic the original data, we fitted the linear regression to the original data with the original patient allocation, . We stored the residuals, , and the coefficient estimates, , , and , of the fitted linear regression. For the simulated patient allocations , we simulate the outcome variable, , according to , where was uniformly sampled from the residuals .
Using the simulated outcome variable, we estimated the average treatment effect using under the proposed method, rerandomization, and complete randomization. The resulting performance comparison is summarized in Table 4. As seen from this table, in terms of estimating the treatment effect, the proposed method exhibits the best performance compared with rerandomization and complete randomization. The proposed method yields the largest percent reduction in variance (PRIV) and the lowest variance. For rerandomization, a smaller threshold (i.e., or ) results in better performance; however, this comes at the cost of a longer computational time and a lower acceptance probability. Note that the for the regression fitted to the original data is only , and because of the finite sample size, the optimal PRIV cannot be achieved. We can see that if we increase the sample size to , the PRIV of the proposed method is greatly improved and is very close to optimal, whereas that of the rerandomization method does not improve at all. The gain from the proposed method is quite substantial. This simulation clearly illustrates the potential of the proposed method in the context of big data, in which both the dimension and the sample size are relatively large.
|Sample Size||Method||PRIV||MSE (or Var)||or|
In this article, we have introduced a new randomization procedure for balancing the covariates to improve the accuracy of causal inference. Compared with traditional methods, the proposed method can cope with a large number of covariates, which is especially advantageous in the era of big data, in which high dimensional data is frequently encountered. The proposed method also shows superior performance in terms of computational time. In addition, it achieves optimality under the linear regression framework, in the sense that, asymptotically, the proposed method can balance the covariates so well that the imbalance adjustment provided by linear regression is not needed.
Although the proposed method is different from the minimization methods and other randomization methods used in clinical trials (Wei, 1978; Begg and Iglewicz, 1980; Atkinson, 1982; Smith, 1984a, b), it can be extended to such settings. Instead of selecting a pair of units, we can select only one unit to allocate. However, the behavior of the Mahalanobis distance in such a scenario will be further complicated, because the proportion of the treatment group (i.e., ) then becomes a random variable. We suspect that the allocation procedure should be slightly modified such that both the Mahalanobis distance and the proportion are controlled. In such a scenario, we anticipate that the rate of convergence of the Mahalanobis distance can be further improved. We leave this possibility as a topic for future investigation.
Many other potential directions for further research remain as well. For example, we have shown the optimality of the estimated treatment effect. An extension to hypothesis testing is also of interest (Ma et al., 2015). The optimality of the estimator hints at the most powerful test for the treatment effect. In addition, as the number of covariates increases, it is more efficient to balance only the most important covariates (Morgan and Rubin, 2015); therefore, the selection of the important covariates to balance in our proposed framework is an interesting topic. The proposed method may also be applied to balance important covariates in the field of crowdsourced-internet experimentation.
We provide outlines of the key proofs in the Appendix. The supplementary materials contain detailed proofs of all theorems.
Proof of Theorem 3.1.
We first convert the covariates to canonical form (Rubin and Thomas, 1992). Let and where is the Cholesky square root of . Suppose that is even. By the assumption of normality, , and
We further define
We can see that is a Markov process and . Define the test function . By denoting , we have
where is a positive constant. For the first term on the right, we have
where is the angle between and . Note that and are two positive constants. Since , there exist a constant and , such as when , . Therefore, for . Similarly, we have for . By the “drift conditions” (Meyn and Tweedie, 2009), we know has a stationary distribution. Therefore, has a stationary distribution and . ∎
Proof of Theorem 4.1.
Since and are exchangeable, for any , . Since , we have . Suppose is even.
Proof of Theorem 4.2.
Please see supplementary materials. ∎
Proof of Theorem 4.3.
Please see supplementary materials. ∎
Proof of Theorem 4.4.
We first convert the covariates to canonical form (Rubin and Thomas, 1992). Let and where is the Cholesky square root of . Suppose is even. By the assumption of normality, .
and , , and .
Then true model, equation (2), can be rewritten as
Suppose , then can be obtained by running the regression, , even though the true model is . In particular, we can write as
We know, as ,