Randomizationbased Inference for BernoulliTrial Experiments and Implications for Observational Studies
We present a randomizationbased inferential framework for experiments characterized by a strongly ignorable assignment mechanism where units have independent probabilities of receiving treatment. Previous works on randomization tests often assume these probabilities are equal within blocks of units. We consider the general case where they differ across units and show how to perform randomization tests and obtain point estimates and confidence intervals. Furthermore, we develop rejectionsampling and importancesampling approaches for conducting randomizationbased inference conditional on any statistic of interest, such as the number of treated units or forms of covariate balance. We establish that our randomization tests are valid tests, and through simulation we demonstrate how the rejectionsampling and importancesampling approaches can yield powerful randomization tests and thus precise inference. Our work also has implications for observational studies, which commonly assume a strongly ignorable assignment mechanism. Most methodologies for observational studies make additional modeling or asymptotic assumptions, while our framework only assumes the strongly ignorable assignment mechanism, and thus can be considered a minimalassumption approach.
1 Introduction
Randomizationbased inference centers around the idea that the treatment assignment mechanism is the only stochastic element in a randomized experiment and thus acts as the basis for conducting statistical inference.^{1} In general, a central tenet of randomizationbased inference is that the analysis of any given experiment should reflect its design: The inference for completely randomized experiments, blocked randomized experiments, and other designs should reflect the actual assignment mechanism that was used during the experiment. The idea that the assignment mechanism is the only stochastic element of an experiment is also commonly employed in the potential outcomes framework,^{2} which is now regularly used when estimating causal effects in randomized experiments and observational studies.^{3}, ^{4} While randomizationbased inference focuses on estimating causal effects for only the finite sample at hand, it can flexibly incorporate any kind of assignment mechanism without model specifications. Rosenbaum^{5} provides a comprehensive review of randomizationbased inference.
An essential step to estimating causal effects within the randomizationbased inference framework as well as the potential outcomes framework is to state the probability distribution of the assignment mechanism. For simplicity, we focus on treatmentversuscontrol experiments, but our discussion can be extended to experiments with multiple treatments. Let the vector denote the assignment mechanism for units in an experiment or observational study. It is commonly assumed that the probability distribution of can be written as a product of independent Bernoulli trials that may depend on background covariates:^{6}, ^{7}, ^{8}
(1) 
Here, is a covariate matrix with rows , and denotes the probability that the unit receives treatment conditional on pretreatment covariates ; i.e., . The probabilities are commonly known as propensity scores.^{9} An assignment mechanism that can be written as (1) is known as an unconfounded, strongly ignorable assignment mechanism.^{8} The assumption of an unconfounded, strongly ignorable assignment mechanism is essential to propensity score analyses and other methodologies (e.g., regressionbased methods) for analyzing observational studies.^{10}, ^{11}, ^{12}, ^{13}
In randomized experiments, the propensity scores are defined by the designer(s) of the experiment and are thus known; this knowledge is all that is needed to construct unbiased estimates for average treatment effects.^{8} The propensity score is not necessarily a function of all or any of the covariates: For example, in completely randomized experiments, for all units; and for blockedrandomized and paired experiments, the propensity scores are equal for all units within the same block or pair.
In observational studies, the propensity scores are not known, and instead must be estimated. The in (1) are often estimated using logistic regression, but any model that estimates conditional probabilities for a binary treatment can be used. These estimates, , are commonly employed to “reconstruct” a hypothetical experiment that yielded the observed data.^{8} For example, matching methodologies are used to obtain subsets of treatment and control that are balanced in terms of pretreatment covariates; then, these subsets of treatment and control are analyzed as if they came from a completely randomized experiment.^{14}, ^{8}, ^{12} Others have suggested regressionbased adjustments combined with the propensity score^{15, 16} as well as Bayesian modeling.^{17, 4, 18} Notably, all of these methodologies implicitly assume the Bernoulli trial assignment mechanism shown in (1), but the subsequent analyses reflect a completely randomized, blockedrandomized, or paired assignment mechanism instead. One methodology commonly employed in observational studies that more closely reflects a Bernoulli trial assignment mechanism is inverse propensity score weighting;^{19}, ^{20}, ^{21}, ^{22} however, the variance of such estimators is unstable, especially when estimated propensity scores are particularly close to 0 or 1, which is an ongoing concern in the literature.^{23}, ^{24} Furthermore, the validity of such point estimates and uncertainty intervals rely on asymptotic arguments and an infinitepopulation interpretation.
More importantly, all of the above methodologies—matching, frequentist or Bayesian modeling, inverse propensity score weighting, or any combination of them—assume the strongly ignorable assignment mechanism shown in (1), but they also intrinsically make additional modeling or asymptotic assumptions. On the other hand, although randomizationbased inference methodologies also make the common assumption of the strongly ignorable assignment mechanism, they do not require any additional model specifications or asymptotic arguments.
However, while there is a wide literature on randomization tests, most have focused on assignment mechanisms where the propensity scores are assumed to be the same across units (i.e., completely randomized experiments) or groups of units (i.e., blocked or paired experiments), instead of the more general case where they may differ across all units, as in (1). Imbens and Rubin^{25} briefly mention Bernoulli trial experiments, but only discuss inference for purely randomized and block randomized designs. Another example is Basu,^{26} who thoroughly discusses Fisherian randomization tests and briefly considers Bernoulli trial experiments, but does not provide a randomizationtest framework for such experiments. This trend continues for observational studies: Most randomization tests for observational studies utilize permutations of the treatment indicator within covariate strata, and thus reflect a blockrandomized assignment mechanism instead of the assumed Bernoull trial assignment mechanism.^{27}, ^{28}, ^{6} While these tests are valid under certain assumptions, they are not immediately applicable to cases where covariates are not easily stratified (e.g., continuous covariates) or where there is not at least one treated unit and one control unit in each stratum.^{5} None of these randomization tests are applicable to cases where the propensity scores (known or unknown) differ across all units.
Most randomization tests that incorporate varying propensity scores focus on the biasedcoin design popularized by Efron^{29}, where propensity scores are dependent on the order units enter the experiment and possibly pretreatment covariates as well. Wei^{30} and Soares and Wu^{31} developed extensions for this experimental design, while Smythe and Wei^{32}, Wei^{33}, and Mehta et al.^{34} developed significance tests for such designs. Good^{35} (Section 4.5) provides further discussion on this literature. The biasedcoin design is related to covariateadaptive randomization schemes in the clinical trial literature, starting with the work of Pocock and Simon.^{36} Covariateadaptive randomization schemes sequentially randomize units such that the treatment and control groups are balanced in terms of pretreatment covariates,^{37, 38, 39} and recent works in the statistics literature have explored valid randomization tests for covariateadaptive randomization schemes.^{40, 41} Importantly, the randomization test literature for biasedcoin and covariateadaptive designs differs from the randomization test presented here: All of these works focus on sequential designs, and thus depend on the sequential dependence among units inherent in the randomization scheme. In contrast, we assume that all units are simultaneously assigned to treatment according to the strongly ignorable assignment mechanism (1).
To the best of our knowledge, there is not an explicit randomizationbased inference framework for analyzing Bernoulli trial experiments, let alone observational studies. Here we develop such a framework for randomized experiments characterized by Bernoulli trials, with the implication that this framework can be extended to the observational study literature as well. In particular, we develop rejectionsampling and importancesampling approaches for conducting conditional randomizationbased inference for Bernoull trial experiments, which has not been previously discussed in the literature. These approaches allow one to conduct randomization tests conditional on statistics of interest for more precise inference.
In Section 2, we review randomizationbased inference in general, including randomization tests and how these tests can be inverted to yield point estimates and confidence intervals. In Section 3, we develop a randomizationbased inference framework for Bernoulli trial experiments, first reviewing the case where propensity scores are equal across units, and then extending this framework to the general case where propensity scores differ across units. Furthermore, we establish that randomization tests under this framework are valid tests, both unconditionally and conditional on statistics of interest. In Section 4, we demonstrate our framework with a simple example and provide simulation evidence for how our rejectionsampling and importancesampling approaches can yield statistically powerful conditional randomization tests. In Section 5, we discuss extensions and implications of this work, particularly for observational studies.
2 Review of RandomizationBased Inference
Randomizationbased inference focuses on randomization tests for treatment effects, which can be inverted to obtain both point estimates and confidence intervals. Randomization tests were first proposed by Fisher,^{1} and foundational theory for these tests was later developed by Pitman^{42} and Kempthorne.^{43} We follow the notation of Imbens and Rubin^{25} in our discussion of randomization tests for treatmentversuscontrol experiments.
2.1 Notation
Randomization tests utilize the potential outcomes framework, where the only stochastic element of an experiment is the treatment assignment. Let
(2) 
denote the treatment assignment, and let denote the unit’s potential outcome, which only depends on the treatment assignment . Only or is ultimately observed at the end of an experiment—never both. Let
(3) 
denote the observed outcomes. Finally, let denote the set of all possible treatment assignments, and let denote the subset of with positive probability, i.e., .
Importantly, the probability distribution of treatment assignments, , fully characterizes the assignment mechanism: Because treatment assignment is the only stochastic element in a randomized experiment, the distribution specifies the randomness in a randomized experiment. Consequentially, inference within the randomizationbased framework is determined by .
We first review how is used to perform randomization tests. We then discuss how to invert these tests to obtain point estimates and confidence intervals for the average treatment effect.
2.2 Testing the Sharp Null Hypothesis via Randomization Tests
The most common use of randomization tests is to test the Sharp Null Hypothesis, which is
(4) 
i.e., the hypothesis that there is no treatment effect. Under the Sharp Null Hypothesis, the outcomes for any randomization from the set of all possible randomizations is known: Regardless of a unit’s treatment assignment, its outcome will always be equal to the observed response under the Sharp Null Hypothesis. This knowledge allows one to test the Sharp Null Hypothesis.
To test this hypothesis, one first chooses a suitable test statistic
(5) 
and determines whether the observed test statistic is unlikely to occur according to the randomization distribution of the test statistic (5) under the Sharp Null Hypothesis. For example, one common choice of test statistic is the difference in mean response between treatment and control units, defined as
(6) 
Such a test statistic will be powerful in detecting a difference in means between the distributions of and . In general, one should choose a test statistic according to possible differences in the distributions of and that one is most interested in. Please see Rosenbaum^{5} (Chapter 2) for a discussion on the choice of test statistics for randomization tests.
After a test statistic is chosen, a randomizationtest value can be computed by comparing the observed test statistic to the set of that are possible given the set of possible treatment assignments , assuming the Sharp Null Hypothesis is true. The twosided randomizationtest value is
(7) 
where if event occurs and zero otherwise. Importantly, the randomizationtest value (7) depends on the set of possible treatment assignments , the probability distribution , and the choice of test statistic .
Thus, testing the Sharp Null Hypothesis is a threestep procedure:

Specify the distribution (and, consequentially, the set of possible treatment assignments ).

Choose a test statistic .

Compute or approximate the value (7).
All randomization tests discussed in this paper follow this threestep procedure, with the only difference among them being the choice of , i.e. the first step. The third step notes that exactly computing the randomizationtest value is often computationally intensive because it requires enumerating all possible ; instead, it can be approximated. A typical approximation is to generate a random sample from , and then approximate the value (7) by
(8) 
Importantly, the approximation (8) still depends on the probability distribution of the assignment mechanism, , because the random samples are generated using . This distinction will be important in our discussion of Bernoulli trial experiments, where the probability of receiving treatment—i.e., the propensity scores—may be equal or nonequal across units. In both cases, the set is the same, but the probability distribution is different.
Testing the Sharp Null Hypothesis will provide information about the presence of any treatment effect amongst all units in the study. Furthermore, this test can be inverted to obtain point estimates and confidence intervals for the treatment effect.
2.3 Randomizationbased Point Estimates and Confidence Intervals for the Treatment Effect
A confidence interval can be constructed by inverting a variation of the Sharp Null Hypothesis that assumes an additive treatment effect. A randomizationbased confidence interval for the average treatment effect is the set of such that one fails to reject the hypothesis
(9) 
The above hypothesis is a sharp hypothesis in the sense that, under , every unit’s outcome for any treatment assignment is known: Under , the missing potential outcome of any treated unit would be ; likewise, the missing potential outcome of any control unit would be . Thus, for any hypothetical treatment assignment , one can calculate the corresponding potential outcomes under in terms of the observed outcomes and observed treatment assignment :
(10) 
Therefore, one can obtain a value for the hypothesis by drawing many hypothetical randomizations from , computing each using (10), and then using (8) to approximate the value for any given test statistic .
To construct a 95% confidence interval, one considers many (e.g., via a line search), tests the hypothesis for each , and defines the confidence interval as the set of with corresponding values above 0.05.^{5}, ^{25} Importantly, the confidence interval will depend on the probability distribution through the draws to compute each value; thus, the confidence interval will reflect a prespecified assignment mechanism. As we discuss in Section 3.3, this also allows one to flexibly construct confidence intervals that condition on particular statistics of interest.
Testing the hypothesis also yields a natural point estimate: Define the point estimate as the such that the value for testing the hypothesis is maximized. For example, given a 95% confidence interval containing with corresponding values above 0.05, is defined as the with the highest value. The interpretation of such a is that this is the “most probable” under the assumption of an additive treatment effect. This point estimate is a variant of the HodgesLehmann randomizationbased point estimate, which equates the test statistic under the hypothesis to its expectation under the randomization distribution.^{44}, ^{5}
Some have criticized randomizationbased confidence intervals constructed by inverting hypotheses such as (9) because it assumes a homogeneous treatment effect, which may be an inappropriate assumption. However, in general, confidence intervals can be constructed using any Sharp Null Hypothesis that fully specifies unitlevel treatment effects, including sharp null hypotheses that specify heterogeneous treatment effects.^{45} Thus, while we focus on homogeneous treatment effects as assumed in (9), the randomization test framework that we present below can be extended to point estimates and confidence intervals that account for treatment effect heterogeneity to the extent that one can specify sharp null hypotheses that incorporate heterogeneous treatment effects.
3 Randomizationbased Inference for Bernoulli Trial Experiments
Here we consider experimental designs that are characterized by Bernoulli trials and develop randomization tests for these designs. First, we review randomization tests for experimental designs where the probability of receiving treatment is the same for all units; this will motivate our development of randomization tests for experimental designs where the probability of receiving treatment differs across units, which is our main contribution. For both cases—first when the propensity scores are equal across units, and then when the propensity scores differ—we will discuss several assignment mechanisms and sets of possible treatment assignments , which correspond to different randomization tests. Once and are specified, the Sharp Null Hypothesis can be tested by following the threestep procedure in Section 2.2; furthermore, these tests can be inverted to yield point estimates and confidence intervals, as discussed in Section 2.3. For each test, we will state an explicit form for for any to compute the randomization test value (7) exactly, and we will also state how random samples can be generated to approximate this value using (8). In Section 3.3, we introduce rejectionsampling and importancesampling approaches to perform randomization tests conditional on various statistics of interest, which has not been previously considered for randomizationbased inference for Bernoulli trial experiments.
3.1 Case 1: Propensity Scores are Equal Across Units
Let denote the propensity score, i.e., the probability that the unit receives treatment, given a vector of pretreatment covariates . In this section we assume without loss of generality that for all ; i.e., for all units. We consider several sets of possible treatment assignments and note the corresponding for each , which can be used to compute the value (7) for testing the Sharp Null Hypothesis.
First consider the set , i.e., experiments that are characterized by independent, unbiased coin flips, where any number of units can receive treatment or control. In this case, for all . To generate random draws , one simply flips unbiased coins to generate an dimensional vector of 0s and 1s.
However, Imbens and Rubin^{25} note that when , there is a nonzero probability of or . In these cases, most test statistics are undefined, and so they do not consider this case further. This concern can be addressed by either defining test statistics for these cases (a common choice being zero) or instead considering the set of possible treatment assignments. In this case, for all . To generate random draws , one simply flips unbiased coins and only accepts a random draw if it is not or . This follows the argument of Imbens and Rubin^{25} that preventing “unhelpful treatment allocations” will yield more precise inferences for treatment effects.
Indeed, we can even further restrict . It is common to condition on statistics such as the number of units that receive treatment . When for some prespecified , for all . To generate random draws , one simply flips unbiased coins and only accepts a random draw if ; equivalently, one can obtain such random draws by randomly permuting the observed treatment assignment . A randomization test that uses such a and is the most common randomization test in the literature and corresponds to what is typically referred to as a “completely randomized” experimental design.^{25} Because of the equivalence to random permutations of , this randomization test is also often called a permutation test.
3.2 Case 2: Propensity Scores Differ Across Units
Now consider the case where for some , i.e., where the propensity scores differ across units. This may be due to differences in the covariate vectors and or some other experimental design prespecification. Again we consider several sets of possible treatment assignments , note the corresponding for each , and state how to generate random draws , which can be used to compute or approximate the value for testing the Sharp Null Hypothesis.
First consider the set . In this case,
(11) 
which is identical to the assignment mechanism (1) typically assumed in observational studies. To generate random draws , one simply flips biased coins with probabilities corresponding to the to generate an dimensional vector of 0s and 1s.
However, there is still a chance—though small—that a random draw from will be equal to or , and in this case test statistics will be undefined. Now consider the restricted set . In this case,
(12) 
To arrive at this result, note that when ,
(13) 
Thus, the probabilities (12) sum to one. To generate random draws , one simply flips biased coins and only accepts a random draw if it is not or .
Again, we can further restrict to incorporate certain statistics of interest, such as the number of units assigned to treatment. Consider the set for some prespecified . In this case,
(14) 
The denominator, , is seemingly difficult to compute, due to the large number, , of possible treatment assignments . Chen and Liu^{46} provide an algorithm to compute exactly. Alternatively, can be estimated, and there are many ways to estimate this quantity. One option is to randomly sample from and use the unbiased estimator
(15) 
which is the typical estimator for a population total seen in the survey sampling literature (e.g., Lohr Page 55).^{47}
However, computing is only required when one wants to compute the randomizationtest value exactly using (7). Instead, one can still approximate this value using (8) by generating random draws , which is done by flipping biased coins and only accepting a random draw if .
This introduces straightforward rejectionsampling and importancesampling procedures for conducting conditional randomizationbased inference for Bernoulli trial experiments.
3.3 RejectionSampling and ImportanceSampling Procedures for Conditional Randomization Tests
As discussed in Section 2, researchers do not typically compute the randomization test value (7) exactly, but instead generate random draws from the probability distribution and then approximate the randomization test value using (8). To conduct conditional randomizationbased inference, one generates random draws from conditional probability distributions such as instead of . This is straightforward when the propensity scores are the same across units: For example, as discussed in Section 3.1, samples from correspond to random permutations of the observed treatment assignment when the propensity scores are equal across units. However, sampling from such conditional distributions when the propensity scores differ across units is less trivial. To the best of our knowledge, a strategy for how to sample from such distributions has not been described in the literature.
Conducting conditional randomizationbased inference involves focusing only on “acceptable” treatment assignments ; e.g., that are not or , or such that for some prespecified . To formalize this idea, define an acceptance criterion that is a function of the treatment assignment and pretreatment covariates:
(16) 
The criterion can encapsulate any statistic of interest, such as the number of treated units or forms of covariate balance. The criterion should be defined by statistics that are believed to be related to the outcome, such as the number of treated units with a certain covariate value or the covariate means in the treatment and control groups. See Hennessy et al.^{48} for further discussion about the types of statistics that should be conditioned on for conditional randomizationbased inference.
Once is defined, one conducts conditional randomizationbased inference by performing a randomization test only within the set of randomizations such that the acceptance criterion is satisfied. For example, Sections 3.1 and 3.2 discuss conducting randomizationbased inference for the case when if and 0 otherwise. Thus, the true conditional randomization test value is
(17) 
where is the set of acceptable randomizations. The value is nearly identical to the value (7), but using only the set of acceptable randomizations instead of the set of all randomizations. The set of acceptable randomizations is typically large, and thus the value cannot always be computed exactly. Instead, it can be unbiasedly estimated using
(18) 
i.e., the approximation presented in (8). We propose a rejectionsamping procedure for generating random samples : Randomly generate draws from , and only accept a draw if . For Bernoulli trials, this involves flipping coins (biased or unbiased, depending on the experimental design), and only accepting a particular assignment if .
While the rejectionsampling estimator is unbiased for , it may be computationally intensive to generate random samples if is particularly stringent. As an alternative, one can take an importancesampling approach to biasedly estimate at a much lower computational cost.^{49, 50, 51} First, define a proposal distribution whose support includes the support of but is less computationally burdensome to sample from than from . Then, the importancesampling estimator for is
(19) 
In other words, the rejectionsampling estimator is a simple average based on the random draws , whereas the importancesampling estimator is a weighted average based on the random draws . Thus, will be easier to compute than if it is less computationally intensive to sample from the proposal distribution than from the target distribution .
The importancesampling estimator can be reduced to a simple form by first noting that, under the assumption of a strongly ignorable assignment mechanism (1),
(20)  
(21)  
(22) 
where is the set of acceptable assignments according to the acceptance criterion. Then, if the proposal distribution is uniform across all acceptable assignments, i.e., if for all , then the importancesampling value approximation reduces to
(23) 
where the quantity is easy to compute because the propensity scores are known.
For example, sampling from the distribution via rejectionsampling may be computationally intensive if the propensity scores differ across units and is large. One proposal distribution that is uniform across assignments is random permutations of , whose support is equal to the support of but is less computational to sample from. Thus, one can still utilize random permutations of to estimate the conditional randomization test value—as in Case 1 in Section 3.1—using the importancesampling estimator .
However, as noted earlier, unlike the estimator , the estimator is biased of order ,^{49} which—as we show in Section 4—may break the validity of the conditional randomization test. Thus, we recommend using the rejectionsampling estimator to ensure valid inferences from our conditional randomization test if it is not computationally intensive to do so. However, if it is computationally intensive to generate draws but easy to generate draws for some proposal distribution, then we recommend using the importancesampling estimator while ensuring that the number of random samples is large such that the bias of is minimal. For an indepth discussion of rejectionsampling versus importancesampling, see Robert and Casella (Chapter 3).^{50}
The above procedure is closely related to the rerandomization framework developed by Morgan and Rubin,^{52} who define an assignment criterion in order to ensure a certain level of covariate balance as part of an experimental design. Recent works on rerandomization have shown how can be flexibly defined: Morgan and Rubin^{53} defined such that it incorporates tiers of importance for covariates, and Branson et al.^{54} defined such that it incorporates tiers of importance for both covariates and multiple treatment effects of interest.
However, the purpose of the introduction of here is to conduct a conditional randomization test, rather than yield a desirable experimental design. It is similar to the conditional randomization test of Hennessy et al.,^{48} who define in terms of categorical covariate balance. However, because Hennessy et al.^{48} and other conditional randomization tests (e.g., Rosenbaum^{27}) have focused on cases where propensity scores are equal across units or strata, they could sample from directly via random permutations of . Indeed, both the rerandomization and conditional randomization test literature have focused on cases where the propensity scores are equal across units, whereas our approach addresses the more general case where propensity scores differ across units. Furthermore, if our rejectionsampling approach is computationally intensive, our importancesampling approach allows one to still utilize random permutations of to quickly estimate the conditional randomization test value at the cost of incurring a small bias.
Now we establish that the unconditional and conditional randomization tests (i.e., the randomization test using in (7) and the randomization test using in (17), respectively) are valid tests for Bernoulli trial experiments. While these are results for the randomization tests that use the exact values and , this also suggests that our rejectionsampling approach for unbiasedly estimating yields valid statistical inferences. In Section 4, we empirically confirm the validity of these randomization tests, and we discuss to what extent our importancesampling approach also yields valid statistical inferences.
3.4 Validity of Unconditional and Conditional Randomization Tests for Bernoulli Trial Experiments
For both theorems presented below, we assume that the treatment is assigned according to the strongly ignorable assignment mechanism (1). First, we establish that the randomization test that uses this assignment mechanism is valid, i.e., that the probability of this level randomization test falsely rejecting the Sharp Null Hypothesis is no greater than . This result is unsurprising given wellknown results about the validity of randomization tests. Then, we establish that the conditional randomization test—i.e., the randomization test that uses the assignment mechanism for some prespecified criterion instead of the assignment mechanism (1)—is also valid. This result is slightly surprising in the sense that the validity of the randomization test holds even if the test uses an assignment mechanism other than the one used to conduct the randomized experiment.
Theorem 3.1 (Validity of Unconditional Randomization Test)
Assume that a randomized experiment is conducted using the strongly ignorable assignment mechanism (1). Define the twosided randomizationtest value as
(24) 
for some test statistic , where . Then the randomization test that rejects the Sharp Null Hypothesis when is a valid test in the sense that
(25) 
where is the Sharp Null Hypothesis defined in (4).
Theorem 3.2 (Validity of Conditional Randomization Test)
Assume that a randomized experiment is conducted using the strongly ignorable assignment mechanism (1). Define the twosided conditional randomizationtest value as
(26) 
for some test statistic , where is the set of acceptable randomizations according to some prespecified criterion . Then the randomization test that rejects the Sharp Null Hypothesis when is a valid test in the sense that
(27) 
where is the Sharp Null Hypothesis defined in (4).
Now we illustrate our randomization test procedure using a simple example where the randomization test value is computed exactly. Then we conduct a simulation study where the randomization test value is estimated, and we compare the rejectionsampling and importancesampling approaches for estimating the value. Furthermore, we empirically confirm the validity of our randomization tests as established by Theorems 3.1 and 3.2 above, and we demonstrate how conditioning on various statistics of interest can be used to construct statistically powerful randomization tests for Bernoulli trial experiments.
4 Simulation Study of Unconditional and Conditional Randomization Tests
4.1 Illustrative Example: Computing the Exact value
As discussed in Section 2.2, the randomizationtest value is typically approximated using (8) by drawing many possible treatment assignments . However, for small samples, the value can be computed exactly using (7) by examining each in the set of possible treatment assignments . Here we explore a smallsample example to illustrate how to conduct randomization tests and construct confidence intervals when propensity scores vary across units. We also discuss how this procedure differs from the typical case where propensity scores are the same across units.
Consider a randomized experiment with units. The potential outcomes for these units are shown in Table 1, where the true treatment effect is . Say that a randomized experiment characterized by Bernoulli trials has occurred; the corresponding propensity scores, treatment assignment, and observed outcomes are also shown in Table 1. For now, assume that the task at hand is to conduct randomizationbased inference for the average treatment effect given the treatment assignment, observed outcomes, and propensity scores in Table 1.
Unit  (0)  (1)  

1  0.56  0.06  0  0.56  0.1 
2  0.23  0.27  1  0.26  0.2 
3  1.56  2.06  1  2.06  0.3 
4  0.07  0.57  0  0.07  0.4 
5  0.13  0.63  0  0.13  0.5 
6  1.72  2.22  1  2.22  0.5 
7  0.46  0.96  1  0.96  0.6 
8  1.27  0.77  1  0.77  0.7 
9  0.69  0.19  0  0.69  0.8 
10  0.45  0.05  1  0.05  0.9 
With units, only possible treatment assignments can be considered. Excluding the treatment assignments and leaves 1022 possible assignments. Under the Sharp Null Hypothesis, the observed outcomes will be the same as those in Table 1 for all 1022 of these assignments. We test this hypothesis following the threestep procedure in Section 2.2: First choose and , then choose a test statistic, and finally compute the randomization test value.
We first consider the set that was used during randomization, where
(28) 
for each , as previously shown in (12). We choose the meandifference estimator—given in (6)—as the test statistic. We then iterate through each of the treatment assignments and compute the test statistic assuming the Sharp Null Hypothesis is true. Once this is done, the randomization test value can be computed exactly using
(29) 
as previously shown in (7). From Table 1, one can calculate the observed test statistic, , which is equal to 1.06.
Figure 0(a) shows the distribution of the absolute value of the test statistic for each assuming the Sharp Null Hypothesis is true. The portion of this distribution that corresponds to test statistics larger than the observed one is colored in gray. The randomization test value is then the probability of any gray treatment assignment occurring, which we find to be 0.12. If the propensity scores were equal across units—which is typically the case in the randomization test literature—then the randomization test value would simply be the number of gray treatment assignments divided by the total number of treatment assignments, which was, in this case, . Thus, importantly, the value reflects the design of the randomized experiment—i.e., it incorporates the propensity scores that were used to randomize the units during the experiment.
Furthermore, we can obtain a confidence interval for the average treatment effect by inverting this randomization test using the procedure outlined in Section 2.3. We did a line search of values and defined our 95% confidence interval as the set of ’s for which we obtained values greater than 0.05 when testing the hypothesis (9) for each . We found the confidence interval to be . Again, this confidence interval reflects the design of the randomized experiment, because the values corresponding to each depend on the propensity scores that were used during randomization.
Note that Figure 0(a) displays every possible treatment assignment, including assignments where only one unit is assigned to treatment and the rest to control (and vice versa). However, researchers may want the statistical analysis to only consider treatment assignments similar to the observed one. For example, consider the more stringent set of treatment assignments , where in this example the number of treated units , as seen in Table 1. Figure 0(b) shows the distribution of the test statistic for each in this case, assuming the Sharp Null Hypothesis is true. Note that there are only treatment assignments, which is a subset of the assignments displayed in Figure 0(a). Again, the randomization test value is the probability of any gray treatment assignment occurring, but now the probability of any is
(30) 
as previously shown in (14). Because there are only 210 treatment assignments such that , we can compute the denominator exactly and thus compute the randomization test value exactly as well, which we find to be equal to 0.17. Furthermore, using the same procedure as above, we found the 95% confidence interval to be . Thus, in addition to reflecting the experimental design, randomizationbased inference can also reflect particular experiments of interest, such as ones similar to the observed one.
Now we conduct a simulation study with units. In this case, it is computationally intensive to compute randomization test values exactly, and we instead approximate them. Furthermore, because the propensity scores vary across units, it will be difficult to directly sample from conditional probability distributions such as , and thus we will need the rejectionsampling procedure from Section 3.3 to conduct conditional inference.
4.2 Simulation Setup
Hennessy et al.^{48} conducted a simulation study to show that their randomization test that conditioned on categorical covariate balance was more powerful than unconditional randomization tests when covariates were associated with the outcome. Hennessy et al.^{48} consider the case where the propensity scores are the same across units. We modify their simulation study such that units’ propensity scores differ. This simulation study serves two purposes:

Demonstrate how the rejectionsampling and importancesampling procedures presented in Section 3.3 can be used to construct statistically powerful conditional randomization tests.
Consider units with a single covariate , where 50 units have covariate value and the other 50 units have covariate value . Each unit has two potential outcomes—corresponding to treatment and control—which are generated once from the following:
(31)  
The parameter determines the strength of the association between and the potential outcomes, while is the treatment effect. Similar to Hennessy et al.,^{48} we consider the values and in our simulation. The previous example from Table 1 was generated using and .
The probability of the unit receiving treatment—i.e., its propensity score—was generated once from the following:
(32) 
This generating mechanism resulted in propensity scores being centered but spread around 0.5. In our simulation, propensity scores ranged from 0.22 to 0.87 with a mean of 0.49.
After the potential outcomes and propensity scores were generated, we randomly assigned units to treatment and control according to the probability distribution defined by the propensity scores. We prevented any single treatment assignment from being or ; in other words, we considered the set of possible treatment assignments during randomization. In this case, there will always be 50 units with and 50 units with , but the number of units assigned to treatment and control can vary from randomization to randomization. Any randomization of the 100 units to treatment and control can be summarized by Table 2, which includes the number of units assigned to treatment and control ( and ) and the number of units with covariate values and ( and ).
1  0  

1  
2  
Before conducting the full simulation, let’s first consider one possible treatment assignment that we may observe during this simulation. We will present four randomization tests one could use to test the Sharp Null Hypothesis.
4.3 Example of One Treatment Assignment
Consider the case when and ; i.e., when the covariate is strongly associated with the outcome and the treatment effect is moderate. The potential outcomes were generated using (31), the propensity scores were generated using (32), and then units were randomized by flipping biased coins corresponding to these propensity scores. Table 3 shows the resulting randomization. Given this randomization and the corresponding dataset, how should we test the Sharp Null Hypothesis?
1  0  

1  
2  
Any randomization test should involve generating treatment assignments via biased coins corresponding to the prespecified propensity scores, because this is how the randomization observed in Table 3 was generated. However, which set of possible treatment assignments should one consider during the test? We consider four different and their associated randomization tests:

An unconditional randomization test (as presented in Section 2.2), with .

A randomization test conditional on the number of units assigned to treatment, with .

A randomization test conditional on the number of units with assigned to treatment, with .

A randomization test conditional on and , with .
Arguably, the first randomization test is the most natural choice, because it corresponds to the that was actually used to generate the randomization observed in Table 3; however, because conditional randomization tests can be more powerful than unconditional randomization tests, the other three tests may be options researchers might consider as well.
The above tests are ordered in terms of the restrictiveness of : The first two randomization tests involve flipping biased coins to generate treatment assignments, where the values , , , and in Table 2 can vary across assignments; in the third randomization test, only and can vary; and in the fourth randomization test, none of these values can vary. Because iterating through every possible treatment assignment in is computationally intensive—for the example in Table 3, for the first test, and for the fourth test—we instead generate 1,000 treatment assignments using our rejectionsampling procedure discussed in Section 3.3 to approximate the randomization distribution for each test.
The approximate randomization distribution of the meandifference test statistic under the Sharp Null Hypothesis for each of these four tests is shown in Figure 2. The conditional randomization distributions for the third and fourth tests are shifted to the left of the unconditional randomization distribution. This is no coincidence: In Table 3, there are more units with in the treatment group and more units with in the control group; as a result, the treatment group will have units with systematically lower potential outcomes, due to the potential outcomes model (31). This is reflected in the conditional randomization distributions but not the unconditional one. Consequentially, the conditional and unconditional randomization tests will give different results: Onesided values for the four tests are 0.58, 0.57, 0.08, and 0.00, respectively. This suggests that some of these randomization tests may be more powerful at detecting a treatment effect than others, which we further explore below.
4.4 Full Simulation Study
Now we compare the four randomization tests discussed in Section 4.3 in terms of their power. For each combination of and , the potential outcomes were generated using (31), the propensity scores were generated using (32), and then units were randomized 1,000 times by flipping biased coins corresponding to these propensity scores.
For each of the 1,000 randomizations, we performed the four randomization tests discussed in Section 4.3 using the rejectionsampling approach to unbiasedly estimate each value using given in (18). For each test, we rejected the Sharp Null Hypothesis if . Figure 3 displays the average rejection rate of the Sharp Null Hypothesis—i.e., the power—for each randomization test. When , the Sharp Null Hypothesis is true, and all of the randomization tests reject the null 5% of the time. This confirms the validity of our unconditional and conditional randomization tests, as established by Theorems 3.1 and 3.2. When , the covariate is not associated with the outcome, and all of the randomization tests are essentially equivalent. As the covariate becomes more associated with the outcome, the third and fourth conditional randomization tests become more powerful than the unconditional test, while the randomization test that only conditions on remains equivalent to the unconditional randomization test. This is due to the fact that the quantity combined with may be confounded with the treatment effect if there is covariate imbalance between the treatment and control groups, as in the example presented in Table 3 and Figure 2.
However, our rejectionsampling approach can be computationally expensive. Generating 1,000 samples for the unconditional randomization test, the randomization test conditional on , the randomization test conditional on , and the randomization test conditional on and took on average 0.25, 1.22, 2.14, and 34.75 seconds, respectively. As an alternative to the rejectionsampling approach for computing the randomization test value conditional on and , we can take our importancesampling approach discussed in Section 3.3. Instead of sampling directly from via rejectionsampling, we generate proposals uniformly from the set of acceptable randomizations ; this corresponds to random permutations of within the and strata. Then, we compute given in (23) and reject if .
Figure 4 compares the rejectionsampling approach (i.e., rejecting the Sharp Null Hypothesis if ) with the importancesampling approach (i.e., rejecting the Sharp Null Hypothesis if ) for different values of . The importancesampling approach is computationally less intensive than the rejectionsampling approach: The importancesampling approach using , , and took on average 0.68, 3.30, 16.31 seconds, respectively. Note that even the case required less than half the time as the rejectionsampling approach. However, as noted in Section 3.3, has a bias of order , and thus the value for the importancesampling approach may be notably biased for low . This can be seen in Figure 4: For , the importancesampling approach falsely rejects the Sharp Null Hypothesis when at a substantially higher rate than 0.05; this suggests that the importancesampling approach has a negative bias in this case. However, as increases, this bias is less substantial, and results using approach those using . Thus, the bias of importancesampling can break the validity of our randomization test, but this can be alleviated by increasing the