On mitigating the analytical limitations of finely stratified experiments
Abstract
While attractive from a theoretical perspective, finely stratified experiments such as paired designs suffer from certain analytical limitations not present in blockrandomized experiments with multiple treated and control individuals in each block. In short, when using an appropriately weighted differenceinmeans to estimated the sample average treatment effect, the traditional variance estimator in a paired experiment is conservative unless the pairwise average treatment effects are constant across pairs; however, in more coarsely stratified experiments, the corresponding variance estimator is unbiased if treatment effects are constant within blocks, even if they vary across blocks. Using insights from classical least squares theory, we present an improved variance estimator appropriate in finely stratified experiments. The variance estimator is still conservative in expectation for the true variance of the differenceinmeans estimator, but is asymptotically no larger than the classical variance estimator under mild conditions. The improvements stem from the exploitation of effect modification, and thus the magnitude of the improvement depends upon on the extent to which effect heterogeneity can be explained by observed covariates. Aided by these estimators, a new test for the null hypothesis of a constant treatment effect is proposed. These findings extend to some, but not all, superpopulation models, depending on whether or not the covariates are viewed as fixed across samples in the superpopulation formulation under consideration.
1 Introduction
1.1 The analytical limitations of finely stratified experiment
When considering competing experimental designs, both theoretical and practical concerns must be taken into account. While the advice stemming from theoretical derivations is often in harmony with advice addressing issues of implementation, discordant recommendations can be encountered in the literature. As an illustration, consider the choice of granularity of stratification in a randomized experiment as it pertains to the variance of the resulting differenceinmeans estimator of the average treatment effect. Imbens (2011) demonstrates that when considering, ex ante, whether one should use a completely randomized experiment or a blockrandomized experiment, the classical differenceinmeans estimator for the average treatment effect in blockrandomized experiment has a variance which cannot be higher than that of the estimator from a completely randomized experiment; see also Fisher (1935); Cochran and Cox (1957); Cox (1958) and Greevy et al. (2004) among many. By the same logic, a given block can be further broken into substrata while not increasing the estimator’s variance. This leads Imai et al. (2009) and Imbens (2011) to prefer paired experiments from a theoretical perspective. Kallus (2013) further notes that from a population perspective, if one believes the response functions under treatment and control are Lipchitz with respect to some distance metric , then optimal pair matching with respect to minimizes the variance of the differenceinmeans estimator.
Moving away from designs with a priori fixed block sizes, Higgins et al. (2016) present a new experimental design called “threshold blocking" which produces stratifications wherein each block contains at least some number, call it , individuals in each treatment arm. Taking in a treatmentcontrol experiment then yields a design that is more flexible than pairing. Higgins et al. (2016) present a nearoptimal threshold blocking algorithm when one takes minimizing the maximal withinblock covariate discrepancy between any two individuals in the same block as the objective. For the classical treatmentcontrol experiment, the optimal stratification is mix of pairs and triplets, as any feasible stratum with four or more individuals can broken down into substrata of sizes two or three without increasing covariate discrepancy. Sävje (2015) illustrates that this additional flexibility from allowing for both pairs and triplets can result in lower estimator variance than a paired design, much in the same way that variable ratio matching tends to outperform fixed ratio matching in observational studies (Hansen, 2004).
We define a finely stratified design as one where within each block, there is either exactly one treated individual or exactly one control individual; both paired studies and optimal stratifications returned by threshold blocking satisfy this definition. We contrast these with coarsely stratified designs, wherein each block has at least two individuals in each treatment group. Of course in principle this experimental taxonomy is not exhaustive as a treatmentcontrol experiment could have both fine and coarse strata; we ignore this possibility in what follows. The preceding discussion has illustrated the theoretical merits of fine stratifications relative to coarse stratifications; however, finely stratified designs face certain “analytical limitations" avoided by coarsely stratified designs (Klar and Donner, 1997; Imbens, 2011; Sävje, 2015). As is well known, the true variance of differenceinmeans estimator for the sample average treatment effect cannot be identified without further assumptions being made on the individual level treatment effects. Following the tradition of Neyman (1923), conventional estimators for this variance exist which are conservative in expectation with respect to the experimental design’s randomization distribution; see Gadbury (2001) for an overview. It is when considering the magnitude of conservativeness for different experimental designs’ standard variance estimators that the practical issues faced by finely stratified designs come to light.
As will be presented explicitly in §3, the conventional variance estimator for a paired experiment is conservative in expectation unless the average treatment effect is constant across pairs, in which case it is unbiased; however, the typical variance estimator in a coarsely stratified experiment is unbiased so long as the treatment effect is constant within blocks, even if the effects are heterogeneous across blocks. The practitioner must conduct hypothesis tests and form confidence intervals for the sample average treatment effect using a variance estimator appropriate for the design at hand. Hence, if the practitioner believes that the blocks in her experiment were formed on the basis of effect modifying covariates, any benefits in precision from employing a finely stratified design may be washed away by the increased conservativeness of the corresponding variance estimator. Klar and Donner (1997) write that “these limitations lead us…to favour stratified designs in which there are at least two [units] in each stratum" (Klar and Donner, 1997, p. 1753). Imbens (2011) similarly notes that “[These limitations are an] important reason to prefer experiments with at least two units of each treatment type in each stratum" (Imbens, 2011, p. 17).
1.2 An insight from classical least squares squares theory
The analytical limitations of finely stratified experiments thus present an unappealing gap between theory and practice. Practical limitations hinder the actualization of theoretical benefits, an issue which we now seek to mitigate. Recent work by Aronow and Middleton (2013); Lin (2013); Fogarty (2016); Bloniarz et al. (2016) and Lu (2016) among others has shown how regression adjustment can be utilized to provide improved estimators for the average treatment effect in various experimental designs. In this work, we will demonstrate how illustrate how regression adjustment can be utilized to yield improved variance estimators in finely stratified experiments while using the classical differenceinmeans estimator for the average treatment effect, hence preserving the socalled “hands above the table" analysis (Freedman, 2008; Lin, 2013). The key takeaway from this work is that effect modification can be exploited in a finely stratified experiment to yield improved variance estimates even when the model is misspecified. As the potential impact of effect modification is the source of the discrepancy between the variance estimators in finely and coarsely stratified experiments, this serves to close the gap between variance estimators in these respective designs. See Abadie and Imbens (2008); Ding (2016); Abadie et al. (2017) for recent work on the role of effect modification in variance estimation in related contexts.
Before proceeding, let us take a detour into classical least squares theory to provide insight into the improvements which will follow. Suppose we have responses , and an centered matrix of covariates , where is the identity matrix and is a vector containing ones. Consider running two regressions, the first a regression of on and the second a regression of on and . By orthogonality, the coefficient on the intercept column, , will equal the sample mean in both regressions. On the other hand, the variance estimators for will differ between the two regressions. For the regression on the intercept, the classical variance estimator for is . For a regression of on and , the classical variance for is As a result, . The use of this improved variance estimator, , is typically justified by an ancillarity argument: if the assumptions underpinning the regression model are satisfied, then the distribution of is ancillary for inference on any slope coefficient . The conditionality principle would then support conditioning on in the inference that follows, hence restricting attention to the relevant subset of the sample space.
Buja et al. (2014) provide an illuminating discussion not only of the classical arguments for conditioning on , but also of the breakdown of these arguments in the presence on model misspecification. The fundamental issue is that when is itself considered to be random, is ancillary for inference on if and only if the model is correctly specified. The framework considered therein is one of a practitioner jointly sampling responses and covariates from some target population, with the target of inference being the best linear approximation to the response function for this population. In the analysis of randomized experiments, a generative model of this nature is often implausible, as individuals within a given experiment need not constitute a representative sample. As such, inference is performed on local estimands such as the average treatment effect for the individuals in the experiment at hand, with the act of randomization itself provides the basis for inference for these estimands (Neyman, 1923; Fisher, 1935; Rubin, 1974; Imbens and Rubin, 2015). For these local estimands, conditioning on the covariates for the individuals in the experiment is justified without an ancillarity, argument, as the estimands are themselves defined with respect to the sample at hand. As will be illustrated, variance estimators which utilize will furnish improvements in power while facilitating Neymanstyle conservative inference for the sample average treatment effect.
2 The sample average treatment effect
2.1 Notation for a blockrandomized experiment
There are independent blocks. The of blocks contains individuals, of whom receive the treatment and receive the control. There are total individuals in the study. Let be an indicator of whether or not the individual in block receives the treatment, such that and . A finely stratified experiment is then characterized by for all , while in a coarsely stratified experiment for all . Individual in block has a dimensional vector of measured covariates . Each individual has a potential outcome under treatment, , and under control, , . The pair of potential outcomes is not jointly observable for any individual. Instead, we observe the response for each individual. As a consequence, the individual level treatment effect is not observable for any individual, nor is the average of the treatment effects in any block , (Neyman, 1923; Rubin, 1974).
Let be the set of possible values of under the blockrandomized design. Each has probability of being selected, where the notation denotes the cardinality of the set . Let denote the event . Quantities dependent on the assignment vector such as and are random, whereas contains fixed quantities for the experiment at hand. In a blockrandomized experiment, , and .
2.2 The estimand and the estimator
The sample average treatment effect, or , is defined as
where . The conventional unbiased estimator for , the average treatment effect for individuals in block , is simply the observed differenceinmeans between the treated and control individuals in block .
The classical unbiased estimator for the overall sample average treatment effect is
(1) 
i.e. a weighted average of the blockspecific estimators with serving as weights (Rosenbaum, 2002, Chapter 2).
3 A comparison of standard variance estimators
3.1 Conventional variance estimation in coarsely stratified experiments
For block , define the blockspecific averages of the potential outcomes under treatment and control as and . Further, define , , and by
The variance of the sample average treatment effect estimator in block , , can be expressed as (Imbens and Rubin, 2015, Theorem 6.2)
This immediately yields the following expression for :
This variance is unknown in practice because it depends on the missing potential outcomes. In a coarsely stratified experiment where we have for all , the conventional estimator for is based on an appropriately weighted sum of the sample variances of the treated and control responses in each block. Let and be the observed averages of responses for the treated and control individuals in block . Further, let and be the sample variances for the responses of the treated and control units in block ,
The classical variance estimator in a coarsely stratified experiment takes on the following form:
A well known fact dating back to Neyman (1923) is that this estimator yields conservative inference for the sample average treatment effect, since
(2) 
Hence, the variance estimator is an upper bound on in expectation unless the treatment effect is constant within each block (i.e. if for each block , for ). This thus enables Neymanstyle conservative inference on to proceed using .
3.2 Classical results on variance estimation in finely stratified experiments
In a finely stratified experiment, at least one of and will be undefined as . As a result, the estimator cannot be employed. To the best of our knowledge there does not exist a “classical" variance estimator for the general class of finely stratified experiments without making assumptions such as additivity of treatment effects or equal variance of potential outcomes (Rosenbaum, 2002; Hansen, 2004; Sävje, 2015). In the particular case of paired designs where for all strata, the classical variance estimator is simply the sample variance of the observed paired differences divided by the number of pairs,
(3) 
Imai (2008) discusses inference for the sample average treatment effect within a paired design. Proposition 1 of that work illustrates that is also an upper bound in expectation for , and that the degree of the bias is given by
(4) 
A comparison of bias expressions (2) and (4) reveals the analytical limitations alluded to in §1.1. For a paired design, is biased upwards unless the average treatment effects are the same across pairs. In a coarsely stratified design, is unbiased if there is additivity within blocks, even if there is effect heterogeneity across blocks. If the blocks were formed using covariates that are thought to be effect modifiers, it may be the case that the coarsely stratified design yields an unbiased estimator for the variance, while the paired design would yield a variance estimator that is substantially biased upwards. Were (3) the only variance estimator available to facilitate inference in a paired experiment, the practitioner in this case may well be justified in preferring the more coarsely stratified design as a means of shrinking confidence intervals and yielding more powerful hypothesis tests.
4 Conservative variance estimators in finely stratified experiments
4.1 Two recipes with projection matrices
Let be an arbitrary matrix with , and let be the orthogonal projection of onto the column space of . Let be the element of . Define and . Let , and let the analogous definitions hold for , , and . Finally, let be a diagonal matrix whose entry equals
Let be a diagonal matrix whose diagonal element contains . We will now show that the matrix can be used to produce two variance estimators which are conservative in expectation for
Define the first of these estimators, , as
(5) 
Proposition 1.
If is constant across all elements of :
Proof.
Define as before, and let be the covariance matrix for , a diagonal matrix with . Noting that is symmetric,
Recalling that
where the last line stems from being a projection matrix, and hence positive semidefinite. ∎
Define the second estimator, , as
(6) 
Proposition 2.
If is constant across all elements of :
Proof.
Define as before, and let be the covariance matrix for , a diagonal matrix with . Noting that is symmetric,
The element of is given by
Recalling the form of and noting that is positive semidefinite completes the proof. ∎
Propositions 1 and 2 illustrate that for any constant matrix with , the corresponding projection matrix can be utilized for conservative variance estimation in a finely stratified experiment through the estimators and defined in (5) and (6). We will first illustrate that certain choices of recover the standard variance estimator in a paired experiment when using , and further suggest two conventional estimators for finely stratified experiments with varying block sizes. We will then show that the form of the bias expressions in Proposition 1 and 2 provides insight into choices for which will provide improvements in variance estimation.
4.2 Preliminary conservative variance estimators with equal and unequal block sizes
Initially, let to be a matrix with a constant column along with a column corresponding to the centered weights (note that . Define , where denotes a matrix of dimension with ones on the diagonal and zeroes everywhere else; this removes the column when block sizes are equal to avoid rank deficiency. We will now consider the implications of choosing in (5) and (6) to define a conservative variance estimator.
When block sizes are equal , and hence the diagonal elements of the hat matrix associated with equal for each observation. The variance estimator then takes on the simplified form
In the case of matched pairs, this estimator is simply the sample variance of the observed paired differences divided by the number of pairs, hence recovering the classical variance estimator. Proposition 1 of Imai (2008) for matched pairs can be viewed as a special case of our Proposition 1 with . This also indicates that an additive treatment effect model implies unbiasedness of the estimator for in a finely stratified experiments with equal block sizes, even if the design is not paired. With equal block sizes, we have that , meaning that the estimator should always be preferred in this case.
With unequal block sizes, the diagonal elements of the hat matrix associated with is . Since the diagonal elements of the hat matrix depend on , the estimator will be a strict upper bound in expectation for under an additive treatment effect model for finite samples for all . So long as for all as , the estimator and will both be asymptotically unbiased for under an additive treatment effect (this condition would hold under the assumption that the block sizes are bounded, for example). In the unequal block case there is no longer a consistent ordering between and , but the discrepancies tend to be minor: as will be demonstrated Theorem 2, appropriately scaled versions of these two estimators converge in probability to the same limit under mild conditions.
4.3 Improved variance estimation through exploiting effect modification
For each block , let be the vector of length whose entry is the average of the covariate for the individuals in block , i.e. . Let be the matrix whose column contains for . Let be the weighted covariate means adjusted for . Let . While the mutual orthogonality of , , and within is not required at this point, it facilitates forthcoming illustrations and makes clearer certain connections to heteroskedasticity consistent standard errors. Let and be the variance estimators corresponding to setting in (5) and (6).
To understand the potential benefits of the variance estimator , note that from Proposition 1 the bias in is . Under mild regularity conditions described in §4.4, the diagonal elements of the hat matrix associated with tend to 0 implying that in sufficiently large samples. We can then think of as, approximately,the mean squared error from a regression of the weighted treatment effects, , on the weighted covariates, along with an intercept and a column for the block sizes. If the matrix contains covariates which are predictive of the treatment effects in different blocks, could yield a substantially less conservative estimator for than the estimator , which does not exploit potential effect modification.
For , there is an additional connection to commonly employed standard error estimators in linear regression. In fact, since was constructed such that is orthogonal to all other columns of , exactly corresponds to the square of the HC3 heteroskedasticity consistent standard error for the intercept column in a regression of on (MacKinnon and White, 1985; Long and Ervin, 2000). The bias term for is then approximately equal to times the HC3 variance for the intercept column of a regression of on , which is itself a close approximation to the mean squared error from a regression of the weighted treatment effects on .
Importantly, Propositons 1 and 2 make no assumption about the truth of the linear model generating the projection matrix . While the magnitude of the improvement from using instead of for depends on how well the weighted covariate means predict , any choice of Q in (5) or (6) will yield a variance estimator which is conservative in expectation for . As will now be shown, under mild conditions is asymptotically no worse than for regardless of the functional form describing the relationship between the observed covariates and the stratumspecific treatment effects. Further, both and converge in probability to zero.
4.4 Asymptotic performance of variance estimators
We now give sufficient conditions which enable asymptotically valid inference for to proceed using and for . In so doing, we will also quantify the potential improvements from exploiting effect modification through the variance estimator. The finite population asymptotics presented herein embed a given experiment with strata within an infinite sequence of experiments with increasingly many blocks. To reflect their changing values along this sequence, quantities such as , and should be subscripted by for precision of notation; we omit this, trading precision for readability. Let be the hat matrix associated with as defined in the previous section, and consider the following regularity conditions.
Condition 1.
(Bounded Block Sizes) There exists a such that for all and all as .
Condition 2.
(Bounded Fourth Moments). There exists a such that, for all ,
, , and
for .
Condition 3.
(Existence of Population Moments).

, , and converge to finite limits as .

converges to a finite limit for as . Let be the vector of length containing these limits, i.e. .

converges to a finite, invertible matrix as . Call this limit .
Let . The following theorems illustrate that and for can all be used to conduct asymptotically conservative inference for the sample average treatment effect, . After establishing asymptotic normality, we demonstrate that inference using will be no less powerful than that conducted using for .
Theorem 1.
Under Conditions 13 and conditional on and ,
Theorem 2.
Under Conditions 13 and conditional on and , then for ,
Corollary 1.
For ,
The proofs are deferred to the appendix. The above results, in concert with Propositions 1 and 2, justify multiple means by which inference can be conducted for the sample average treatment effect, , in finely stratified experiments. The results validate new standard error estimators for inference on the in finely stratified experiments while using classical weighted differenceinmean estimator. Furthermore, these results highlight how effect modification can be leveraged to reduce the degree of conservativeness of the performed inference. As Corollary 1 demonstrates, standard errors derived by including suitably weighted average values for covariates within blocks are, asymptotically, never worse than those derived without including covariate information.
5 Consonant and dissonant superpopulation formulations
5.1 Populationlevel causal estimands
The preceding results make no assumptions about the manner by which individuals were selected for inclusion into the blockrandomized experiment in the first place; that is, they neither require nor postulate the existence of a larger population from which individuals were drawn. The target of estimation, the sample average treatment effect, attests merely to the treatment effect for individuals in the sample at hand, and the act of randomization provides a reasoned basis for making probabilistic statements (Fisher, 1935). That being said, it is sometimes desired to postulate that individuals in the study at hand were in fact draws from a superpopulation, and to perform inference on the average treatment effect within that superpopulation.
5.2 Conditional average treatment effect (CATE)
As an initial superpopulation extension, suppose we consider the covariates and the block sizes as fixed and consider the pairs of potential outcomes as having arisen through the following sampling mechanism.
where are drawn from an arbitrary distribution with mean and blockspecific variancecovariance matrix . Let , and let . Let be the set containing the covariates for all individuals. Within this superpopulation abstraction, the conditional average treatment effect, or , in a finely stratified experiment is defined as
(7) 
Let , and let . Note that (7) reflects the view of the covariates as fixed, in much the same way that conventional least squares theory operates under the assumption of fixed covariates. The classical unbiased estimator for the overall conditional average treatment effect remains the weighted differenceinmeans estimator given in (1). The true variance for this estimator is inflated, as unlike with the sample average treatment effect we no longer condition on the potential outcomes in each block. Nonetheless, we now demonstrate the variance estimators given in (5) and given in (6) remain conservative estimators in expectation for .
Proposition 3.
If is constant across all elements of :
where is a vector of length with . Further,
The proof is analogous to that of Propositions 1 and 2. The insights from Theorem 2 similarly extend variance estimation for the conditional average treatment effect: through using regression adjustments on the average level of the covariates in a given block results in less conservative variance estimators, with the degree of improvement now dependent on the extent to which the average of the weighted covariates in a given block are able to predict , the weighted conditional average treatment effect in a block given the covariate values.
In the case of equal block sizes, if the stratumlevel treatment effects are homoskedastic (i.e. is constant across all blocks), then we are also entitled to an additional variance estimator connected to standard errors. Let be a diagonal matrix whose diagonal element is , and define as
(8) 
Proposition 4.
If is constant across all elements of , block sizes are equal (such that ), and is constant across blocks:
The proof is deferred to the appendix. In the general case with across block heteroskedasticity, unequal block sizes, or when conducting inference on the the sample average treatment effect need not be conservative in expectation. It does, however, converge in probability to the same limiting value as and , indicating that the prospect of anticonservative inference through may only be a realistic concern in small samples.
These developments demonstrate that the modes of inference presented for the sample average treatment effect in §4 yield harmonious extensions to inference on the conditional average treatment effect. That is, hypothesis tests and confidence intervals for the sample average treatment can also be interpreted as hypothesis tests and confidence intervals for the conditional average treatment effect should the practitioner deem the superpopulation formulation.
5.3 Population average treatment effect (PATE)
As an alternative superpopulation formulation, suppose we now consider the block sizes as fixed, but the covariates within a given block, as random. We now consider the pair of potential outcomes