Life After Bootstrap: Residual Randomization Inference in Regression Models
Abstract
We develop a randomizationbased method for inference in regression models. The basis of inference is an invariance assumption on the regression errors, such as invariance to permutations or random signs. To test significance, the randomization method repeatedly calculates a suitable test statistic over transformations of the regression residuals according to the invariant. Inversion of the test can produce confidence intervals. We prove general conditions for asymptotic validity of this residual randomization test and illustrate in many models, including clustered errors with oneway or twoway clustering structure. We also show that finitesample validity is possible under a suitable construction, and illustrate with an exact test for a case of the Behrens–Fisher problem. The proposed method offers four main advantages over the bootstrap: (1) it addresses the inference problem in a unified way, while bootstrap typically needs to be adapted to the task; (2) it can be more powerful by exploiting a richer and more flexible set of invariances than exchangeability; (3) it does not rely on asymptotic normality; and (4) it can be valid in finite samples. In extensive empirical evaluations, including highdimensional regression and autocorrelated errors, the proposed method performs favorably against many alternatives, including bootstrap variants and asymptotic robust error methods.
Keywords: bootstrap, randomization, invariance, inferential primitive, exact inference.
Contents
 1 Introduction
 2 The residual randomization method
 3 Simple error structures
 4 Clustered error structures
 5 Applications and experiments
 6 Extensions
 7 Discussion
 8 Conclusion
 A Proofs of Main Theorems
 B Proofs of Validity Theorems
 C Concrete Algorithms
 D Details on simulations
 E Additional experiments
1 Introduction
The bootstrap (Efron, 1979, 1981) is a widely popular method for nonparametric estimation of standard errors. In a regression model,
(1) 
the residual bootstrap (Freedman, 1981; Efron and Tibshirani, 1986) uses an estimate (e.g., from OLS) to bootstrap datasets , where , and is a
sample with replacement from the residuals, , or the centered residuals, .
The residual bootstrap is more flexible and generally adapts better to the problem than other
bootstrap alternatives
To be valid asymptotically, the bootstrap requires some form of exchangeability of the errors. When such exchangeability does not hold, the method may have to be altered substantially. For instance, in regression models with clustered errors, the “cluster wild bootstrap” method (Cameron et al., 2008) resamples the entire cluster residuals with replacement, and switches their signs at random. Moreover, although it is not necessary in theory, bootstrap methods in practice rely on asymptotic normality, which is generally not plausible with small samples. This requires additional complex alterations, such as acceleration or refinement (DiCiccio and Romano, 1988; DiCiccio and Efron, 1996). In short, a bootstrap approach needs to adapt the problem to the method, rather than the method to the problem.
As a robust alternative for inference we propose the residual randomization method. To test a linear hypothesis on parameters, , with fixed, the proposed method relies on the following two key ideas. First, we make an invariance assumption on the errors, such as , where and is an algebraic group of linear maps. For example, if is the set of all permutations, the assumption is equivalent to exchangeability of errors. Second, we choose a test statistic , such that under the null hypothesis, and has an asymptotic distribution law. For example, and satisfy this requirement, where is the standard OLS estimator. Then, inference on that is valid in finite samples is possible, in principle, by comparing the observed value of against the set of alternative values .
In practice, however, this method is infeasible because the errors are generally unknown, except when we are testing a joint hypothesis on all . The feasible method is then to compare with respect to the set , where are the residuals from OLS constrained under . The feasible method is no longer valid in finite samples, in general, but is asymptotically valid under certain conditions depending on the choice of . An overarching sufficient condition for validity, which we prove in this paper (see Theorem 1), is that is similar to , in the sense that , where , is scalar and is the identity matrix. Intuitively, this ensures that the choice of does not alter substantially the information structure of the problem, so that is sufficiently close to . Moreover, a construction of may be possible so that the aforementioned similarity condition holds exactly in the sample, leading to a randomization test of significance that achieves the nominal level for any finite .
In related work, the idea of residual randomization was first developed in two seminal papers by Freedman and Lane (1983b, a). Their key result was to show that the permutation distribution of a suitable test statistic converges to a standard distribution. Freedman and Lane (1983a) described their idea as the extension of Fisher’s randomization test (Fisher, 1935) to observational studies. Interestingly, Freedman did not carry out the extension to more complex , but instead produced an important series of papers on bootstrapbased variations of his permutation inference method, including the residual bootstrap (Bickel and Freedman, 1981; Freedman, 1981; Peters and Freedman, 1984; Freedman and Peters, 1984). This work stimulated an important sequence of papers on permutation tests in linear regression models, which highlighted the asymptotic properties and power of residual randomization (Anderson and Robinson, 2001; Kennedy, 1995; Manly, 2018; Ter Braak, 1992). All of these methods, including the original work of Freedman and Lane (1983b), also rely on residual transformations, but only consider exchangeability as the main invariance assumption, i.e., denotes only permutations. Furthermore, they do not analyze complex error structures, such clustered errors with oneway or twoway clustering, as we do in this paper.
Recently, randomization tests have received increased attention, especially in causal inference, due to their robustness and minimal assumptions (Imbens and Rubin, 2015; Rosenbaum, 2002; Lehmann and Romano, 2005; Gerber and Green, 2012; Athey et al., 2018; Basse et al., 2019; Ding et al., 2016; Ding, 2017; Cattaneo et al., 2015; Canay et al., 2017, 2018; Chernozhukov et al., 2017). In the context of clustered errors, Canay et al. (2018) have made the important connection between randomization testing and “wild bootstrap” (Wu, 1986). Specifically, they proved asymptotic validity of the wild bootstrap when clusters grow in size but are finite in number, under the homogeneity condition that within each cluster is similar to the population matrix; that is, Canay et al. (2018) consider as the set of diagonal matrices with clustered random signs on the diagonal. In Section 4 of this paper, we extend this result and show that validity is possible even without the homogeneity condition, provided that the errors have some suitable of symmetry, such as exchangeability within clusters.
Our paper is organized as follows. Section 2 presents the details of the residual randomization method, and the key technical result for validity. In Section 3, we discuss simple error structures, and in Section 4 we consider clustered structures, including twoway clustering. For each particular error structure we derive theoretical conditions of asymptotic validity. Section 5 presents extensive empirical evaluations that show the benefits of the proposed randomization method, including a construction of an exact test for an instance of the famous Behrens–Fisher problem in Section 5.2.2. Finally, in Section 6 we discuss residual randomization with highdimensional regression models and autocorrelated errors, showing notable empirical success in both cases.
2 The residual randomization method
We begin with the regression model of Equation (1), written compactly as
(2) 
where are dimensional column vectors, is the matrix of covariates, and is the dimensional vector of parameters; here, is fixed and . We focus on testing linear hypotheses on the regression parameters, , of the following form:
(3) 
where and are fixed and known. These hypotheses include the individual significance tests, , for . The inversion of these tests can produce confidence intervals.
Our proposed method relies on two main components:

An inferential primitive, , as an algebraic group of linear maps.

An invariant, , as a measurable function, such that
(4)
The idea is that if there exists a test statistic , such that under , then the randomization test that compares with the values is exact, i.e., valid for any finite . In the context of linear regression, a natural approach is to define the test statistic as
(5) 
where is the standard OLS estimate of , and the invariant as . Then, by standard OLS theory, under , as required. The corresponding pvalue,
is exact (Lehmann and Romano, 2005, Theorem 15.2.3).
In practice, this test is infeasible because the errors are unknown, and so we rely on estimated residuals. The approach is then to calculate the constrained OLS estimator , such that , and then obtain the restricted residuals, . The approximate pvalue,
(6) 
where , , with fixed, defines a feasible randomization test that is also valid asymptotically under certain conditions, which we formalize in Theorem 1.
To simplify, we choose as a group of realvalued matrices. We also satisfy Equation (4), for any , by the following stronger invariance assumption:
(7) 
These simplifications add useful conceptual clarity. For example, when is the set of all permutation matrices, Equation (7) is equivalent to exchangeability of errors; or when denotes signed diagonal matrices, the assumption is equivalent to error symmetry around zero, also known as “orthant symmetry” (Efron, 1969). Moreover, finitesample invariance in Equation (7) may be necessary even in asymptotic settings, notably to show asymptotic validity of cluster sign tests when the number of clusters is finite but their sizes grow (see Section 4). Finally, under a suitable construction, finitesample invariance can produce tests that are feasible and valid even in finite samples; see Section 2.3 for details.
2.1 Concrete procedure
For the sake of concreteness, we define here the basic procedure for the residual randomization test. All subsequent tests in this paper are derivatives of this procedure for various choices of .

Given calculate the restricted OLS estimate, , such that , and then calculate the corresponding constrained residuals, .

Define the test statistic, , and let be the observed value.

Repeat , with fixed:
Sample , and store , where . 
Calculate the onesided pvalue as in Equation (6), or the twosided pvalue:
(8)
At target level , the test decision may be described by the function
(9) 
There are two main differences between the residual randomization procedure defined above and residual bootstrap. First, in the randomization procedure we can use any inferential primitive beyond just permutations. This is especially useful with complex error structures, such as clustered errors, and can also be useful for sensitivity analysis (see Section 5.1). Second, the residual randomization procedure does not rely on any form of parametric assumptions to assess significance because it calculates the reference distribution empirically through random transformations of the residuals (Step 3). In contrast, bootstrap errors typically rely on asymptotic normality.
Remark 2.1. As an intuition for why this procedure is valid, consider the idealized residual randomization test, which essentially conditions on event with . Because is an algebraic group, is the orbit of in the error sample space. Orbits are special grouptheoretic objects with the key property that they define a partition of their domain set. Thus, Equation (7) implies that where . This guarantees validity of the idealized randomization test (Lehmann and Romano, 2005, Chapter 15), and suggests asymptotic validity of the feasible residual test. Such inference based on algebraic orbits is reminiscent of the “structure of inference” (Fraser, 1968).
Remark 2.2. A confidence interval for is possible through test inversion (Rosenbaum, 2002, Section 2.6.2). Specifically, define so that , and is zero otherwise. Then, for a sequence of values, calculate the pvalue of the test, say, . Given , we report as the confidence interval for .
Remark 2.3. In some tests, the values of in Step 3 may often repeat. Then, a correction is needed for the test to achieve the target test level. See Appendix D.1 for details.
2.2 Asymptotic validity
Here, we develop theory on the asymptotic validity of the residual randomization test in Section 2.1. We start with a general theorem, and then specialize in the following sections.
Assumption 1.
For the linear regression model of Equation (2) suppose that:

is nonstochastic; is bounded and invertible;

, where is positive definite; and

there exists random variable , such that
Assumptions 1(a) and (b) are common regularity conditions on to guarantee that the estimators, and , are wellbehaved. Assumption 1(c) requires sufficient regularity in so that the test statistic has an asymptotic distribution law. This is only an existence assumption, which is relatively weak because does not have to be normal necessarily, and may even be unknown.
Theorem 1.
The proof of Theorem 1 is straightforward. The idea is to show that of the approximate test is asymptotically close to of the idealized test. The difference between these two values is exactly equal to . Under the similarity condition of Equation (10), this quantity is proportional to , which is zero under the null hypothesis.
Remark 2.4. The Gram matrix captures the information structure of the problem, also known as Fisher information. With that in mind, the condition in Equation (10) requires, intuitively, that the choice of inferential primitive, , does not alter substantially the information structure of the problem. We clarify here that grows with , but this dependence is left implicit to simplify notation; e.g., it is implicit in the limit of Equation (10).
Remark 2.5. Theorem 1 does not require finitesample invariance, as defined in Equation (7), but actually a weaker assumption of asymptotic invariance:
(11) 
However, we use the finitesample invariance in Equation (7) as our main working assumption in this paper due to its conceptual clarity; see also the discussion around Equation (7) for more detailed arguments.
Remark 2.6. Theorem 1 describes only sufficient conditions for asymptotic validity. It is therefore possible that some conditions do not hold but the test is still valid asympotically—we discuss one case in Section 3.1. It can also happen that the test is valid but powerless. For example, the residual randomization test for the intercept model, , has zero power under only exchangeability of , since here the OLS estimator (i.e., the sample mean, ) is invariant to permutations.
2.3 Finite sample validity
Here, we discuss a construction of feasible and exact randomization tests as a complement to the asymptotic result of Theorem 1. The idea is to cluster some of the datapoints in a way such that the similarity condition in Equation (10) holds in the sample. This could mean an artificial clustering of the data done by the analyst, as opposed to some natural clustering, e.g., by state or school.
We need the following notation. A clustering, , denotes a partition of datapoints , and is indexed by . Let denote the set of datapoints in the th cluster. Let us also define cluster membership indicators, namely, and ; thus, is the dimensional column vector with elements that are one only for the members of , and are zero otherwise; is the corresponding diagonal matrix.
Theorem 2.
For fixed , consider a clustering with . Suppose that for every cluster there exists such that , where are the cluster covariates. For some inferential primitive , suppose that every member transform, , can be decomposed as , where . Then, the residual randomization test with as the primitive is exact in finite samples, i.e., .
Intuitively, Theorem 2 suggests the following construction of exact residual randomization tests. First, split the data in clusters, such that each matrix within each cluster is similar to the population matrix. Second, define transforms jointly on the cluster level; e.g., flipping the signs of cluster residuals. If these transforms form an algebraic group and reflect the true error invariance, then the resulting test is exact. We illustrate this idea in an instance of the BehrensFisher problem in Section 5.2.2, where we split a binary covariate in clusters, and then apply a cluster sign test. The resulting test is exact even though there are only 30 data points.
3 Simple error structures
In this section, we begin with simple error structures. For each structure we specify an appropriate inferential primitive , and then derive conditions for asymptotic validity of residual randomization.
3.1 Exchangeable errors
The simplest error structure is when are exchangeable, so that , for any permutation matrix . We may write , where is the binary dimensional column vector that is one only at the th element, and is a permutation of elements, i.e., a bijection of on itself. Let be the space of all such permutations. Then, the inferential primitive is
(12) 
Operationally, the residual randomization test of Section 2.1 using as the primitive repeatedly calculates the test statistic, , on permutations of the constrained residuals, .
Theorem 3.
Notably, the condition of Theorem 1 does not hold here because there is bias between and , the test statistics of the feasible and idealized test, respectively. Intuitively, this bias exists because a permutation matrix does not completely decorrelate the columns of . However, the theorem shows that this bias is negligible in the limit.
3.2 Heteroskedastic errors
We continue with independent and heteroskedastic errors, where . In this setting, the seminal papers of White (1980); Eicker (1963) have established asymptotically correct inference for through heteroskedasticityconsistent variance estimation. However, in small samples these methods can be severely biased (Chesher and Jewitt, 1987), especially when the design is highly leveraged; MacKinnon and White (1985); Davidson and Flachaire (2008); Godfrey and Orme (2001) provide additional theory and simulations illustrating the problem.
For residual randomization, we can assume that the errors are symmetric around zero, and thus their distribution to be invariant to random sign flips. Formally, , where and is a vector of signs. The inferential primitive is defined as
(13) 
Operationally, the residual randomization test using as the primitive repeatedly calculates the test statistic, , on constrained residuals with their signs randomly flipped.
Theorem 4.
The residual randomization test with as the inferential primitive is very similar to the “wild bootstrap” (Wu, 1986). This bootstrap variant resamples data sets as , where is a zeromean random variable, and is a transform of the residuals, usually the identity . Mammen (1993) proposed sampling as or , such that , and ; these moment conditions lead to fast convergence of the bootstrap according to Edgeworth expansions; see also (Liu, 1988).
Remark 3.1. Recent work has pointed out that the aforementioned analytical approximations of the wild bootstrap are untrustworthy, and suggest using random signs for , also known as the Rademacher distribution (Davidson and Flachaire, 2008; Flachaire, 2005; Horowitz, 2001). The residual randomization approach sides with the latter choice. Random signs express symmetry around zero, which is more natural than the symmetries expressed by other choices.
Remark 3.2. The proof of Theorem 4 does not require that the errors are symmetric around zero in finite samples, but only that the test statistic has such symmetry asymptotically in the sense of Equation (11). Theorem 4 can therefore be more general. We choose to present the narrower formulation here because the finitesample condition of sign symmetry is conceptually clearer than the asymptotic condition; see also Section 2 and Remark 2.6 for additional discussion.
4 Clustered error structures
Here, we consider problems with a natural clustering of the data, for example, a clustering by city or state, school or classroom. Moulton (1986) and Arellano (1987) provide classical results on clusterrobust variance estimation, showing that regular OLS may be significantly biased when the cluster structure is ignored. Asymptotically correct inference is possible when the number of clusters grows (White, 1984; Hansen, 2007; Carter et al., 2017), or when the number is fixed (Ibragimov and Müller, 2010, 2016; Bester et al., 2011; Conley and Taber, 2011). However, all these parametric methods rely on asymptotics that can be biased in practice, especially when the number of clusters is small (Donald and Lang, 2007; Imbens and Kolesar, 2016; Cameron et al., 2008; Cameron and Miller, 2015); see also Angrist and Pischke (2009) and (Abadie et al., 2017) on practical discussions about the need for cluster error adjustments.
Such small sample issues can be alleviated by “cluster wild bootstrap” (Cameron et al., 2008). However, the standard bootstrap asymptotics don’t work with a finite number of clusters. In current work, Canay et al. (2018) analyze the cluster wild bootstrap in a randomization framework, and prove its asymptotic validity when the number of clusters is finite but the individual cluster sizes grow. Their result requires that is homogeneous across clusters, and similar to the population . In this section, we extend this result in various ways, and show, for instance, that such homogeneity is not necessary when errors are exchangeable within clusters. We summarize all our results in the following section, and then proceed to individual cases.
4.1 Overview of results
To proceed we need some additional notation. Let denote a sequence of clusterings of , respectively. Let and let denote that datapoint is in th cluster. The clusterings may grow infinitely but they exhibit a nested structure:
(14) 
so that a datapoint stays within its assigned cluster. This assumption is without practical loss of generality, while it allows us to posit withincluster limits (e.g., means) as grows.
In this paper, we consider three types of cluster invariances: (1) exchangeability within clusters; (2) sign symmetry across clusters; and (3) double invariance, where both symmetries hold. In all cases we assume, as usual, that the errors are independent across clusters. Our technical results on validity of residual randomization for all such invariances are summarized in Table 1.
Cluster size,  

inferential primitive  
exchangeability within clusters ()  , or  , or 
sign symmetry across clusters ()  , or  
double invariance ()  , or  , or , or 
The assumption of cluster homogeneity, denoted by “HA” in Table 1, stands out. For example, when the number of clusters is fixed , the homogeneity assumption posits that, for every cluster ,
(15) 
where is the diagonal matrix of cluster membership. As mentioned before, Equation (15) implies a homogeneous covariate correlation structure across clusters in the limit, and was first identified by Canay et al. (2018) in the context of cluster wild bootstrap. Interestingly, this condition is the asymptotic equivalent of the finitesample homogeneity condition in Theorem 2, which can yield exact tests in finite samples. However, homogeneity is not necessary for validity when the errors are exchangeable within clusters (see first row of Table 1). In this case, permuting the residuals, after cluster covariates have been centered, also leads to valid inference, except for the intercept that is differenced out. On the other hand, when the additional condition can also lead to valid inference, beyond homogeneity or centered clusters (see second column of Table 1). This requires, intuitively, that no cluster dominates the others in size.
4.2 Exchangeability within clusters
We start with the cluster variant of exchangeability in Section 3.1. Formally, let be a permutation of elements within clusters of ; i.e., if , for some , then . Let denote the space of all such permutations. Then, finitesample invariance implies that , where , for some unique . The inferential primitive is
(16) 
Operationally, the residual randomization test using as the primitive repeatedly calculates the test statistic, , on random withincluster permutations of the residuals, .
Theorem 5.
Suppose that Assumptions 1(a)(c) hold for the regression model of Equation (2), and , for all . Suppose also that Assumptions 1(a)(b) hold within clusters. Then, the residual randomization test using as the primitive is asymptotically valid if the clusters are centered; or if the homogeneity condition of Equation (15) holds and .
Theorem 5 shows two conditions for asymptotic validity of a residual randomization test that permutes the residuals within clusters. First, when the covariates are centered the test is valid even when the number of clusters is fixed; however, this precludes inference on . Second, when the homogeneity condition holds and all clusters grow in size, the test can be analyzed as a permutation within each cluster separately, and then validity follows from Theorem 3.
4.3 Sign symmetry across clusters
When the errors are not exchangeable within clusters, we may assume that the entire cluster errors are independent and signsymmetric, similar to Section 3.2. Then, finitesample invariance implies that , where , and . The inferential primitive is
(17) 
Operationally, the residual randomization test using as the primitive repeatedly calculates the test statistic, , on cluster residuals with their signs randomly flipped.
Theorem 6.
Suppose that Assumptions 1(a)(c) hold for the regression model of Equation (2), and , for all . Suppose also that Assumptions 1(a)(b) hold within clusters. If is fixed and , then the residual randomization test using as the primitive is asymptotically valid when the homogeneity condition of Equation (15) holds. If , then the test is valid also when .
Theorem 6 shows two conditions for validity of a residual randomization test that randomly flips the signs of cluster residuals. The homogeneity condition is required when the number of clusters is finite because does not converge in this setting. However, when the number of clusters grows, then the condition also guarantees validity of the test. This condition roughly requires that no cluster dominates the others in the limit.
Remark 4.1. The residual randomization test here is similar to the “cluster wild bootstrap” method of Cameron and Miller (2015). The main difference is that cluster wild bootstrap resamples the cluster residuals with replacement. Canay et al. (2018) proved the asymptotic validity of cluster wild bootstrap with finite clusters under the cluster homogeneity condition of Equation (15). In the following section, we show that homogeneity is not required when errors are also exchangeable within clusters. When exchangeability does not hold, however, the proof of Theorem 6 shows that validity is still possible if the covariates are uncorrelated. This suggests running the randomization test on decorrelated covariates (e.g., through PCA).
Remark 4.2. Canay et al. (2018) also pointed out that asymptotic validity with a fixed number of clusters requires a finitesample invariance assumption, i.e., the symmetry of errors around zero in finite samples. This is not surprising since we cannot estimate cluster variance consistently if the number of clusters is fixed. See also Remarks 2.6 and 3.2 regarding our discussion thread on such contrast between finitesample and asymptotic invariance.
4.4 Double invariance
Here, we make both invariance assumptions from the two previous sections, namely exchangeability within clusters and sign symmetry across clusters. Formally, , where , where are cluster signs and . The inferential primitive is defined as follows:
(18) 
Operationally, the residual randomization test of Section 2.1 using as the primitive repeatedly calculates the test statistic, , on permuted cluster residuals with randomly flipped signs.
Theorem 7.
Suppose that Assumptions 1(a)(c) hold for the regression model of Equation (2), and , for all . Suppose also that Assumptions 1(a)(b) hold within clusters. If is fixed and , then the residual randomization test using as the primitive is asymptotically valid when the homogeneity condition of Equation (15) holds, or when the clusters are centered. If , then the test is valid also when .
The idea behind the double invariant, , is to combine the strengths of both cluster permutations and random signs. This procedure should be used when there is no concern of heteroskedasticity within clusters, because it will generally lead to more powerful tests and sharper inference compared to the the simple cluster sign test (or the wild bootstrap). We illustrate this point in the experiments of Section 5.2.
4.5 Twoway clustering
In many problems, the data could be clustered in two ways. For example, with stateschool data one cluster could be the state and another could be the school; or in panel data, the clusters could be state and time. Here, we discuss a residual randomization test for such error structure, and prove conditions for validity. These technical conditions are more subtle than in oneway clustering because the data size may grow in either one or both cluster dimensions.
In related work, Cameron et al. (2011) proposed estimation based on the inclusionexclusion principle, where standard robust error estimates are used for every cluster individually and in combinations. Aronow et al. (2015); Fafchamps and Gubert (2007); Conley (1999) study sandwich estimators. However, all these estimators have the typical smallsample problems, including a reliance on normal asymptotics. Additionally, they may be illdefined (e.g., yielding nonpositive definite estimates), and can be computationally cumbersome.
Our approach starts by visualizing an arrangement of the errors in a twodimensional array. Data fall in cells that correspond to a “row clustering” and “column clustering”, namely and , indexed by and , respectively. Without loss of generality we assume that the clusters have the same size, so that . For datapoint , the value denotes its assigned row cluster, and its column cluster. The error structure can be represented without loss of generality as follows. Let be the th observation in row cluster and column cluster , then we can write
(19) 
The common row and column terms induce the twoway correlation in the errors. We will assume, as usual, that the error is pure noise, so that it obeys any desirable invariance. Inference therefore depends on the assumptions we put on .
Suppose that are individually exchangeable and there is only one replication per cell. Then, the errors have the simple form , and are thus invariant to entire row and column permutations. Specifically, define as the following group of permutations:
That is, is the set of permutations for which if and are in the same row (or column), then both and are mapped to the same row (or column) under any . Moreover, is a subgroup of . The inferential primitive can thus be defined as follows:
(20) 
Operationally, the residual randomization test of Section 2.1 using as the primitive essentially permutes the residuals row and columnwise.
Theorem 8.
Suppose that Assumptions 1(a)(c) hold for the regression model of Equation (2), and , for all . Suppose also that Assumptions 1(a)(b) hold within clusters with respect to both and . When the number of clusters of both grows indefinitely, the residual randomization test using as the primitive is asymptotically valid. When only the row (or only the column) cluster grows indefinitely, the same randomization test is valid if the covariates have been centered columnwise (or rowwise).
Notably, the conditions for validity in Theorem 8 are similar to Theorem 3 on unclustered exchangeability, and are actually weaker than Theorem 5 on oneway clustered exchangeability. Intuitively, in oneway clustering the randomization bias depends on the cluster covariate means, which requires additional assumptions, such as cluster homogeneity. In twoway clustering the randomization bias depends on the global means, and so validity follows exactly as in Theorem 3.
Remark 4.3. The permutation test described above can be extended to multiway clustered errors. As in the twoway clustering case, this would require an assumption of additivity of cluster errors and exchangeability within clusters.
Remark 4.4. A related test known as “quadratic assignment procedure”(QAP) (Hubert, 1986; Hubert and Schultz, 1976; Krackhardt, 1988) is used in network analysis to test independence between network data (Christakis and Fowler, 2013). In a dyadic regression model, for instance, QAP tests permute the rows and vectors of the dependent variable, , and refit the model. As such, a QAP test acts as an Ftest, but sometimes is misused to assess significance of individual parameters. More appropriate for significance is the residual randomization test based on , which is essentially the residual version of the QAP test. Dekker et al. (2007) have made a similar point based on their adaptation of the residual permutation method of Freedman and Lane (1983b).
Remark 4.5. If there are replications within cells, and exchangeability still holds, then we could simply average data within cells, and then perform the randomization test as above. Such test operates on the errors , which are still invariant to row and column permutations. The averaging may preclude inference for parameters that are differenced out.
Remark 4.6. Inference with heteroskedasticity in twoway clustered errors is more challenging. It may be tempting to switch the signs of the entire row or column residuals, but this would be wrong because changing one row affects all columns, and changing one column affects all rows. One approach is to follow a decomposition similar to ANOVA, where we difference out row and column averages. Such test operates on the transformed errors . If the number of rows (and columns) grows, then the averages converge to zero. The errors could then be treated as approximately satisfying sign symmetry. Theorem 1 could be extended to include such approximate symmetries, but we leave this for future work.
5 Applications and experiments
In this section, we perform an extensive evaluation of the residual randomization tests introduced earlier. Our comparisons include both the bootstrap and its variants, and also standard asymptotic methods from the robust standard error literature. Exact algorithmic details on the methods used here can be found in Appendix D.
5.1 Hormone data
In the hormone data of Efron and Tibshirani (1986, Section 9.3), there are 27 medical devices, each producing a data point ; here, is number of hours the device is worn by a patient, and is remaining amount of hormone on the device. The model is simple linear regression, , where are zeromean, independent, and possibly not normal errors. Efron and Tibshirani (1986) illustrated bootstrap in estimating standard errors for . Here, we use the residual randomization method to test , and then invert the test to get a confidence interval. Thus, we set and in the null hypothesis of Equation (3).
We start by assuming only exchangeability of errors, and so we use the inferential primitive , defined in Equation (12). In this simple setting, the randomization procedure is equivalent to regressing permutations of the constrained residuals, , with , and then comparing the estimated slopes of these regressions with the standard OLS estimate . We calculate the pvalues based on iterations of the randomization procedure.
The null hypothesis, , is strongly rejected (pvalue ) with the residual randomization test. To produce a confidence interval we follow the process outlined in Remark 2.3. Specifically, we consider a sequence of null hypotheses, , for various values of . Using the output of regular OLS, we may focus the range to . The (twosided) pvalues for the entire sequence of tests are shown in Figure 1. From the plot, we can extract the values of that lead to a pvalue larger than , and determine the 95% confidence interval. In the figure, the interval is marked by the two vertical lines, and is roughly equal to .
We can extend this analysis to include heteroskedastic errors. For example, we can repeat the test inversion described above, this time using random signs as the inferential primitive; see Equation (13). Moreover, the data can be clustered according to the manufacturer of the device; in the data there are 3 manufacturers, each providing 9 data points. We can therefore do inference using withincluster permutations or random signs, as described in Section 4. The results of this analysis are shown in Table 2, juxtaposed with results from regular OLS and bootstrap (Efron and Tibshirani, 1986, Table 9.2). We see that the standard errors from all methods roughly agree. The exception is the results from cluster random signs, which cannot produce significance at the 5% level since there are only 3 clusters. This would be concerning if we were not willing to assume exchangeability of errors within clusters.
In summary, the numerical example of this section illustrates that the residual randomization method provides nonparametric estimation of the standard error in a straightforward manner like bootstrap. Additionally, the example shows that the randomization method can be useful for sensitivity analysis simply by doing inference under different invariance assumptions. In doing so, the randomization method is more versatile than bootstrap since one single procedure (presented in Section 2.1) underlies all of the aforementioned tests, whereas a bootstrap approach would require the use of an appropriate variant each time.
inference method  midpoint estimate  s.e.  95% interval  

OLS  .  
bootstrap  

permutations  
random signs  





5.2 Clustered errors
Here, we illustrate the theory and methods of Section 4 on clustered errors. There are two main takeaways. First, using richer inferential primitives can yield substantially more powerful tests (Section 5.2.1). Second, exact tests in finite samples are possible by appropriately adapting to the data structure. We illustrate the idea in the famous Behrens–Fisher problem (Section 5.2.2).
Simulated study
Here, we use a familiar simulation setup (Cameron et al., 2008; Hansen, 2018; Canay et al., 2018) with oneway clustering, and errors of the form
(21) 
The random effects term, , can induce withincluster correlation. We consider two cases, and . All errors are independent. The data are simulated as , with a scalar. Following Cameron et al. (2008), can also have a clustered structure:
(22) 
For the cluster term we consider two cases: (1) , and (2) , the lognormal distribution. The second choice is interesting because lognormal covariates produce a regression model with high leverage points, which makes the estimation problem harder. We consider clusters with ; each cluster has 30 datapoints, and thus there are datapoints in total. When we want to induce heteroskedasticity, we scale the errors in Equation (21) by . Throughout all simulations, we set , and for the homoskedastic case, and for the heteroskedastic.
The results for testing the true null hypothesis, , are shown in Table 3. The table shows rejection frequencies over 5000 replications at 5% significance level, and includes cluster robust error estimates (see Appendix D.3 for details). For each replication, every randomization test uses replications internally.
First, we focus on Panel (A) with homoskedastic errors. As expected, standard OLS achieves the nominal level when there is no cluster correlation , but fails considerably when the errors are clustered. The cluster robust errors are a definite improvement but are also far from the nominal level. This is likely due to the small sample, since rejection rates do improve as grows. The residual randomization tests work well across the board: rejection rates are around 5% even when the covariates are lognormal, as shown in Columns marked by (2).
We now turn our attention to the heteroskedastic setting in Panel (B) of Table 3. Standard OLS now fails across all settings, as expected. The robust error method works roughly as well as in Panel (A). Notably, the lognormal covariates seem to affect the robust method even more in this setting. For example, the rejection rates can be around 14% with lognormal covariates, almost the nominal level. On the other hand, the cluster sign randomization test shows remarkable robustness. It mostly achieves the nominal rate with normal covariates. And with lognormal covariates, its worst rate is around 8% when , but otherwise it remains close to the nominal level of 5%. All other randomization tests don’t work well. This is expected because these tests include permutations within clusters, which is invalid when there is heteroskedasticity.
Next, we discuss power results when testing the false null hypothesis, , shown in Table 4. We omit the results for the heteroskedastic panel in this case, since all tests except for the sign test don’t come close to the correct level. Overall, the randomization tests achieve good power compared to robust OLS. The double randomization test is more powerful than both individual tests in the left subpanel of Table 4, where . In the right subpanel, where the errors are homoskedastic and clustered, the double test remains more powerful than the sign test.
Notably, in that same subpanel, the permutation test now performs much better than the rest, even compared to the double test. For example, when the permutation test is roughly 6 more powerful than the other tests. This may seem to contradict our argument that more invariances generally lead to more power. The result can be explained, however. By construction of the simulation, with and both zeromean and independent normals, the randomized Gram matrix, , for the permutation test is similar to the population . The similarity condition in Theorem 1 holds with good precision, suggesting an almost exact test. To check this theory, we can try to set , where , and , which produces heterogeneous clusters by emphasizing the differences between clusters (terms ) and dampening their similarity (terms ). With this setup, and all else equal, the rejection rates are now , and over 5000 replications for the sign, permutation, and double test, respectively, compared to and of Table 4. This reversal cautions that testing power depends heavily on the problem structure.
(A) Homoskedastic  

cluster error term ()  
(nonclustered errors)  
number of clusters ()  
OLS  (1)  (2)  (1)  (2)  (1)  (2)  (1)  (2)  (1)  (2)  (1)  (2) 
standard  0.057  0.051  0.054  0.051  0.050  0.053  0.490  0.382  0.480  0.394  0.493  0.421 
HC2, robust  0.086  0.090  0.076  0.079  0.061  0.077  0.103  0.110  0.081  0.089  0.081  0.090 
Randomizations  
signs, across  0.059  0.047  0.054  0.049  0.047  0.054  0.053  0.055  0.056  0.048  0.055  0.050 
perms.,within  0.053  0.050  0.046  0.050  0.052  0.051  0.051  0.048  0.048  0.051  0.054  0.056 
double  0.061  0.054  0.056  0.052  0.051  0.056  0.055  0.052  0.054  0.046  0.051  0.050 
(B) Heteroskedastic  
OLS  (1)  (2)  (1)  (2)  (1)  (2)  (1)  (2)  (1)  (2)  (1)  (2) 
standard  0.228  0.249  0.244  0.264  0.239  0.286  0.278  0.274  0.301  0.303  0.301  0.309 
cluster robust  0.095  0.140  0.085  0.116  0.074  0.114  0.100  0.126  0.091  0.124  0.075  0.121 
Randomizations  
signs, across  0.055  0.084  0.055  0.072  0.052  0.072  0.049  0.065  0.059  0.071  0.056  0.072 
perms., within  0.152  0.162  0.155  0.148  0.145  0.153  0.154  0.150  0.166  0.149  0.154  0.144 
double  0.205  0.194  0.198  0.174  0.183  0.170  0.166  0.167  0.168  0.163  0.150  0.155 
(A) Homoskedastic  

cluster error term ()  
(nonclustered errors)  
number of clusters ()  
OLS  (1)  (2)  (1)  (2)  (1)  (2)  (1)  (2)  (1)  (2)  (1)  (2) 
standard  0.649  0.575  0.816  0.742  0.923  0.859  0.572  0.495  0.587  0.536  0.629  0.589 
cluster robust  0.646  0.582  0.804  0.730  0.905  0.849  0.166  0.207  0.162  0.209  0.177  0.237 
Randomizations  
signs, across  0.532  0.450  0.744  0.638  0.875  0.791  0.119  0.128  0.129  0.147 