A Permutation Test on Complex Sample Data
Daniell Toth
Bureau of Labor Statistics^{1}^{1}1Daniell Toth is Senior Mathematical Statistician, Office of Survey Methods Research, Bureau of Labor Statistics, Suite 3950, Washington, DC 20212 (email:toth.daniell@bls.gov)
Abstract
Permutation tests are a distribution free way of performing hypothesis tests. These tests rely on the condition that the observed data are exchangeable among the groups being tested under the null hypothesis. This assumption is easily satisfied for data obtained from a simple random sample or a controlled study after simple adjustments to the data, but there is no general method for adjusting survey data collected using a complex sample design to allow for permutation tests. In this article, we propose a general method for performing a pseudopermutation test that accounts for the complex sample design. The proposed method is not a true permutation test in that the new values do not come from the set of observed values in general, but of an expanded set of values satisfying a randomeffects model on the clustered residuals. Tests using a simulated population comparing the performance of the proposed method to permutation tests that ignore the sample design demonstrate that it is necessary to account for certain design features in order to obtain reasonable value estimates.
Keywords: cluster sample; hypothesis test; survey data; value; nonparametric.
1 Introduction
The permutation test is a simple test to assess the significance of association between a random variable and group membership, proposed originally for data from designed experiments (Fisher, 1935) and then more generally for observed data (Pitman, 1938). Given a dataset containing observations of a variable and corresponding group (or treatment) labels , the permutation test provides an estimate of the distribution of a test statistic, conditioned on the observed data, under the null hypothesis that the group labels are independent of the values. This estimated conditional distribution is constructed by calculating the test statistic for all possible permutations of the observed values under the null hypothesis. Then a value is obtained by comparing it to the original value.
Despite being first purposed for designed experiments, which are strongly related to survey sample designs (Fienberg and Tanur, 1996), permutation tests have not been generally applied to survey data as they have for experimental design data (Good, 2005). Indeed, the key assumption of exchangeability (Kingman, 1978) is often violated for survey data. Unlike experimental designs where simple adaptions to the test statistic have allowed for these tests to be applied to the data, there have been no adaptions purposed that allow for permutation tests to be applied to data collected using a general complex sample design.
The purpose of this article is to propose a method, following the procedure of Welch (1990), for a randomization test on group effects using data obtained from a complex sample. In order to permute values within and across clusters, we adopt a model based method like that of Scott and Holt (1982) for estimating cluster effects, leading to what we call a pseudopermutation test. We show that estimating this model does not prevent the method from leading to a consistent permutation test under certain conditions. In Section 2 we describe a general permutation test on independent, identically distributed (iid) data and then provide a method for conducting the test on complex sample data. We demonstrate the method through simulations in Section 3. In Section 4 we apply this method to an analysis of consumer expenditure data. A discussion of the results is provided in Section 5.
2 Permutation Tests
Consider a data set consisting of observations of a continuous random variable along with corresponding group labels from the random variable . Permutation tests are based on the idea that if is independent of the group labels , then we are as equally as likely to have observed a dataset with the same observed values and but with the assignment between the values and the group labels permuted. A test statistic is computed on several permuted datasets and compared to the value of the test statistic under the observed order. If the value of the observed test statistic is considered too extreme based on the values over several permutations, the null hypothesis of independence is rejected.
2.1 A Test on iid Data
Suppose is a continuous random variable satisfying
(1) 
for where are unknown constants and each is an independent and identically distributed (iid) random variable with mean 0 and finite variance from an unknown density Without loss of generality, we will only consider the case when there are two groups and we are testing the hypothesis
(2) 
or equivalently The conditional probability of observing given is
If then is independent of so the probability of given becomes
Given a vector let represent a random permutation of and if is the th value in , let denote the th value in the permuted vector . Then, since are iid, the distribution of given is same as the distribution of the permuted values of under the null hypothesis (Cox and Hinkley, 1979, Chapter 6.2). That is
(3) 
Therefore, we can estimate the conditional distribution of any finite, deterministic function of by computing the values of using permutations of the values of These values provide an empirical distribution that is conditional on the observed values of under the assumption that and are independent.
For example, in order to test the hypothesis given in (2) we consider the test statistic
where and is an indicator function. Compute, the statistic under the observed order of and compare this value to the empirical distribution of obtained by computing the test statistic for many random permutations of the values. If all possible permutations are used to compute the distribution of the statistic, this is called an exact test, whereas if a large number of randomly generated permutations are used to approximate this distribution then it is called a randomization test (Good, 2005).
2.2 Data from a Complex Sample Design
Survey data are often collected under a sample design that invalidates the iid assumption for observed units. A design which causes the distribution of observed values of the variable of interest to be different than the distribution of the variable of interest in the population is called an informative design. Analysis ignoring the sample design can lead to invalid inference (Holt et al., 1980; Pfeffermann, 1993).
For informative sample designs, auxiliary data must be available for each observation before the sample is drawn for use in the sample design. These auxiliary data can be used to stratify the population and select observations from each stratum separately, identify clusters to select instead of individual observations, select certain observations with higher probability than others based on the values of an auxiliary variable, or a combination of these. Some of these design features are likely to provide observed data that violate the assumptions under model (1). For instance data that are collected from a cluster sample are likely to have observations with values that are more homogeneous within clusters than over the whole population and values of the variable of interest are usually related to the variables used to stratify the population as well as the sample probabilities.
As in Scott and Holt (1982), we consider a sample of total observations drawn from clusters. The observations in the sample are indexed by where is the number of units in cluster and Here represents the variable of interest, is the vector of auxiliary random variables associated with each unit, and the corresponding group label. We assume the first variable in the auxiliary data is for each and and that the values of the auxiliary data are known for all units in the population. The values of and are collected from the sample units and so only values from the sampled units are observed. We are interested in testing the null hypothesis
(4) 
using the observed data.
Suppose we model the conditional expectation of by the linear equation
(5) 
for some unknown vector of coefficients Then the estimated vector of coefficients obtained using the designconsistent estimating equation (Binder, 1983)
is the solution to the equation
(6) 
for all where is the sample weight for the observation in cluster
Define the residual of the estimated conditional expectation for observation Then
for each under the null hypothesis (4), including where We will now derive a permutation test on these residuals as proposed by Gail et al. (1988).
Consider the sum of weighted residuals for only units with a particular group label, such as
(7) 
Under the null hypothesis given by (4), the test statistic defined by equation (7) has expected value
In order to test the null hypothesis, we need to compute over all permutations of observed values but unlike the model defined by equation (1) the values of are not necessarily exchangeable. Though we are accounting for much of the design through the model (5) and the sample weights, values of from different clusters are not necessarily exchangeable. Therefore, we assume a model for like the one used by Scott and Smith (1969) for multistage surveys,
(8) 
where and are independent, continuous random variables with mean 0 with distribution functions and respectively. Under these assumptions and the null hypothesis
(9)  
(10)  
(11) 
This leads to a method for conducting permutation tests using data from a complex sample design by permuting the estimated values of the cluster effects and error terms according to equation (11). Next we describe a multistep procedure for obtaining a set of permuted "pseudo"values of the set
2.3 Method and Conditions
By a random permutation of we really mean a vector of random permutations where is a random permutation of the set of indices and is a random permutation of for each If we denote a set of permuted values of by then The cluster effects are permuted and then the are permuted within each cluster. For a given set of observed let denote the set of all such permutations.
If we randomly select random permutations from , then for a constant the probability under the null hypothesis can be estimated by
(12) 
(Flury, 1997, Chapter 6.7). Note that even though and for are distribution functions of continuous random variables, the value of the test statistic could be equal, for two different permutations and in This occurs if the permuted values of that have group label are the same under both permutations and for all where is the number of observations from cluster that have group label 1. The unique values of the test statistic applied to permuted form an equivalence class of permutations in let be the set of unique values. Therefore, equation (12) is estimating the proportion of permutations that are in equivalence classes with values of the test statistic that are greater or equal to the value of the test statistic on the observed data.
Since the values of and are unknown for each the next step is to estimate these values in order to perform the permutations. The cluster mean, for each cluster is estimated by
(13) 
and The permuted pseudovalues are obtained by adding the estimated value of the permuted cluster effects to the permuted values of in cluster Since these new values of lead to values that are not in the original vector of values this is not a true permutation of but rather to a set of pseudovalues. This set of pseudovalues are the permuted values of under the assumed model (8) for the true values of and
The following result states that the effect of replacing the true with these estimated values in equation (12) is small and vanishes asymptotically under certain conditions. In order to obtain asymptotic results, we consider samples of increasing size, from a clustered superpopulation model satisfying equations (5) and (8). We use the notation and to remind us that the number of clusters, unique values of the test statistic, and the set of all possible permutations on the data under the proposed method depends on the sample. Obviously the data depends on the sample and sample size but we suppress the subscript to reduce the complexity of the notation. The conditions stated for the next result are assumed to occur with probability 1 with respect to this superpopulation model.
Proposition 2.1
Suppose a sample of observations from clusters, is drawn from the superpopulation model. If the following conditions are satisfied:

for some and all

such that


then for
The first condition, Condition 1, assumes the residuals from the model, equation (5), have a finite variance, therefore the Central Limit Theorem applies to the error term obtained from estimating the cluster effect values Condition 2 requires the difference of the absolute values of the test statistic between equivalence classes to be uniformly bounded above 0. The next two conditions pertain to the sample design. Condition 3 requires that the number of observations within each cluster increases as increases, but allows for the number of clusters sampled to increase as the sample size increases. Condition 4 requires that the difference in the number of observations from a cluster that have group label 1 is bounded for all clusters.
proof:
Let be the error in estimation of from equation (13), then the test statistic defined in (7) using these permuted pseudo values,
where
Define the inverse function of a permutation as the integer valued function such that implies Then the difference between the value of the test statistic for the permuted pseudovalues and the true permuted values can be written
By Condition 4 there exists a such that for all Since the random variable is the sum of iid random variables with zero mean and finite variance for each by Condition 1, is a meanzero random variable with variance where Therefore, by Condition 3, as with probability 1 with respect to the super population model.
Now, let be the absolute value of the test statistic on the original order of the data, and be a fixed permutation of the data using the above procedure. We now consider the value of . If is in the same equivalence class as the nullpermutation, then
(14) 
otherwise,
Since as for large enough by Condition 2. Therefore,
(15) 
3 Simulations
For testing the method, we generated a finite population consisting of 500 clusters with 20 observations each, for a total population size of 10,000 observations. Each observation where is observation of cluster contains values for 4 random variables. The continuous random variable represents the variable of interest and variables  the corresponding group labels.
All of the group labels were generated from Bernoulli random variables with equal probability () and each have varying amounts of clustering. Label was generated from iid Bernoulli random variables with for all and so are independent of cluster label. Label was generated from independent Bernoulli random variables with where for each cluster was drawn from a U random variable, so each cluster has more or less observations labeled 1 than other clusters. Labels for all in cluster where was generated from iid Bernoulli random variables with for each Therefore, every observation has the same label within a cluster.
The observations of the variable of interest were generated as iid random variables with distribution given by
(16) 
where and is a deterministic function of group label where is constant and The simulation results presented in this article were obtained using the values and or where is the standard deviation of the random variable defined by equation (8). Figure 1 shows the distribution of the simulated values for when
We compared the performance of a hypothesis test based on the proposed pseudopermutation method to the regular permutation test. The test was done over several different sample designs of different sizes. Taking 2,000 independent samples from the finite population, using a given sample design, and computing the pvalue obtained from the proposed test and the regular test for each sample, we obtain a vector of 2,000 estimated pvalues for each test and the corresponding value of the test statistic given in equation (7).
When the null hypothesis is true, the empirical distribution of the set of test statistic values can be used to estimate the true pvalue. This estimate is then compared to the estimated pvalues from the permutation tests obtained for each teststatic value. When the null hypothesis is false, the empirical distribution of the set of estimated pvalues from a permutation test can be used to estimate the power of the test.
For example, consider the test on the label variable The top graph of Figure 2 displays (thick lightgrey line) the pvalues estimated from the empirical distribution of test statistics observed over the 2,000 simple random samples (srs) of size 60 along with the estimated pvalues from the pseudopermutation test (orange solidline) and the regular permutation test (black dottedline) over the observed values of the test statistic. Under the srs design, the regular permutation test gives pvalues that match the empirical distribution perfectly; the black dottedline overlaps the empirical distribution. The pseudopermutation test gives higher estimated pvalues for lower values of the test statistic than the empirical distribution and regular permutation test, which leads to having less power than the regular test.
The bottom graph of Figure 2 displays the power (the proportion of times the test rejected the null) when of the regular permutation test (black dottedline) and the pseudopermutation test (orange solidline) for pvalues between 0 and 0.1. The pseudopermutation test can be seen to have lower power than the regular test when testing at low (< .02) significance levels under a srs design.
Though the estimated pvalues obtained from the pseudopermutation test a little higher (too high) than the regular test for most values of the test statistic, the regular test provided pvalues that are slightly too low. Indeed probability of rejection when the null hypothesis is true, is 0.053 for the pseudopermutation test compared to for the regular test at the .05 level and 0.012 compared to 0.01 at the .01 significance level. Overall the pvalues produced by both tests were acceptable in all of the srs designs (for all variable labels  and sample sizes ) we tested.
The performance of both tests improved on data from stratified designs. For our tests, we stratified the population based on the quartile values of group label, or both. Both tests improved for stratified designs even when the units were sampled with unequal probability of selection, when the probabilities were related to the group label being tested. When the sample design included unequal probabilities of selection that were related to the values of only the pseudopermutation test (adjusted for the sample design) performed reasonably.
Figure 3 shows the results of the tests on label when for stratified sample designs. The results in the top two graphs respectively are for a stratified equal probability of selection design and a stratified design, where units with group label were sampled at twice the rate as units with group label The estimated pvalues of both tests follow the empirical pvalues under both designs.
The third graph in Figure 3 contains results of the test for a stratified design where the probability of selection was higher for larger quartiles. In this case, the test using the weight adjusted estimator produces pvalues that closely follow the empirical pvalues, while the unadjusted test failed to produce reasonable pvalues. For example, the probability of mistakenly rejecting the nullhypothesis using a 5% confidence level was only 0.004 for the unadjusted test compared to 0.05 for the pseudopermutation test adjusted using the sample weights.
The final graph in Figure 3 shows the results for a design with strata based on quartiles of and group label Units were selected so that units in the larger quartiles of and with label were selected with higher probability than units in lower quartiles of or with label In this case, the varying weights made the tests less efficient, but again only the pseudopermutation test adjusted using the sample weights produced reasonable pvalues estimates.
Figure 4 displays the results of the tests on 2,000 repeated samples of 20 randomly selected clusters, when over group labels and The top graph displays the results for the test of group label under the nullhypothesis. In this case, the estimated pvalues from the twotests and the empirical distribution are indistinguishable because the cluster ids and the labels are independent; thus both tests in this case do an excellent job providing approximations to the true pvalue of the test statistic. The results displayed in the bottom two graphs of Figure 4, testing group labels and respectively, demonstrate that ignoring the cluster design when the group labels are more homogeneous within cluster, leads to misleading low pvalue estimates using the regular permutation test. Meanwhile, the proposed pseudopermutation test produces pvalues that closely match the empirical distribution under all three labels.
This robustness of the pseudopermutation test under cluster sampling comes at a cost in power over the regular permutation test when there is no association between the group labels and cluster id. The three graphs of Figure 5 show the power of the pseudopermutation test (orange solidline) and the regular permutation test (red dottedline) over different levels of significance when for the three group labels: and respectively, when the data comes from a simple random sample of 20 clusters. We can see that two tests have the same power testing group but the pseudopermutation test has considerably less power than the regular test when testing groups and Since the regular permutation test produces pvalues that are much too small under the null hypothesis for tests on labels and the power is meaningless for this test and only provided for reference to compare against the power of the pseudotest.
Looking at the last graph of Figure 5, the pseudopermutation test for label in particular appears to has considerably reduced power. However, because all units in each cluster have the same label value, the pseudopermutation test reduces to a test of 20 observations. In fact, the twosided ttest on 20 observations at 5% significance level has power 0.562 compared to 0.54, the power of the pseudopermutation test.
4 Consumer Expenditure Survey Data
For the illustration of this method we use data from the U.S. Bureau of Labor Statistics Consumer Expenditure (CE) survey to test differences in earnings and spending between families with a primary earner that has at least a bachelor’s degree against families with a primary earner that does not. We will refer to the groups as the "college educated" group () and the "not college educated" group () respectively. A subset of variables contained in the CE 2015 interview dataset are provided in the rpms package (Toth, 2017). This dataset includes variables on the sample design, household, and person listed as the household’s primary earner.
We test for differences between these groups on two quantitative variables (household income and family size) and two proportions (proportion of families that have expenditures on tobacco and proportion of families that have a vehicle). For our analysis we will consider households with primary earners between the ages of 22 and 64 and where the education level of the primary earner, household income and expenditure information is provided. This gives us a sample size of 50,762 from 115 sampled clusters. Table 1 shows the comparisons between the college educated group and non college educated group as well as the estimated value from the permutation test ignoring clusters and the permutation test adjusted for the cluster membership. In both tests, we first subtracted the estimated unconditional mean of each random variable using the sample weights and performed the test on the residuals.
Description  College Educated  Estimated pvalue  
Yes  No  iid  cluster  
Sample Size  18,175  32,587     
Mean Household Income  94,584  46,487  0  0 
Mean Family Size  2.5795  2.7888  0  0.313 
Proportion with a Vehicle  0.9329  0.8705  0  0.3905 
Proportion of Using  
Tobacco  0.0609  0.1868  0  0 
The two continuous random variables, family size and household income before tax, are both available on the CE dataset. In order to estimate the proportion of families with vehicles, we made an indicator random variable equal to 1 if the sum of the reported number of vehicles owned and the vehicles leased was greater than 0. Similary, for estimating the proportion of families using tobacco, we made an indicator random variable that is equal to 1 if the reported expenditure on tobacco was greater than 0.
It is interesting to note that when we treat all the observations as independent (ignoring clustering), the permutation test finds the difference between every variable considered significant. However, when we accounted for the clustering, the difference in family size and the proportion of families with a vehicle was not found to be significant. These results seem to be reasonable as we would expect household income for families where the primary earner has a college degree to be larger than those with a primary earner without a degree. Likewise, it is probable that the more educated families would be less likely to use tobacco due to the many scientific reports linking tobacco use to a variety of health issues, but it is not clear that they would be more or less likely to have a car or have a larger or smaller family size.
These comparisons are intended to illustrate the method, and the results are encouraging in that they seem to highlight the importance of adjusting for the sample design when using sample data. Because the simulation results show that this method leads to a loss of power compared to the iid test, we cannot be sure that there is not a difference between family size and proportion of families with a car between these two groups, but they also show that ignoring the sample design is likely to lead to completely unreliable estimated values. Since the variables tested are likely to be correlated within clusters, we would not trust any results ignoring the sample design.
5 Discussion
We have proposed a general method for performing a pseudopermutation test that accounts for the complex sample design and have shown that the test will give design consistent results under a set of conditions on the sample design and population structure. Tests using a simulated population comparing the performance of the proposed method to permutation tests that ignore the sample design demonstrate that it is important to account for the sample design in order to obtain reasonable value estimates. The results of these simulations and an application using publicly available consumer expenditure data especially highlight the importance of accounting for clustering in the sample.
Though accounting for the sample design protects against performing an invalid test when the design is informative, the presented permutation method also leads to a loss of power. This loss of power occurs whether the sample design is informative with respect to the variable of interest or not. Perhaps, this could be mitigated by adjusting the proposed method using an estimate of the designeffect in some way, which could be the subject of further research. However, the presented method represents a general method for performing a permutation test on data obtained through a complex sample that will provide valid inference at the cost of some power.
Acknowledgments
The authors would like to thank people (specifics to be added later).
References
 Binder (1983) David A Binder. On the variances of asymptotically normal estimators from complex surveys. International Statistical Review/Revue Internationale de Statistique, pages 279–292, 1983.
 Cox and Hinkley (1979) David Roxbee Cox and David Victor Hinkley. Theoretical statistics. CRC Press, 1979.
 Fienberg and Tanur (1996) Stephen E Fienberg and Judith M Tanur. Reconsidering the fundamental contributions of fisher and neyman on experimentation and sampling. International Statistical Review/Revue Internationale de Statistique, pages 237–253, 1996.
 Fisher (1935) Ronald A Fisher. The logic of inductive inference. Journal of the Royal Statistical Society, 98:39–82, 1935.
 Flury (1997) Bernard Flury. A First Course in Multivariate Statistics. Springer Science & Business Media, 1997.
 Gail et al. (1988) MH Gail, WaiYuan Tan, and Steven Piantadosi. Tests for no treatment effect in randomized clinical trials. Biometrika, 75(1):57–64, 1988.
 Good (2005) Phillip Good. Permutation, Parametric, and Bootstrap Tests of Hypotheses. SpringerVerlag: New York, 2005.
 Holt et al. (1980) D Holt, TMF Smith, and PD Winter. Regression analysis of data from complex surveys. Journal of the Royal Statistical Society. Series A (General), pages 474–487, 1980.
 Kingman (1978) John FC Kingman. Uses of exchangeability. The Annals of Probability, pages 183–197, 1978.
 Pfeffermann (1993) Danny Pfeffermann. The role of sampling weights when modeling survey data. International Statistical Review/Revue Internationale de Statistique, pages 317–337, 1993.
 Pitman (1938) Edwin James George Pitman. Significance tests which may be applied to samples from any populations: Iii. the analysis of variance test. Biometrika, 29:322–335, 1938.
 Scott and Smith (1969) Alastair Scott and Terence MF Smith. Estimation in multistage surveys. Journal of the American Statistical Association, 64(327):830–840, 1969.
 Scott and Holt (1982) Andrew J Scott and D Holt. The effect of twostage sampling on ordinary least squares methods. Journal of the American statistical Association, 77(380):848–854, 1982.
 Toth (2017) Daniell Toth. rpms: Recursive Partitioning for Modeling Survey Data, 2017. R package version 0.2.0.
 Welch (1990) William J Welch. Construction of permutation tests. Journal of the American Statistical Association, 85(411):693–698, 1990.