# The impact of a Hausman pretest on the coverage probability and expected length of confidence intervals

###### Abstract

In the analysis of clustered and longitudinal data, which includes a covariate that varies both between and within clusters (e.g. time-varying covariate in longitudinal data), a Hausman pretest is commonly used to decide whether subsequent inference is made using the linear random intercept model or the fixed effects model. We assess the effect of this pretest on the coverage probability and expected length of a confidence interval for the slope parameter. Our results show that for the small levels of significance of the Hausman pretest commonly used in applications, the minimum coverage probability of this confidence interval can be far below nominal. Furthermore, the expected length of this confidence interval is, on average, larger than the expected length of a confidence interval for the slope parameter based on the fixed effects model with the same minimum coverage.

Keywords: Clustered data, Fixed effects model, Hausman specification test, Longitudinal data, Random intercept model.

## 1 Introduction

The linear random intercept model is commonly used in the analysis of clustered and longitudinal data. In clustered data the response variable is measured once on a unit where each unit is nested within a particular cluster of units. For example, analyzing the reading test score of children which are nested within classrooms (clusters). Longitudinal data, which can also be viewed as clustered data (see Rabe-Hesketh and Skrondal (2012, p. 227)), is where the response variable is measured at different time points for each subject and where the measurements across time are nested within each subject (cluster). For example, analyzing the weights of individuals over time where the weight measurements across time are nested within individuals (clusters). When including a covariate in the linear random intercept model that varies both between and within clusters (e.g. time-varying covariate in longitudinal data) a preliminary Hausman (1978) test is commonly used to test the assumption of no correlation between the random intercept and covariate. If the Hausman pretest rejects the null hypothesis of no correlation between the random intercept and covariate then the fixed effects model is chosen for subsequent inference, otherwise the linear random intercept model is chosen. This preliminary model selection procedure has been widely used in econometrics (see e.g. Wooldridge, 2002 and Baltagi, 2005) and has also been adopted in other areas such as medical statistics, see e.g. Gardiner et al. (2009) and Mann et al. (2004). The Hausman pretest has also been implemented in popular statistical computer programmes including SAS, Stata, eViews and R, see Ajmani (2009, Chapter 7.5.3), Rabe-Hesketh and Skrondal (2012, Chapter 3.7.6), Griffiths et al. (2012, Chapter 10.4) and Croissant and Millo (2008), respectively.

The two-stage procedure widely used in the analysis of clustered and longitudinal data is as follows. In the first stage, the Hausman pretest is used to decide whether subsequent inference is made using the random intercept model or the fixed effects model (see e.g. Ebbes et al., 2004 and Jackowicz et al., 2013). The second stage is that the inference of interest is carried out assuming that the model chosen in the first stage had been given to us a priori, as the true model. Guggenberger (2010) considers this two-stage procedure when the inference of interest is a hypothesis test about the slope parameter. He provides both a local asymptotic analysis of the size of this test and a finite sample analysis (via simulations) of the probability of Type I error.

We consider the case that the inference of interest is a confidence interval for the slope parameter. Kabaila et al., 2015 state 3 new finite sample results (Theorems 1, 2 and 7 of the present paper) on the effect of a Hausman pretest on the coverage probability of the confidence interval for this parameter. They also provide outline proofs of these results and a brief initial analysis of the coverage properties of this confidence interval. These coverage properties are estimated by simulation. In the present paper, we describe how the third of these results can be used to substantially improve the efficiency of these simulations through the use of variance reduction by control variates.

We compare the expected length of the confidence interval resulting from the two-stage procedure with the expected length of the confidence interval based on the fixed effects model, where the latter interval is adjusted to have the same minimum coverage as the former interval. The quantity used for this comparison is the scaled expected length, defined as the expected length of the interval resulting from the two-stage procedure divided by the expected length of the adjusted interval from the fixed effects model. The scaled expected length provides an insight into how useful the Hausman pretest is in this context. In the present paper, we provide 4 new finite sample theorems (Theorems 3, 4, 5 and 6) on the scaled expected length. The scaled expected length is estimated by simulation. We describe how to substantially improve the efficiency of these simulations through the use of variance reduction by control variates.

The results that we present make it easy to assess, for a wide variety of circumstances, the effect of the Hausman pretest on the coverage properties and scaled expected length of the confidence interval for the slope parameter. Our results show that when the usual small nominal level of significance for the Hausman pretest is used, the two-stage procedure can result in a confidence interval with minimum coverage probability far below nominal. However, if the nominal level of significance is increased to 50% then the minimum coverage probability is much closer to the nominal coverage (see Figures 1, 2 and 3). We also show that, in terms of expected length, the confidence interval resulting from the two-stage procedure is consistently outperformed by the confidence interval based on the fixed effects model, regardless of the nominal level of significance chosen for the Hausman pretest (see Table 1). The results presented in this paper were computed using programs written in the R programming language, which will be made available in a convenient R package.

In Section 2, we consider the practical situation that the random error and random effect variances are estimated from the data. We consider three estimators of these variances: the usual unbiased estimators, the maximum likelihood estimators of Hsiao (1986) and the estimators of Wooldridge (2002). The coverage probability and the scaled expected length of the confidence interval resulting from the two-stage procedure are determined by 4 known quantities and 5 unknown parameters. The known quantities are the number of individuals, the number of time points (for the longitudinal data case), the nominal significance level of the Hausman pretest and the nominal coverage probability of this confidence interval. The unknown parameters are the random error variance, the random effect variance, the variance of the covariate (or time-varying covariate in the case of longitudinal data), a scalar parameter that determines the correlation matrix of the covariates and a non-exogeneity parameter.

If, for given values of the 4 known quantities, we wish to assess the dependence of the coverage probability and the scaled expected length of the confidence interval resulting from the two-stage procedure on the 5 unknown parameters then we might consider, say, five values for each of these unknown parameters, leading to 3125 parameter combinations. Apart from the daunting task of summarizing so many results, it is possible that one might miss important values of the unknown parameters, such as values for which the coverage probability is particularly low or the scaled expected length is particularly large.

Theorems 1 and 5 state that, apart from the known quantities, the coverage probability and scaled expected length are actually determined by only 3 unknown parameters, including the non-exogeneity parameter. If we compute the minimum coverage probability with respect to the non-exogeneity parameter then we have only 2 unknown parameters and our assessment of the coverage properties of the confidence interval resulting from the two-stage procedure is greatly simplified. Theorems 2 and 6 state that the coverage probability and scaled expected length are even functions of the non-exogeneity parameter, so that computation time is halved. We also propose a scaling of the non-exogeneity parameter that takes account of the sample size. In effect, this scaling reduces the number of known quantities that determine this coverage probability from 4 to 3.

In Section 3, we consider the coverage probability of the confidence interval resulting from the two-stage procedure when the random error and random effect variances are assumed to be known. Theorem 7 shows that the conditional coverage probability of the confidence interval resulting from the two-stage procedure can be found exactly by the evaluation of the bivariate normal cumulative distribution function. This theorem is important because it is used to reduce the variance of the simulation based estimators of the coverage probability and scaled expected length of the confidence interval resulting from the two-stage procedure (when random error and random effect variances are estimated). As we show in Section 4 this variance reduction is achieved by using control variates.

##
2
The model and the practical two-stage procedure (random error and random effect variances are

estimated)

We focus on the case of longitudinal data for which denotes the individual and denotes the time period . By interpreting as the cluster index and as the unit of analysis, our results also apply to the analysis of clustered data. Part of the following description of the model and two-stage procedure is taken from Kabaila et al., 2015. Let and denote the response variable and the time-varying covariate, respectively, for the ’th individual at time . Suppose that

(1) |

where the ’s and the ’s are independent, the ’s are i.i.d. and the ’s are i.i.d. . We call the slope parameter, the error variance and the random effect variance. The ’s and the ’s are unobserved. Suppose that the parameter of interest is the slope parameter and that the inference of interest is a confidence interval for .

Also suppose that the ’s are i.i.d. multivariate normally distributed with zero mean and covariance matrix

(2) |

where is a vector of 1’s, is a matrix with ’s on the diagonal and on the off-diagonal (compound symmetry), and is a parameter that measures the dependence between and . We define the “non-exogeneity parameter” and note that it is a correlation, so . If then the ’s are exogenous variables. Let . In simulations, we generate as follows. Suppose that are i.i.d. . Also suppose that these random variables and the ’s are independent. Let for and . The distribution of conditional on , found from (2), is used to generate . Let .

Assume, for the moment, that and are known. When , a confidence interval for may be found as follows. Let . Condition on and use the GLS estimator of . Let , where denotes the cdf. Define the following confidence interval for

where denotes the variance of , conditional on when . The confidence interval has coverage probability when . Averaging (1) over for each we obtain

(3) |

where

This model is called the between effects model. When , an alternative estimator of is , the OLS estimator based on the model (3), when we condition on . This estimator does not require a knowledge of . Subtracting (3) from (1), we obtain

(4) |

This model is called the fixed effects model. We estimate by , the OLS estimator based on this model. Define the following confidence interval for

where denotes the variance of , conditional on . Irrespective of the value of , the confidence interval has coverage probability . In practice, we do not know whether or not . The usual procedure is to use a Hausman pretest to test against . We consider this pretest, based on the test statistic

(5) |

where denotes the variance of conditional on and assuming that . This test statistic has a distribution under . Suppose that we accept if ; otherwise we reject . Note that is the level of significance of this test. We now describe the two-stage procedure. If is accepted then we use the confidence interval ; otherwise we use the confidence interval . Let denote the confidence interval, with nominal coverage , that results from this two-stage procedure.

Of course, in practice, and are not known and need to be estimated. So, in practice, the Hausman pretest is based on the test statistic and the two-stage procedure results in the confidence interval where and denote estimators of and , respectively.

### 2.1 The coverage probability of the confidence interval resulting from the two-stage procedure

The coverage probability of the confidence interval constructed from this two-stage procedure is . The following two theorems (stated by Kabaila et al. 2015) give properties of this coverage probability.

###### Theorem 1

For any of the pairs of estimators listed in Appendix B, the unconditional coverage probability is determined by (the number of individuals), (the number of time points), (the nominal significance level of the Hausman pretest), (the nominal coverage probability), (the ratio ), (the parameter that determines ) and (the non-exogeneity parameter). Given these quantities, the coverage probability does not depend on either (the variance of the random error) or (the variance of the random effect) or (the variance of the time-varying covariate ).

###### Theorem 2

Suppose that , , , , and are fixed. When and are replaced by any of the pairs of estimators listed in Appendix B, the unconditional coverage probability is an even function of .

Outline proofs of these results are provided by Kabaila et al., 2015. Detailed proofs are provided in the supplementary material for the present paper. A remarkable feature of the proofs of these theorems is that they are carried out without relying on a simple expression for the coverage probability . We use simulations to compute , employing variance reduction by control variates, as described in Section 4. Using Theorem 2, we only need to consider in the interval , which means that we have reduced the number of simulations needed to estimate the coverage probability function (or its minimum) by half.

We now examine the influence that the nominal level of significance of the Hausman pretest has on the coverage probability function . Suppose that and are the usual unbiased estimators of and , respectively (described in Appendix B). Consider , , , and the nominal coverage probability . In practice, it is common to use a small value of , such as 0.05 or 0.01. As noted by Guggenberger (2010), examples of practical applications that have used a small for the Hausman pretest are provided by Gaynor et al. (2005, p.245) and Bedard and Deschenes (2006, p.189). Figure 1 presents graphs of the coverage probability , considered as a function of . Each graph is computed using the variance reduction method and the common random numbers (to produce smoother graphs) described in Section 4. The number of simulation runs used to compute each graph is . The bottom (with circle points) graph is for nominal significance level of the Hausman pretest. This graph falls well below the nominal coverage for a wide interval of values of , with the minimum of the coverage probability approximately equal to 0.75. Suppose that we choose the significance level of the Hausman pretest to be quite large, say . Now the Hausman pretest is more likely to reject the null hypothesis that and therefore more likely to choose the fixed effects model for the construction of the confidence interval. The top (with triangle points) graph is for nominal significance level of the Hausman pretest. Although this graph is still below the nominal coverage, there has been a large improvement.

As noted in the Introduction and in Section 2, if we compute the minimum over of the coverage probability then we are left with only two unknown parameters, and . If we fix then the minimum coverage depends only on , where , as it is a correlation. Suppose that and are the usual unbiased estimators of and , respectively. Suppose that , , and the nominal coverage probability . Figure 2 presents graphs of the coverage probability , minimized over , considered as a function of (). It can be seen that the coverage probability, minimized over , is a decreasing function of . This is because as increases, becomes closer to 0 for (), causing to become a very inaccurate estimator of . This then causes the test statistic to have very little power to detect non-zero values of . Each estimate of the minimum coverage is found using the common random numbers and the variance reduction method described in Section 4. Similarly to Figure 1, we see a vast improvement in the minimum coverage by letting rather than choosing to be the commonly used, smaller value 0.05.

In practice, is not known and must be estimated from the data. However, it is commonly observed in practice that takes small to moderate values, and not values close to 1. This suggests that we fix and plot the graph of the coverage probability , minimized over , as a function of . Suppose that and are the usual unbiased estimators of and , respectively. Consider , , the nominal coverage probability and . Figure 3 presents graphs of the coverage probability , minimized over , as a function of . For nominal significance level of the Hausman pretest, this minimized coverage probability is far below the nominal coverage for approximately equal to 0.2. However, for nominal significance level of the Hausman pretest, we see (once more) a dramatic improvement in the minimum coverage probability.

###
2.2 Comparison of the two-stage interval with the adjusted

interval based on the fixed effects model

The following notation is introduced to describe the expected length of the two-stage confidence interval. Let , (“sum of squares between”) and (“sum of squares within”). Let and . Now define and , where .

We use the notation

where is an arbitrary statement. Let be the statement . The expected length of is

Values of close to 1 are unlikely to be encountered in practice. We therefore restrict attention to for some given . Let denote the coverage probability of , minimized over , and .

Let

(6) |

The following new theorem allows us to easily compute for any given .

###### Theorem 3

does not depend on any unknown parameters.

The proof of Theorem 3 is provided in the supplementary material. Define to be the value of such that . Let be for . In other words, is the confidence interval based on the fixed effects model, when is estimated from the data, that is adjusted to have the same minimum coverage as . The expected length of is

The scaled expected length of is defined to be the expected length of divided by the expected length of . Let . A useful expression for this scaled expected length is given by the following result.

###### Theorem 4

The scaled expected length of is equal to

(7) |

A proof of this result is provided in the supplementary material. If (7) is less than 1 then is a shorter interval on average than . The following two new theorems describe important properties of the scaled expected length. Outline proofs of these theorems are given in Appendix C. Detailed proofs are provided in the supplementary material.

###### Theorem 5

For any of the pairs of estimators listed in Appendix B, the unconditional scaled expected length given in (7) is determined by (the number of individuals), (the number of time points), (the nominal significance level of the Hausman pretest), (the nominal coverage probability), (the ratio ), (the parameter that determines ) and (the non-exogeneity parameter). Given these quantities, the scaled expected length does not depend on either (the variance of the random error) or (the variance of the random effect) or (the variance of the time-varying covariate ).

###### Theorem 6

Suppose that , , , , and are fixed. When and are replaced by any of the pairs of estimators listed in Appendix B, the unconditional scaled expected length given in (7) is an even function of .

Suppose that and are the usual unbiased estimators of and , respectively. Consider the case that , , and . We find by minimizing the coverage probability over , and , since it is reasonable to expect that in practice. For given and , we define min SEL and max SEL to be the scaled expected length minimized and maximized, respectively, with respect to and . For each and , we have computed min SEL and max SEL. This information is shown in Table 1. From Table 1 it is clear that, for this example, is the shorter interval on average. The minimum scaled expected length is a decreasing function of because as increases, decreases.

0.05 | 0.50 | |||||||||
---|---|---|---|---|---|---|---|---|---|---|

0 | 0.2 | 0.4 | 0.6 | 0.8 | 0 | 0.2 | 0.4 | 0.6 | 0.8 | |

min SEL | 4.06 | 3.67 | 3.23 | 2.70 | 1.99 | 1.65 | 1.57 | 1.49 | 1.38 | 1.24 |

max SEL | 4.88 | 4.88 | 4.87 | 4.89 | 4.89 | 1.81 | 1.81 | 1.81 | 1.81 | 1.81 |

## 3 The coverage probability when random error and random effect variances are assumed known

When and are known, the confidence interval resulting from the two-stage procedure is denoted by . In this section we describe how the coverage probability of , conditional on , can be computed exactly using the bivariate normal distribution. The results of this section are used in Section 4 to find control variates (used for variance reduction) for the estimation by simulation of and the scaled expected length (given by (7)).

Let denote the coverage probability of , conditional on . Observe that is equal to

(8) |

where , and . By the law of total probability, (3) is equal to the sum of and

(9) |

The first and second terms in this expression are determined by the conditional distributions of the random vectors and , respectively. Theorem 7 gives these distributions, which requires the introduction of , where is given in Appendix A. The following theorem is stated by Kabaila et al 2015, together with an outline proof. A detailed proof is proved in the supplementary material for the present paper.

###### Theorem 7

Conditional on , and have bivariate normal distributions, where , ,

Thus, when and are known, can be found easily by evaluation of the bivariate normal cumulative distribution function in the expression (9). Similarly to Theorem 1, this probability is determined by , , , , , , and . Note that the dependence on is through . Also, similarly to Theorem 2, is an even function of . These results may be proved using similar, but much simpler, arguments to those used in the proofs of Theorems 1 and 2 which are given in the supplementary material.

##
4 Simulation methods, including the use of variance reduction, when the random error and random

effect variances are unknown

In Section 3 we described how to find the coverage probability of the confidence interval resulting from the two-stage procedure, conditional on when and are known, using the bivariate normal distribution. In the practically important case that and are replaced by estimators, the results of Section 3 allow us to find control variates (for variance reduction) for the estimation by simulation of both the coverage probability and the scaled expected length. In Section 4.1 we describe the estimation by simulation of the coverage probability and in Section 4.2 we describe the estimation by simulation of the scaled expected length. The simulation methods described in this section apply to any of the pairs of estimators listed in Appendix B.

We consider the model (1) and choose the intercept , the parameter of interest and the values for , , , , , and . Of course, by Theorem 1 and Theorem 5, the coverage probability and the scaled expected length do not depend on either , or and depend on and only through . Our simulation methods consist of independent simulation runs. On the ’th simulation run (), we generate observations of the ’s and ’s using the assumptions made in Section 2, i.e. the ’s are i.i.d. and the ’s are i.i.d. with a multivariate normal distribution with mean and covariance matrix (2). As noted in Section 2, in simulations is determined by , which we write as . Let and denote the observed values of and , respectively, for the ’th simulation run.

### 4.1 Estimating the coverage probability by simulation

For the observed values in the ’th simulation run, we compute the following three quantities. The confidence interval resulting from the two-stage procedure, when and are assumed known, denoted by . The confidence interval resulting from the two-stage procedure, when and are estimated by and , respectively, denoted by . The coverage probability of , conditional on , when and are assumed known, is . Note that this conditional coverage probability is computed exactly using the bivariate normal distributions given in Theorem 7.

Let , the coverage probability of . Now define the unbiased estimator

of CP. This is the usual “brute-force” simulation estimator of CP. We estimate the variance of this estimator by noting that it is a binomial proportion. Let , the coverage probability of , when and are assumed known. Now define the unbiased estimator

of CPK. By the double expectation theorem, . Thus another unbiased estimator of is

which is a much more accurate estimator of CPK than .

Define the control variate , which has expected value zero. The simulation-based unbiased estimator of that employs variance reduction using this control variate, is

We expect that the correlation between and will be close to 1. Since is a much more accurate estimator of CPK than , we expect that the correlation between and the control variate will also be close to 1. Note that

We estimate the variance of this estimator by noting that it is an average of i.i.d. random variables.

We evaluate the efficiency gain of using to estimate the coverage probability CP over , as follows. Let and denote the times taken to carry out simulation runs when we estimate CP by and , respectively. The efficiency of the control variate estimator relative to the “brute-force” estimator is

The larger this relative efficiency, the greater the gain in using the control variate estimator , by comparison with using the “brute-force” estimator . To give an example of the efficiency gained by using compared to , when estimating CP, we set , , , , , and . We obtain seconds, seconds, and . The time ratio is and the variance ratio is , so the efficiency of relative to is approximately 4.17. In other words, it would take approximately 4.17 times as long to compute the “brute-force” estimator with the same accuracy as the control variate estimator.

We also use common random numbers to create smoother plots of the estimated coverage probability, as a function of . The estimates of the coverage probability are computed for an equally-spaced grid of values of . On the ’th simulation run we generate an observation of from observations of the random numbers and . So, on the ’th simulation run, for each value of in the grid, we use the same random numbers that are used to generate the observations of the ’s and the ’s. These observations are then used to construct our simulation-based estimate of CP. Therefore on the ’th simulation run, for each value of , we have an estimate of the coverage probability using the same random numbers.

### 4.2 Estimating the scaled expected length by simulation

When estimating (7) by simulation we consider the expected values in the numerator and denominator seperately. We begin with the expected value in the numerator of (7). Let and let be the value of when . An unbiased simulation estimator of the expected value of that does not use any variance reduction technique is

Now let , where is the statement . Also let be the value of when . Then is equal to

where . We compute using Theorem 7. An unbiased simulation estimator of the expected value of that makes use of a control variate for variance reduction is

Next we will consider the expected value in the denominator of (7). Let