Marginal and Conditional Multiple Inference in Linear Mixed Models

Marginal and Conditional Multiple Inference in Linear Mixed Models

\fnmsPeter \snmKramlinger\thanksrefm1label=e1]peter.kramlinger@uni-goettingen.de [    \fnmsTatyana \snmKrivobokova\thanksrefm1label=e2]tkrivob@gwdg.de [    \fnmsStefan \snmSperlich\thanksrefm2label=e3]stefan.sperlich@unige.ch [ Georg-August-Universität\thanksmarkm1 and Université de Genève\thanksmarkm2 Georg-August-Universität
Institute for Mathematical Stochastics
Goldschmidtstr. 7,
37077 Göttingen, Germany
Université de Genève
School of Economics and Management
40 Bd du Pont d’Arve,
1211 Genève 4, Switzerland
Abstract

This work introduces a general framework for multiple inference in linear mixed models. Inference in mixed models can be done about population effects (marginal) and about subject specific characteristics (conditional). For two asymptotic scenarios that adequately address settings arising in practice, we construct consistent simultaneous confidence sets for subject specific characteristics. We show that remarkably marginal confidence sets are still asymptotically valid for conditional inference. The resulting sets are suitable for multiple testing and thus fill a gap in the literature that is of practical relevance. We confirm our findings with a simulation study and real data example of Spanish income data.

[
\kwd
\runtitle

Multiple Inference in Linear Mixed Models

{aug}

,
and

class=MSC] \kwd[Primary ]62J15 \kwd[; secondary ]62F05 \kwd62F03 \kwd62J10

Marginal prediction \kwdconditional confidence \kwdsimultaneous inference \kwdmultiple testing \kwdsmall area estimation

1 Introduction

Linear mixed models (LMMs) were introduced by Charles Roy Henderson in 1950s [14, 15] and are applied if repeated measurements on several independent subjects of interest are available. Monographs [29], [8] and [17] give a comprehensive overview of LMMs and their generalizations. The classical LMM can be written as

 yi=Xiβ+Zivi+ei,i=1,…,mei∼Nni{0ni,Ri(δ)},vi∼Nqi{0qi,Gi(δ)}, (1)

with observations , known covariates and , independent random effects and error terms , such that . Parameters and are unknown and we denote , where and are known up to .

Model (1) accomodates both settings with a fixed number of subjects by a growing number of observations per subject , as well as settings with a growing number of subjects by few observations per subject , implying two possible asymptotic scenarios for mixed models, as noted by [18]. The latter case is often referred to as small area estimation [31].

Depending on the research question, the focus of estimation and inference might lay either on the population parameter or on subject specific parameters associated with . In the former case, a LMM (1) is interpreted as a linear regression model with mean and covariance matrix that accounts for complex dependences in the data. Inference about is referred to as marginal and is fairly well understood. If the research focus is rather on the subject specific characterteristics, then the inference should be carried out conditional on , which is more involved. This distinction between marginal and conditional inference was emphasized already in [13] and has got particular attention in the model selection context. For example, [36] argued that the conventional (marginal) Akaike information criterion (AIC) is applicable to the selection of population parameter only and suggested a conditional AIC that should be employed if the research focus is conditional. For a general discussion on marginal versus conditional inference in mixed models, see also [24].

In this work we address simultaneous inference about , , where and are known. Typically, are taken to be subject conditional means. Under two possible asymptotic scenarios we construct simultaneous confidence sets for all and discuss the corresponding multiple testing problem. Thereby, we distinguish between the marginal scenario, where are treated as proper random variables and the conditional scenario, where are considered as fixed parameters.

There is a large body of literature on the confidence intervals for a single under the small area asymptotic scenario. In particular, much attention is given to the estimation of the mean squared error , where the expectation is taken under the marginal law and is some estimator of , which depends on unknown . To estimate marginal MSE, one can either plug in an appropriate estimator of (e.g., restricted maximum likelihood (REML) or Hendersons method III estimator given in [33]) or use second-order unbiased margnial MSE approximations derived in [30, 4, 3]. Other distribution-free approaches to the estimation of marginal MSE comprise a diverse collection of bootstrap methods, for an extensive review consult [2].

Since inference about has a clear conditional focus (under the marginal law are simply not available), it seems counterintuitive to base inference about on the marginal MSE. In fact, we show that the nominal coverage of the pointwise confidence intervals for based on the marginal MSE holds under the conditional law on average (over subjects) only, see Proposition 1 in Section 4 for more details. However, are biased under the conditional law and this bias is, in general, difficult to handle. Ignoring the bias leads to a clear under-coverage, see [5, 6], while estimating the bias leads to unacceptably wide intervals, see [19, 25, 28].

In this article we construct simultaneous confidence sets for in LMMs under two possible asymptotic scenarios. This problem remained largly untreated, to the best of our knowledge, except for [10], who pointed out the need for simultaneous inference and considered a related problem of inference about certain linear combinations of in the Fay-Herriot model (a special case of (1) under small area asymptotics) employing the Bayesian approach.

We first consider simultaneous confidence sets for under the conditional law and show that the nominal coverage is attained at the usual parametric rate. Additionally, we show that, surprisingly, the simultaneous confidence sets built under the marginal law, being also accurate at the same parametric rate, are at the same time approximately valid when conditioning on the subjects. This, however, is not true for the pointwise confidence intervals, as pointed out already. Furthermore, we use the derived confidence sets for multiple testing and demonstrate usefulness of this inference tool on the Spanish income data.

The main results are given in Section 2. Applications for comparative statistics and multiple testing are elaborated in Section 3. The fundamental problem and our results are then visualized in an extensive simulation study in Section 4 and further exemplified on Spanish income data in Section 5. Finally, we conclude with a discussion in Section 6. All proofs are given in Appendix and some auxiliary results can be found in Supplement [23].

2 Simultaneous Inference

We start with basic notation and assumptions. In the notation of [31], the empirical BLUP (EBLUP) as estimator of for unknown reads as

 (2)

Under the mild assumptions below , if is finite [22], but .

We consider two asymptotic scenarios.

1. , while .

2. , while .

Hereafter the asymptotic results are given with respect to , which depends on the specified scenarios: under 1 we set , while under 2. The requirement in 1 that the number of observations for each subject remains bounded was introduced by [12]. In this scenario the bias of the BLUP under conditional law does not vanish asymptotically. Conversely, it does so under 2. Note that the asymptotic scenario 2 allows that , grow with different rates.

Further, we adopt the regularity conditions from [30] and [4]:

1. , , , , contain only finite values.

2. has entries for .

3. , for and .

4. is linear in the variance components .

5. is an estimator of for which for .

Conditions 1 - 3 ensure that the behaviour of those quantities does not outweigh the asymptotic considerations needed to achieve desired order of the coverage error. Condition 4 implies that the second derivatives of and w.r.t. are zero. Condition 5 is required to consistently estimate the variance components under 2. This condition is needed as the effective sample size for might not grow as is bounded, see [26].

For the variance components can be estimated using both REML and Hendersons method III. Those estimators are unbiased, even and translation invariant, which are the conditions of Kackar and Harville [22]. Subsequently, denotes an estimator of obtained with either one of these methods.

Now we turn to the construction of simultaneous confidence sets for . Since the inference focus in this case is conditional, we start by constructing a confidence set , such that , for a pre-specified level . In particular, for the conditional inference is treated as a fixed parameter and the assumption on normality of in (1) is ignored. Thereby, all parameter estimators are still obtained under model (1).

Let and be the approximately second-order unbiased estimator for , whose explicit form is given in equation (8) in Appendix.

Theorem 1.

Let model (1) hold and be as in (8). Under 1, 1-4 and 2, 1-5 it holds that

 P{∥∥ˆΣ−1/2c(^μ−μ)∥∥2<χ2m,1−α(^λ)∣∣∣v}=1−α+O(s−1/2),

where , is the -quantile of the -distribution and is the estimator for the non-centrality parameter

 λ

defined in (7).

Since is not unbiased under the conditional law, has to account for the conditional bias, whereas accounts for the correct variability under such law. Note that the result of Theorem 1 holds for any fixed , not necessarily a realisation of a normally distributed random variable.

From Theorem 1 we immediately obtain

 Cα={μ∈Rm:∥∥ˆΣ−1/2c(^μ−μ)∥∥2≤χ2m,1−α(^λ)},

which forms an approximate simultaneous confidence set over all subjects under the conditional law. The practical difficulty in construction of is the need to estimate the non-centrality parameter that introduces additional uncertainty.

If is treated as a proper random variable, this implies the following result.

Theorem 2.

Let model (1) hold and be an estimator for given in (5). Under 1, 1-4 and 2, 1-5 it holds that

 P{∥∥ˆΣ−1/2(^μ−μ)∥∥2<χ2m,1−α}=1−α+O(s−1/2),

where and is the -quantile of the -distribution.

As before, from Theorem 2 one obtains

 Mα={μ∈Rm:∥∥ˆΣ−1/2(^μ−μ)∥∥2≤χ2m,1−α},

such that , for . Note that such marginal confidence intervals have to interpreted with care, since under the marginal model is a random variable. However, it turns out that these marginal confidence intervals can be used for the simultaneous inference under the conditional law. Indeed, the following theorem states that , albeit derived under the marginal law, lead to the asymptotically correct coverage under the conditional law.

Theorem 3.

Let model (1) hold and be as in (5). Under 1, 1-4 and 2, 1-5 it holds that

From the proof one can see that the misspecification in using the marginal formulation under the conditional scenario is averaged out across the subjects under 1 or, less surprisingly, within the subjects under 2. Notably, the rates for the marginal formulation in the marginal versus conditional scenario coincide. Eventually, the result implies that .

Note that if the quadratic form in Theorem 3 is reformulated for one subject with in 1 we get rather

 P{(^μi−μi)2^σii<χ21,1−α∣∣∣v}=1−α+O(1).

In 2 however, the bias vanishes for each subject and the nominal coverage is attained for a single subject as well.

The results of this section suggest that simultaneous inference about under the conditional law can be performed based on the confidence sets obtained under the marginal law. In particular, this allows to circumvent estimation of a non-centrality parameter.

3 Multiple Testing

In practice, it is appealing to use the derived sets and for hypothesis testing. Assume, it is of interest to test

 H0:Lμ=cvs.H1:Lμ≠c, (3)

where and is a given -matrix with and . Value corresponds to the number of simultaneous tests of linear combinations, whereas each linear combination of interest is specified in the rows of . For example, if , the test and for two contrasts is carried out. Choosing and , can be used to test the hypothesis that is driven by subject specific characteristics. Such a test will be illustrated in Section 5.

For the conditional inference about , Theorem 1 implies that the -level test for (3) rejects if , where

This test is consistent with an error . is the non-centrality parameter that depends on the modified covariance . It has to be re-calculated if another multiple testing problem is of interest.

At the same time, Theorem 3 allows to employ the confidence set as well. An -level test rejects if , where

 Mα,L={c∈Rm∗:∥∥(LˆΣLt)−1/2(L^μ−c)∥∥2≤χ2m∗,1−α}.

For 2, this test is consistent with the same rate , whereas for 1 it is necessary that , for . This affirms that individual confidence intervals () can not be constructed using .

Multiple testing about subject specific effects has a clear conditional focus and should be done under the conditional law. If , then the test can be based on the marginal confidence set, which does not require estimation of the non-centrality parameter.

4 Simulation Study

Consider a special case of (1), the nested error regression model (see [1])

 yij=xtijβ+vi+eij,i=1,…,m,j=1,…,ni,eij∼N(0,σ2e),vi∼N(0,σ2v). (4)

The data are simulated as follows. For each given set of the parameters , , , , the value of the subject specific effect is obtained once as a realization of a -distributed random variable and remains fixed in all Monte Carlo samples. The values of where drawn from a standard normal distribution, whereas consists of a column of ’s and a column of entries drawn from the uniform distribution. For the results in Figure 1, the variance components where set to .

The parameter of interest is , where .

Before we report simulation results for simultaneous inference, we visualize consequences of using marginal law for pointwise inference about single , using the same simulation design. In particular, we set and under 1 and under 2. The results are based on 1.000 Monte Carlo samples. Figure 1 shows the pointwise coverage of confidence intervales for built under the marginal law. The left hand side plot of Figure 1 corresponds to the small area asymptotics 1. Apparently, subjects which comprise a large , being those with most prominent subject characteristic, exhibit a severe undercoverage. This is particularly annoying, since such subjects are arguably those that a practitioner is most interested in, as in Section 5. This behavior has also been previously noticed in [19]. On average (over all subjects), however, over- and undercoverage cancel each other out. Under 2, this problem is less pronounced, since the bias for every subject vanishes asymptotically and so does the difference between conditional and marginal variance, as visible on the right hand side plot of Figure 1. These observations are formalised in the following proposition.

Proposition 1.

Let model (1) hold, known, and the -quantile of the standard normal distribution. Under 1 and 2 with 1 and 2, it holds that

 1mm∑i=1P(|Ti|≤z1−α|v)=1−α+O(s−1/2).

That is, the coverage probability of marginal confidence intervals under the conditional law attains its nominal level on average over all subjects. For the simulated data in Figure 1 the average coverage under 1 is 95.4%, while under 2 it is 94.9%.

We turn now to the simultaneous inference. Table 1 contains results based in a Monte Carlo sample of size . For each Monte Carlo sample the estimates and , as well as and , are calculated and then it is checked whether lies within the confidence set. The resulting coverage probability is eventually reported in Table 1. For comparison, the coverage of the oracle confidence sets for known is also reported. The relative volume of the confidence sets to the volume of the REML-based marginal set is given in brackets.

Under 1, the asymptotic behavior relies on and thus studied for and . One case is carried out for only observations per subject, relating to the study of [1]. The other, for doubles the number of observations per subject, but is still very much in the small area framework. Under 2, percent of subjects grows with rate , while much faster with . The unpromising case of , is compared to , while the observations on other subjects only increase from to .

It is well known that the relation of and , the so-called intraclass correlation coefficient (ICC), plays a key role in the reliability of the estimators. Therefore, different ratios of and are considered.

The last four columns of Table 1 give the simulated coverage of the confidence sets for the nominal coverage of . The differences between each of the two marginal and conditional coverages displays the impact of the REML estimation. The estimation of the variance components is indeed very influential, in accuracy as well as in size. Further, a comparison between marginal REML and conditional REML coverages reveals the performance of and . The marginal sets are generally smaller. This is due to the conditional sets being amplified by the non-central quantile to meet the nominal level, but not stretched in the direction where the multivariate distribution of has the most mass.

The first two rows of of each configuration of show the asymptotic behavior for 1 with observations only, whereas the less extreme case for 1 is given in lines three and four. Indeed, values of larger produce better results. However, the reported coverage seems to be more significantly influenced by the number of observations in each subject, which should not be too small. This is the realm of case 2. Convergence for that scenario seems to be more sensitive, although this is likely due to the much smaller sample size.

More influential proves to be the relative size of the error variance to , with coverage being closest to the nominal level for large . This is not unexpected as these parameters determine the validity of the REML-estimates and this has already been observed for individual confidence intervals, see [3]. However, even for known , can exhibit undercoverage if the ICC is too small and too few data is available, whereas does not.

Let us turn to the test vs. , and . Power functions studying the error of the second kind for different parameters and , cf. Table 1, are given in Figure 2 with different ratios of the variance components.

Unsurprisingly, the power growths steeper for larger and , but again is sensitive to the relative size of to . The power of the tests based on the marginal set (solid line) is notably steeper than the slope of the power based on the conditional set (dashed line). Although being of less importance if is large, the plots favour the use of the marginal confidence sets for testing.

5 Study on Spanish Income

The proposed method is now applied to a case study on a data set of income for the Spanish working population obtained from the survey of living conditions in 2008 [7]. Variable of interest is the log-transformed yearly income for working people of age 50 and older. The subjects are now small areas that are formed by cross section of administrative provinces and city size. Available explanatory variables are gender, education level (primary or secondary school or university) and nationality. The observations are assumed to follow a nested error regression model (4).

In a first attempt, our interest lies in determining whether the hypothesis that no area-specific characteristic is present, , , , can be rejected. In total, observations are partitioned in areas, with a median of observations per area. The variance components are estimated via REML and yield and . If testing each area individually only of individual tests do not reject the hypothesis. Under this value should be for a nominal level of , thus indicating that should be rejected. As the previously available MSE-estimators did not require to consider the between-area variation of the estimators , a conservative Bonferroni correction was the only tool for simultaneous inference for the marginal approach. Considering all areas, the Bonferroni correction rejects . This is confirmed by the presented tools for accurate simultaneous inference, as the marginal approach gives

 ∥∥ˆΣ−1/2(^μ−¯¯¯¯¯X^β)∥∥2≈661>239≈χ2205,.95,

and so does the conditional approach:

 ∥∥ˆΣ−1/2c(^μ−¯¯¯¯¯X^β)∥∥2≈1061>239≈χ2205,.95(0).

Both sets have the same nominal coverage, and indeed, here they both yield the same result: They strongly reject . The conditional approach, however, fails to produce a positive estimate of for this data set. If interest lies in testing the hypothesis on only a subset of all small areas, the marginal set can be immediately applied with , whereas the conditional approach requires to re-estimate on the new subset of interest. This aspect additionally makes the application of the marginal set more appealing.

As a subset of interest consider now the autonomous community of Andalucía, with a total of small areas for observations with a median of per area. Again, only of individual confidence intervals do not reject the hypothesis, where it should be under . This time, the Bonferroni correction, being too conservative, does not reject the hypothesis. The accurate confidence set of the marginal approach however gives

 ∥∥(LAndˆΣLtAnd)−1/2LAnd(^μ−¯¯¯¯¯X^β)∥∥2≈90>55≈χ239,.95

and clearly rejects, correcting the defective inference made by Bonferroni correction.

As a variation of this case study, suppose now our interest lies in whether areas exhibit a statistically significant difference in income between genders. Thus, small areas are formed by a cross section of province and gender, and city size is included as an auxiliary variable in exchange. We study , for , where is the small area mean of income for males and for females, respectively. Along Section 3, matrix is the -matrix specifying the linear combination in each line for the corresponding provinces, such that . Again, the variance components are estimated via REML and yield and . As before, less than anticipated under , only of individual confidence sets do not reject the hypothesis. And again, both the Bonferroni correction as well as the marginal approach reject the hypothesis, as

 ∥∥(LˆΣLt)−1/2L^μ∥∥2≈85>70≈χ252,.95.

Looking at the confidence ellipsoid under we see that the rejection is mainly due to two provinces, namely Cuenca and Sevilla.

With , one can find the projection of onto and thus obtain for which would not be rejected, i.e. giving the smallest estimates for which you don’t reject. The result for some provinces with particular large differences in income by gender is shown in Table 2. Note that the higher income for females is due to the selection of the population in our study. This procedure indicates how much effort, and in which province is to be made to attain statistically insignificant differences. Such findings could not have been obtained from so far existing, conventional tools – typically provided for area-individual inference. This could be another interesting application of simultaneous inference in small area estimation, others are easily conceivable. In summary, the presented tools deliver a useful extension to accurately infer about multiple or all small area estimates simultaneously.

6 Discussion

In two asymptotic scenarios we considered simultaneous inference about subject specific characteristics in linear mixed models. We overcome the problem of bias estimation in the conditional case by showing that the coverage error in applying marginal sets is averaged out over multiple subjects. A simulation study confirms the finite sample behaviour of this effect.

Most uncertainty is induced by the estimation of . If the normality assumption of errors and random effects as stated in (1) is not met, it has been shown that the parameters for hierarchical [32] and non-hierarchical LMM [16] or by Hendersons Method III [11] are consistent and asymptotically normal. For the latter, one obtains asymptotically the same results [23]. However, in practice it is to be expected that depending on the deviations from normality larger samples are needed to reach the nominal coverage probability. A popular strategy is to transform the data in order to achieve normality for errors and random effects, with a recent review of methods given by [35]. Its usefulness is partly due to the availability of software that provides checks for the distributions of residuals and predictors [21]. Other than these, bootstrap methods for LMM can account for non-Gaussian data, with an extensive collection given in [9].

We expect that our results can be extended to other estimators of LMMs, such as the best predictor [20].

Acknowledgements

The authors thank Domingo Morales, John Rao, Carmen Cadarso-Suárez, Gauri Datta, Jiming Jiang, Maria-Jose Lombardia, and Wenceslao Gonzalez-Manteiga for helpful discussion. They also gratefully acknowledge the funding by the German Research Association (DFG) via Research Training Group 1644 “Scaling Problems in Statistics”.

Appendix

Notation

Throughout the appendix the following notation is used. The -th entry of matrix is denoted as or . The denotes the -th column vector of matrix . Other ways to display a vector or matrix is by e.g. , a stochastic matrix with each entry being of probabilistic order , or , if it is obvious that . Furthermore, for a matrix , is the Frobenius norm. For easier readability, the dependence on the is suppressed for various quantities. It should be clear from the context if e.g. or depend on or . Further, we adapt the notation of [31] and denote as given in (2) and analogously. Eventually for convenience, dropping the subject index labels the respective quantities over all observations, e.g. , and , etc. If the range of the index is clear from the context, it will not be dropped as well. For the proofs, denote the subject and the respective observation for the -th subject. Eventually, are indices referring the entries in .

Proof and Definitions for Theorem 2

The estimator for the across-area generalization of the Prasad-Rao MSE estimator from [30] is defined below. Let be the asymptotic covariance matrix of . Then,

 ˆΣ=ˆΣ(^δ) (5) K1(δ) =diag{hti(Gi−GiZtiV−1iZiGi)hi}i=1,…,m, K2(δ) ={dti(m∑l=1XtlV−1lXl)−1dk}i,k=1,…,m, ˆK3(δ) =diag{tr(∂bti∂δVi∂bi∂δt¯¯¯¯¯V)}i=1,…,m.

is a second-order unbiased estimator of . The leading term is an estimator for the variability induced in the prediction of the random effect, whereas describes the variability induced by the estimation of such that . Finally, th