Efficient and Robust Estimation for a Class of Generalized Linear Longitudinal Mixed Models

# Efficient and Robust Estimation for a Class of Generalized Linear Longitudinal Mixed Models

René Holst
rho@aqua.dtu.dk
National Institute of Aquatic Resources
Technical University of Denmark
Box 101, DK-9850 Hirtshals, Denmark
and
Bent Jørgensen
bentj@stat.sdu.dk
Department of Statistics
University of Southern Denmark
Campusvej 55, DK-5230 Odense M, Denmark

Abstract
We propose a versatile and computationally efficient estimating equation method for a class of hierarchical multiplicative generalized linear mixed models with additive dispersion components, based on explicit modelling of the covariance structure. The class combines longitudinal and random effects models and retains a marginal as well as a conditional interpretation. The estimation procedure combines that of generalized estimating equations for the regression with residual maximum likelihood estimation for the association parameters. This avoids the multidimensional integral of the conventional generalized linear mixed models likelihood and allows an extension of the robust empirical sandwich estimator for use with both association and regression parameters. The method is applied to a set of otolith data, used for age determination of fish.

Key words: Bias correction; Best linear unbiased predictor; Crowder weights; Dispersion components; Generalized estimating equation; Nuisance parameter insensitivity; Pearson estimating function; Residual maximum likelihood; Tweedie distribution.

## 1 Introduction

Ever since Nelder and Wedderburn (1972) introduced generalized linear models for independent data, there has been a steady development of methods for analysis of non-normal correlated data. This development was accelerated by the introduction of generalized estimating equations (Liang and Zeger, 1986) and generalized linear mixed models (Schall, 1991; Breslow and Clayton, 1993; Wolfinger and O’Connell, 1993). These two types of models differ conceptually and computationally as reflected in the conventional distinction between marginal and conditional models.

In practice, however, one is often faced with a combination of longitudinal and random effects, where neither of the two, on their own, are adequate. In this light, it is somewhat of an enigma why generalized estimating equations and generalized linear mixed models have continued to evolve along separate paths. With few exceptions (McCulloch and Searle, 2001; Diggle et al., 2002; Fitzmaurice et al., 2004), the literature is clearly divided into two separate strands. Pinheiro and Bates (2000) combined the two approaches for normal data. Ziegler et al. (1998) and Hall (2001) summarized the first decade of developments for generalized estimating equations and recent contributions include for example Hardin and Hilbe (2003), Wang and Carey (2004), Coull et al. (2006) and Wang and Hanfelt (2007). For generalized linear mixed models, we refer to recent monographs, such as Verbeke and Molenberghs (2000), Lee et al. (2006) and references therein.

In the present paper we propose a versatile and computationally efficient method for generalized linear longitudinal mixed models. The estimating equations used for the regression parameters are similar in style to those known from conventional generalized estimating equations, whereas Pearson estimating equations are used for the association parameters. The method combines longitudinal and random effects models while retaining a marginal as well as a conditional interpretation. A serial correlation structure is employed within clusters, but unlike the state space models of Jørgensen et al. (1999) and Jørgensen and Song (2007), where the latent process is non-stationary, our approach is based on a stationary latent process defined by means of a linear filter.

In order to avoid the intricate efficiency considerations associated with conventional generalized estimating equations, see eg. Wang and Hanfelt (2003), we emphasize explicit modelling of the covariance structure based on second-moment assumptions for the data-generating process. An example of the utility of this approach is the twin data analysis by Iachina et al. (2002), where the correlation between the two twins in a pair was estimated on the basis of a generalized estimating equations model. Nevertheless, the method allows the use of a working correlation structure and we extend the robust empirical sandwich estimator to handle the asymptotic variance for both regression and association parameters.

Our estimation of the association parameters compares to that of residual maximum likelihood by means of the bias-corrected Pearson estimating equations of Jørgensen and Knudsen (2004), following earlier work by McCullagh and Tibshirani (1990) and Hall and Severini (1998).

The estimating equations are solved by an efficient Newton scoring algorithm, thereby circumventing the problems associated with the multidimensional integral defining the likelihood in conventional generalized linear mixed models approaches such as Schall (1991), Breslow and Clayton (1993) and Wolfinger and O’Connell (1993).

The models covered by our approach are defined hierarchically via conditional Tweedie distributions, much like the multiplicative mixed effects models of Ma and Jørgensen (2007) and Ma et al. (2009). This leads to a multiplicative mean structure with a corresponding additive decomposition of the variance into dispersion components, thereby retaining much of the simplicity of classical linear models. The Tweedie family provides a flexible class of models for both positive continuous data, count data and positive continuous data with a point mass at zero (Jørgensen, 1997, Ch. 4), see also Jørgensen and Song (2007).

## 2 Model

### 2.1 Model specification

We now introduce the main type of our approach followed by a discussion of their covariance structure, which is crucial for the interpretation and estimation of the models. The model is based on the class of Tweedie exponential dispersion models (Jørgensen, 1997, Ch. 4). A Tweedie variable is characterized by having and . Special cases include the normal (), Poisson ( and ) and gamma () families. The case correspond to compound Poisson distributions, which are non-negative and continuous, except for a positive probability at zero.

For a given cluster and time consider a response vector with conditionally independent Tweedie distributions. For ease of presentation we consider a balanced design with equidistant observation times common to all clusters. The model is readily adapted to ragged structures.

The cluster random effect is represented by independent latent Tweedie variables,

 Zi∼Twr1(1,σ2), (1)

where in order to make positive. Given the cluster variable , we consider a latent process based on Tweedie noise,

 Zit∣Zi=zi∼Twr2(α−1+zi,αr2+ω2z1−r2i), (2)

where the are assumed conditionally independent given the variables , and again . Here with and for . The coefficients determine a conditionally weakly stationary latent process , defined by the linear filter

 Qit=∞∑s=0αsZit−s. (3)

By way of construction of the middle layer of the latent process has mean . At the observation level we assume

 Yitj∣\boldmathQi∗=\boldmathqi∗∼Twr3(μitqit,ρ2q1−r3it), (4)

with conditional independence given . Here the notation denotes the vector obtained by letting the corresponding index run, so that and so on. By definition of the Tweedie variance function we obtain linearity in the conditioning variable of the mean and variance, so that and . This property is essential for the derivation of the covariance structure and the estimating functions below. In the Poisson case () it is convenient to let the dispersion parameter accommodate potential over- or under-dispersion.

A variant of the model, applicable to normal response variables, assumes normal zero mean random cluster effects, , replacing (1). The noise process (2) is then assumed to be Gaussian, , while maintaining the filter (3) as above. At the response level (4) is replaced by an additive structure with identity link function, . Since the structure is linear, the corresponding covariance structure is easily derived.

The marginal means may depend on covariates , where denotes a vector of regression parameters. With positive data, the log link is a suitable choice, providing a natural interpretation of the regression parameters. Furthermore the log link, along with the multiplicative structure, enables easy comparison with conventional generalized linear mixed models, where the random effects enter as a term in the linear predictor: .

### 2.2 Covariance structure

The marginal variance-covariance matrix of the observation vector is crucial for the inference and estimation procedures. Detailed derivations are found in Appendix 1.

The covariance between two given observations within the th cluster is

 cov(Yit,Yit′)=σ2μitμit′+ω2μitμit′∞∑s=0αsαs+|t−t′|+δt′tρ2μr3it,

where is the Kronecker delta, being for and zero otherwise. It is an important property of the model that the covariance does not depend on and . From this we now derive a matrix expression for .

First we consider the latent process correlation matrix, , with th entry . Next let denote the -vector of s. In matrix notation the variance-covariance matrix of the response vector for the th cluster may then be expressed as

 var(\boldmathYi∗) =\boldmathμi∗\boldmathμ\scriptsize Ti∗⊙{σ2 \boldmath1T% \boldmath1\scriptsize TT+ω2\boldmathK(\boldmathα)}+ρ2 diag(\boldmathμr3i∗) =σ2\boldmathμi∗\boldmathμ\scriptsize Ti∗+ω2diag(\boldmathμi∗)\boldmathK(\boldmathα)diag(\boldmathμi∗)+ρ2 diag(\boldmathμr3i∗), (5)

say, where is the Hadamard (elementwise) product (Magnus and Neudecker, 1999,  p. 45).

Similar to conventional linear mixed models, the variance is decomposed into components of dispersion corresponding to the different sources of variation. The covariance terms in (2.2) reflect the observation error, the covariance within cluster and the variation between clusters.

The models, accommodated by our approach, hence extend the range of possible serial correlation patterns compared with the conventional generalized estimating equations correlation structures usually considered. Particular covariance structures may be obtained by imposing restrictions on the linear filter parameter vector or the dispersion parameters. Table 1 lists the more common covariance structures and the corresponding parameter restrictions.

## 3 Estimation

### 3.1 General issues

The set of parameters to be estimated is naturally partitioned into regression and association parameters, , where the regression parameters usually are those of interest whereas the association parameters , containing dispersion and correlation parameters, are often considered nuisance parameters. For estimation of the parameters we use a set of corresponding estimating functions denoted .

The estimating function for the regression parameters is

 \boldmathψ\boldmathβ =I∑i=1\boldmathD\scriptsize Ti\boldmathC−1i(\boldmathYi−E(% \boldmathYi)), (6)

where and . Although (6) is similar to the well known estimating function for the regression parameters from the conventional generalized estimating equations framework (Liang and Zeger, 1986), it corresponds to using the model covariance matrix (2.2) rather than the working covariance matrix. The conventional generalized estimating equations working covariance matrix is built around the working correlation matrix so that

 var(\boldmathYi)=ϕ\boldmathA1/2i(\boldmathμi)\boldmathR(α)\boldmathA1/2i(\boldmathμi), (7)

where is a dispersion parameter, and is the variance function. In contrast, we emphasize the decomposition of the variance into components of dispersion and associate the process correlation matrix with an appropriate level in the hierarchy.

We use Pearson estimating functions for the estimation of the association parameters . The entire vector of functions is denoted , where and and with the th component given by

 ψγn(\boldmathβ,\boldmathγ) =I∑i=1tr{\boldmathWin(\boldmathri\boldmathr\scriptsize Ti−\boldmathCi)},

where and are suitable weights. This form emphasizes the model covariance matrix in contrast to the more conventional expressions of the Pearson estimating function (Jørgensen and Knudsen, 2004).

The estimating functions and are explained in more detail below, in Section 3.4 and 3.5 respectively.

### 3.2 Sensitivity

Cox and Reid (1987) studied parameter orthogonality in the likelihood framework corresponding to block diagonality of the Fisher information matrix. Jørgensen and Knudsen (2004) studied the corresponding property of nuisance parameter insensitivity in an estimating equation context, by means of the sensitivity matrix, defined by , where is the gradient operator. The sensitivity matrix may be partitioned into blocks corresponding to and as follows:

Nuisance parameter insensitivity (for short denoted -insensitivity) is defined by the upper right-hand block being zero. First of all this implies efficiency stable estimation of , meaning that the estimation of does not affect the asymptotic variance of ; see Section 3.7. Second, it simplifies the Newton scoring algorithm (Jørgensen and Knudsen, 2004) as detailed below. Third, it implies that varies only slowly with , where is the estimate of for with known. While nuisance parameter-insensitivity does not ensure asymptotic independence of and , it does ease the computation of the asymptotic variance of .

Following Jørgensen and Knudsen (2004) it is easily seen that is -insensitive. In fact, from (6) we see that depends on only via and hence has zero mean, i.e. .

In the rest of the paper we write for etc, whenever the meaning is unambiguous. The remaining blocks of are detailed along with the estimating functions in Sections 3.4 and 3.5

### 3.3 Algorithm

Calculation of the parameter estimates is achieved by the Newton scoring algorithm (Jørgensen et al., 1996) in which we update the previous value of by

 \boldmathθ∗=\boldmathθ−\boldmathS% −1\boldmathθ\boldmathψ(% \boldmathθ).

By the regularity of , the -insensitivity of , and using simple matrix manipulations we may express the inverse of in blocks as follows:

 \boldmathS−1\boldmathθ=⎡⎣% \boldmathS−1\boldmathβ\boldmath0−\boldmathS−1\boldmathγ\boldmathS\boldmathγ\boldmathβ\boldmathS−1\boldmathβ\boldmathS−1\boldmathγ⎤⎦. (8)

The algorithm therefore splits into a step

 \boldmathβ∗=\boldmathβ−\boldmathS−1\boldmathβ\boldmathψ\boldmathβ(\boldmathθ) (9)

and a step

 \boldmathγ∗ =\boldmathγ+\boldmathS−1% \boldmathγ\boldmathS\boldmathγ\boldmathβ\boldmathS−1\boldmathβ\boldmathψ\boldmathβ(\boldmathθ)−\boldmathS−1\boldmathγ% \boldmathψ\boldmathγ(\boldmathθ% ) =\boldmathγ−\boldmathS−1% \boldmathγ{\boldmathψ\boldmathγ(\boldmathθ)−\boldmathS%\boldmath$γ$\boldmathβ\boldmathS−1% \boldmathβ\boldmathψ\boldmathβ(\boldmathθ)}. (10)

Following Jørgensen and Knudsen (2004) we insert from (9) into equation (10). Since equation (9) can be rewritten as , this makes , where indicates is being used. Consequently the modified step becomes

 \boldmathγ∗=\boldmathγ−\boldmathS% −1\boldmathγ\boldmathψ\boldmathγ(\boldmathθ∗). (11)

Analogously we use the most recent estimate of when updating in (9). This is however of less importance, due to the slow variation of with . Jørgensen and Knudsen (2004) coined the scheme of alternating between (9) and (11) the chaser algorithm, with reference to the asymmetric interdependence between and . Our experience with the algorithm confirms this property.

### 3.4 Regression parameters β

Following Ma (1999), Ma et al. (2003) and Ma and Jørgensen (2007) we use best linear unbiased predictor for predicting the random effects. The best linear unbiased predictor of a random variable given the observed data is defined by (Henderson, 1975; Ma, 1999)

 ˆ\boldmathQ=E(\boldmathQ)+cov(\boldmathQ,\boldmathY)var(\boldmathY)−1{\boldmathY−E(% \boldmathY)}. (12)

The model specification using Tweedie distributions allows for derivation of the joint score function (Ma and Jørgensen, 2007). We define unbiased estimating functions by substituting the random effects by their respective best linear unbiased predictors, i.e.

 \boldmathψ\boldmathβ(\boldmathθ;\boldmathY)=u(\boldmathθ;%\boldmath$Y$,ˆ\boldmathQ).

It follows from (12) and the linearity of and that the best linear unbiased predictor of given is , where and are matrices of suitable dimensions. Since is linear in both the observed and the latent variables, and , we find that is the best linear unbiased predictor of the score function given the data. By (12) we therefore arrive at the conventional generalized estimating equations form expression (6)

 \boldmathψ\boldmathβ=E(u)+cov% (u,\boldmathY)\boldmathC−1{% \boldmathY−E(\boldmathY)}=I∑i=1\boldmathD\scriptsize Ti\boldmathC−1i{\boldmathYi−E(\boldmathYi)}. (13)

Here we have used the independence between clusters, along with the following Bartlett-type identity

 \boldmathD=∇\boldmathβE(\boldmath% Y)=E(\boldmathY⋅u)=cov(%\boldmath$Y$,u).

From (6) we furthermore obtain the sensitivity and variability as

 \boldmathS\boldmathβ=−\boldmathD\scriptsize T\boldmathC−1\boldmathD,\boldmathV\boldmathβ=\boldmathD\scriptsize T\boldmathC−1\boldmathD. (14)

The identity is characteristic for quasi-score functions. We therefore conclude that is optimal within the class of linear estimating functions. This also follows from (13) as the best linear unbiased predictor is optimal among all linear predictors.

### 3.5 Association parameters γ

Our approach is akin to that of Ma and Jørgensen (2007), but deviates by allowing for correlation structures within clusters. Furthermore Ma and Jørgensen used a closed form ad-hoc estimator for the association parameters Our estimation of is based on Pearson estimating functions, following the path of Hall and Severini (1998) and Jørgensen and Knudsen (2004). For it is defined by

 ψγn(\boldmathβ,\boldmathγ) =I∑i=1{\boldmathr% \scriptsize Ti\boldmathWin\boldmathri−tr(\boldmathWin\boldmathCi)} =I∑i=1{tr(\boldmathWin\boldmathri\boldmathr\scriptsize Ti)−tr(\boldmathWin\boldmathCi)} =I∑i=1tr{\boldmathWin(\boldmathri\boldmathr\scriptsize Ti−\boldmathCi)}, (15)

where and are suitable weights. By linearity of and these estimating functions are clearly unbiased since .

In the conventional generalized estimating equations framework the Pearson estimating function hinges the estimation of association parameters on a working correlation matrix used for defining as shown in (7). In contrast, we emphasize the decomposition of the variance into components of dispersion and associates the process correlation matrix with an appropriate level in the hierarchy.

For we use the weights proposed by Hall and Severini (1998),

 \boldmathWin =−∂\boldmathC−1i∂γn=\boldmathC−1i∂\boldmathCi∂γn\boldmathC−1i.

In the normal case these weights lead to quasi-score functions and in general they provide estimating functions that resemble the structure of the estimating functions (6) for the regression parameters.

From (15) we may derive the -sensitivity of , namely

 E(∂∂θmψγn) =E[∂∂θmI∑i=1tr% {\boldmathWin(\boldmathri% \boldmathr\scriptsize Ti−\boldmathCi)}] =I∑i=1tr[\boldmathWinE{∂∂θm(\boldmathri% \boldmathr\scriptsize Ti−\boldmathCi)}] =−I∑i=1tr(\boldmathWin∂\boldmathCi∂θm).

Here we have used that the derivatives of do not depend on data so .

The symmetry between the blocks and is highlighted by the forms of the th entries

 {\boldmathS\boldmathγ}nm =−I∑i=1tr(\boldmathC−1i∂\boldmathCi∂γn\boldmathC−1i∂\boldmathCi∂γm) (16) and {\boldmathS\boldmathγ% \boldmathβ}nm =−I∑i=1tr(\boldmathC−1i∂\boldmathCi∂γn\boldmathC−1i∂\boldmathCi∂βm) (17)

respectively.

### 3.6 Bias correction

The estimation of nuisance parameters may be subject to bias (McCullagh and Tibshirani, 1990; Jørgensen and Knudsen, 2004), caused by not taking into account the degrees of freedom spent on estimating the regression parameters.

In the spirit of Godambe (1960), Heyde (1997) and Jørgensen and Knudsen (2004) we adjust the estimating function for bias rather than the estimate. The corrected estimating function for becomes

 ˘ψγn(\boldmathβ,% \boldmathγ) =ψγn(\boldmathβ,\boldmath% γ)+bγn(\boldmathβ,% \boldmathγ) =I∑i=1tr{\boldmathWin(\boldmathri\boldmathr\scriptsize Ti−\boldmathCi)}+tr⎧⎨⎩(I∑i=1\boldmathD\scriptsize Ti\boldmathWin\boldmathDi )(I∑i=1\boldmathD\scriptsize Ti\boldmathC−1i\boldmathDi )−1⎫⎬⎭ =I∑i=1tr{\boldmathWin(\boldmathri\boldmathr\scriptsize Ti−\boldmathCi)}−tr(J(γn)\boldmath% βJ−1\boldmathβ),

where . The Godambe information , see Section 3.7, plays a role in the estimating equation context analogous to that of the Fisher information in the likelihood framework, with being the asymptotic variance of . The penalty term therefore represents the -dependency of , weighted by the precision of the estimate . In this way it corrects for the effect upon of using .

We note that , which may be a more convenient form in some applications.

Since does not depend on the data, we obtain the - and -sensitivity of by amending and respectively, with the - and -derivatives, of the penalty term, respectively. For the th entries we obtain

 ∂∂γmbγn(% \boldmathβ,\boldmathγ) =tr(J(γn)\boldmathβJ−1\boldmathβJ(γm)\boldmathβJ−1\boldmathβ−J(γn,γm)\boldmathβJ−1% \boldmathβ) (18) and ∂∂βmbγn(% \boldmathβ,\boldmathγ) =tr(J(γn)\boldmathβJ−1\boldmathβJ(βm)\boldmathβJ−1\boldmathβ−J(γn,βm)\boldmathβJ−1% \boldmathβ). (19)

The derivatives of are listed in Appendix 2.

### 3.7 Godambe information J\boldmathθ

For joint inference on we use the asymptotic property, valid under mild regularity conditions

 ˆ\boldmathθ∼N(\boldmathθ,J−1\boldmathθ),

where , the inverse Godambe information or the sandwich estimator.

The structure of the ”bread” in the sandwich estimator is (8), with blocks listed in (14), (16) and (17). The lower blocks, associated with are however amended with terms for bias correction as given by (18) and (19).

The ”meat” part is the variability of and may be structured analogously

 \boldmathV\boldmathθ=[% \boldmathV\boldmathβ\boldmathV% \boldmathβ\boldmathγ\boldmathV\boldmathγ\boldmathβ\boldmathV\boldmathγ], (20)

where obviously .

Using (8) and (20) may be written as

 J−1\boldmathθ =[J\boldmathβJ% \boldmathβ\boldmathγJ\boldmathγ\boldmathβJ% \boldmathγ] =⎡⎣\boldmathS−1β% \boldmath0−\boldmathS−1\boldmathγ\boldmathS\boldmathγ\boldmathβ\boldmathS−1\boldmathβ\boldmathS−1\boldmathγ⎤⎦[\boldmathV\boldmathβ\boldmathV\boldmathβ\boldmathγ\boldmathV\boldmathγ\boldmathβ\boldmathV\boldmathγ]⎡⎣\boldmathS−1\boldmathβ−\boldmathS−1\boldmathγ\boldmathS\scriptsize T\boldmathγ\boldmathβ\boldmathS−1\boldmathβ\boldmath0\boldmathS−1\boldmathγ⎤⎦ =⎡⎢⎣\boldmathS−1\boldmathβ\boldmathV\boldmathβ\boldmathS−1\boldmathβ\boldmathS−1\boldmathβ(−\boldmathV\boldmathβ% \boldmathS−1\boldmathβ\boldmathS\scriptsize T% \boldmathγ\boldmathβ+% \boldmathV%T\boldmathγ\boldmathβ)\boldmathS−1\boldmathγ\boldmathS−1\boldmathγ(−\boldmathS\boldmathγ\boldmathβ\boldmathS−1\boldmathβ\boldmathV\boldmathβ+\boldmathV\boldmathγ\boldmathβ)\boldmathS−1\boldmathβ% \boldmathS−1\boldmathγ(\boldmathL+\boldmathVγ)\boldmathS−1\boldmathγ⎤⎥⎦, (21)

where .

The upper left block of shows that the the asymptotic variance of is unaffected by the estimation of . On the other hand the quantity in the lower right block represents the inflation of the asymptotic variance of caused by the estimation of . By (14) and therefore the upper right block of (21) reduces to . If then and would have been a quasi score. In that sense the measures how much deviates from being quasi-score.

Except for , the blocks rely on rd and th moments. For practical use this seems less tractable and we will instead employ empirical variabilities of , defined by . By plugging in we obtain the empirical sandwich estimator,

 ˆJ−1\boldmathθ,\scriptsize Emp% =⎡⎢ ⎢ ⎢⎣\boldmathS−1\boldmathβˆ\boldmathV\boldmathβ,% \scriptsize Emp\boldmathS−1\boldmathβ% \boldmathS−1\boldmathβ(−ˆ% \boldmathV\boldmathβ,\scriptsize Emp% \boldmathS−1\boldmathβ\boldmathS\scriptsize T% \boldmathγ\boldmathβ+ˆ\boldmathV\scriptsize T\boldmathγ\boldmathβ,\scriptsize Emp)% \boldmathS−1\boldmathγ\boldmathS−1\boldmathγ(−\boldmathS\boldmathγ\boldmathβ\boldmathS−1\boldmathβˆ\boldmathV% \boldmathβ,\scriptsize Emp+ˆ\boldmathV\boldmathγ\boldmathβ,\scriptsize Emp)\boldmathS−1\boldmathβ\boldmathS−1\boldmathγ(ˆ\boldmathL+ˆ\boldmathV\boldmathγ,\scriptsize Emp% )\boldmathS−1\boldmathγ⎤⎥ ⎥ ⎥⎦, (22)

where .

We may replace