Variational Inference for Generalized Linear Mixed Models Using Partially Noncentered Parametrizations

# Variational Inference for Generalized Linear Mixed Models Using Partially Noncentered Parametrizations

[ [    [ [ National University of Singapore Linda S. L. Tan is a Ph.D. student and David J. Nott is Associate Professor, Department of Statistics and Applied Probability, National University of Singapore, Singapore 117546, Singapore \printeade1,e2.
###### Abstract

The effects of different parametrizations on the convergence of Bayesian computational algorithms for hierarchical models are well explored. Techniques such as centering, noncentering and partial noncentering can be used to accelerate convergence in MCMC and EM algorithms but are still not well studied for variational Bayes (VB) methods. As a fast deterministic approach to posterior approximation, VB is attracting increasing interest due to its suitability for large high-dimensional data. Use of different parametrizations for VB has not only computational but also statistical implications, as different parametrizations are associated with different factorized posterior approximations. We examine the use of partially noncentered parametrizations in VB for generalized linear mixed models (GLMMs). Our paper makes four contributions. First, we show how to implement an algorithm called nonconjugate variational message passing for GLMMs. Second, we show that the partially noncentered parametrization can adapt to the quantity of information in the data and determine a parametrization close to optimal. Third, we show that partial noncentering can accelerate convergence and produce more accurate posterior approximations than centering or noncentering. Finally, we demonstrate how the variational lower bound, produced as part of the computation, can be useful for model selection.

\kwd
\volume

28 \issue2 2013 \firstpage168 \lastpage188 \doi10.1214/13-STS418 \setattributeabstractwidth360pt \setattributekeywordwidth360pt

\runtitle

Variational Inference for GLMMs Using Partially Noncentered Parametrizations

{aug}

a]\fnmsLinda S. L. \snmTan\correflabel=e1]g0900760@nus.edu.sg and a]\fnmsDavid J. \snmNottlabel=e2]standj@nus.edu.sg

Variational Bayes \kwdhierarchical centering \kwdvariational message passing \kwdnonconjugate models \kwdlongitudinal data analysis

## 1 Introduction

The convergence of Markov chain Monte Carlo (MCMC) algorithms depends greatly on the choice of parametrization and simple reparametrizations can often give improved convergence. Here we investigate the use of centered, noncentered and partially noncentered parametrizations of hierarchical models in the context of variational Bayes (VB) (Attias (1999)). As a fast deterministic approach to approximation of the posterior distribution in Bayesian inference, VB is attracting increasing interest due to its suitability for large high-dimensional data (see, e.g., Braun and McAuliffe (2010); Hoffman et al. (2012)). VB methods approximate the intractable posterior by a factorized distribution which can be represented by a directed graph and optimization of the factorized variational posterior can be decomposed into local computations that involve only neighboring nodes. Variational message passing(Winn and Bishop (2005)) is an algorithmic implementation of VB that can be applied to a general class of conjugate-exponential models (Attias (2000); Ghahramani and Beal (2001)). Knowles and Minka (2011) proposed an algorithm called a nonconjugate variational message passing to extend variational message passing to nonconjugate models.

We examine the use of partially noncentered parametrization in VB for generalized linear mixed models (GLMMs). Our paper makes four contributions. First, we show how to implement nonconjugate variational message passing for GLMMs. Second, we show that the partially noncentered parametrization is able to adapt to the quantity of information in the data so that it is not necessary to make a choice in advance between centering and noncentering with the data deciding the optimal parametrization. Third, we show that in addition to accelerating convergence, partial noncentering is a good strategy statistically for VB in terms of producing more accurate approximations to the posterior than either centering or noncentering. Finally, we demonstrate how the variational lower bound, which is produced as part of the computation, can be useful for model selection.

GLMMs extend generalized linear models by the inclusion of random effects to account for correlation of observations in grouped data and are of wide applicability. Estimation of GLMMs using maximum likelihood is challenging, as the integral over random effects is intractable. Methods involving numerical quadrature or MCMC to approximate these integrals are computationally intensive. Various approximate methods such as penalized quasi-likelihood (Breslow and Clayton (1993)), Laplace approximation and its extension (Raudenbush, Yang and Yosef (2000)) and Gaussian variational approximation(Ormerod and Wand (2012)) have been developed. Fong, Rue and Wakefield (2010) considered a Bayesian approach using integrated nested Laplace approximations. We show how to fit GLMMs using nonconjugate variational message passing, focusing on Poisson and logistic mixed models and their applications in longitudinal data analysis.

The literature on parametrization of hierarchical models including partial noncentering techniques for accelerating MCMC algorithms is inspired by earlier similar work for the expectation maximization (EM) algorithm (see Meng and van Dyk, 1997, 1999; Liu and Wu (1999)). Gelfand, Sahu and Carlin (1995, 1996) proposed hierarchical centering for normal linear mixed models and GLMMs to improve the slow mixing in MCMC algorithms due to high correlations between model parameters. Papaspiliopoulos, Roberts and Sköld (2003, 2007) demonstrated that centering and noncentering play complementaryroles in boosting MCMC efficiency and neither are uniformly effective. They considered the partially noncentered parametrization which is data dependent and lies on the continuum between the centered and noncentered parametrizations. Extending this idea, Christensen, Roberts and Sköld (2006) devised reparametrization techniques to improve performance for Hastings-within Gibbs algorithms for spatialGLMMs. Yu and Meng (2011) introduced a strategy for boosting MCMC efficiency via interweaving the centered and noncentered parametrizations to reduce dependence between draws. Parameter-expanded VB methods were proposed by Qi and Jaakkola (2006) to reduce coupling in updates and speed up VB.

The idea of partial noncentering is to introduce a tuning parameter via reparametrization of the model and then seek its optimal value for fastest convergence. For the normal hierarchical model, Papaspiliopoulos, Roberts and Sköld (2003) showed that the partially noncentered parametrization has convergence properties superior to that of the centered and noncentered parametrizations for the Gibbs sampler. As the rate of convergence of an algorithm based on VB is equal to that of the corresponding Gibbs sampler when the target distribution is Gaussian (Tan and Nott (2013)), partial noncentering will similarly outperform centering and noncentering in the context of VB for the normal hierarchical model. This provides motivation to consider partial noncentering in the VB context. We illustrate this idea with the following example.

### 1.1 Motivating Example: Linear Mixed Model

Consider the linear mixed model

 yi=Xiβ+Xiui+εi, (2) \eqntextεi∼N(0,σ2I),i=1,…,n,

where is a vector of length , is a vector of length of fixed effects, is a matrix of covariates and is a vector of length of random effects independently distributed as . For simplicity, we specify a constant prior on and assume and are known. Let

 αi=β+uiand~αi=αi−Wiβ,i=1,…,n,

where is an tuning matrix to be specified. corresponds to the centered and to the noncentered parametrization. For each ,

 yi=XiWiβ+Xi~αi+εi

and

 ~αi∼N((I−Wi)β,D).

This is the partially noncentered parametrization and the set of unknown parameters is , where . Let denote the observed data. Of interest is the posterior distribution of , .

Suppose is not analytically tractable. In the variational approach, we approximate by a for which inference is more tractable and is chosen to minimize the Kullback–Leibler divergence between and given by

 ∫q(θ)logq(θ)p(θ|y)dθ = ∫q(θ)logq(θ)p(y,θ)dθ +logp(y),

where is the marginal likelihood . Since the Kullback–Leibler divergence is nonnegative,

 logp(y) ≥ ∫logp(y,θ)q(θ)q(θ)dθ = Eq{logp(y,θ)}−Eq{logq(θ)} = L,

where is a lower bound on the log marginal likelihood. Maximization of is equivalent to minimization of the Kullback–Leibler divergence between and . In VB, is assumed to be of a factorized form, say, for some partition of . Maximization of over each of lead to optimal densities satisfying , , where denotes expectation with respect to the density. See Ormerod and Wand (2010) for an explanation of variational approximation methods very accessible to statisticians.

If we apply VB to (2) and approximate the posterior with , the optimal densities can be derived to be and , where . The expressions for the variational parameters , and , , , are, however, dependent on each other and can be computed by an iterative scheme such as that given in Algorithm 1.

Observe that Algorithm 1 converges in one iteration if for each , that is, if

 (5) \eqntextfor i=1,…,n.

For this specification of the tuning parameters, partial noncentering gives more rapid convergence than centering or noncentering. Moreover, it can be shown that the true posteriors are recovered in this partially noncentered parametrization so that a better fit is achieved than in the centered or noncentered parametrizations. This example suggests that with careful tuning of , , the partially noncentered parametrization can potentially outperform the centered and noncentered parametrizations in the VB context.

The rest of the paper is organized as follows. Section 2 specifies the GLMM and priors used. Section 3 describes the partially noncentered parametrization for GLMMs. Section 4 describes the nonconjugate variational message passing algorithm for fittingGLMMs. Section 5 discusses briefly the use of the variational lower bound for model selection and Section 6 considers examples including real and simulated data. Section 7 concludes.

## 2 The Generalized Linear Mixed Model

Consider clustered data where denotes the th response from cluster , , . Conditional on the -dimensional random effects drawn independently from , is independently distributed from some exponential family distribution with density

 f(yij|ui)=exp{yijζij−b(ζij)a(ϕ)+c(yij,ϕ)}, (6)

where is the canonical parameter, is the dispersion parameter, and , and are functions specific to the family. The conditional mean of , , is assumed to depend on the fixed and random effects through the linear predictor,

 ηij=XRijTβR+XGijTβG+XRijTui

with for some known link function, . Here, and are and vectors of covariates and is a vector of fixed effects. We considered the above breakdown (see Zhao et al. (2006)) for the linear predictor to allow for centering. For the th cluster, let , , , and . Let denote the column vector with all entries equal to 1. We assume that the first column of is if is not a zero matrix. For Bayesian inference, we specify prior distributions on the fixed effects and random effects covariance matrix . In this paper, we focus on responses from the Bernoulli and Poisson families and the dispersion parameter is one in these cases, so we do not consider a prior for . We assume a diffuse prior, , for and an independent inverse Wishart prior, , for . Following the suggestion by Kass and Natarajan (2006), we set and let the scale matrix be determined from first-stage data variability. In particular, , where

 ^R=c(1nn∑i=1XRiTMi(^β)XRi)−1, (7)

denotes the diagonal generalized linear model weight matrix with diagonal elements , is the variance function based on in (6) and is the link function. Here, , where is set as for all and is an estimate of the regression coefficients from the generalized linear model obtained by pooling all data and setting for all . The value of is an inflation factor representing the amount by which within-cluster variability should be increased in determining . We used in all examples.

## 3 A Partially Noncentered Parametrization for the Generalized Linear Mixed Model

We introduce the following partially noncentered parametrization for the GLMM. For each , the linear predictor is . Let

 XGiβG = XG1iβG1+XG2iβG2 = 1nixG1iTβG1+XG2iβG2,

where is a vector of length consisting of all parameters corresponding to subject specific covariates (i.e., the rows of are all the same and equal to the vector say). Recall that the first column of is if is not a zero matrix. We have

 ηi=XRi(CiβRG1+ui)+XG2iβG2,

where

 Ci=⎡⎢ ⎢⎣xG1iTIr0⎤⎥ ⎥⎦andβRG1=[βRβG1].

Let and , where is an matrix to be specified. The proportion of subtracted from each is allowed to vary with as in Papaspiliopoulos, Roberts and Sköld (2003) to reflect the varying informativity of each response about the underlying . corresponds to the centered and to the noncentered parametrization. Finally,

 ηi = XRi(~αi+WiCiβRG1)+XG2iβG2 = Viβ+XRi~αi,

where and . We refer to (3) as the partially noncentered parametrization. Let and denote the set of unknown parameters in the GLMM. The joint distribution of is

 p(y,θ) = {n∏i=1p(yi|β,~αi)p(~αi|β,D)} ⋅p(β|Σβ)p(D|ν,S).

Figure 1 shows the factor graph for where there is a node (circle) for every variable, which is shaded in the case of observed variables, and a node (filled rectangle) for each factor in the joint distribution. Constants or hyperparameters are denoted with smaller filled circles. Each factor node is connected by undirected links to all of the variable nodes on which that factor depends (see Bishop (2006)).

Next, we consider specification of the tuning parameter , referring to the linear mixed model example in Section 1.1 which is a special case of the GLMM in (6) with an identity link.

### 3.1 Specification of Tuning Parameter

It is interesting to note that for the linear mixed model in (2), the expression for leading to optimal performance in VB and the Gibbs sampling algorithm is exactly the same (see Papaspiliopoulos, Roberts and Sköld, 2003). Gelfand, Sahu and Carlin (1995) also observed the importance of in assessing convergence properties of the centered parametrization. They showed that for all and is close to zero (centering is more efficient) when the generalized variance is large. On the other hand, is close to 1 (noncentering works better) when the error variance is large. Outside the Gaussian context, Papaspiliopoulos, Roberts andSköld (2003) considered partial noncentering for the spatial GLMM and specified the tuning parameters by using a quadratic expansion of the log-likelihood to obtain an indication of the information present in . Observe that in (5) can be expressed as

 Wi=(If+D−1)−1D−1, (10)

if denotes the log-likelihood and . We use (10) to extend partially noncentered parametrizations to GLMMs and consider the specification of for responses from the Bernoulli and Poisson families in particular.

Recall that the linear predictor can be expressed as . For Poisson responses with the log link function, we allow for an offset so that . Let . We have

 ℓ = yTi(logEi+ηi)−ETiexp(ηi) (11) −1Tnilog(yi!)and If = ni∑j=1Eijexp(ηij)XRijXRijT≈ni∑j=1yijXRijXRijT,

if we approximate the conditional mean with the response. For Bernoulli responses with the logit link function, we have

 ℓ = yTiηi−1Tnilog{1ni+exp(ηi)}and (12) If = ni∑j=1exp(ηij){1+exp(ηij)}2XRijXRijT.

The specification of depends on the random effects covariance and, for Bernoulli responses, on the linear predictor as well. In Algorithm 3, we initialize by considering and using estimates of , and from penalized quasi-likelihood. Subsequently, we can either keep as fixed or update them by replacing with , assuming the variational posterior of is and with , where and are the variational posterior means of and , respectively. This can be done at the beginning of each iteration after new estimates of , , and are obtained (see Algorithm 3 step 1).

## 4 Variational Inference for GLMMs

In this section we present the nonconjugate variational message passing algorithm recently developed in machine learning by Knowles and Minka (2011) for fitting GLMMs. Recall that in VB, the posterior distribution is approximated by a which is assumed to be of a factorized form, say, for some partition of . For conjugate-exponential models, the optimal densities will have the same form as the prior so that it suffices to update the parameters of , such as in Algorithm 1. Variational message passing (Winn and Bishop (2005)) is an algorithm which allows VB to be applied to conjugate-exponential models without having to derive application-specific updates. In the case of GLMMs where the responses are from the Bernoulli or Poisson families, the factor of in (3) is nonconjugate with respect to the prior distributions over and for each . Therefore, if we apply VB and assume, say, , the optimal densities for and will not belong to recognizable density families.

### 4.1 Nonconjugate Variational Message Passing

In nonconjugate variational message passing, besides assuming that must factorize into for some partition of , we impose another restriction that each must belong to some exponential family. In this way, we only have to find the parameters of each that maximizes the lower bound . Suppose each can be written in the form

 qi(θi)=exp{λTit(θi)−h(λi)},

where is the vector of natural parameters and are the sufficient statistics. We wish to maximize with respect to the variational parameters which are also natural parameters of, respectively. In the following, we show that nonconjugate variational message passing can be interpreted as a fixed-point iteration where updates are obtained from the condition that the gradient of with respect to each is zero when is maximized.

From (1.1), the gradient of with respect to is

 ∂L∂λi=∂∂λiEq{logp(y,θ)}−∂∂λiEq{logq(θ)}. (13)

Let us consider the first term in (13). Suppose . We have

 Eq{logp(y,θ)}=∑aSa,

where

 Sa=Eq{logfa(y,θ)}.

Note that each is a function of the natural parameters . Since we have assumed that is independent of all where in the variational approximation , the only terms in which depend on are the factors connected to in the factor graph of . Therefore,

 ∂∂λiEq{logp(y,θ)}=∑a∈N(θi)∂Sa∂λi, (14)

where the summation is over all factors in , the neighborhood of in the factor graph. For the second term in (13), we have

 Eq{logq(θ)}=m∑l=1Eq{logql(θl)},

where the only term in the sum that depends on is the th term. Hence,

 ∂∂λiEq{logq(θ)} = ∂∂λi{λTi∂h(λi)∂λi−h(λi)} = V(λi)λi,

where we have used the fact that and denotes the variance–covariance matrix of . Note that is symmetric positive semi-definite. Putting (14) and (4.1) together, the gradient of the lower bound is

 ∂L∂λi=∑a∈N(θi)∂Sa∂λi−V(λi)λi

and is zero when , provided is invertible. This condition is used to obtain updates to in nonconjugate variational message passing (Algorithm 2).

The updates can be simplified when the factor is conjugate to , that is, has the same functional form as with respect to . Let . Suppose

 fa(y,θ)=exp{ga(y,θ−i)Tt(θi)−ha(y,θ−i)}.

Then , where does not depend on . When every factor in the neighborhood of is conjugate to , the gradient of the lower bound can be simplified to and the updates in nonconjugate variational message passing reduce to

 λi←∑a∈N(θi)Eq{ga(y,θ−i)}. (16)

These are precisely the updates in variational message passing. Nonconjugate variational message passing thus reduces to variational message passing for conjugate factors (see also Knowles and Minka(2011)). Unlike variational message passing, however, the Kullback–Leibler divergence is not guaranteed to decrease at each step and sometimes convergence problems may be encountered. Knowles and Minka (2011) suggested using damping to fix convergence problems. We did not encounter any convergence problems in the examples considered in this paper.

### 4.2 Updates for Multivariate Gaussian Variational Distribution

Suppose is Gaussian. While the updates in Algorithm 2 are expressed in terms of the natural parameters , it might be more convenient to express in terms of the mean and covariance of . Knowles and Minka (2011) have considered the univariate case and Wand (2013) derived fully simplified updates for the multivariate case. However, as Wand (2013) is in preparation, we give enough details of the update so that the derivation can be understood. Magnus and Neudecker (1988) is a good reference for the matrix differential calculus techniques used in the derivation.

Suppose where is a vector of length . For a square matrix , denotes the vector obtained by stacking the columns of under each other, from left to right in order, and denotes the vector obtained from by eliminating all supradiagonal elements of . We can write as

 exp{λTi[vech(θiθTi)θi]−h(λi)}

where

 λi=⎡⎣−12DTdvec(Σqθi−1)Σqθi−1μqθi⎤⎦

and . The matrix is a unique matrix that transforms into if is symmetric, that is, . Let denote the Moore–Penrose inverse of . If we let and , can be expressed as

 ⎡⎢ ⎢ ⎢⎣∂Sa∂λi1∂Sa∂λi2⎤⎥ ⎥ ⎥⎦ = ⎡⎢ ⎢ ⎢ ⎢ ⎢⎣∂vec(Σqθi)∂λi1∂μqθi∂λi1∂vec(Σqθi)∂λi2∂μqθi∂λi2⎤⎥ ⎥ ⎥ ⎥ ⎥⎦⎡⎢ ⎢ ⎢ ⎢ ⎢⎣∂Sa∂vec(Σqθi)∂Sa∂μqθi⎤⎥ ⎥ ⎥ ⎥ ⎥⎦ = U(λi)⎡⎢ ⎢ ⎢ ⎢ ⎢⎣∂Sa∂vec(Σqθi)∂Sa∂μqθi⎤⎥ ⎥ ⎥ ⎥ ⎥⎦,

where

 U(λi)=[2D+d(Σqθi⊗Σqθi)2D+d(μqθi⊗Σqθi)0Σqθi]

and denotes the Kronecker product. Moreover, can be derived to be

 ⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣2D+d(μqθiμqθiT⊗Σqθi+Σqθi⊗μqθiμqθiT2D+d(μqθi⊗Σqθi)+Σqθi⊗Σqθi)D+dT{2D+d(μqθi⊗Σqθi)}TΣqθi⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦.

The update for can be computed as

 λi←V(λi)−1U(λi)∑a∈N(θi)⎡⎢ ⎢ ⎢ ⎢ ⎢⎣∂Sa∂vec(Σqθi)∂Sa∂μqθi⎤⎥ ⎥ ⎥ ⎥ ⎥⎦

and

 V(λi)−1U(λi)=[DTd0−2(μqθiT⊗I)D+dTDTdI].

Wand (2013) showed that the updates simplify to

 Σqθi ← −12[vec−1(∑a∈N(θi)∂Sa∂vec(Σqθi))]−1and (17) μqθi ← μqθi+Σqθi∑a∈N(θi)∂Sa∂μqθi.

A more detailed version of the argument will be given in the forthcoming manuscript of Wand (2013).