A note on MCMC for nested multilevel regression models via belief propagation
In the quest for scalable Bayesian computational algorithms we need to exploit the full potential of existing methodologies. In this note we show that the posterior distribution of the regression parameters conditionally on covariance hyperparameters in a large family of multilevel regression models is a high-dimensional Gaussian that can be sampled exactly (as well as marginalized) using belief propagation at a cost that scales linearly in the number of parameters and data. We derive an algorithm that works efficiently even for conditionally singular Gaussian distributions, e.g., when there are linear constraints between the parameters at different levels. We provide a synthesis of similar ideas that have been suggested in the literature before.
1 Belief propagation
Graphical models are now mainstream for modelling high-dimensional vectors of dependent random variables and for doing efficient computations with the joint density thereof, such as marginalisations and maximisations. Among others, two algorithmic paradigms have been established for scalable computations in graphs. More popular within statistics is the junction tree algorithm, see for example Chapter 6 of Cowell et al. (1999), and more popular within machine learning is belief propagation, see for example Section 8.4 in Bishop (2006). The connections between the two paradigms are understood, see for example Chapter 2 of Wainwright and Jordan (2008). For the class of models we consider in this article the two paradigms are essentially equivalent, so we will use the belief propagation formulation, which we find more intuitive in our context and we describe below. The starting point of belief propagation is a factorisation of the joint density of a vector of random variables. (In Section 2 we work with a broader framework that starts with a factorisation of the joint law of the variables.) The set of variables and factors are then represented by a bipartite graph known as the factor graph. There are two sets of nodes: variable nodes (one for each of the random variables) and factor nodes (one for each of the factors in the density factorisation). The edges in the graph connect each factor node to the variable nodes that correspond to those variables involved in the corresponding factor in the factorisation. Details on this construction can be found for example in Section 8.4 of Bishop (2006).
An example is the two-level hierarchical model
where is the identity matrix whose dimensions vary depending on the context; denotes the Gaussian distribution, later will denote the corresponding Gaussian density with argument ; is a matrix of covariates, one for each , where superscription by is explained in Section 2 in the context of a generalisation of (1). The last line of (1) defines a linear regression model, , where is the th row of matrix . The second line of the model pools the local regression coefficients for each towards a global parameter vector . The directed acyclic graph of this model for given covariance and design matrices is shown in Figure 1(left). Each of the conditional densities in the model specification corresponds to a factor and the associated factor graph is shown in Figure 1(right).
Belief propagation involves exchange of messages between variable and factor nodes. Let denote a variable node in the graph, the corresponding variable, a neighbouring factor node, the corresponding factor in the density factorisation, and be a function that takes as input a node index and returns the set of neighbouring nodes in the graph. By construction factor nodes neighbour variable nodes only and variable nodes neighbour factor nodes only. Hence, is a function of and . Messages exchanged between variable and factor nodes are functions of the variable attached to the variable node and are defined by the following equations:
In terms of notation used in this paper, for sets and , denotes the set difference, we identify the variable index with the one-element set that contains it, for a set of indices , denotes the set of variables indexed by elements in , and denote generic function arguments. If the factor graph is a tree, as for example in Figure 1(right), and the integrals in (2) can be computed, then the system of equations defined by (2) can be efficiently solved with a forward-backward algorithm. After choosing an arbitrary variable node as the root of the tree, the direction from the leaves to the root is defined as forward and the reverse as backward. For a given node , its neighbour on its unique path towards the root will be called its parent and denoted by . The algorithm is initialised by setting the messages at the variables nodes at the leaves (if any) equal to 1. The forward pass of belief propagation computes messages going from the leaves to the root and the backward pass computes messages from the root to the leaves. After two sweeps we obtain the set of messages solving (2). The resulting messages can then be used to compute the marginal density at any variable node as . When some variables in the definition of the graphical model have been observed, say , the message passing steps that involve these variables keep them fixed at their observed values and do not integrate them out. Then, after the forward and backward step the product of the messages at a variable node is proportional to the conditional density . Although belief propagation is typically developed for computing marginal distributions and normalising constants, it can also be used to simulate from conditional distributions. This is exploited in the following section, where in the backward pass simulation replaces message computation.
2 Bayesian nested multilevel models
2.1 Model formulation
The methods we develop are appropriate for -level models with the following nested structure
The level closest to the data will be understood as the deepest one. The last line of (3) is a linear regression model where we have concatenated observations at the deepest level into the vector , like we did in (1). We allow for level-specific covariates that explain the variation of regression coefficients at that level. Throughout the following presentation there will be two sets of quantities associated to the deepest level and that are subscripted in the same way. To distinguish between them, we add the superscript to those that link to the data; therefore, is the matrix of covariates used to explain the dataset , whereas is used to explain . We also allow for flat prior on the hyperparameters, which yields a proper posterior under a condition on the matrix defined in (10) below. The set of all observations will be referred to as data, that of all regression parameters, i.e. and all , as coefficients and that of all covariance hyperparameters as covariances, e.g. we will use data | covariances to refer to marginal likelihood of observations with regression parameters integrated out.
By allowing rank deficient covariance matrices we can fit in this framework mixed-effects models (also known as linear mixed models) as discussed for example in Section 13.3 of Gelman and Hill (2007). To clarify the concept, an example more complex than (1) is helpful. Suppose we have data on individuals that belong to different countries , that we model as a regression , where and are individual-level covariates. We want to allow variation of the corresponding coefficients that we partially explain using a country-level covariate , and with and independent. In order to obtain a factor graph that is a tree, we need to clump coefficients at the deepest level into one vector , and the covariates and a vector of 1’s into the matrix . The ’s are artificial copies of . Even though we model the variation on independently of that on , having them clumped together forces us to work with them as a single vector. Hence, we write where , , and is a matrix with rows , , and . Hence, we have obtained the nested multilevel structure of (3) with . An example of a structure that does not fit the framework of (3) is the crossed-effects model, , to which the belief propagation algorithm we describe below does not apply.
2.2 Message passing for posterior simulation
We describe the messages for belief propagation in nested multilevel models as defined in (3), and how they can be used for computing data | covariances and simulating from coefficients | data,covariances. The factor graph associated with (3) is a tree; throughout our presentation we will take to be its root. The factor nodes will be denoted by ; as already discussed the factor node that links to will be denoted by and that linking to will be denoted by ; see Figure 1. Messages are functions that can be parameterised in terms of a triplet , where is a positive scalar, a positive semidefinite matrix and a vector, and for brevity we write
First we show how to update this set of parameters during the forward step, which works from leaves to the root, i.e., from the deepest level in the hierarchy to the most shallow. With every factor node, except for those at the deepest level linked to the observations, there are associated two messages, one that arrives from the variable node deeper and one that is sent to the variable node one level higher up. We distinguish between the two corresponding triplets by using tildes for the latter. In this context, the message that arrives at a variable node from below coincides with the density of the data on the leaves that originate from the given branch of the tree, conditional on the variable at this node but marginal with respect to all other variables in-between. The messages from factor to variable nodes at the leaves are
where is the size of . At higher levels, the factor to variable messages are
where , and implicitly depend on and are defined by
The triplet in (6) corresponds to the message . The matrix decompositions in (7) are used with Schur’s complement to obtain formulae above that are valid without any assumptions on invertibility of either or . is invertible by contruction because is positive semidefinite.
The messages from variable to factor nodes are as follows: at the deepest level
while for any
where runs over the appropriate index set, which depends on the numbers of offsprings has on the Bayesian network.
At the root we collect messages from for , which come from the level below, and from , which comes from the prior. The normalised message is the posterior density at the root:
where the alternative expressions depend on whether a Gaussian or flat prior is used for . Flat prior leads to proper posterior provided defined above is positive definite. Additionally,
where the first term in nominator is omitted for flat prior, and any value can be used. Then, for any intermediate level regression coefficients, we have
We have not given the details of the calculations that produce the formulae for the messages and the conditional distributions. We have worked under a framework mathematically richer than that of (2), where the definition of a factor graph is extended to correspond to a factorisation of the joint probability measure in terms of regular conditional densities, along the lines of the disintegration theorem as in Theorem 5.4 of Kallenberg (1997). This type of construction is suitable for the factor graph representation of Bayesian networks with conditional Gaussian distributions with semi-definite covariance matrices. In the context of a multilevel model, the messages are Radon-Nikodym derivatives between Gaussian measures. It can be checked (but we have omitted the details here) that indeed the posterior Gaussian laws (12) are absolutely continuous with respect to the prior Gaussian laws in the definition of the multilevel model in (3) with Radon-Nikodym derivative proportional to the message .
2.3 Complexity considerations
Both computation of data | covariances and simulation from coefficients | data,covariances via belief propagation as described above involve a computational cost that scales linearly with the total number of regression coefficient vectors. The forward and backward steps at each level involve computations that without further structural assumptions scale cubically with the dimension of the level-specific coefficients. On the other hand, if we wish to draw several samples for the coefficients for given values of the covariances, the cost per step can be made quadratic in the dimension of the coefficient vector, since matrix decompositions do not have to be redone. In other words, whereas the computational cost of the algorithm has the same dependence on the characteristics of the model as that of a Gibbs sampling algorithm that simulates each regression coefficient conditionally on the rest, it achieves exact draws from the high-dimensional distribution coefficients | data,covariances. On the other hand, to obtain a factor tree graph we might have to clump together variables, hence increasing the dimension of the variable node dimensions, as we did for the mixed-effects model example in Section 2.1. In such cases the computational cost per iteration of the Gibbs sampler will be smaller. However, such differences will typically be small relative to the overall cost of the algorithms. Additionally, one can use belief propagation to update a subset of the variables conditionally on the rest, as we discuss in the next section, hence obtaining again the same cost per iteration as the Gibbs sampler. Sampling using belief propagation lends itself to parallelisation, but we do not develop this idea further here, although it should be considered for software development.
3 Blocked Gibbs sampling and marginal algorithms
The algorithm of Section 2 can be used in the popular context where the covariance components are given a prior distribution and Bayesian inferences are performed on the joint distribution . Posterior computations are typically performed using a Gibbs sampling scheme that alternates sampling from covariances | data, coefficients and . If conditionally conjugate priors (e.g. Wishart) are used, sampling from can be efficiently accomplished exploiting the conditional independence across covariance components given the coefficients and the data. Belief propagation can then be used to sample from coefficients | data,covariances efficiently.
An alternative approach is to perform inferences on the marginal space covariances | data. Even in this case, the belief propagation algorithm of Section 2 is crucial to allow efficient point-wise evaluation of the marginal likelihood data | covariances, and thus of covariances | data up to proportionality. The relative merits of working with the extended posterior coefficients, covariances | data or with the marginal one covariances | data are case-specific. In general one would expect the marginal approach to be preferable when the distribution is small or moderate dimensional, so that numerical or Monte Carlo integration can be efficiently employed in the marginal space. When instead the distribution is high-dimensional it may be challenging to perform inferences in the marginal space because, having integrated out the regression coefficients, the covariance components are not anymore conditionally independent and one ends up with a potentially complex high-dimensional correlated distribution. In this case the blocked Gibbs Sampling approach may be more efficient.
The methodology we presented so far is well-suited to contexts where the number of regression coefficients is large (i.e. tree with many nodes) and the size of each coefficient is small, since the cost of operations grows as the cube of that size but only linearly with the number of nodes in the tree. When the size of variable nodes is large, we can still use the message passing algorithm of Section 2 to update a subset of regression coefficients (i.e. some components of each ) jointly across all nodes of the tree conditionally on the remaining coefficients. The resulting blocked Gibbs sampling scheme can still exhibit large improvements over the single-site updating. A more detailed study of this case is left to future work.
4 Synthesis of the related literature
The forward-backward iteration we describe for sampling efficiently the high-dimensional Gaussian distribution, , is a generalisation of the so-called forward filtering backward sampling algorithm, which is used in the time series community for Bayesian inference for state space models, see for example Algorithm 13.4 in Frühwirth-Schnatter (2006). State space models have factor graphs with single-branch tree structure. The extension of Kalman filtering recursion to tree structures has been long-known, especially in the context of multiscale systems (e.g. Chou et al., 1994). Similar ideas have been exploited in spatial statistics contexts, see for example Huang and Cressie (2001) where algorithms for spatial Gaussian models with tree-structured dependence are developed. Zhang and Agarwal (2008) exploit Kalman filter recursions to perform posterior maximization for some multilevel models that are subset of the framework we consider here. These previous works typically focus on computing marginal distributions and do not develop sampling algorithms, and they do not make a clear connection with belief propagation. The connection between belief propagation and Kalman recursions is recognised in the early technical report Dempster (1990) and using a message-passing formulation in Normand and Tritchler (1992), who provide references to works even older than Dempster (1990).
The role of belief propagation within Bayesian computation for multilevel models is fully recognised in Wilkinson and Yeung (2002) who work along the same lines we have followed in this note, in particular their Section 2.4 on message passing for sampling posteriors that arise in Gaussian tree models (the approach we use in Section 2.2 corresponds to what they call the canonical parameterisation of the multivariate Gaussian). Relative to that work the main novelty in this note is that we have worked out messages with no assumptions on invertibility of prior covariance matrices. This extension has allowed us to also cast mixed effect models as nested multilevel models and use belief propagation for those too. For the two-level mixed effect models Chib and Carlin (1999) derive an efficient algorithm for sampling the posterior distribution of the regression coefficients, which is an instance of the generic algorithm of Section 2, where the structure of 0’s in prior covariances that results from clumping is explicitly exploited to simplify some of the matrix computations.
Seen as latent Gaussian models, the nested multilevel models we consider give rise to large sparse precision matrices. Wilkinson and Yeung (2002) and Wilkinson and Yeung (2004) show how to compute the canonical parameters of the Gaussian prior and posterior using sparse linear algebra computations and exploiting the graphical model structure to identify the zeros in the precision matrices. The potential of sparse linear algebra computations for inference and simulation in latent Gaussian models has been recognised at least since Rue (2001); this work, and inter alia the follow-ups Knorr-Held and Rue (2002) and Rue and Held (2005) have illustrated how the precision matrices that arise in spatiotemporal latent Gaussian models can be treated using sparse linear algebra algorithms, such as algorithms for banded matrices and algorithms that attempt to produce Choleksy factors nearly as sparse as the posterior precision. This work as has also recognised the connections between their fast linear algebra approach and Kalman filters/smoothers; see for example the overview in Section 1.2.1 of Rue and Held (2005). The way numerical analysis techniques for sparse matrices can be used for computations with Gaussian graphical models is exposed in Section 2.4 of Rue and Held (2005), and Section 2.5 of the same book carries out a simulation study that tries two black-box algorithms for sparse matrices in the context of precision matrices that arise in spatiotemporal graphical models. Wilkinson and Yeung (2004) suggests the use of such sparse linear algebra methods for inference in Gaussian hierarchical models; first compile the sparse precision using computations that exploit the known positions of the zeros and then feed the precision into a sparse linear algebra algorithm that returns its Cholesky factor. The article does not study whether the structure of zeros in the precision from common families of hierarchical models, such as for example the nested hierarchical models considered in this note, is such that the extraction of the Cholesky factor is actually fast or whether the resultant factor is sparse, or the complexity of the associated operations. These type of questions have been studied in more depth in the case of spatial models in Chapter 2 of Rue and Held (2005).
In the opposite direction, the linear algebra community has explored the use of belief propagation as an efficient iterative method for solving linear systems. An example of this line of work is Shental et al. (2008), who use belief propagation for solving linear systems with positive definite matrices by linking the system solution to the marginal means (or the mode) of a Gaussian distribution with precision matrix given by the system matrix. They discuss connections of this approach to direct and iterative methods for solving systems and find it competitive.
The authors are grateful to Darren Wilkinson for his input.
- Bishop (2006) Bishop, C. M. (2006). Pattern recognition and machine learning. Information Science and Statistics. Springer, New York.
- Chib and Carlin (1999) Chib, S. and Carlin, B. P. (1999). On mcmc sampling in hierarchical longitudinal models. Statistics and Computing, (9):17–26.
- Chou et al. (1994) Chou, K. C., Willsky, A. S., and Nikoukhah, R. (1994). Multiscale systems, Kalman filters, and Riccati equations. IEEE Transactions on Automatic Control, 39(3):479–492.
- Cowell et al. (1999) Cowell, R. G., Dawid, A. P., Lauritzen, S. L., and Spiegelhalter, D. J. (1999). Probabilistic networks and expert systems. Statistics for Engineering and Information Science. Springer-Verlag, New York.
- Dempster (1990) Dempster, A. (1990). Normal belief functions and the kalman filter.
- Frühwirth-Schnatter (2006) Frühwirth-Schnatter, S. (2006). Finite mixture and Markov switching models. Springer Series in Statistics. Springer, New York.
- Gelman and Hill (2007) Gelman, A. and Hill, J. (2007). Data analysis using regression and multilevel/hierarchical models, volume 3. Cambridge University Press New York, New York, USA.
- Huang and Cressie (2001) Huang, H.-C. and Cressie, N. (2001). Multiscale graphical modeling in space: applications to command and control. In Spatial statistics: methodological aspects and applications, volume 159 of Lect. Notes Stat., pages 83–113. Springer, New York.
- Kallenberg (1997) Kallenberg, O. (1997). Foundations of modern probability. Probability and its Applications (New York). Springer-Verlag, New York.
- Knorr-Held and Rue (2002) Knorr-Held, L. and Rue, H. v. (2002). On block updating in Markov random field models for disease mapping. Scand. J. Statist., 29(4):597–614.
- Normand and Tritchler (1992) Normand, S.-L. and Tritchler, D. (1992). Parameter updating in a Bayes network. Journal of the American Statistical Association, 87(420):1109–1115.
- Rue and Held (2005) Rue, H. and Held, L. (2005). Gaussian Markov random fields: theory and applications. Chapman & Hall.
- Rue (2001) Rue, H. v. (2001). Fast sampling of Gaussian Markov random fields. J. R. Stat. Soc. Ser. B Stat. Methodol., 63(2):325–338.
- Shental et al. (2008) Shental, O., Bickson, D., Siegel, P. H., Wolf, J. K., and Dolev, D. (2008). Gaussian belief propagation solver for systems of linear equations. In EEE Int. Symp. on Inform. Theory (ISIT).
- Wainwright and Jordan (2008) Wainwright, M. J. and Jordan, M. I. (2008). Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1-2):1–305.
- Wilkinson and Yeung (2002) Wilkinson, D. J. and Yeung, S. K. H. (2002). Conditional simulation from highly structured Gaussian systems, with application to blocking-MCMC for the Bayesian analysis of very large linear models. Stat. Comput., 12(3):287–300.
- Wilkinson and Yeung (2004) Wilkinson, D. J. and Yeung, S. K. H. (2004). A sparse matrix approach to Bayesian computation in large linear models. Comput. Statist. Data Anal., 44(3):493–516.
- Zhang and Agarwal (2008) Zhang, L. and Agarwal, D. (2008). Fast computation of posterior mode in multi-level hierarchical models. In Koller, D., Schuurmans, D., Bengio, Y., and Bottou, L., editors, Advances in Neural Information Processing Systems 21, pages 1913–1920.