Efficient MCMC for Gibbs Random Fields using pre-computation

# Efficient MCMC for Gibbs Random Fields using pre-computation

Aidan Boland Nial Friel Florian Maire
###### Abstract

Bayesian inference of Gibbs random fields (GRFs) is often referred to as a doubly intractable problem, since the likelihood function is intractable. The exploration of the posterior distribution of such models is typically carried out with a sophisticated Markov chain Monte Carlo (MCMC) method, the exchange algorithm (murray06), which requires simulations from the likelihood function at each iteration. The purpose of this paper is to consider an approach to dramatically reduce this computational overhead. To this end we introduce a novel class of algorithms which use realizations of the GRF model, simulated offline, at locations specified by a grid that spans the parameter space. This strategy speeds up dramatically the posterior inference, as illustrated on several examples. However, using the pre-computed graphs introduces a noise in the MCMC algorithm, which is no longer exact. We study the theoretical behaviour of the resulting approximate MCMC algorithm and derive convergence bounds using a recent theoretical development on approximate MCMC methods.

## 1 Introduction

The focus of this study is on Bayesian inference of Gibbs random fields (GRFs), a class of models used in many areas of statistics, such as the autologistic model (besag74) in spatial statistics, the exponential random graph model in social network analysis (robins07), etc. Unfortunately, for all but trivially small graphs, GRFs suffer from intractability of the likelihood function making standard analysis impossible. Such models are often referred to as doubly-intractable in the Bayesian literature, since the normalizing constant of both the likelihood function and the posterior distribution form a source of intractability. In the recent past there has been considerable research activity in designing Bayesian algorithms which overcome this intractability all of which rely on simulation from the intractable likelihood. Such methods include Approximate Bayesian Computation initiated by pritchard99 (see e.g. marin2012approximate for an excellent review) and Pseudo-Marginal algorithms (AndRob). Perhaps the most popular approach to infer a doubly-intractable posterior distribution is the exchange algorithm (murray06). The exchange algorithm is a Markov chain Monte Carlo (MCMC) method that extends the Metropolis-Hastings (MH) algorithm (metropolis1953equation) to situations where the likelihood is intractable. Compared to MH, the exchange uses a different acceptance probability and this has two main implications:

• theoretically: the exchange chain is less efficient than the MH chain, in terms of mixing time and asymptotic variance (see peskun1973optimum and tierney1998note for a discussion on the optimality of the MH chain)

• computationally: at each iteration, the exchange requires exact and independent draws from the likelihood model at the current state of the Markov chain to calculate the acceptance probability, a step that may substantially impact upon the computational performance of the algorithm

For many likelihood models, it is not possible to simulate exactly from the likelihood function. In those situations, cucala09 and caimo11 replace the exact sampling step in the exchange algorithm with the simulation of an auxiliary Markov chain targeting the likelihood function, whereby inducing a noise process in the main Markov chain. This approximation was extended further by alquier16 who used multiple samples to speed up the convergence of the exchange algorithm.

This short literature review of the exchange algorithm and its variants shows that simulations from the likelihood function, either exactly or approximately, is central to those methods. However, this simulation step often compromises their practical implementation, especially for large graph models. Indeed, for a realistic run time, a user may end up with a limited number of draws from the posterior as most of the computational budget is dedicated to obtaining likelihood realizations. In addition, note that since the likelihood draws are conditioned on the Markov chain states, those simulation steps are intrinsically incompatible with parallel computing (friel2016exploiting).

Intuitively, there is a redundance of simulation. Indeed, should the Markov chain return to an area previously visited, simulation of the likelihood is nevertheless carried out as it had never been done before. This is precisely the point we address in this paper. We propose a novel class of algorithms where likelihood realizations are generated and then subsequently re-used at in an online inference phase. More precisely, a regular grid spanning the parameter space is specified and draws from the likelihood at locations given by the vertices of this grid are obtained offline in a parallel fashion. The grid is tailored to the posterior topology using estimators of the gradient and the Hessian matrix to ensure that the pre-computation sampling covers the posterior areas of high probability. However, using realizations of the likelihood at pre-specified grid points instead of at the actual Markov chain state introduces a noise process in the algorithm. This leads us to study the theoretical behaviour of the resulting approximate MCMC algorithm and to derive quantitative convergence bounds using the noisy MCMC framework developed in alquier16. Essentially, our results allow one to quantify how the noise induced by the pre-computing step propagates through to the stationary distribution of the approximate chain. We find an upper bound on the bias between this distribution and the posterior of interest, which depends on the pre-computing step parameters i.e. the distance between the grid points and the number of graphs drawn at each grid point. We also show that the bias vanishes asymptotically in the number of simulated graphs at each grid point, regardless of the grid structure.

Note that moores14 suggested a similar strategy to speed-up ABC algorithms by learning about the sufficient statistics of simulated data through an estimated mapping function that uses draws from the likelihood function at a pre-defined set of parameter values. This method was shown to be computationally very efficient but its suitability for models with more than one parameter can be questioned. Finally, we note that a related approach has been presented by everitt17 which also relies on previously sampled likelihood draws in order to estimate the intractable ratio of normalising constants. However this approach falls within a sequential Monte Carlo framework.

The paper is organised as follows. Section 2 introduces the intractable likelihood that we focus on and details our class of approximate MCMC schemes which uses pre-computed likelihood simulations. We also detail how we automatically specific the grid of parameter values. In Section 3, we establish some theoretical results for noisy MCMC algorithms making use of a pre-computation step. In Section 4, the inference of a number of GRFs is carried out using both pre-computed algorithms and exact algorithms such as the exchange. Results show a dramatic improvement of our method over exact methods in time normalized experiments. Finally, this paper concludes with some related open problems.

## 2 Pre-computing Metropolis algorithms

### 2.1 Preliminary notation

We frame our analysis in the setting of Gibbs random fields (GRFs) and we denote by the observed graph. A graph is identified by its adjacency matrix and is taken as where is the number of nodes in the graph. The likelihood function of is paramaterized by a vector and is defined as

 f(y|θ)=qθ(y)Z(θ)=exp{θTs(y)}Z(θ),

where is a vector of statistics which are sufficient for the likelihood. The normalizing constant,

 Z(θ)=∑y∈Yexp{θTs(y)},

depends on and is intractable for all but trivially small graphs. The aim is to infer the parameters through the posterior distribution

 π(θ|y)∝qθ(y)Z(θ)p(θ),

where denotes the prior distribution of . In absence of ambiguity, a distribution and its probability density function will share the same notation.

### 2.2 Computational complexity of MCMC algorithms for doubly intractable distributions

In Bayesian statistics, Markov chain Monte Carlo methods (MCMC, see e.g. gilks1995markov for an introduction) remain the most popular way to explore . MCMC algorithms proceed by creating a Markov chain whose invariant distribution has a density equal to the posterior distribution. One such algorithm, the Metropolis-Hastings (MH) algorithm metropolis1953equation, creates a Markov chain by sequentially drawing candidate parameters from a proposal distribution and accepting the proposed new parameter with probability

 α(θ,θ′):=1∧a(θ,θ′),a(θ,θ′):=qθ′(y)p(θ′)h(θ|θ′)qθ(y)p(θ)h(θ′|θ)×Z(θ)Z(θ′). (1)

This acceptance probability depends on the ratio of the intractable normalising constants and cannot therefore be calculated in the case of GRFs. As a result, the MH algorithm cannot be implemented to infer GRFs.

As detailed in the introduction section, a number of variants of the MH algorithm bypass the need to calculate the ratio , replacing it in Eq. (1) by an unbiased estimator

 ϱn(θ,θ′,x)=1nn∑k=1qθ(xk)qθ′(xk),x1,x2,…∼%iidf(⋅|θ′). (2)

Perhaps surprisingly, when the resulting algorithm, known as the exchange algorithm (murray06), is -invariant. The general implementation using auxiliary draws was proposed in alquier16 and referred therein as the noisy exchange algorithm. It is not -invariant but the asymptotic bias in distribution was studied in (alquier16). We note however that when is large, the resulting algorithm bears little resemblance with the exchange algorithm and really aims at approximating the MH acceptance ratio (1). For clarity, we will therefore refer to the exchange algorithm whenever draw of the likelihood is needed at each iteration and to the noisy Metropolis-Hastings whenever .

From Eq. (2), we see that those modified MH algorithms crucially rely on the ability to sample efficiently from the likelihood distribution ( for any ). While perfect sampling is possible for certain GRFs, for example for the Ising model (propp96), it can be computationally expensive in some cases, including large Ising graphs. For some GRFs such as the exponential random graph model, perfect sampling does not even exist yet. cucala09 and caimo11 substituted the iid sampling in Eq. (2) with draw from a long auxiliary Markov chain that admits as stationary distribution. Convergence of this type of approximate exchange algorithm was studied in Everitt12 under certain assumptions on the main Markov chain. The computational bottleneck of those methods is clearly the simulation step, a drawback which is amplified when is large and inference is on high-dimensional data such as large graphs.

Intuitively, obtaining a likelihood sample at each step independently of the past history of the chain seems to be an inefficient strategy. Indeed, the Markov chain may return to areas of the state space previously visited. As a result, realizations from the likelihood function are simulated at similar parameter values multiple times, throughout the algorithm. Under general assumptions on the likelihood function, data simulated at similar parameter values will share similar statistical features. Hence, repeated sampling without accounting for previous likelihood simulations seems to lead to an inefficient use of computational time. However, the price to pay to use information from the past history of the chain to speed up the simulation step is the loss of the Markovian dynamic of the chain, leading to a so-called adaptive Markov chain (see e.g. andrieu2008tutorial). We do not pursue this approach in this paper, essentially since convergence results for adaptive Markov chains depart significantly from the theoretical arguments supporting the validity of the exchange and its variants.

In a different context, moores14 addressed the computational expense of repeated simulations of Gibbs random fields used within an Approximate Bayesian Computation algorithm (ABC). The authors defined a pre-processing step designed to learn about the distribution of the summary statistics of simulated data. Part of the total computational budget is spent offline, simulating data from parameter values across the parameter space . Those pre-simulated data are interpolated to create a mapping function that is then used during the course of the ABC algorithm to assign an (estimated) sufficient statistics vector to any parameter for which simulation would be otherwise needed. moores15 examined a particular GRF, the single parameter hidden Potts model. They combined the pre-processing idea with path sampling (gelman98) to estimate the ratio of intractable normalising constants. The method presented in moores15 is suitable for single parameter models but the interpolation step remains a challenge when the dimension of the parameter space is greater than .

Inspired by the efficiency of a pre-computation step, we develop a novel class of MCMC algorithms, Pre-computing Metropolis-Hastings, which uses pre-computed data simulated offline to estimate each normalizing constant ratio in Eq. (1). This makes the extension to multi-parameter models straightforward. The steps undertaken during the pre-computing stage are now outlined.

### 2.3 Pre-computation step

Firstly, a set of parameter values, referred to as a grid, must be chosen from which to sample graphs from. should cover the full state space and especially the areas of high probability of . Finding areas of high probability is not straightforward as this requires knowledge of the posterior distribution. Fortunately, for GRFs we can use Monte Carlo methods to obtain estimates of the gradient and the Hessian matrix of the log posterior at different values of the parameters, which will allow to build a meaningful grid. For a GRF, the well known identity

 ∇θlogπ(θ|y)=s(y)−Ef(⋅|θ)s(X)+∇θlogp(θ)

allows the derivation of the following unbiased estimate of the gradient of the log posterior at a parameter :

 G(θ,y):=s(y)−1NN∑i=1s(Xi)+∇θlogp(θ),X1,X2…∼iidf(⋅|θ). (3)

Similarly, the Hessian matrix of the log posterior at a parameter can be unbiasedly estimated by:

 H(θ):=1N−1N∑i=1{s(Xi)−¯s}{s(Xi)−¯s}T+∇2logp(θ),X1,X2…∼iidf(⋅|θ), (4)

where is the average vector of simulated sufficient statistics.

The grid specification begins by estimating the mode of the posterior . This is achieved by mean of a stochastic approximation algorithm (e.g. the Robbins-Monro algorithm (robbins51)), using the log posterior gradient estimate defined at Eq. (3).

The second step is to estimate the Hessian matrix of the log posterior at using Eq. (4), in order to get an insight of the posterior curvature at the mode. We denote by the matrix whose columns are the eigenvectors of the inverse Hessian at the mode and by the diagonal matrix filled with its eigenvalues. The idea is to construct a grid that preserves the correlations between the variables. It is achieved by taking regular steps in the uncorrelated space i.e. the space spanned by , starting from and until subsequent estimated gradients are close to each other. The idea is that, for regular models, once the estimated gradients of two successive parameters are similar, the grid has hit the posterior distribution support boundary. Two tuning parameters are required: a threshold parameter for the gradient comparison and an exploratory magnitude parameter . The grid specification is rigorously outlined in Algorithm 1. Note that in Algorithm 1, we have used the notation for the -dimensional indicator vector of direction i.e.

The left panel of Figure 1 shows an example of a naively chosen grid built following standard coordinate directions for a two dimensional posterior distribution. The grid on the right hand side is adapted to the topology of the posterior distribution as described above. This method can be extended to higher dimensional models, but the number of sample grid points would then increase exponentially with dimension. In this paper we do not look beyond two dimensions.

Hereafter, we denote by the parameters constituting the grid , assuming grid points in total. The second step of the pre-computing step is to sample for each , iid random variables from the likelihood function . Note that this step is easily parallelised and samples can therefore be obtained from several grid points simultaneously. Parallel processing can be used to reduce considerably the time taken to sample from every pre-computed grid value. Essentially, these draws allow to form unbiased estimators for any ratio of the type :

 ˆZ(θ)Z(˙θm)n:=1nn∑k=1qθ(Xkm)q˙θm(Xkm)=1nn∑k=1exp(θ−˙θm)Ts(Xkm). (5)

Note that those estimators depend on the simulated data only through the sufficient statistics . As a consequence, only the sufficient statistics need to be saved, as opposed to the actual collection of simulated graphs at each grid point. In the following we denote by the collection of the pre-computing data comprising of the grid and the simulated sufficient statistics .

### 2.4 Estimators of the ratio of normalising constants

We now detail several pre-computing version of the Metropolis-Hastings algorithm. The central idea is to replace the ratio of normalizing constants in the Metropolis-Hastings acceptance probability (1) by an estimator based on . As a starting point this can be done by observing that for all ,

 Z(θ)Z(θ′)=Z(θ)Z(˙θ)Z(˙θ)Z(θ′)=Z(θ)Z(˙θ)/Z(θ′)Z(˙θ), (6)

and in particular for any grid point . We thus consider a general class of estimators of written as

 ρXn(θ,θ′,U):=ΨXn(θ,θ′,U)ΦXn(θ,θ′,U), (7)

where and are unbiased estimators of the numerator and the denominator of the right hand side of (6), respectively, based on . In (7), simply denotes the different type of estimators considered. To simplify notations and in absence of ambiguity, the dependence of , and on , , and is made implicit and we stress that given , the estimators and are deterministic.

We first note that as defined in (7) is not an unbiased estimator of . In fact, resorting to biased estimators of the normalizing constants ratio is the price to pay for using the pre-computed data. This represents a significant departure compared to the algorithms designed in the noisy MCMC literature (alquier16; medina2016stability). Nevertheless, as we shall see in the next Section, this does not prevent us from controlling the distance between the distribution of the pre-computing Markov chain and .

We propose a number of different estimators of and . Those estimators share in common the idea that, given the current chain location and an attempted move , a path of grid point(s) connects to .

The simplest path consists of the singleton , where is any grid point. Since only one grid point is used, we refer to this estimator as the One Pivot estimator. Following (6), the estimators and are defined as

 {ΨOPn(θ,θ′,U):=1/n∑nk=1qθ(Xkτ)/q˙θτ(Xkτ),ΦOPn(θ,θ′,U):=1/n∑nk=1qθ′(Xkτ)/q˙θτ(Xkτ). (8)

However, for some , the variance of or defined in Eq. (8) may be large. This is especially likely when or . The following Example illustrates this situation.

###### Example 1.

Consider the Erdös-Renyi graph model, where all graphs with the same number of edges are equally likely. More precisely, the dyads are independent and connected with a probability for any . The likelihood function is given for any by . For this model, the normalizing constant is tractable. In particular, where and is the number of nodes in the graph.

For all , consider estimating the ratio with for some using the estimator

 ˆZ(θ+h)Z(θ)∣∣ ∣∣n=1nn∑k=1qθ+h(Xk)qθ(Xk)=1nn∑k=1exp{hs(Xk)},Xk∼iidf(⋅|θ).

Then, when increases, the variance of this estimator diverges exponentially i.e.

 nvn(h)∼exp(2h¯p)ν(θ), (9)

where denotes here the asymptotic equivalence notation and is a constant. Remarkably, can be interpreted as the variance of the Bernoulli trial with the full graph and its complementary event as outcomes.

###### Proof.

By straightforward algebra, we have

 vn(h)=1n{1+exp(2h+θ)1+exp(θ)}¯p{1−R(θ,h)},

where

 R(θ,h)={1+exp(θ+h)}2¯p{1+exp(2h+θ)}¯p{1+exp(θ)}¯p.

Asymptotically in , we have

 R(θ,h)∼exp(¯pθ){1+exp(θ)}¯p=ϱ(θ)¯p

and noting that

 {1+exp(2h+θ)1+exp(θ)}¯p∼exp(2h¯p)exp{¯pθ}{1+exp(θ)}¯p=exp(2h¯p)ϱ(θ)¯p

concludes the proof. ∎

This is a concern since as we shall see in the next Section, the noise introduced by the pre-computing step in the Markov chain is intimately related to the variance of the estimator of . In particular, the distance between the pre-computing chain distribution and can only be controlled when the variance of and is bounded. Example 1 shows that this is not necessarily the case, for some Gibbs random fields at least. The following Proposition hints at the possibility to control the variance of and when .

###### Proposition 1.

For any Gibbs random field model and all , the variance of the normalizing constant estimator

 ˆZ(θ)Z(θ′)∣∣ ∣∣n:=1nn∑k=1qθ(Xk)qθ′(Xk),Xk∼iidf(⋅|θ′)

decreases when and more precisely

 varˆZ(θ)Z(θ′)∣∣ ∣∣n=O(∥θ−θ′∥2). (10)

Proposition 1 motivates the consideration of estimators that may have smaller variability than the One Pivot estimator.

1. Direct Path estimator: the path between and consists now of two grid points defined such that and . We therefore extend (6) and write

 Z(θ)Z(θ′)=Z(θ)Z(˙θ1)Z(˙θ1)Z(˙θ2)Z(˙θ2)Z(θ′)=Z(θ)Z(˙θ1)Z(˙θ1)Z(˙θ2)/Z(θ′)Z(˙θ2).

This leads to two estimators and defined as

 {ΨDPn(θ,θ′,U):=1/n∑nk=1qθ(Xk1)/q˙θ1(Xk1)×1/n∑nk=1q˙θ1(X2k)/q˙θ2(X2k),ΦDPn(θ,θ′,U):=1/n∑nk=1qθ′(Xk2)/q˙θ2(Xk2). (11)
2. Full Path estimator: the path between and consists now of adjacent grid points , where is a number that depends on and . Note that given , there is not only one path such as connecting to . However, for any possible path, two adjacent points always satisfy the following identity (in the basis given by the eigenvector of ):

 ∃j∈{1,…,d},VT(˙θi−˙θi+1)=±εδj,

where refers to the -dimensional indicator vector of direction i.e. . As before, we extend (6) to accommodate this situation and write

 Z(θ)Z(θ′)=Z(θ)Z(˙θ1)Z(˙θ1)Z(˙θ2)×⋯×Z(˙θC−1)Z(˙θC)Z(˙θC)Z(θ′)=Z(θ)Z(˙θ1)Z(˙θ1)Z(˙θ2)×⋯×Z(˙θC−1)Z(˙θC)/Z(θ′)Z(˙θc).

This then lead to consider two estimators and defined as

 ⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩ΨFPn(θ,θ′,U):=1/n∑nk=1qθ(Xk1)/q˙θ1(Xk1)×1/n∑nk=1q˙θ1(X2k)/q˙θ2(X2k)×⋯×1/n∑nk=1q˙θC−1(XkC−1)/q˙θC(XkC),ΦFPn(θ,θ′,U):=1/n∑nk=1qθ′(XkC)/q˙θC(XkC). (12)

Variants of the Direct Path and Full Path estimators exist. For the Direct Path, could be estimating and the ratio . For the Full Path, defining as a middle point of , and could respectively be defined as estimators of and using the same number of grid points in both estimators. However, our experiments have shown that these alternative estimators have very similar behaviour with those defined in Eqs. (11) and (12). In particular, the variance of an estimator does not vary much when path points are removed from the numerator estimator and added to the denominator estimator, or conversely. As hinted by Proposition 1, the discriminant feature between those estimators is the distance between grid points constituting the path. In this respect, the variance of the Full Path estimator was always found to be lower than that of the Direct Path or One Pivot estimators. Even though establishing a rigorous comparison result between those estimators is a challenge on its own, a reader might be interested in the following result that somewhat formalizes our empirical observations.

###### Proposition 2.

Let and consider the Direct Path and Full Path estimators of defined at (11) and (12). Denoting by a full path connecting to , we define for as the estimator of and as the estimator of i.e.

 Rin=1nn∑k=1q˙θi−1(Xki)q˙θi(Xki),R2Cn=1nn∑k=1q˙θ1(XkC)q˙θC(XkC),Xki∼iidf(⋅|˙θi). (13)

Let and be the variance of the Full Path and Direct Path estimators using pre-computed sufficient statistics are drawn at each grid point.

Assume and are independent. Then, there exists a positive constant such that

 vDPn−vFPn=γ{var(R2Cn)−var(R2n×⋯×RCn)}. (14)

Moreover,

 var(R2Cn)=1nvarexp{(˙θ1−˙θC)Ts(XC)} (15)

and for large and and small we have

 (16)

where is a sequence of finite numbers such that .

Proposition 2 shows that for a large enough number of pre-computed draws , a long enough path and a dense grid i.e. , the variance of the Full Path estimator is several order of magnitude less than that of the Direct Path estimator. In particular, unlike the Full Path estimator, the grid refinement does not help to reduce the variance of the Direct Path estimator. Proposition 2 coupled with the observation made at Example 1 helps to understand the variance reduction achieved with the Full Path estimator compared to the Direct Path estimator.

Note that when the parameter space is two-dimensional or higher, there is more than one choice of path connecting to . The right panel of Figure 2 shows two different paths. In this situation, one could simply average the Full Path estimators obtained through each (or a number of) possible path. The different steps included in the Pre-computing Metropolis algorithm are summarized in Algorithm 2.