Model comparison for Gibbs random fields using noisy reversible jump Markov chain Monte Carlo

Model comparison for Gibbs random fields using
noisy reversible jump Markov chain Monte Carlo

Lampros Bouranis lampros.bouranis@insight-centre.org Nial Friel nial.friel@ucd.ie Florian Maire florian.maire@ucd.ie School of Mathematics and Statistics & Insight Centre for Data Analytics,
University College Dublin, Ireland
Abstract

The reversible jump Markov chain Monte Carlo (RJMCMC) method offers an across-model simulation approach for Bayesian estimation and model comparison, by exploring the sampling space that consists of several models of varying dimensions. The implementation of RJMCMC to models like Gibbs random fields suffers from computational difficulties: the posterior distribution for each model is termed doubly-intractable since computation of the likelihood function is rarely available. Consequently, it is simply impossible to simulate a transition of the Markov chain in the presence of likelihood intractability. In this paper we present a variant of RJMCMC, called noisy RJMCMC, where we replace the underlying transition kernel with an approximation based on unbiased estimators. Building upon the theoretical developments of Alquier et al. (2016), we provide convergence guarantees for the noisy RJMCMC algorithm. Our experiments show that the noisy RJMCMC algorithm can be much more efficient than other exact methods, provided that an estimator with lower variance is used, a fact which is in agreement with our theoretical analysis.

keywords:
Bayes factors, Intractable likelihoods, Markov random fields, Noisy MCMC.
journal:

1 Introduction

Model selection is a problem of great importance in statistical science. The aim is to choose which model among a set of possible ones best describes the data . From a Bayesian perspective, the prior beliefs for each model are reflected by a prior distribution and this information is then updated objectively when data are observed. This step is typically carried out by calculating the marginal likelihood or evidence which is defined as the integrated likelihood with respect to the prior measure. In many cases, this quantity cannot be derived analytically and thus needs to be estimated.

This paper considers the problem of Bayesian model comparison of doubly-intractable distributions. The motivating application is Gibbs random fields (GRFs), which are discrete-valued Markov random fields where an intractable normalising constant that depends on the model parameters, , is present for the tractable un-normalised likelihood . The likelihood density, given a vector of parameters and a vector of statistics that are sufficient for the likelihood, is

(1)

Posterior parameter estimation for GRFs, which have found applicability in areas like image analysis and disease mapping (Friel and Rue, 2007), genetic analysis (François et al., 2006) and social network analysis (Wasserman and Pattison, 1996), is termed a doubly-intractable problem because both the likelihood density and the posterior distribution are intractable. Bayesian model comparison of such models has attracted the attention of researchers and several methods have been considered, relying on likelihood simulations (Friel, 2013; Caimo and Friel, 2013; Everitt et al., 2017a) or approximations to the intractable likelihood (Bouranis et al., 2017a).

In this paper we explore trans-dimensional Markov chain Monte Carlo (MCMC) for GRFs, focusing on the direct approach of a single across-model Markov chain using the celebrated reversible jump MCMC (RJMCMC) technique (Green, 1995). The advantage of across-model approaches is that they avoid the need for computing the evidence for each competing model by treating the model indicator as a parameter, where the chain explores simultaneously the model set and the parameter space. In the context of GRFs, however, RJMCMC techniques simply cannot be implemented because the likelihood normalising constant cannot be computed point-wise.

The three main contributions of this paper are:

  1. To develop a variant of the RJMCMC algorithm where the intractable ratio of normalising constants that the acceptance probability depends on is approximated by an unbiased estimator. The resulting algorithm falls in the noisy MCMC framework (Alquier et al., 2016) as it simulates a chain that is not invariant for the target distribution. We extend the theoretical analysis of noisy MCMC algorithms proposed in Alquier et al. (2016) to trans-dimensional kernels.

  2. The construction of efficient jump proposal distributions for random walk (RW) noisy RJMCMC, which could be useful in the context of a large number of nested competing models.

  3. To show that noisy RJMCMC algorithms are only useful when the estimator of the ratio of normalising constants has a small variance. Motivated by Gelman and Meng (1998), we propose a smoother transition path between different models and resort to an alternative estimator with lower variance. We demonstrate empirically that this idea simultaneously: (i) improves the mixing of the RJ Markov chain and (ii) decreases the asymptotic bias between the exact RJ Markov chain and its noisy approximation.

The outline of the article is as follows. Section 2 introduces the reader to basic concepts regarding Bayesian model comparison and to the reversible jump MCMC formulation. In Section 3 we discuss the extension of the reversible jump methodology to doubly-intractable posterior distributions and present theoretical properties and practical aspects of the RJMCMC samplers under such computational difficulties. Chapter 4 presents some proposal tuning strategies for noisy RJMCMC. In Section 5 we study the theoretical behavior of the noisy RJMCMC algorithm and derive convergence bounds. We investigate the performance of noisy RJMCMC with a detailed numerical study that focuses on social network analysis in Section 6. We conclude the paper in Section 7 with final remarks.

2 Preliminaries

Suppose a set of competing models are under consideration to describe the data . In the Bayesian setting each model is characterised by a likelihood function , parameterised by an unknown parameter vector . Each model is also associated with a prior distribution , used to express the beliefs about the parameter vector prior to observing the data . The focus of interest in Bayesian inference for each competing model is the posterior distribution

(2)

The prior beliefs for each model are expressed through a prior distribution , such that .

2.1 Bayesian model comparison

The marginal likelihood or model evidence for model is

and is rarely analytically tractable. However, knowledge of the evidence is required for a quantitative discrimination between competing models with the posterior model probabilities. Using Bayes theorem the posterior model probability for model is

(3)

The probabilities are treated as a measure of the uncertainty of model . Comparison of two competing models in the Bayesian setting is performed through the Bayes factor

which provides evidence in favor of model compared with model . Using (3), the Bayes factor can also be expressed as the ratio of the posterior model odds to the prior odds,

A comprehensive review of Bayes factors is presented by Kass and Raftery (1995). There are at least two computational approaches to estimate Bayes factors: (i) within-model simulation, where the evidence is estimated separately for each model, see Friel and Wyse (2012) for a recent review on related methods; (ii) across-model simulation with trans-dimensional MCMC methods, which involves estimation of posterior model odds for chosen prior model odds. Reversible jump MCMC (Green, 1995) is a popular trans-dimensional MCMC method and is the focus of this work.

2.2 Reversible jump MCMC

Reversible jump MCMC (Green, 1995) generalises the Metropolis-Hastings (MH) algorithm (Metropolis et al., 1953; Hastings, 1970) to allow for sampling from a distribution on a union of spaces of (possibly) differing dimensions, permitting state-dependent choices of move types. Let be a distribution over the general state space ,

The dimension of the parameter space of model is denoted by . The target is a joint posterior distribution on a model and a parameter, {linenomath}

(4)

which can be factorised as the product of posterior model probabilities and model-specific parameter posteriors, {linenomath}

(5)

To sample the model indicator and the model parameters jointly a Markov chain is constructed with state space and stationary distribution . The state space is a finite union of subspaces of varying dimensions. By marginalisation, we obtain the probability of being in subspace .

The reversible jump MCMC scheme allows for Metropolis-Hastings moves between states defined by and that may have different dimensions and , respectively. Since its a MH-type algorithm it is -reversible and thus -invariant. As a consequence, the Markov chain simulated by RJMCMC produces samples .

Below we formulate the RJMCMC algorithm in a way that generalises non trans-dimensional MCMC. This will ease the analysis of Section 5. Let be the proposal distribution for the transition from to . We denote by the probability of proposing a jump to model when the chain is currently at model and by the proposal for the parameter vector. We define as the set of common parameters between models , and by the set of parameters that are in , but not in . Then for the case of RW updates the proposal distribution for the parameter vector takes the form

(6)

where and are some proposal distributions and is the indicator function. The RJMCMC scheme proceeds as in Algorithm 1. Green (1995, 2003) offers a representation in terms of random numbers to eliminate the apparent difficulty of moving between spaces of different dimensions.

The advantage of RJMCMC is that the posterior model probabilities are readily available along with parameter estimates for each competing model. To assess support for the models under examination with the RJMCMC scheme, the output from the trans-dimensional chain is processed to calculate the Bayes factor. Assuming equal prior probabilities on models and , this motivates the simple estimate of as , where is the frequency of visits (out of a chain of length ) to model .

1:Initialise .
2:for  do
3:     Select a candidate model with probability .
4:     Propose .
5:      Estimate the acceptance probability , (7)
6:     Set with probability , else .
7:end for
8:return
Algorithm 1 Reversible jump MCMC

3 Reversible jump MCMC in the presence of likelihood intractability

The reversible jump algorithm presents an asymptotically exact MCMC method by targeting the posterior distribution of interest, . When the posterior involves an intractable likelihood the implementation of RJMCMC for GRFs is challenging because it is simply not possible to simulate a transition of this exact chain.

Indeed, given the current state , the proposed value of the next state of the chain, , is accepted with probability , where

(8)

The acceptance probability is dependent on a ratio of intractable normalising constants and, thus, cannot be evaluated. GRF models belong to the exponential family and satisfy the identity

(9)

This expectation can be estimated with importance sampling, where the algebraic part, of , is viewed as an importance distribution for the "target" .

Below we present variants of the RJMCMC algorithm that bypass the need to calculate the intractable ratio in (8), by replacing it with an unbiased estimator. It is important to note that using an estimator instead of the true ratio has significant consequences on the convergence of the Markov chain. In particular, the Markov chain is usually no longer asymptotically exact and in this context, such an algorithm is referred to as noisy MCMC. However, Alquier et al. (2016) have shown that the distribution of the noisy MCMC Markov chain can be made arbitrarily close to the desired stationary distribution, provided that an estimator with an arbitrarily small variance is available. In the following, we refer to as noisy RJMCMC the implementation of the RJMCMC algorithm where the ratio is estimated.

The efficiency of the noisy RJMCMC depends on the asymptotic approximation of , an estimator of with small variance and the mixing property of the chain. The first and second points are discussed in the remainder of this Section through the presentation of different estimators of the ratio of normalising constants. The third point is considered at Section 4, where efficient proposal distributions are designed.

3.1 Reversible jump exchange algorithm

Murray et al. (2006) presented the exchange algorithm that allows inference for doubly-intractable distributions and circumvents the issue of intractability of (1). The exchange algorithm introduces an auxiliary variable that is used to estimate the intractable ratio of normalising constants with a one-sample unbiased importance sampling estimator of (9).

Pseudo-marginal algorithms (Andrieu and Roberts, 2009) share the same principles with the exchange algorithm, replacing the intractable likelihood function with a positive and unbiased estimate that can be obtained using likelihood simulations or importance sampling.

Caimo and Friel (2014) devised a trans-dimensional extension of the exchange algorithm which is -invariant and generalises the exchange algorithm of Murray et al. (2006) to trans-dimensional settings in the same way that RJMCMC generalises the Metropolis-Hastings algorithm. We refer to this instance as the RJ exchange when draws from the likelihood are used to estimate (9).

Despite yielding a Markov chain that is -invariant, the one-sample unbiased importance sampling estimator usually has a large variance (see discussion below). Empirical results in Alquier et al. (2016) suggest that such an exchange algorithm is likely to be inefficient because of slow mixing properties and propose taking multiple auxiliary draws . In the trans-dimensional setting, this motivates a noisy RJMCMC algorithm that uses likelihood draws or another estimator with small variance.

A graphical illustration that attempts to explain the inefficiency of the RJ exchange can be seen in Figure 1. This example involves transitions with a random walk proposal from ERG model to model , which we provide details of in Section 6.2. It is evident that the RJ exchange acceptance ratio will frequently underestimate the RJ MH acceptance ratio, making it more likely to reject the proposed move and affecting the convergence speed of the algorithm, particularly as the norm increases.

3.2 Noisy reversible jump MCMC

Estimating ratios of intractable normalising constants using (9) has been proposed in Alquier et al. (2016). A Monte Carlo estimator based on multiple auxiliary draws from the likelihood with respect to the proposed parameter, , approximates (9) by

(10)

For the remainder of the paper, we will refer to (10) as the standard importance sampling estimator (ISE). The special case where yields the RJ exchange algorithm. We now address the issue of estimator variance.

The work of Stoehr et al. (2017) that considered non trans-dimensional moves demonstrated that the quality of the ISE strongly decreases when the distance increases. This is even more involved when and belong to different parameter spaces. Boland et al. (2017) proved that the variance of the estimator (10) is controlled when the distance between the parameters decreases. This result can be extended to jumps between models of different dimensions. For instance, in the case of an independence sampler the full parameter vector is updated. Consequently, can be large and so may the variance of the estimator. This is a serious concern that may affect the performance of the algorithm, as mentioned at the beginning of Section 3.

Figure 1: In the context of a trans-dimensional move between 2 different exponential random graph (ERG) models (see Section 6), 10 samples are drawn at random. For each of them we plot the MH acceptance rate (yellow spade) and the distribution of the exchange acceptance rate (violin plot). In practice the MH acceptance rate is not tractable but we use (10) with to estimate it precisely. As the distance between and increases, the acceptance ratio in the noisy RJ algorithm becomes more variable.

To study the variance of the ISE in the trans-dimensional setting, we consider for simplicity the case of a random walk transition between two nested models where and define . A RW allows us to control the distance , as it is simply equal to , and to apply Proposition 1 below.

Proposition 1 (Proposition 1 in Boland et al. (2017)).

For any GRF model and , the variance of the estimator in (10) decreases when , such that

(11)

To further decrease the variance we consider an alternative estimator that takes advantage of the auxiliary draws from the likelihood, and one which has also been considered in previous works (Friel, 2013; Bouranis et al., 2017a; Stoehr et al., 2017; Everitt et al., 2017b). Let us introduce an auxiliary variable and discretise as . We remain in the case of nested model jumps and define the vector-valued mapping

The idea is to replace the ratio of normalising constants by a telescopic product of ratios of normalising constants taken at two consecutive points of the path , so that are close to each other. More precisely, the ratio of normalising constants can be written as

(12)

Any factor in the RHS of this equation is estimated as follows. The un-normalised likelihood can be considered as an importance distribution for the "target" distribution , noting that

An unbiased importance sampling estimator of this expectation can be obtained by simulating multiple draws , yielding

Since the expectation of a product of independent random variables is the product of the expectations, then

(13)

gives an unbiased estimator of the intractable ratio of normalising constants.

This telescopic product estimator (TPE) will come at an extra computational cost, similarly to (10). The larger the number of path steps and the number of draws (where ), the more precise the estimate of the ratio of normalising constants at the expense of computational time. Nevertheless, empirical results in Section 6 show that this estimator can lead to improved performance results. Algorithm 2 contains pseudo-code for the noisy RJMCMC algorithm.

4 Choice of reversible jump proposal densities

All algorithms in this paper were run using a proposal distribution for the transition from to that was set to a Uniform , where is the cardinality of the model set. Below we present the strategies that we followed in order to tune the proposal distribution for the parameter vector.

1:Initialise .
2:for  do
3:     Select a candidate model with probability .
4:     Propose .
5:      Draw depending on the type of estimator of the ratio , using (10) or (13).
6:      Estimate the acceptance probability , (14)
7:     Set with probability , else .
8:end for
9:return
Algorithm 2 Noisy reversible jump MCMC

4.1 Independent proposals

We adopted the auto-reversible jump (Auto-RJ) exchange algorithm that was proposed by Caimo and Friel (2013) to perform Bayesian model comparison of exponential random graph models in the area of Bayesian network analysis.

The Auto-RJ takes the form of an independence sampler, making use of a parametric approximation of the posterior distribution for each model in an attempt to bypass the issue of tuning the parameters of the jump proposal distributions and increase within-model acceptance rates. The Auto-RJ consists of pilot MCMC runs that are used to sample from the posterior distribution of each competing model with the exchange algorithm (Murray et al., 2006) and then to approximate the estimated posterior by Gaussian distributions determined by the first moments of each sample. The online step of the Auto-RJ sets to this approximation of the posterior density for both the within- and between-model jumps.

The TP estimator (13) is particularly suitable to the independence sampler, as it allows for "long" transitions while having an estimator of the intractable ratio of normalising constants with small variance. On the other hand, the requirement of pilot MCMC runs reduces the appeal of this method when the number of models under investigation is large. Indeed, in such a case a within-model simulation approach may be as efficient. Additionally, the Auto-RJ is likely to fail when the posterior of at least one model is multi-modal.

4.2 Random walk proposals

A natural alternative to the independence sampler comes from allowing the proposal to depend on the current state . Despite eliminating the need for pilot runs, an open question remains as to how optimally scale the proposal variance parameter. An inefficient proposal mechanism will result in a Markov chain which slowly explores the state space and has high auto-correlation, increasing the asymptotic variance of Monte Carlo estimators. Such inefficiency can be caused by proposing large moves away from the current state of the chain or by proposing moves with prohibitively small associated acceptance probabilities.

For the within-model RW updates, a proposal distribution with a variance-covariance matrix in the form of was assumed, to account for possible correlations between the model parameters (Chib and Jeliazkov, 2001; Martin et al., 2011). The prior precision is denoted by and is a positive Metropolis tuning parameter. The precision matrix is set as the negative Hessian . Estimation of the MLE for each ERG model in Section 6 was performed with the Monte Carlo Maximum Likelihood Estimation (MC-MLE) procedure proposed by Geyer and Thompson (1992). Alternative procedures exist, see Hunter and Handcock (2006). The gradient of the log-likelihood function can be written as

(15)

Then the Hessian matrix can be found using the identity

(16)

where denotes the covariance matrix of the vector with respect to . For GRF models those Hessian matrices are intractable and so we resort to Monte Carlo sampling from in order to estimate . This setting helped us reach a reasonable mixing rate within each model.

For the between-model moves, a popular choice in the implementation of the RJMCMC algorithm for nested cases is the second order method of Brooks et al. (2003). The second order method is based on a Taylor series expansion of the acceptance probability around certain canonical jumps, providing a fully automated framework for achieving local adaptation of the proposal density. It attempts to maximise the marginal acceptance probability and not the extended acceptance probability that uses the auxiliary data and so it is incompatible with the reversible jump exchange algorithm. While such a strategy would be sensible for noisy RJMCMC, it involves estimation of (4.2) and (4.2), which are unavailable in closed form for models like GRFs. The gradients can be numerically approximated with Monte Carlo simulation; this relies on repeated likelihood simulations, increasing considerably the computational cost per iteration of the noisy RJMCMC algorithm and so we do not pursue this strategy further.

Our implementation of the between-model moves is inspired by Ehlers and Brooks (2008), who proved that the full-conditional posterior distribution, , which is based upon the current state of the chain, is the optimal proposal distribution for the random vector . We tackled the intractability of the full-conditional by first considering a multivariate Gaussian distribution , assuming diffuse prior distributions. Standard theory can be used to derive a mean and covariance matrix and approximate the full-conditional distribution with a multivariate Gaussian distribution, . To summarise, the proposal distribution for our implementation of the noisy RW RJMCMC is

(17)

5 Theoretical guarantees for noisy reversible jump MCMC

The goal of this section is to extend the results of Alquier et al. (2016) to the trans-dimensional setting and then to apply these to the context of noisy RJMCMC with the different estimators developed earlier. To do so, we first present RJMCMC as a simple generalisation of the MH algorithm. Following Section 2.2, we consider again the general state space that defines the target distribution . Our interest lies in constructing a Markov chain with transition kernel , with as the invariant distribution.

Let us consider as a probability measure on the compact set . Then is an invariant distribution if {linenomath}

It is also required that the respective Markov chain is reversible, so that for all Borel sets {linenomath}

(18)

Let be a proposal measure for the move from the current state to the proposed state , where is accepted with probability . The transition kernel is given by {linenomath}

(19)

where {linenomath}

is the probability of rejecting the proposed move, while at state . By substituting (19) into (18), it is straightforward to show that {linenomath}

The above formulation encompasses standard MCMC.

We denote by the transition kernel of the Markov chain resulting from the estimators (10) and (13), where the acceptance ratio in is replaced by an estimator , yielding .

Theorem 2 (Corollary 3.1 in Mitrophanov (2005)).

We assume that

(H1)

The reversible jump MH Markov chain with transition kernel and acceptance probability is uniformly ergodic, for which it holds that {linenomath}

for some constants and , where is the total variation distance.

Then for any and for any starting point it holds that {linenomath}

where is the total variation measure between the two kernels and .

An application of Theorem 2 is now provided when an approximation to the true transition kernel arises from Algorithm 2.

Corollary 3.

Let us assume that

(H1)

The Markov chain with transition kernel is uniformly ergodic holds.

(H2)

There exists a function , such that for all ,

(20)

Then for any and for any starting point it holds that

where .

When the upper bound in (20) is uniformly bounded such that for all it holds that , then {linenomath}

Consequently, letting yields {linenomath}

Below we make a series of assumptions that will help us show that the noisy RJMCMC algorithm will yield a Markov chain which will converge to the target posterior density as: (i) , if the ratio is estimated with (10) or (ii) , if the TP estimator (13) is considered, instead.

(A1)

For each prior over the model-specific parameters, there is a constant such that
.

(A2)

There is a constant such that .

(A3)

A prior probability is assigned to each model.

(A4)

For any and , .

These assumptions are met when is a bounded set, such that

are finite, which gives for any and , using the Cauchy-Schwartz inequality. Hence, we can set .

We acknowledge that Assumptions (A1) to (A4) are, in general, quite strong. However, for the specific case of exponential random graph models, these assumptions can be deemed realistic for the following reason. These models are known to feature degenerate regions, i.e. parts of the parameter space where the model generates only full or empty graphs. Hence, the model parameters will effectively lie on a compact space and so a Bayesian analysis of those models may be carried out, as a matter of course, with priors whose support is included in the non-degenerate region. In this sense, we feel that Assumptions (A1) to (A4) become rather mild so that the theoretical results are then relevant.

Lemma 4.

Under assumptions (A1)-(A4), for any , estimated with (10) satisfies

Theorem 5 places a bound on the total variation between the Markov chain of a noisy RJMCMC algorithm and a Markov chain with the desired target distribution, when the intractable ratio of normalising constants is estimated by (10). Theorem 3.1 in Alquier et al. (2016) is the special case of Theorem 5 when a within-model transition is attempted.

Theorem 5.

Under assumptions (A1)-(A3) then (H2) in Corollary 3 is satisfied with

and

where is explicitly known.

Empirical results in this paper have demonstrated that the use of the TP estimator can lead to an improved RJMCMC algorithm relative to the RJ exchange algorithm. We present the following Lemma that shows how to control the variance of the TP estimator with .

Lemma 6 (Lemma 6 in Boland et al. (2017)).

The TPE requires 1 sets of simulated data,

that are used to estimate the ratio of normalising constants with (13). Let be i.i.d one-dimensional sample mean estimators, such that

When assumptions (A1)-(A4) are satisfied, then

Below we provide a bound on the total variation distance between the Markov chain of a noisy RJMCMC algorithm with the TP estimator and a Markov chain with the desired target distribution with respect to the factors .

Lemma 7.

Under assumptions (A1)-(A4), for any , estimated with (13) satisfies