Bayesian computation for statistical models with intractable normalizing constants

Bayesian computation for statistical models with intractable normalizing constants

Abstract

This paper deals with some computational aspects in the Bayesian analysis of statistical models with intractable normalizing constants. In the presence of intractable normalizing constants in the likelihood function, traditional MCMC methods cannot be applied. We propose an approach to sample from such posterior distributions. The method can be thought as a Bayesian version of the MCMC-MLE approach of [8]. To the best of our knowledge, this is the first general and asymptotically consistent Monte Carlo method for such problems. We illustrate the method with examples from image segmentation and social network modeling. We study as well the asymptotic behavior of the algorithm and obtain a strong law of large numbers for empirical averages.

{frontmatter}\runtitle

Bayesian computation for intractable normalizing constants

Yves F. Atchadé111Department of Statistics, University of Michigan, email: yvesa@umich.edu,  Nicolas Lartillot 222LIRM, Université de Montpellier 2, email: nicolas.lartillot@lirmm.fr and Christian P. Robert 333CEREMADE, Université Paris-Dauphine and CREST, INSEE, email: xian@ceremade.dauphine.fr
(April 2008)

{keyword}

[class=AMS] \kwd[Primary ]60C05, 60J27, 60J35, 65C40

{keyword}\kwd

Monte Carlo methods \kwdAdaptive MCMC \kwdBayesian inference \kwdIsing model \kwdImage segmentation \kwdSocial network modeling

1 Introduction

Statistical inference for models with intractable normalizing constants poses a major computational challenge. This problem occurs in the statistical modeling of many scientific problems. Examples include the analysis of spatial point processes ([13]), image analysis ([10]), protein design ([11]) and many others. The problem can be described as follows. Suppose we have a dataset generated from a statistical model with parameter , where the normalizing constant depends on and is not available in closed form. Let be the prior density of the parameter . The posterior distribution of given is then given by

 π(θ)∝1Z(θ)eE(x0,θ)μ(θ). (1)

When cannot be easily evaluated, Monte Carlo simulation from this posterior distribution is problematic even using Markov Chain Monte Carlo (MCMC). [14] uses the term doubly intractable distribution to refer to posterior distributions of the form (1). Current Monte Carlo sampling methods do not allow one to deal with such models in a Bayesian framework. For example, a Metropolis-Hastings algorithm with proposal kernel and target distribution , would have acceptance ratio which cannot be computed as it involves the intractable normalizing constant evaluated at and .

An early attempt to deal with this problem is the pseudo-likelihood approximation of Besag ([4]) which approximates the model by a more tractable model. Pseudo-likelihood inference provides a first approximation but typically performs poorly (see e.g. [5]). Maximum likelihood inference is possible. MCMC-MLE, a maximum likelihood approach using MCMC has been developed in the 90’s ([8, 7]). Another related approach to find MLE estimates is Younes’ algorithm ([18]) based on stochastic approximation. An interesting simulation study comparing these three methods is presented in [10].

Comparatively little work has been done to develop asymptotically exact methods for the Bayesian approach to this problem. But various approximate algorithms exist in the literature, often based on path sampling ([6]). Recently, [12] have shown that if exact sampling of from (as a density in ) is possible then a valid MCMC algorithm to sample from (1) can be developed. See also [14] for some improvements. Their approach uses a clever auxiliary variable algorithm. But intractable normalizing constants often occur in models for which exact sampling of is not possible or is very expensive. Another recent development to the problem is the approximate Bayesian computation schemes of Plagnol-Tavaré ([15]) but which sample only approximately from the posterior distribution.

In this paper, we propose an adaptive Monte Carlo approach to sample from (1). Our algorithm generates a stochastic process (not Markov in general) in such that as , the marginal distribution of converges to (1). It is clear that any method to sample from (1) will have to deal with the intractable normalizing constant . In the auxiliary variable method of [12], computing is replaced in a sense by perfect sampling from . This strategy works well so long as perfect sampling is feasible and inexpensive. In the present work, we take another approach building on the idea of estimating the entire function from a single Monte Carlo sampler. The starting point of the method is importance sampling. Suppose that for some , we can sample (perhaps by MCMC) from the density in . Using this sample, we can certainly estimate for any . This is the same idea behind the MCMC-MLE algorithm of [8]. But it is well known that these estimates are typically very poor as gets far from . Now, suppose that instead of a single point , we generate a population in and that we can sample from on . Then we show that in principle, efficient estimation for is possible for any . Building on [3] and the ideas sketched above, we propose an algorithm that generates a random process such that the marginal distribution of converges to and the marginal distribution of converges to (1). This random process is not a Markov chain in general but we show (from first principle) that has limiting distribution and satisfies a strong law of large numbers.

The paper is organized as follows. A full description of the method including practical implementation details is given in Section 2. We illustrate the algorithm with three examples. The Ising model, a Bayesian image segmentation example and a Bayesian modeling of social networks. The examples are presented in Section 4. Some theoretical aspects of the method are discussed in Section 3 with the proofs postponed to 6.

2 Sampling from posterior distributions with intractable normalizing constants

Throughout, we fix the sample space and the parameter space . The problem of interest is to sample from the posterior distribution (1) with

 Z(θ)=∫XeE(x,θ)λ(dx). (2)

Let be a sequence of points in . Let be the probability measure on given by:

 Λ∗(x,i)=eE(x,θ(i))dZ(θ(i)),x∈X,i∈{1,…,d}. (3)

Let be a similarity kernel on such that for all . The starting point of the algorithm is the following decomposition of the partition function:

 Z(θ) = ∫XeE(x,θ)λ(dx) (4) = d∑i=1κ(θ,θ(i))∫XeE(x,θ)−E(x,θ(i))eE(x,θ(i))λ(dx) = dd∑i=1κ(θ,θ(i))Z(θ(i))∫XeE(x,θ)−E(x,θ(i))eE(x,θ(i))dZ(θ(i))λ(dx) = d∑i=1∫XΛ∗(x,i)hθ(x,i)λ(dx),

where

 hθ(x,i)=dκ(θ,θ(i))Z(θ(i))eE(x,θ)−E(x,θ(i)). (5)

The interest of the decomposition (4) is that and do not depend on . Therefore, using samples from , this decomposition gives an approach to estimate for all . This estimate should be reliable provided is close to at least one particle . The problem of sampling from probability measures such as has been recently considered by [3] building on the Wang-Landau algorithm of [17]. We follow and improve that approach here. The resulting estimate of can then continuously be fed to a second Monte Carlo sampler that carries the simulation with respect to . This suggests an adaptive Monte Carlo sampler to sample from (1) which we develop next.

For any , we define the following density function on :

 Λc(x,i)∝eE(x,θ(i))−c(i). (6)

With , . The reader should think of as an estimate of , with . The algorithm will adaptively adjust such that the marginal distribution on is approximately uniform. In which case, we should have . Let be a sequence of (possibly random) positive numbers. We propose a non-Markovian adaptive sampler that lives in . We start from an initial state , where is the initial estimate of . For example, . At time , given we first generate from , where is a transition kernel on with invariant distribution . Next, we generate from the distribution on proportional to . Then we update the current estimate of to given by:

 cn+1(i)=cn(i)+γneE(Xn+1,θ(i))−cn(i)∑dj=1eE(Xn+1,θ(j))−cn(j),i=1,…,d. (7)

In view of (4), we can estimate by:

 Zn+1(θ)=d∑i=1κ(θ,θ(i))ecn+1(i)⎡⎣∑n+1k=1eE(Xk,θ)−E(Xk,θ(i))1i(Ik)∑n+1k=11i(Ik)⎤⎦, (8)

with the convention that . Finally, for any positive function , let be a transition kernel on with invariant distribution

 πζ(θ)∝1ζ(θ)eE(x0,θ)μ(θ). (9)

Given , we generate from , where is the function defined by (8).

The algorithm can be summarized as follows.

Algorithm 2.1.

. Let be the initial state of the algorithm. Let be (a possibly random) sequence of positive numbers. At time , given :

1.

Generate from where is any ergodic kernel on with invariant distribution .

2.

Generate by sampling from the distribution on proportional to .

3.

Compute , the new estimate of using (7).

4.

Using the function defined by (8), generate from .

Remark 2.1.
1. The algorithm can be seen as an MCMC-MCMC analog to the MCMC-MLE of [8]. Indeed, with , the decomposition (4) becomes

 Z(θ)/Z(θ(1))=E[eE(θ,X)−E(X,θ(1))],

where the expectation is taken with respect to the density . But as discussed in the introduction, when has a large variance, the resulting estimate is terribly poor.

2. We introduce to serve as a smoothing factor so that the particles ’s close to contribute more to the estimation of . We expect this smoothing step to reduce the variance of the overall estimate of . In the simulations we choose

 κ(θ,θ(i))=e−12h2∥∥θ−θ(i)∥∥2∑dj=1e−12h2∥∥θ−θ(j)∥∥2.

The value of the smoothing parameter is set by trials and errors for each example.

3. The implementation of the algorithm requires keeping track of all the samples that are generated (Equation (8)). can be a very high-dimensional space and we are aware of the fact that in practice, this bookkeeping can significantly slow down the algorithm. But in many cases, the function takes the form for some real-valued functions . In these cases, we only need to keep track of the statistics . All the examples discussed in the paper fall in this latter category.

4. As mentioned earlier, the update of is essentially the Wang-Landau algorithm of [3] with the following important difference. [3] propose to update one component per iteration:

 cn+1(i)=cn(i)+γn1{i}(In+1).

We improve on this scheme in (7) by Rao-Blackwellization where we integrate out .

5. As mentioned above, and we stress this again, this algorithm is not Markovian in any way. The process is not a Markov chain but a nonhomogeneous Markov chain if we let be a deterministic sequence. , the main random process of interest is not a Markov chain either. Nevertheless, the marginal distribution of will typically converge to . This is because, , the conditional distribution of given converges to as and is a kernel with invariant distribution . We make this precise by showing that a strong law of large numbers holds for additive functionals of .

We now discuss the choice of the parameters of the algorithm.

2.1 Choosing d and the particles {θ(i)}

We do not have any general approach in choosing and but we give some guidelines. The general idea is that the particles need to cover reasonably well the range of the density and be such that for any , the density in can be well approximated by at least one of the densities . One possibility is to sample independently from the prior distribution , some tempered version of it or some other similar distribution. We follow this approach in the examples below. Another possibility is to use a grid of points in . The value of , the number of particles, should depend on the size of . We need to choose such that the distributions (in ) overlap. Otherwise, estimating the constants can be difficult. This implies that should not be too small. In the simulation examples be! low we choose between and .

2.2 Choosing the step-size {γn}

It is shown in [3] that the recursion (7) can also be written as a stochastic approximation algorithm with step-size , so that in theory, any positive sequence such that and can be used. But the convergence of to is very sensitive to the choice . If the ’s are overly small, the recursive equation in (7) will make very small steps. But if these numbers are overly large, the algorithm will have a large variance. In both cases, the convergence to the solution will be slow. Overall, choosing the right step-size for a stochastic approximation algorithm is a difficult problem. Here we follow [3] which has elaborated on a heuristic approach to this problem originally proposed by [17].

The main idea of the method is that typically, the larger , the easier it is for the algorithm to move around the state space. Therefore, at the beginning is set at a relatively large value. This value is kept until has visited equally well all the points of . Let be the first time where the occupation measure of by is approximately uniform. Then we set to some smaller value (for example ) and the process is iterated until become sufficiently small. As which point we can choose to switch to a deterministic sequence of the form . Combining this idea with Algorithm 2.1, we get the following.

Algorithm 2.2.

. Let , be constants and let be some arbitrary initial state of the algorithm. Set and . While and given ,

1.

Generate as in Algorithm 2.1.

2.

For : set .

3.

If , then set and set .

Remark 2.2.

We use this algorithm in the examples below with the following specifications. We set the initial to , . We run until before starting and switching to a deterministic sequence .

3 Convergence analysis

In this section, we derive a law of large numbers under some verifiable conditions. The process of interest here is . Let be the reference probability triplet. We equip with the filtration , where . Note that includes since these random variables are used in generating . From the definition of the algorithm, we have:

 Pr(θn+1∈A|Fn)=QZn+1(θn,A),Pr−a.s.. (10)

We see from (10) that is an adaptive Monte Carlo algorithm with varying target distribution. In analyzing here, we do not strive for the most general result but restrict ourself to conditions that can be easily checked in the examples considered in the paper. We assume that is a compact subset of , the -dimensional Euclidean space equipped with its Borel -algebra and the Lebesgue measure. Firstly, we assume that the function is bounded:

(A1): There exist such that:

 m≤E(x,θ)≤M,x∈X,θ∈Θ. (11)

In many applications, and this is the case for the examples discussed below, is a finite set (typically very large) and is a compact set. In these cases, (A1) is easily checked. In order to proceed any further, we need some notations. A transition kernel on operates on measurable real-valued functions as , and the product of two transition kernels and is the transition kernel defined as . We can then define recursively , , . For two probability measures , the total variation distance between and is defined as . We say that a transition kernel is geometrically ergodic if is -irreducible, aperiodic and has an invariant distribution such that:

 ∥Pn(x,⋅)−π∥TV≤M(x)ρn,n≥0

for some and some function .

Our next assumption involves the transition kernel .

(A2): For , is a Metropolis kernel with invariant distribution and proposal kernel density . There exist and an integer such that for all :

 pn0(θ,θ′)≥ε. (12)
Remark 3.1.
1. The condition (12) clearly holds for most symmetric proposal kernels , provided that remains bounded away from on some ball centered at .

2. (12) often implies that is uniformly ergodic:

 Qζ(θ,A) ≥ ∫Amin(1,eE(x0,θ′)−E(x0,θ)ζ(θ′)ζ(θ))p(θ,θ′)dθ′ ≥ em−Minfθ,θ′∈Θ(ζ(θ)ζ(θ′))∫Ap(θ,θ′)dθ′.

Therefore, provided , if (12) hold then for some .

(A3): is a random sequence, adapted to such that , and -a.s.

Theorem 3.1.

Assume (A1)-(A3). Assume also that each kernel on is geometrically ergodic. Let such that . Then:

 1nn∑k=1h(θk)→π(h),Pr−a.s. (13)

See Section 6.∎

4 Examples

4.1 Ising model

We test the algorithm with the Ising model on a rectangular lattice. This is a simulated example. The model is given by where

 E(x)=m∑i=1n−1∑j=1xijxi,j+1+m−1∑i=1n∑j=1xijxi+1,j, (14)

and . We use . We generate the data from with by perfect sampling using the Propp-Wilson algorithm. Using and postulating the model , we would like to infer on . We use the prior . We set and generate the points independently and uniformly in . As described in Section 2.2, we use the flat histogram approach in selecting with an initial value , until becomes smaller than . Then we start feeding the adaptive chain which is run for iterations. In updating , we use a Random Walk Metropolis sampler with proposal distribution (with reflexion at the boundaries) for some . We adaptively update so as to reach an acceptance rate of (see e.g. [2]). We d! iscard th! e first points as a burn-in period. The results are plotted on Figure 1. As we can see from these plots, the sampler appears to have converged to the posterior distribution . The mixing rate of the algorithm as inferred from the autocorrelation function seems fairly good. In addition, the algorithm yields an estimate of the partition function which can be re-used in other sampling problems.

Figure 1: Output for the Ising model , . (a): estimation of up to an additive constant; (b)-(d): Trace plot, histogram and autocorrelation function of the adaptive sampler .

4.2 An application to image segmentation

We use the Ising model above to illustrate an application of the methodology to image segmentation. In image segmentation, the goal is the reconstruction of images from noisy observations (see e.g. [10, 9]). We represent the image by a vector where is a lattice and . Each represents a pixel, and is the color of the pixel . is the number of colors. Here we assume that and is either black or white. In the image segmentation problem, we do not observe but a noisy approximation . We assume that:

 yi|x,σ2\lx@stackrelind∼N(xi,σ2), (15)

for some unknown parameter . Even though (15) is a continuous model, it has been shown to provide a relatively good framework for image segmentation problems with multiple additive sources of noise ([10]).

We assume that the true image is generated from an Ising model (see Section 4.1) with interaction parameter . We assume that follows a uniform prior distribution on and that has a prior distribution that is proportional to . The posterior distribution is then given by:

 π(θ,σ2,x|y)∝(1σ2)|S|2+1eθE(x)Z(θ)e−12σ2∑s∈S(y(s)−x(s))2% 1(0,1)(θ)1(0,∞)(σ2), (16)

where is as in (14).

We sample from this posterior distribution using the adaptive chain . The chain is updated following Steps (1)-(3) of Algorithm 2.1. It is used to form the adaptive estimate of as given by (8) (with replacing ). These estimates are used to update using a Metropolis-within-Gibbs scheme. More specifically, given , we sample with a Random Walk Metropolis with proposal (with reflexion at the boundaries) and target proportional to . Given , we generate by sampling from the Inverse-Gamma distribution with parameters . And given , we sample each from its conditional distribution given . This conditional distribution is given by

 p(x(s)=a|x(u),u≠s)∝exp(θa∑u∼sx(u)−12σ2(y(s)−a)2),a∈{−1,1}.

Here means that pixels and are neighbors.

To test this algorithm, we generate a simulated dataset according to model (15) with generated from by perfect sampling. We use , and . For the implementation details of the algorithm, we make exactly the same choices as in Example 4.1 above. In particular we choose and generate uniformly in . The results are given in Figure 2. Once again, the sample path obtained from clearly suggests that the distribution of has converged to with a good mixing rate, as inferred from the autocorrelation plots.

Figure 2: Output for the image segmentation model. (a)-(c): plots of ; (d)-(f): plots of .

4.3 Social network modeling

We now give an application of the method to a Bayesian analysis of social networks. Statistical modeling of social network is a growing subject in social science (See e.g. [16] and the references therein for more details). The set up is the following. We have actors . For each pair , define if actor has ties with actor and otherwise. In the example below, we only consider the case of a symmetric relationship where for all . The ties referred to here can be of various natures. For example, we might be interested in modeling how friendships build up between co-workers or how research collaboration takes place between colleagues. Another interesting example from political science is modeling co-sponsorship ties (for a given piece of legislation) between members of a house of representatives or parliment.

In this example we study the Medici business network dataset taken from [16] which describes the business ties between 16 Florentine families. Numbering arbitrarily the family from to , we plot the observed social network in Figure 3. The dataset contains relatively few ties between families and even fewer transitive ties.

Figure 3: Business Relationships between Florentine families.

One of the most popular models for social networks is the class of exponential random graph models. In these models, we assume that is a sample generated from the distribution

 p(y|θ1,…,θK)∝exp(K∑i=1θiSi(y)), (17)

for some parameters ; where is a statistic used to capture some aspect of the network. For this example, and following [16], we consider a -dimensional model with statistics

 S1(y)=∑i
 S2(y)=∑i
 S3(y)=∑i
 S4(y)=∑i

We assume a uniform prior distribution on for and the posterior distribution writes:

 π(θ|y)∝1Z(θ)exp(4∑k=1θkSk(y))1D(θ). (18)

We use Algorithm 2.1 to sample from (18). For this example, we use particles generated from a the normal distribution with mean and variance . We use the same parametrization as in the previous examples to update . For the adaptive chain we use a slightly different strategy. It turns out that some of the components of the target distribution are strongly related. Therefore we sample from in one block, using a Random Walk Metropolis with a Gaussian kernel (restricted to ) for some and a positive definite matrix . We adaptively set so as to reach an acceptance rate of . Ideally, we would like to choose the variance-covariance of which of course, is not available. We adaptively estimate during the simulation as in [2]. As before, we run until . Then we start and run the full chain for a total of iterations. The posterior distributions of the parameters are given in Figures 4a-4d. In Table 1, we give the sample posterior mean together with the and quantiles of the posterior distribution. Overall, these results are consistent with the maximum likelihood estimates obtained by [16] using MCMC-MLE. The main difference appears in which we find here not to be significant. As a by-product, the sampler gives an estimate of the variance-covariance matrix of the posterior distribution :

 (19)

Figure 4a: The adaptive MCMC output from (18). (a)-(c): Plots for . Based on iterations.

Figure 4b: The adaptive MCMC output from (18). (a)-(c): Plots for . Based on iterations.

Figure 4c: The adaptive MCMC output from (18). (a)-(c): Plots for . Based on iterations.

Figure 4d: The adaptive MCMC output from (18). (a)-(c): Plots for . Based on iterations.

5 Conclusion

Sampling from posterior distributions with intractable normalizing constants is a difficult computational problem. Thus far, all methods proposed in the literature but one entail approximations that do not vanish asymptotically. And the only exception ([12]) requires exact sampling in the data space, which is only possible for very specific cases. In this work, we propose an approach that both is more general than [12] and satisfies a strong law of large numbers with limiting distribution equal to the target distribution. The few applications we have presented here suggest that the method is promising. It remains to be determined how the method will scale with the dimensionality and with the size of the problems, although in this respect, adaptations of the method involving annealing/tempering schemes are easy to imagine, which would allow large problems to be analysed properly.

Acknowledgements

The research of the third author had been partly supported by the Agence Nationale de la Recherche (ANR, 212, rue de Bercy 75012 Paris) through the 2006-2008 project Adap’MC.

6 Proof of Theorem 3.1

Proof.

Throughtout the proof, will denote a finite constant but whose actual value can change from one equation to the next. The convergence of the Wang-Landau algorithm has been studied in [3]. It is shown in this work that under the condition of Theorem 3.1, and more importantly, converges almost surely to a (up to a multiplicative constant).

Define

 ωn(i)=ecn(i)∑dj=1ecn(j),
 vn,i(θ)=∑nk=1eE(Xk,θ)−E(Xk,θ(i))1i(Ik)∑nk=11i(Ik)

and

 ~Zn(θ):=Zn(θ)∑dj=1ecn(j)=d∑i=1ωn(i)vn,i(θ). (20)

Instead of , we work with . This is equivalent because does not depend on and always appears in as a ratio. We have:

 infθ∈Θ~Zn(θ)≥em−M. (21)
 infθ,θ′∈Θ(~Zn(θ)~Zn(θ′))≥e2(m−M). (22)

Combining (22) and (12) and part 2 of Remark 3.1, we deduce that there exists such that for all

 sup|h|≤1∣∣QjZnh(θ)−πZn(h)∣∣≤2(1−ε0)j,Pr−a.s. (23)

We introduce the notation . It follows from (23) that for any the following function is well defined:

 gn(θ)=∞∑j=1¯Qjnh(θ).

Moreover for all . satisfies Poisson’s equation for and :

 gn(θ)−¯Qngn(θ)=h−πZn(h). (24)

Using this we can rewrite as:

 1nn∑k=1(h(θk)−πZk(h)) = 1nn∑k=1(gk(θk)−¯Qkgk(θk−1))+1nn∑k=1(¯Qkgk(θk−1)−¯Qk−1gk−1(θk−1)) (25) +1n(¯Q0g0(θ0)−¯Qngn(θn)).

Since , a similar bound hold for and we conclude that actually converges almost surely to as . Writing , it is easily seen that is a martingale difference with bounded increment and we deduce from martingales theory that converges almost surely to as .

Since is a Metropolis kernel and using the fact that for all we deduce that for any function such that ,

 ∣∣(Q~Zn−Q~Zn−1)h(θ)∣∣ ≤ ∫∣∣ ∣∣~Zn(θ)~Zn(θ′)−~Zn−1(θ)~Zn−1(θ′)∣∣ ∣∣eE(x0,θ′)−E(x0,θ)p(θ,θ′)∣∣h(θ′)−h(θ)∣∣dθ′