Pairwise MRF Calibration by Perturbation of the Bethe Reference Point
Abstract
We investigate different ways of generating approximate solutions to the pairwise Markov random field (MRF) selection problem. We focus mainly on the inverse Ising problem, but discuss also the somewhat related inverse Gaussian problem because both types of MRF are suitable for inference tasks with the belief propagation algorithm (BP) under certain conditions. Our approach consists in to take a Bethe meanfield solution obtained with a maximum spanning tree (MST) of pairwise mutual information, referred to as the Bethe reference point, for further perturbation procedures. We consider three different ways following this idea: in the first one, we select and calibrate iteratively the optimal links to be added starting from the Bethe reference point; the second one is based on the observation that the natural gradient can be computed analytically at the Bethe point; in the third one, assuming no local field and using low temperature expansion we develop a dual loop joint model based on a well chosen fundamental cycle basis. We indeed identify a subclass of planar models, which we refer to as Bethedual graph models, having possibly many loops, but characterized by a singly connected dual factor graph, for which the partition function and the linear response can be computed exactly in respectively O(N) and operations, thanks to a dual weight propagation (DWP) message passing procedure that we set up. When restricted to this subclass of models, the inverse Ising problem being convex, becomes tractable at any temperature. Experimental tests on various datasets with refined or regularization procedures indicate that these approaches may be competitive and useful alternatives to existing ones.
Contents
1 Introduction
The problem at stake is a model selection problem, in the MRF families, where variables are observed pair by pair. The optimal solution is the MRF with maximal entropy obeying moment constraints. For binary variables, this happens then to be the Ising model with highest loglikelihood. It is a difficult problem, where both the graph structure and the values of the fields and couplings have to be found. In addition, we wish to ensure that the model is compatible with the fast inference algorithm “belief propagation” (BP) to be useful at large scale for realtime inference tasks. This leads us to look at least for a good tradeoff between likelihood and sparsity.
Concerning the Inverse Ising Problem (IIP), the existing approaches fall mainly in the following categories:

Common analytical approaches are based on the Plefka expansion [39] of the Gibbs free energy by making the assumption that the coupling constants are small. The picture is then of a weakly correlated unimodal probability measure. For example, the recent approach proposed in [9] is based on this assumption.

Another possibility is to assume that relevant coupling have locally a treelike structure. The Bethe approximation [43] can then be used with possibly loop corrections. Again this corresponds to having a weakly correlated unimodal probability measure and these kind of approaches are referred to as pseudomoment matching methods in the literature for the reason explained in the previous section. For example the approaches proposed in [29, 41, 34, 42] are based on this assumption.

In the case where a multimodal distribution is expected, then a model with many attraction basins is to be found and Hopfieldlike models [24, 10] are likely to be more relevant. To be mentioned also is a recent meanfield methods [38] which allows one to find in some simple cases the Ising couplings of a low temperature model, i.e. displaying multiple probabilistic modes.
On the side of inverse Gaussian problem, not surprisingly similar methods have been developed by explicitly performing and matrix norm penalizations on the inverse covariance matrices, so as to determine sparse nonzero couplings in estimated inverse covariance matrices for largescale statistical inference applications [17, 25] where direct inversion is not amenable. In our context the goal is a bit different. In general cases, the underlying inverse covariance matrix is not necessarily sparse. What we aim to find is a good sparse approximation to the exact inverse covariance matrix. Furthermore, sparsity constraint is not enough for constructing graph structure to be used in conjunction with BP. Known sufficient conditions celled walksummability [31] (WS) are likely to be imposed instead of (or in addition to) the sparsity constraint. To the best of our knowledge not much work is taking this point into consideration at the noticeable exception of [2] by restricting the class of learned graph structures. To complete this overview, let us mention also that some authors proposed information based structure learning methods [36] quite in line with some approaches to be discussed in the present paper.
In some preceding work dealing with a road traffic inference application, with large scale and real time specifications [20, 19, 18], we have noticed that these methods could not be used blindly without drastic adjustment, in particular to be compatible with belief propagation. This led us to develop some heuristic models related to the Bethe approximation. The present work is an attempt to give a theoretical basis and firmer ground to these heuristics and to develop new ones.
The paper is organized as follows: in Section 2 we review some standard statistical physics approaches to the IIP, mainly based on perturbation expansions, and derive some new and useful expressions of susceptibility coefficients in Section 2.2. In Section 3 we explain what we mean by the Bethe reference point and develop an iterative proportional scaling (IPS) based method to incrementally, link by link, refine this approximate solution, both for the inverse Ising and inverse GMRF problems. Section 4 explores a second way to refine the Bethe reference point, based on the Plefka’s expansion and on results of Section 2.2. A third method, based on strong coupling expansion and leading to a dual weight propagation algorithm (DWP) is presented in Section 5. Merits of these methods differs, which makes them complementary to each other. The first one is particularly useful when the underlying graph structure is not given; the second one, by giving explicitly the natural gradient direction, can be used to reduce the number of parameters to tune; finally the third one can be fast and exact for given very sparse structure, assuming in addition no local fields. For sake of comparison, we explain in Section 6 how to adapt standard and normpenalized sparsity inducing optimization framework for finding relevant approximate solutions to the inverse GMRF problem in our context. Comparison is then made in Section 7 with our own IPS based approach, in addition to other numerical experiments illustrating performances of the other methods.
2 Preliminaries
2.1 Inverse Ising problem
In this section we consider binary variables (), which at our convenience may be also written as spin variables . We assume that from a set of historical observations, the empirical mean (resp. covariance ) is given for each variable (resp. each pair of variable ). In this case, from Jayne’s maximum entropy principle [28], imposing these moments to the joint distribution leads to a model pertaining to the exponential family, the socalled Ising model:
(2.1) 
where the local fields and the coupling constants are the Lagrange multipliers associated respectively to mean and covariance constraints when maximizing the entropy of . They are obtained as minimizers of the dual optimization problem, namely
(2.2) 
with
(2.3) 
the log likelihood. This leads to invert the linear response equations:
(2.4)  
(2.5) 
being the empirical expectation of . As noted e.g. in [9], the solution is minimizing the cross entropy, a KullbackLeibler distance between the empirical distribution based on historical data and the Ising model:
(2.6) 
The set of equations (2.4,2.5) cannot be solved exactly in general because the computational cost of is exponential. Approximations resorting to various mean field methods can be used to evaluate .
Plefka’s expansion
To simplify the problem, it is customary to make use of the Gibbs free energy, i.e. the Legendre transform of the free energy, to impose the individual expectations for each variable:
(with , is the ordinary scalar product) where depends implicitly on through the set of constraints
(2.7) 
Note that by duality we have
(2.8) 
and
(2.9) 
i.e. the inverse susceptibility matrix. Finding a set of satisfying this last relation along with (2.8) yields a solution to the inverse Ising problem since the ’s and ’s are given. Still a way to connect the couplings directly with the covariance matrix is given by the relation
(2.10) 
The Plefka expansion is used to expand the Gibbs free energy in power of the coupling assumed to be small. Multiplying all coupling by yields the following cluster expansion:
(2.11)  
(2.12) 
where each term corresponds to cluster contributions of size in the number of links involved, and depends implicitly on in order to always fulfill (2.7). This precisely is the Plefka expansion, and each term of the expansion (2.12) can be obtained by successive derivation of (2.11). We have
Letting
using (2.7), the two first derivatives of (2.11) w.r.t read
(2.13)  
(2.14) 
where subscript indicates that expectations, variance and covariance are taken at given . To get successive derivatives of one can use (2.8). Another possibility is to express the fact that is fixed,
giving
(2.15) 
where is the susceptibility delivered by the model when . To get the first two terms in the Plefka expansion, we need to compute these quantities at :
(by convention in these sums). The first and second orders then finally read:
and correspond respectively to the mean field and to the TAP approximation. Higher order terms have been computed in [21].
At this point we are in position to find an approximate solution to the inverse Ising problem, either by inverting equation (2.9) or (2.10). To get a solution at a given order in the coupling, solving (2.10) requires at order , while it is needed at order in (2.9).
Taking the expression of up to second order gives
and (2.10) leads directly to the basic meanfield solution:
(2.16) 
At this level of approximation for , using (2.8) we also have
which corresponds precisely to the TAP equations. Using now (2.9) gives
Ignoring the diagonal terms, the TAP solution is conveniently expressed in terms of the inverse empirical susceptibility,
(2.17) 
where the branch corresponding to a vanishing coupling in the limit of small correlation i.e. small and for , has been chosen.
Bethe approximate solution
When the graph formed by the pairs for which the correlations are given by some observations is a tree, the following form of the joint probability corresponding to the Bethe approximation:
(2.18) 
yields actually an exact solution to the inverse problem (2.2), where the are the single and pair variables empirical marginal given by the observations. Using the following identity
where the following parametrization of the ’s
(2.19)  
(2.20) 
relating the empirical frequency statistics to the empirical “magnetizations” , can be used. Summing up the different terms gives us the mapping onto an Ising model (2.1) with
(2.21)  
(2.22) 
where is the number of neighbors of , using the notation for “ neighbor of ”. The partition function is then explicitly given by
(2.23) 
The corresponding Gibbs free energy can then be written explicitly using (2.21–2.23). With fixed magnetizations ’s, and given a set of couplings , the parameters are implicit function
obtained by inverting the relations (2.22). For the linear response, we get from (2.21) a result derived first in [41]:
Using (2.22), we can also express
so that with little assistance of Maple, we may finally reach the expression given in [37]
(2.24) 
equivalent to the original one derived in [41], albeit written in a different form, more suitable to discuss the inverse Ising problem. This expression is quite paradoxical since the inverse of the matrix, which coefficients appear on the right hand side of this equation, should coincide with the left hand side, given as input of the inverse Ising problem. The existence of an exact solution can therefore be checked directly as a selfconsistency property of the input data : for a given pair either:

then but , which can be nonvanishing, is obtained by inverting defined by (2.24).
Finally, complete consistency of the solution is checked on the diagonal elements in (2.24). If full consistency is not verified, this equation can nevertheless be used to find approximate solutions. Remark that, if we restrict the set of equations (2.24), e.g. by some thresholding procedure, in such a way that the corresponding graph is a spanning tree, then, by construction, will be solution on this restricted set of edges, simply because the BP equations are exact on a tree. The various methods proposed for example in [34, 42] actually correspond to different heuristics for finding approximate solutions to this set of constraints. As noted in [37], a direct way to proceed is to eliminate in the equations obtained from (2.22) and (2.24):
This leads directly to
(2.25) 
while the corresponding computed of , instead of the observed one , has to be inserted in (2.21) to be fully consistent. Note that and coincide at second order in .
Hopfield model
As mentioned in the introduction when the distribution to be modeled is multimodal, the situation corresponds to finding an Ising model in the low temperature phase with many modes, referred to as Mattis states in the physics literature. Previous methods assume implicitly a high temperature where only one single mode, “the paramagnetic state” is selected. The Hopfield model, introduced originally to model autoassociative memories, is a special case of an Ising model, where the coupling matrix is of low rank and corresponds to the sum of outer products of given spin vectors , each one representing a specific attractive pattern:
In our inference context, these patterns are not given directly, the input of the model being the covariance matrix. In [10] these couplings are interpreted as the contribution stemming from the largest principle axes of the correlation matrix. This lead in particular the authors to propose an extension of the Hopfield model by introducing repulsive patterns to take as well into account the smallest principal axes. Assuming small patterns coefficients , they come up with the following couplings with highest likelihood:
at first order of the perturbation. At this order of approximation the local fields are given by
In a previous study [19] we found a connection between the plain direct BP method with the Hopfield model, by considering a parameter deformation of the Bethe inference model (2.18)
(2.26) 
with . We observed indeed that when the data corresponds to some multimodal measure with well separated components, this measure coincides asymptotically with an Hopfield model made only of attractive pattern, representative of each component of the underlying measure. represents basically the inverse temperature of the model and is easy to calibrate in practice.
2.2 More on the Bethe susceptibility
The explicit relation (2.24) between susceptibility and inverse susceptibility coefficients is not the only one that can be obtained. In fact, it is the specific property of a singly connected factor graph that two variables and , conditionally to a variable are independent if is on the path between and along the tree:
Using the parametrization (2.19,2.20) with yields immediately
(2.27) 
By recurrence we get, as noticed in e.g. [35], given the path between and along the tree
reflecting the factorization of the joint measure. This expression actually coincides with (2.24) only on a tree. On a loopy graph, this last expression should be possibly replaced by a sum over paths.
Higher order susceptibility coefficients are built as well in terms of elementary building blocks given by the pairwise susceptibility coefficients . The notations generalize into the following straightforward manner:
where and are respectively three and four point susceptibilities. The quantities being related to the corresponding marginals similarly to (2.19,2.20):
Using the basic fact that, on the tree
when is on the path given by , and
when path , and along intersect on vertex , we obtain
For the fourth order, more cases have to be distinguished. When , , and are aligned as on Figure 2.1.c, in this order on the path along we have
which leads to
When path , and along intersect on vertex as on Figure 2.1.d, we obtain instead ^{1}^{1}1This apparently nonsymmetric expression can be symmetrized with help of (2.27).
For the situation corresponding to Figure 2.1.e, we have
which leads to
Finally, for the situation corresponding to Figure 2.1.f, we have
leading to
2.3 Sparse inverse estimation of covariance matrix
Let us leave the Ising modeling issue aside for a while and introduce another related graph selection problem, named sparse inverse covariance estimation, which is defined on real continuous random variables. This method aims at constructing a sparse factor graph structure by identifying conditionally independent pairs of nodes in the graph, given empirical covariances of random variables. Assuming that all nodes in the graph follow a joint multivariate Gaussian distribution with mean and covariance matrix , the existing correlation between the nodes and given all the other nodes in the graph are indicated by the nonzero th entry of the precision matrix , while zero entries correspond to independent pairs of variables. Therefore, under the joint normal distribution assumption, selection of factor graph structures amounts to finding the sparse precision matrix that best describes the underlying data distribution, given the fixed empirical covariance matrix. When the derived inverse estimation is sparse, it becomes easier to compute marginal distribution of each random variable and conduct statistical inference. To achieve that goal, optimizations methods have been developed based on or norm penalty for the estimation of , to enhance its sparse structure constraint on the estimated inverse of covariance matrix and discover underlying conditionally independent parts.
Let be the empirical covariance matrix of random variables (represented as the nodes in the graph model). The sparsity penalized maximum likelihood estimation of the precision matrix can be derived by solving the following positive definite cone program:
(2.28) 
where
(2.29) 
is the log likelihood of the distribution defined by , being the logarithm of the determinant, and is a sparsity inducing regularization term [17]. is the regularization coefficient balancing the datafitting oriented likelihood and sparsity penalty. Since the precision matrix of joint normal distribution should be positive definite, any feasible solution to this optimization problem is thus required to locate within a positive definite cone. The penalty term is typically constructed using sparsity inducing matrix norm, also known as sparse learning in the domain of statistical learning. There are two typical configurations of :

The norm of the matrix , which counts the number of nonzero elements in the matrix. It is also known as the cardinality or the nonzero support of the matrix. Given its definition, it is easy to find that norm based penalty is a strong and intuitive appeal for sparsity structure of the estimated precision matrix. However, it is computationally infeasible to solve exact norm minimization directly, due to the fact that exact norm penalty is discontinuous and nondifferentiable. In practice, one either uses a continuous approximation to the form of the penalty, or solve it using a greedy method. Due to the nonconvexity of the exact norm penalty, only a local optimum of the feasible solution can be guaranteed. Nevertheless, norm penalty usually leads to much sparser structure in the estimation, while local optimum is good enough for most practical cases.

The matrix norm . norm penalty was firstly introduced into the standard least square estimation problem by Tibshirani [40], under the name ”Lasso regression”. Minimizing the norm based penalty encourages sparse nonzero entries in the estimated precision matrix , which achieves a selection of informative variables for regression and reduces complexity of regression model efficiently. Further extension of the basic norm penalty function allows one assigning different nonnegative weight values to different entries , as . This weighted combination can constrain the sparse penalties only on the offdiagonal entries, so as to avoid unnecessary sparsity on the diagonal elements. Furthermore, this extension allows us to introduce prior knowledge about the conditional independence structure of the graph into the joint combination problem.
For further understanding of the relation between the exact and norm penalty, we illustrate them with respect to one scalar variable in Figure 2.2. As we can see, within , norm penalty plays as a convex envelope of the exact norm penalty. Due to the convexity property of norm penalty, the global optimum of the convex programming problem can be achieved with even linear computational complexity [40, 17]. However, although norm based penalty leads to computationally sound solutions to the original issue, it also introduces modeling bias into the penalized maximum likelihood estimation. As illustrated in the figure, when the underlying true values of the matrix entries are sufficiently large, the corresponding norm based regularization performs linearly increased penalty to those entries, which thus results in a severe bias w.r.t. the maximum likelihood estimation [16]. In contrast, norm based penalty avoids such issue by constraining the penalties of nonzero elements to be constant. It has been proved in [40] that the norm penalty discovers the underlined sparse structure when some suitable assumptions are satisfied. However, in general cases, the quality of the solutions is not clear.
3 Iterative proportional scaling from the Bethe reference point
3.1 Bethe reference point and optimal link correction
As observed in the previous section, when using the Bethe approximation to find an approximate solution to the IIP, the consistency check should then be that either the factor graph be sparse, nearly a tree, either the coupling are small. There are then two distinct ways of using the Bethe approximation:

the direct way, where the form of the joint distribution (2.18) is assumed with a complete graph. There is then by construction a belief propagation fixed point for which the beliefs satisfy all the constraints. This solution to be meaningful requires small correlations, so that the belief propagation fixed point be stable and unique, allowing the corresponding log likelihood to be well approximated. Otherwise, this solution is not satisfactory, but a pruning procedure, which amounts to select a subgraph based on mutual information, can be used. The first step is to find the maximum spanning tree (MST) with mutual information taken as edges weights. How to add new links to this baseline solution in a consistent way will be the subject of the next section.

the indirect way consists in first inverting the potentially nonsparse correlation matrix. If the underlying interaction matrix is actually a tree, this will be visible in the inverse correlation matrix, indicated directly by the nonzero entries. Corresponding couplings are then determined through (2.24). This procedure seems to work better than the previous one also when no sparsity but weak coupling is assumed. It corresponds in fact to the equations solved iteratively by the susceptibility propagation algorithm [34].
Let us first justify the intuitive assertion concerning the optimal model with tree like factor graphs, valid for any type of MRF. Suppose that we are given a set of single and pairwise empirical marginals and for a set of real variables . If we start from an empty graph with no link, the joint probability distribution is simply the product form
Adding a link to the empty graph is optimally done by multiplying by . The gain in log likelihood is then simply the mutual information between and . Thus, as long as no loop get closed by the procedure, the best candidate link corresponds to the pair of variables with maximum mutual information and the measure reads after steps
This suggests that a good initialization point for the algorithm is the maximum spanning tree with edges weights given by the relevant mutual information. This corresponds to the classical results of [8] concerning inference using dependence trees. It is optimal in the class of singly connected graphical models. In the following, we will refer in the text to this specific approximate solution as the Bethe reference point.
Starting from this point, we want now to add one factor to produce the distribution
(3.1) 
with
The log likelihood corresponding to this new distribution reads
Since the the functional derivative w.r.t. is