1 INTRODUCTION

## Abstract

Maximum a posteriori (MAP) inference is a fundamental computational paradigm for statistical inference. In the setting of graphical models, MAP inference entails solving a combinatorial optimization problem to find the most likely configuration of the discrete-valued model. Linear programming (LP) relaxations in the Sherali-Adams hierarchy are widely used to attempt to solve this problem, and smooth message passing algorithms have been proposed to solve regularized versions of these LPs with great success. This paper leverages recent work in entropy-regularized LPs to analyze convergence rates of a class of edge-based smooth message passing algorithms to -optimality in the relaxation. With an appropriately chosen regularization constant, we present a theoretical guarantee on the number of iterations sufficient to recover the true integral MAP solution when the LP is tight and the solution is unique.

\runningtitle

Convergence Rates of Smooth Message Passing with Rounding

## 1 Introduction

Undirected graphical models are a central modeling formalism in machine learning, providing a compact and powerful way to model dependencies between variables. Here we focus on the important class of discrete-valued pairwise models. Inference in discrete-valued graphical models has applications in many areas including computer vision, statistical physics, information theory, and genome research (Antonucci et al., 2014; Wainwright and Jordan, 2008; Mezard and Montanari, 2009).

We focus on the problem of identifying a configuration of all variables that has highest probability, termed maximum a posteriori (MAP) inference. This problem has an extensive literature across multiple communities, where it is described by various names, including energy minimization (Kappes et al., 2013) and constraint satisfaction (Schiex et al., 1995). In the binary case, the MAP problem is sometimes described as quadratic-pseudo Boolean optimization (Hammer et al., 1984) and it is known to be NP-hard to compute exactly (Kolmogorov and Zabin, 2004; Cooper, 1990) or even to approximate (Dagum and Luby, 1993). Consequently, much work has attempted to identify settings where polynomial-time methods are feasible. We call such settings “tractable” and the methods “efficient.” A general framework for obtaining tractable methodology involves “relaxation”—the MAP problem is formulated as an integer linear program (ILP) and is then relaxed to a linear program (LP). If the vertex at which the LP achieves optimality is integral, then it provides an exact solution to the original problem. In this case we say that the LP is tight. If the LP is performed over the convex hull of all integral assignments, otherwise known as the marginal polytope , then it will always be tight. Inference over the marginal polytope is generally intractable because it requires exponentially many constraints to enforce global consistency.

A popular workaround is to relax the marginal polytope to the local polytope  (Wainwright and Jordan, 2008). Instead of enforcing global consistency, the local polytope enforces consistency only over pairs of variables, thus yielding pseudo-marginals which are pairwise consistent but may not correspond to any true global distribution. The number of constraints needed to specify the local polytope is linear in the number of edges. More generally, Sherali and Adams (1990) introduced a series of successively tighter relaxations of the marginal polytope, or convex hull, while retaining control on the number of constraints. However, even with these relaxations, it has been observed that standard LP solvers do not scale well (Yanover et al., 2006), motivating the study of solvers that exploit the structure of the problem, such as message passing algorithms.

Of particular interest to this paper are smooth message passing algorithms, i.e. algorithms derived from regularized versions of the relaxed LP (Meshi et al., 2012; Savchynskyy et al., 2011, 2012; Hazan and Shashua, 2008; Ravikumar et al., 2010). These regularized LPs conduce to efficient optimization in practice and have the special property that their fixed points are unique and optimal; however, this comes at the cost of solving an approximation of the true MAP problem and, without rounding, they do not recover integral solutions in general. Non-asymptotic convergence rates to the optimal regularized function value have been studied (Meshi et al., 2012), but guarantees on the number of iterations sufficient to recover the optimal integral assignment of the true MAP problem have not been considered to our knowledge.

In this work we provide a sharp analysis of the entropy-regularized MAP inference problem with Sherali-Adams relaxations. We first characterize the approximation error of the regularized LP in distance, based on new results on entropy-regularized LPs (Weed, 2018). We then analyze an edge-based smooth message passing algorithm, modified from the algorithms described in Werner (2007) and Ravikumar et al. (2010). We prove a rate of convergence of iterates in distance. Combining the approximation error and convergence results, we present a guarantee on the number of iterations sufficient to recover of the true integral MAP assignment using a standard vertex rounding scheme when the LP relaxation is tight and the solution is unique.

## 2 Related Work

The idea of entropy regularization to aid optimization in inference problems is well studied. It is well known that solving a scaled and entropy-regularized linear program over the marginal polytope yields the scaled Gibbs free energy, intimately related to the log partition function, when the temperature parameter equals one (Wainwright and Jordan, 2008). As the temperature parameter is driven to zero, the calculation of the free energy reduces to the value of the MAP problem. However, this problem is intractable due to the difficulty of both computing the exact entropy and characterizing the marginal polytope (Deza and Laurent, 2009). Therefore, there has been much work in trying to turn this observation into tractable inference algorithms. The standard Bethe approximation instead minimizes an approximation of the true entropy (Bethe, 1935). It was show by Yedidia et al. (2003) that fixed points of the loopy belief propagation correspond to its stationary points, but still the optimization problem resulting from this approximation is non-convex and convergence is not always guaranteed.

To alleviate convergence issues, much work has considered convexifying the free energy problem leading to classes of convergent convex belief propagation often derived directly from convex regularizers (Meshi et al., 2009; Heskes, 2006; Hazan and Shashua, 2008; Johnson and Willsky, 2008; Savchynskyy et al., 2012). For instance, Weiss et al. (2007) proposed a general convexified belief propagation and explored some sufficient conditions that enable heuristically recovering the MAP solution of the LP via a convex sum-product variant. However, the approximation error was still unclear and non-asymptotic convergence rates were not considered. A number of algorithms have also been proposed to directly optimize the unregularized LP relaxation often with only asymptotic convergence guarantees such as block-coordinate methods (Werner, 2007; Globerson and Jaakkola, 2008; Kovalevsky and Koval, 1975; Tourani et al., 2018; Kappes et al., 2013) and tree-reweighted message passing (Wainwright et al., 2005; Kolmogorov, 2006). The relationship between the regularized and unregularized problems can equivalently be viewed as applying a soft-max to the dual objective typically considered in the latter to recover that of the former (Nesterov, 2005; Sontag et al., 2011). Many other convergent methods exist such as augmented Lagrangian (Martins et al., 2011; Meshi and Globerson, 2011), bundle (Kappes et al., 2012), and steepest descent (Schwing et al., 2012, 2014) approaches, but again they are difficult to compare without rates.

Most closely related to our work is recent work in convergence analysis of certain smoothed message passing algorithms that aim to solve the regularized LP objective. Savchynskyy et al. (2011) proposed an accelerated gradient method that achieves convergence to the optimal regularized dual objective value. Convergence of the primal iterates was only shown asymptotically. Meshi et al. (2012) considered a general dual coordinate minimization algorithm based on the entropy-regularized MAP objective. They proved upper bounds on the rate of convergence to the optimal regularized dual objective value; however, closeness to the true MAP assignment was not formally characterized. Furthermore, convergence in the dual objective value again does not make it easy to determine when the true MAP assignment can be recovered. Meshi et al. (2015) later studied the benefits of adding a quadratic term to the LP objective instead and proved similar guarantees. Ravikumar et al. (2010) also considered entropic and quadratic regularization, using a proximal minimization scheme with inner and outer loops. They additionally provided rounding guarantees to recover true primal solutions. However, as noted by the authors, the inexact calculation of the inner loop prevents a convergence rate analysis once combined with the outer loop. Additionally, rates on the inner loop convergence were not addressed.

The approach of this paper can be understood as the bridging the gap between Meshi et al. (2012) and Ravikumar et al. (2010). Our first contribution is a characterization of the approximation error of the entropy-regularized MAP inference problem. We then study an edge-based message passing algorithm that solves the regularized LP, which is essentially a smoothed max-sum diffusion (Werner, 2007) or the inner loop of the proximal steps of Ravikumar et al. (2010). For our main contribution, we provide non-asymptotic guarantees to the integral MAP assignment for this message passing algorithm when the LP is tight and the solution is unique. To our knowledge, this is the first analysis with rates guaranteeing recovery of the true MAP assignment for smooth methods.

## 3 Background

We denote the -dimensional probability simplex as . The set of joint distributions which give rise to is defined as For any two vectors or matrices and having the same number of elements, we use to denote the dot product, i.e. elementwise multiplication then sum over all elements. We use to denote the sum of absolute values of the elements of . The Bregman divergence between with respect to a strictly convex function is We will consider the Bregman divergence with respect to the negative entropy , where need not be a distribution. When is a distribution, this corresponds to the Kullback-Leibler (KL) divergence. The Bregman projection with respect to of onto the set is defined as . The Hellinger distance between is defined as , where is the -norm. We denote the square of the Hellinger distance by . We will often deal with marginal vectors which are ordered collections of joint and marginal distributions in the form of matrices and vectors, respectively.

### 3.1 Pairwise Models

For a set of vertices, , and edges , a pairwise graphical model, , is a Markov random field that represents the joint distribution of variables , taking on values from the set of states . We assume that each vertex has at least one edge. For pairwise models, the joint distribution can be written as a function of doubletons and singletons: We wish to find maximum a posteriori (MAP) estimates of this model. That is, we consider the integer program:

 maxxV∈χn∑i∈Vθi(xi)+∑ij∈Eθij(xi,xj). (Int)

The maximization in (Int) can be written as a linear program by defining a marginal vector over variable vertices and variable edges . The vector represents the marginal distribution probabilities on vertex while the matrix represents the joint distribution probabilities shared between vertices and . We follow the notation of Globerson and Jaakkola (2008) and denote indexing into the vector and matrix variables with parentheses, e.g. for . The set of marginal vectors that are valid probability distributions is known as the marginal polytope and is defined as

 M\lx@stackreldef.=⎧⎪ ⎪⎨⎪ ⎪⎩μ : ∃ P,PXi(xi)=μi(xi), ∀i,xiPXi,Xj(xi,xj)=μij(xi,xj),∀ij,xi,xj⎫⎪ ⎪⎬⎪ ⎪⎭ (1)

We can think of as the set of mean parameters of the model for which there exists a globally consistent distribution . We abuse notation slightly and dually view as a potential “vector.” The edge matrix is indexed as , indicating the element at the th row and th column. The vertex vector is indexed as , indicating the th element. The MAP problem in (Int) can be shown to be equivalent to the following LP (Wainwright and Jordan, 2008):

 max⟨θ,μ⟩ s.t. μ∈M

where .

The number of constraints in is unfortunately superpolynomial (Sontag, 2010). This motivates considering relaxations of the marginal polytope to outer polytopes that involve fewer constraints. For example, the local outer polytope is obtained by enforcing consistency only on edges and vertices:

 L2\lx@stackreldef.={μ≥0 : μi∈Σd∀i∈Vμij∈Ud(μi,μj)∀ij∈E} (2)

Relaxations of higher orders have also been studied, in particular by Sherali and Adams (1990) who introduced a hierarchy of polytopes by enforcing consistency on joint distributions of increasing order up to : . The corresponding Sherali-Adams LP relaxation of order is then

 max⟨θ,μ⟩ s.t. μ∈Lm, (LP)

where . Because is an outer polytope of , we no longer have that the solution to (LP) recovers the true MAP solution of (Int) in general. However if the solution to (LP) is integral, then recovers the optimal solution of the true MAP problem. In this case, we say is tight.

## 4 Entropy-Regularized Map

In this section, we present our first main technical contribution, characterizing the approximation error in the entropy-regularized MAP problem for Sherali-Adams relaxations. In contrast to solving the exact (LP), we aim to solve the entropy-regularized LP:

 min⟨C,μ⟩−1ηH(μ) s.t. μ∈Lm, (Reg)

where and . The hyperparameter adjusts the level of regularization. Denote by the solution of (Reg) where we omit the reference to to alleviate notation. In addition to their extensive history in inference problems, entropy-regularized LPs have arisen in a number of other fields to aid optimization when standard LP solvers are insufficient. For example, recent work in optimal transport has relied on entropy regularization to derive alternating projection algorithms (Cuturi, 2013; Benamou et al., 2015) which admit almost linear time convergence guarantees in the size of the cost matrix (Altschuler et al., 2017). Some of our theoretical results draw inspiration from these works.

### 4.1 Approximation Error

When is tight and the solution is unique, we show that approximate solutions from solving (Reg) are not necessarily detrimental because we can apply standard vertex rounding schemes to yield consistent integral solutions. It was shown by Cominetti and San Martín (1994), and later refined by Weed (2018), that the approximation error of general entropy-regularized linear programs converges to zero at an exponential rate in . Furthermore, it is possible to determine how large should be chosen in order for rounding to exactly recover the optimal solution to (Int). The result is summarized in the following extension of Theorem 1 of Weed (2018)1.

###### Theorem 1.

Let , , be the set of vertices of , and the set of optimal vertices with respect to . Let be the smallest gap in objective value between an optimal vertex and any suboptimal vertex of . Suppose is tight and . If , the following rounded solution is a MAP assignment:

 (round(μ∗η))i:=argmaxx∈χ(μ∗η)i(x)
###### Proof.

Define , where denotes an all-ones vector with the same dimensions as . If then , the set of optimal vertices of with respect to , satisfies and . If ; and , then . Let . If , and then . And therefore, by Corollary 9 of Weed (2018) . Since is assumed to be tight and contains a single integral vertex , the last equation implies . ∎

Consequently, since and 2, we have:

###### Corollary 1.

If is tight, , and , the rounded solution is a MAP assignment.

In general the dependence of on suggested by Theorem 1 is not improvable (Weed, 2018). Nevertheless, when and , since all vertices in have entries equal to either or —see Padberg (1989) or Theorem 3 of Weller et al. (2016)—if the entries of are all integral, we have , thus yielding a more concrete guarantee. The disadvantage of choosing exorbitantly large is that efficient computation of solutions often becomes more difficult in practice (Weed, 2018; Benamou et al., 2015; Altschuler et al., 2017). Thus, in practice, there exists a trade-off between computation time and approximation error that is controlled by . We will provide a precise theoretical characterization of the trade-off in Section 6. In our guarantees, multiplying by a constant (and therefore multiplying by ) is equivalent to multiplying by the same value.

### 4.2 Equivalent Bregman Projection

The objective (Reg) can be interpreted as a Bregman projection. This interpretation has been explored by Ravikumar et al. (2010) as a basis for proximal updates and also Benamou et al. (2015) for the optimal transport problem. The objective is equivalent to

 minDΦ(μ,exp(−ηC)) s.t. μ∈Lm, (Proj)

where . The derivation, based on a mirror descent step can be found in the appendix. The projection, however, cannot be computed in closed form in general due to the complex geometry of .

Ravikumar et al. (2010) proposed using the Bregman method (Bregman, 1966), which has been applied in many fields to solve difficult constrained problems (Benamou et al., 2015; Goldstein and Osher, 2009; Osher et al., 2005, 2010), to compute for the inner loop calculation of their proximal algorithm. While the outer loop proximal algorithm can be shown to converge at least linearly, the inner loop rate was not analyzed and the constants (possibly dependent on dimension) were not made clear. Furthermore, the Bregman method is in general inexact, which makes the approximation and the effect on the outer loop unclear (Liu and Ihler, 2013).

## 5 Smooth Message Passing

We are interested in analyzing a class of algorithms closely inspired by max-sum diffusion (MSD) as presented by Werner (2007) and the proximal updates of Ravikumar et al. (2010) to solve (Proj) over the polytope. We describe it in detail here, with a few minor modifications and variations to facilitate theoretical analysis. In , the constraints occur only over edges between vertices3. Given an edge , we must enforce the constraints prescribed by (2), which is the intersection of the following sets:

 1 Xij→i={μ : μij\mathbbm1=μi} 2 Xij,i={μ : μ⊤i\mathbbm1=1, \mathbbm1⊤μij\mathbbm1=1} 3 Xij→j={μ : μ⊤ij\mathbbm1=μj} 4 Xij,j={μ : μ⊤j\mathbbm1=1, \mathbbm1⊤μij\mathbbm1=1}.

The normalization of the joint distribution in 2 and 4 is actually a redundant constraint, but it facilitates analysis as we demonstrate in Section 6. For each of these affine constraints, we can compute the Bregman projections in closed form with simple multiplicative updates.

###### Proposition 1.

For a given edge , the closed-form solutions of the Bregman projections for each of the above individual constraints are given below.

1. Left consistency: If , then for all , and .

2. Left normalization: If , then for all , and .

3. Right consistency: If , then for all , and .

4. Right normalization: If , then for all , and .

These update rules are similar to a number of algorithms throughout the literature on LP relaxations. Notably, they can be viewed as a smoothed version of MSD (Werner, 2007; Kovalevsky and Koval, 1975) in that the updates enforce agreement between variables on the edges and vertices. Nearly identical smoothed updates were also initially proposed by Ravikumar et al. (2010). As in MSD, it is common for message passing schemes derived from LP relaxations to operate on dual objective instead. We presented the primal view here as the Bregman projections lend semantic meaning to the updates and ultimately the stopping conditions in the algorithms. An equivalent dual view is presented in Appendix C.1.

Based on these update rules, we formally outline the algorithms we wish to analyze, which we call edge-based message passing (EMP) for convenience. We consider two variants: EMP-cyclic (Algorithmic 1), which cyclically applies the updates to each edge in each iteration and EMP-greedy (Algorithmic 2), which applies a single projection update to only the edge with the greatest constraint violation in each iteration. We emphasize that these algorithms are not fundamentally new, but our analysis in the next section is our main contribution. EMP-cyclic is the Bregman method, almost exactly the inner loop proposed by Ravikumar et al. (2010). In both variants, is defined as the normalized value of . The GreedyEdge operation in EMP-greedy is defined as

These procedures are then repeated again until the stopping criterion is met, which is that is -close to satisfying the constraint that the joint distributions sum to the marginals for all edges. Both algorithms also conclude with a rounding operation. Any fixed point of EMP must correspond to an optimal (see details in appendix). Computationally, EMP-greedy requires a search over the edges to identify the greatest constraint violation, which can be efficiently implemented using a max-heap (Nutini et al., 2015).

## 6 Theoretical Analysis

We now present our main contribution, a theoretical analysis of EMP-cyclic and EMP-greedy. This result combines two aspects. First, we present a convergence guarantee on the number of iterations sufficient to solve (Proj), satisfying the constraints with error in distance. We note that, in finite iterations, the pseudo-marginals of EMP are not primal feasible in general due to this -error. We then combine this result with our guarantee on the approximation error in Theorem 1 to show a bound on the number of iterations sufficient to recover the true integral MAP assignment by rounding, assuming the LP is tight and the solution is unique. This holds with sufficient iterations and a sufficiently large regularization constant even though the pseudo-marginals may not be primal feasible. We emphasize that these theorems are a departure from usual convergence rates in the literature (Meshi et al., 2012, 2015). Prior work has guaranteed convergence in objective value to the optimum of the regularized objective (Proj), making it unclear whether the optimal MAP assignment can be recovered, e.g. by rounding. We address this ambiguity in our results.

We begin with the upper bound iterations to obtain -close solutions, which is the result of two facts which we show. The first is that the updates in Proposition 1 monotonically improve a Lyapunov (potential) function by an amount proportional to the constraint violation as measured via the Hellinger distance. The second is that the difference between the initial and optimal values of the Lyapunov function is bounded.

Let denote the maximum degree of graph and define:

 S \lx@stackreldef.=∑ij∈E⎡⎣log∑xi,xj∈χe−ηCij(xi,xj)+∑xi,xj∈χηd2Cij(xi,xj)⎤⎦ +∑i∈V[log∑x∈χe−ηCi(x)+∑x∈χηdCi(x)].
###### Theorem 2.

For any , EMP is guaranteed to satisfy and for all in iterations for EMP-cyclic and iterations for EMP-greedy.

Here, . In this theorem, we give our guarantee in terms of distance rather than function value convergence. As we will see, this is significant, allowing us to relate this result to Theorem 1 in order to derive the main result. The proof is similar in style to Altschuler et al. (2017). We leave the full proof for EMP-cyclic for the appendix due to a need to handle tedious edge cases, but we state several intermediate results and sketch the proof for EMP-greedy for intuition as it reveals possibly how similar message passing algorithms can be analyzed. We first introduce a Lyapunov function written in terms of dual variables , indexed by the edges and vertices to which they belong in . We denote the iteration-indexed dual variables as . For a given edge , constraints enforcing row and column consistency correspond to , respectively. Normalizing constraints correspond to . The Lyapunov function, , is shown in Figure 3.

We note that maximizing over satisfies all constraints and yields the solution to (Proj) by first-order optimality conditions. We now present a result that establishes the monotone improvement in due to the updates in Proposition 1.

###### Lemma 1.

For a given edge , let and denote the updated primal and dual variables after a projection from one of 14 in Proposition 1. We have the following improvements on . If is equal to:

1. , then

2. , then

3. , then

4. , then .

This result shows that improves monotonically after each of the four updates in Proposition 1. Furthermore, at every update, improves by twice the squared Hellinger distance of the constraint violation between the joint and the marginals.

###### Lemma 2.

Let , denote the maximizers of . The difference in function value between the optimal value of and the first iteration value is upper bounded

 L(λ∗,ξ∗)−L(λ(1),ξ(1))≤S0.

Turning to Theorem 2, the result is obtained by observing that as long as the constraints are violated by an amount (i.e., the algorithm has not terminated), then the Lyapunov function must improve by a known positive amount at each iteration. We provide a proof sketch for EMP-greedy.

###### Proof Sketch of Theorem 2 for EMP-greedy.

We now show how to combine the results of Lemma 1 and Lemma 2 to obtain Theorem 2. Let be the first iteration such that the termination condition in Algorithm 2 holds with respect to some . Then, for any satisfying , we have that selects such that either or .

Without loss of generality, suppose . Therefore, we have

 ϵ24≤14∥μij\mathbbm1−μi∥21≤2h2(μij\mathbbm1,μi),

where again denotes the squared Hellinger distance and the last inequality is the Hellinger inequality. Since and are normalized for each iteration, this inequality is valid. Thus, improves by when occurs and by a non-negative amount when occurs by Lemma 1. Therefore, we can guarantee improvement of at least each iteration. Since the optimality gap is at most by Lemma 2, this means the algorithm must terminate in iterations.∎

We now turn to our main theoretical result. We combine our approximation and iteration convergence guarantees to fully characterize the convergence of EMP for to the optimal MAP assignment when the relaxation is tight and the solution is unique.

###### Theorem 3.

Let , and . If is tight and , the EMP algorithm returns a MAP assignment after iterations for EMP-cyclic and after iterations for EMP-greedy.

When is integral, , yielding a bound of all known parameters. The main technical challenge in producing this result is to relate the termination condition of EMP to the distance between and (the MAP assignment), as this may lie outside the polytope . It does not suffice to provide convergence guarantees in function value as the goal of MAP inference is to produce integral assignments. The proof proceeds in two steps. First we show that is the entropy-regularized solution to objective over a “slack” polytope . Where the slack vector corresponds to the constraint violations of . We use this characterization to “project” onto a nearby feasible point . Second, we can use the properties of the primal objective to bound and . The proof is in the appendix.

## 7 Numerical Experiments

We illustrate our theoretical results in a practical application of the EMP algorithms. Ravikumar et al. (2010) already gave empirical evidence that the basic EMP-cyclic is competitive with standard solvers. Therefore, the objective of these experiments is to understand how graph and algorithm properties affect approximation (Theorem 1) and convergence (Theorem 2). We consider the family of multi-label Potts models (Wainwright et al., 2005) with labels on . For each trial, the cost vector is , and

 Cij(xi,xj) ={βijxi=xj0otherwise∀ij,xi,xj

where the parameters are random and . The graphs considered are structured as grids (Globerson and Jaakkola, 2008; Ravikumar et al., 2010; Erdogdu et al., 2017) and as Erdős-Rényi random graphs with edge probability . To evaluate recovery of the optimal MAP assignment, we first solved each graph with the ECOS LP solver (Domahidi et al., 2013) and selected graphs that were tight. Solving the LP to find the ground-truth was the main computational bottleneck. Further details can be found in Appendix E.

#### Approximation

In Figure 4, we evaluate the effect of regularization and graph size on the quality of the nearly converged solution from EMP for over iterations on grids. The box-plots indicate that large choices of often yield the exact MAP solution (cyan and purple). Moderate choices still yield competitive solutions but not optimal for larger graphs (orange and green). Low choices generally give poor solutions with high spread for all graph sizes (red and blue).

#### Convergence

We then investigate the effects of regularization on convergence for both variants. Figure 5 illustrates the distance of the rounded solution to the optimal MAP solution over projection steps on grids of size . EMP-greedy converges sharply and varying regularization has less of an effect on its convergence rate. Finally, in Figure 6, we look at Erdős-Rényi random graphs to observe the effect of the graph structure for both variants. We considered degree-limited random graphs with and . The figure shows convergence over projection steps for graphs of size . For both variants, the convergence rate deteriorates for higher degrees.

## 8 Conclusion

In this paper, we investigated the approximation effects of entropy regularization on MAP inference objectives. We combined these approximation guarantees with a convergence analysis of an edge-based message passing algorithm that solves the regularized objective to derive guarantees on the number of iterations sufficient to recover the true MAP assignment. We also showed empirically the effect of regularization and graph propertise on both the approximation and convergence. In future work, we wish to extend the analyses and proof techniques to higher order polytopes and general block-coordinate minimization algorithms.

#### Acknowledgements

We thank the anonymous reviewers and Marco Pavone for their invaluable feedback.

## Appendix A Bregman Projection Derivation

The objective (Reg) can be equivalently interpreted as a Bregman projection. This interpretation has been explored by Ravikumar et al. (2010) as a basis for proximal updates and also Benamou et al. (2015) for the optimal transport problem. Here, we review the transformation because it is central to the algorithm of Ravikumar et al. (2010), upon which our main theoretical results are based.

By definition of the Bregman projection with respect to the negative entropy, , we have

 DΦ(μ,\mathbbm1) =+⟨μ,logμ−\mathbbm1⟩−⟨log\mathbbm1,μ−\mathbbm1⟩ =−H(μ)

where is a vector of ones of the same size as the marginal vector and denotes the two sides are equal up to a constant. Substituting this into (Reg) and multiplying through by yields the objective:

 minη⟨C,μ⟩+DΦ(μ,\mathbbm1) s.t.  μ∈Lm.

Note the similarity to a projected mirror descent update over starting from (Nemirovsky and Yudin, 1983; Bubeck, 2015). Using this insight and performing a single gradient update in the dual, we can transform the problem into a single Bregman projection of the vector. The unprojected marginal vector satisfies

 ∇Φ(μ′)=∇Φ(\mathbbm1)−ηC,

where is the dual map and is the inverse dual map. We have and the solution to the mirror descent update is . Therefore it is sufficient to solve the following Bregman projection problem:

 minDΦ(μ,exp(−ηC)) s.t. μ∈Lm

The projection, however, cannot be computed in closed form due to the complex geometry of . Sinkhorn-like algorithms such as those used in Cuturi (2013) are unavailable because the transportation polytopes are dependent on variables and which are also involved in the projection operation.

## Appendix B Derivation of EMP Update Rules

We present the derivations of the update rules similar to Ravikumar et al. (2010) for a given edge based on the Bregman projections onto the individual constraint sets , , , . We refer the reader to Ravikumar et al. (2010) for the original algorithm and derivation. We derive only the first two projections; the last two can be found by exchanging the indices.

• For the projection , where

 Xij→i ={μ : μij\mathbbm1=μi},

there are no constraints on any edges or vertices other than and . Therefore, , . Similarly, , .

The Lagrangian of the projection is given in terms of primal variables and dual variables :

 L(μ′,α) =∑xi,xjμ′ij(xi,xj)(logμ′ij(xi,xj)μij(xi,xj)−1+α(xi))+∑xiμ′i(xi)(logμ′i(xi)μi(xi)−1−α(xi)).

By the first-order optimality condition, the primal solution in terms of the dual variables is

 μ′ij(xi,xj) =μij(xi,xj)e−α(xi) μ′i(xi) =μi(xi)eα(xi).

Substituting this solution back in to the Lagrangian, we have

 L(α)=−∑xi,xjμij(xi,xj)e−α(xi)−∑xiμi(xi)eα(xi).

Again, by the first-order optimality condition, the dual solution is

 α∗(xi)=12log∑xjμij(xi,xj)μi(xi).

Substituting this value for into the primal solution yields the desired result.

• Again, for the projection onto

 Xij,i ={μ : μ⊤i\mathbbm1=1, \mathbbm1⊤μij\mathbbm1=1},

only and are affected. enforces that the variables and each sum to one. It is well known and easy to show that the Bregman projection with respect to the negative entropy is simply the and normalized by their sums. This normalization can also be written as a multiplicative update of the same form by observing that

 μ′ij(xi,xj) =μ′ij(xi,xj)e−ξ∗ij μ′i(xi) =μ′i(xi)e−ξ∗i,

where and . Again, these can be derived via the Lagrangian.

## Appendix C Extensions of EMP

### c.1 Dual EMP

We may also equivalently interpret the multiplicative updates in Algorithm 1 and Algorithm 2 as additive updates of the dual variables. The dual interpretation is consistent with past work in dual MAP algorithms (Sontag et al., 2011) and may be more practical to avoid numerical issues in implementation. Instead of tracking the primal variables , we track a sum of the dual variables with for each vertex and edge. Enforcing consistency between a given joint distribution and its marginals in 1 yields updated dual variable sums

 ζ′ij(xi,xj)←ζij(xi,xj)−α∗(xi) ζ′i(xi)←ζi(xi)+α∗(xi),

where again