Alternating Directions Dual Decomposition
Abstract
We propose AD, a new algorithm for approximate maximum a posteriori (MAP) inference on factor graphs based on the alternating directions method of multipliers. Like dual decomposition algorithms, AD uses worker nodes to iteratively solve local subproblems and a controller node to combine these local solutions into a global update. The key characteristic of AD is that each local subproblem has a quadratic regularizer, leading to a faster consensus than subgradientbased dual decomposition, both theoretically and in practice. We provide closedform solutions for these AD subproblems for binary pairwise factors and factors imposing firstorder logic constraints. For arbitrary factors (large or combinatorial), we introduce an active set method which requires only an oracle for computing a local MAP configuration, making AD applicable to a wide range of problems. Experiments on synthetic and realworld problems show that AD compares favorably with the stateoftheart.
1 Introduction
Graphical models enable compact representations of probability distributions, being widely used in computer vision, natural language processing (NLP), and computational biology (Pearl, 1988; Lauritzen, 1996; Koller and Friedman, 2009). When using these models, a central problem is that of inferring the most probable (a.k.a. maximum a posteriori – MAP) configuration. Unfortunately, exact MAP inference is an intractable problem for many graphical models of interest in applications, such as those involving nonlocal features and/or structural constraints. This fact has motivated a significant research effort on approximate MAP inference.
A class of methods that proved effective for approximate inference is based on linear programming relaxations of the MAP problem (LPMAP) (Schlesinger, 1976). Several messagepassing algorithms have been proposed to address the resulting LP problems, taking advantage of the underlying graph structure (Wainwright et al., 2005; Kolmogorov, 2006; Werner, 2007; Globerson and Jaakkola, 2008; Ravikumar et al., 2010). In the same line, Komodakis et al. (2007) proposed a method based on the classical dual decomposition (Dantzig and Wolfe, 1960; Everett III, 1963; Shor, 1985), which breaks the original problem into a set of smaller subproblems; this set of subproblems, together with the constraints that they should agree on the shared variables, yields a constrained optimization problem, the Lagrange dual of which is solved using a projected subgradient method. Initially proposed for computer vision (Komodakis et al., 2007; Komodakis and Paragios, 2009), this technique has also been shown effective in several NLP tasks that involve combinatorial constraints, such as parsing and machine translation (Koo et al., 2010; Rush et al., 2010; Auli and Lopez, 2011; Rush and Collins, 2011; Chang and Collins, 2011; DeNero and Macherey, 2011). The major drawback is that the subgradient algorithm is very slow to achieve consensus in problems with a large number of overlapping components, requiring iterations for an accurate solution. Jojic et al. (2010) addressed this drawback by introducing an accelerated method that enjoys a faster convergence rate ; however, that method is sensitive to previous knowledge of and the prescription of a temperature parameter, which may yield numerical instabilities.
In this paper, we ally the simplicity of dual decomposition (DD) with
the effectiveness of augmented Lagrangian methods, which have a
longstanding history in optimization (Hestenes, 1969; Powell, 1969).
In particular, we employ the alternating directions method of multipliers
Our main contributions are:

We derive AD and establish its convergence properties, blending classical and newer results about ADMM (Glowinski and Le Tallec, 1989; Eckstein and Bertsekas, 1992; Wang and Banerjee, 2012). We show that the algorithm has the same form as the DD method of Komodakis et al. (2007), with the local MAP subproblems replaced by quadratic programs.

We show that these AD subproblems can be solved exactly and efficiently in many cases of interest, including Ising models and a wide range of hard factors representing arbitrary constraints in firstorder logic. Up to a logarithmic term, the asymptotic cost is the same as that of passing messages or doing local MAP inference.

For factors lacking a closedform solution of the AD subproblems, we introduce a new active set method. Remarkably, our method requires only a black box that returns local MAP configurations for each factor (the same requirement of the projected subgradient algorithm). This paves the way for using AD with large dense or structured factors, based on offtheshelf combinatorial algorithms (e.g., Viterbi, ChuLiuEdmonds, FordFulkerson).

We show that AD can be wrapped into a branchandbound procedure to retrieve the exact MAP configuration.
AD was originally introduced by Martins et al. (2010a, 2011a) (then called DDADMM). In addition to a considerably more detailed presentation, this paper contains contributions that substantially extend that preliminary work in several directions: the rate of convergence, the active set method for general factors, and the branchandbound procedure for exact MAP inference. It also reports a wider set of experiments and the release of opensource code (available at http://www.ark.cs.cmu.edu/AD3), which may be useful to other researchers in the field.
This paper is organized as follows. We start by providing background material: MAP inference in graphical models and its LPMAP relaxation (Section 2); the projected subgradient algorithm for the DD of Komodakis et al. (2007) (Section 3). In Section 4, we derive AD and analyze its convergence. The AD local subproblems are addressed in Section 5, where closedform solutions are derived for Ising models and several structural constraint factors. In Section 6, we introduce an active set method to solve the AD subproblems for arbitrary factors. Experiments with synthetic models, as well as in protein design and dependency parsing (Section 7) testify for the success of our approach. Finally, related work in discussed in Section 8, and Section 9 concludes the paper.
2 Background
2.1 Factor Graphs
Let be a tuple of random variables, where each takes values in a finite set and takes values in the corresponding Cartesian product set . Graphical models (e.g., Bayesian networks, Markov random fields) are structural representations of joint probability distributions, which conveniently model statistical (in)dependencies among the variables. In this paper, we focus on factor graphs, introduced by Tanner (1981) and Kschischang et al. (2001).
Definition 1 (Factor graph.)
A factor graph is a bipartite graph , comprised of:

a set of variable nodes , corresponding to the variables ;

a set of factor nodes (disjoint from );

a set of edges linking variable nodes to factor nodes.
For notational convenience, we use Latin letters () and Greek letters () to refer to variable and factor nodes, respectively. We denote by the neighborhood set of its node argument, whose cardinality is called the degree of the node. Formally, , for variable nodes, and for factor nodes. We use the short notation to denote the tuple of variables linked to factor , i.e., , and lowercase to denote instantiations of random variables, e.g., and .
The joint probability distribution of is said to factor according to the factor graph if it can be written as
(1) 
where and are called, respectively, the unary and higherorder logpotential functions. To accommodate hard constraints, we allow these functions to take values in . Fig. 1 shows examples of factor graphs with hard constraint factors (to be studied in detail in Section 5.3). For notational convenience, we represent potential functions as vectors, and , where . We denote by the vector of dimension obtained by stacking all these vectors.
2.2 MAP Inference
Given a factor graph , along with all the logpotential functions—which fully specify the joint distribution of —we are interested in finding an assignment with maximal probability (the socalled MAP assignment/configuration):
(2)  
This combinatorial problem, which is NPhard in general (Koller and Friedman, 2009), can be transformed into a linear program by using the concept of marginal polytope (Wainwright and Jordan, 2008), as next described. Let denote the dimensional probability simplex,
(3) 
where has all entries equal to zero except for the th one, which is equal to one. We introduce “local” probability distributions over the variables and factors, and . We stack these distributions into vectors and , with dimensions and , respectively, and denote by their concatenation. We say that is globally consistent if the local distributions and are the marginals of some global joint distribution. The set of all globally consistent vectors is called the marginal polytope of , denoted as . There is a onetoone mapping between the vertices of and the set of possible configurations : each corresponds to a vertex of consisting of “degenerate” distributions and , for each and . The MAP problem in (2) is then equivalent to the following linear program (LP):
(4) 
The equivalence between (2) and (4) stems from the fact that any linear program on a bounded convex constraint polytope attains a solution at a vertex of that polytope (see, e.g., Rockafellar (1970)). Unfortunately, the linear program (4) is not easier to solve than (2): for a graph with cycles (which induce global consistency constraints that are hard to specify concisely), the number of linear inequalities that characterize may grow superpolynomially with the size of . As a consequence, approximations of have been actively investigated in recent years.
2.3 LPMAP Inference
Tractable approximations of can be built by using weaker constraints that all realizable marginals must satisfy to ensure local consistency:

Nonnegativity: all local marginals and must be nonnegative.

Normalization: local marginals must be properly normalized, i.e., , for all , and , for all .

Marginalization: a variable participating in a factor must have a marginal which is consistent with the factor marginals, i.e., , for all and .
^{2}
Note that some of these constraints are redundant: the nonnegativity of and the marginalization constraint imply the nonnegativity of ; the normalization of and the marginalization constraints imply . For convenience, we express those constraints in vector notation. For each , we define a (consistency) matrix as follows:
(5) 
The local marginalization constraints can be expressed as Combining all the local consistency constraints, and dropping redundant ones, yields the socalled local polytope:
(6) 
The number of constraints that define grows linearly with , rather than superpolynomially. The elements of are called pseudomarginals. Since any true marginal vector must satisfy the constraints above, is an outer approximation, i.e.,
(7) 
as illustrated in Fig. 2.
This approximation is tight for treestructured graphs and other special cases, but not in general (even for small graphs with cycles) (Wainwright and Jordan, 2008). The LPMAP problem results from replacing by in (4) (Schlesinger, 1976):
(8) 
It can be shown that the points of that are integer are the vertices of ; therefore, the solution of LPMAP provides an upper bound on the optimal objective of MAP.
2.4 LPMAP Inference Algorithms
While any offtheshelf LP solver can be used for solving (8), specialized algorithms have been designed to exploit the graph structure, achieving superior performance on several benchmarks (Yanover et al., 2006). Most of these specialized algorithms belong to two classes: block (dual) coordinate descent, which take the form of messagepassing algorithms, and projected subgradient algorithms, based on dual decomposition.
Block coordinate descent methods address the dual of (8) by alternately optimizing over blocks of coordinates. Different choices of blocks lead to different algorithms: maxsum diffusion (Kovalevsky and Koval, 1975; Werner, 2007); maxproduct sequential treereweighted belief propagation (TRWS) (Wainwright et al., 2005; Kolmogorov, 2006); maxproduct linear programming (MPLP) (Globerson and Jaakkola, 2008). These algorithms work by passing messages (that require computing maxmarginals) between factors and variables. Under certain conditions, if the relaxation is tight, one may obtain optimality certificates. If the relaxation is not tight, it is sometimes possible to reduce the problem size or use a cuttingplane method to move toward the exact MAP (Sontag et al., 2008). A disadvantage of block coordinate descent algorithms is that they may stop at suboptimal solutions, since the objective is nonsmooth (Bertsekas et al. 1999, Section 6.3.4).
Projected subgradient based on the dual decomposition is a classical technique (Dantzig and Wolfe, 1960; Everett III, 1963; Shor, 1985), which was first proposed in the context of graphical models by Komodakis et al. (2007) and Johnson et al. (2007). Due to its strong relationship with the approach pursued in this paper, that method is described in detail in the next section.
3 Dual Decomposition with the Projected Subgradient Algorithm
The projected subgradient method for dual decomposition can be seen as an iterative “consensus algorithm,” alternating between a broadcast operation, where local subproblems are distributed across worker nodes, and a gather operation, where the local solutions are assembled by a master node to update the current global solution. At each round, the worker nodes perform local MAP inference independently; hence, the algorithm is completely modular and trivially parallelizable.
For each edge , define a potential function that satisfies ; a trivial choice is . The objective (8) can be rewritten as
(9) 
because . The LPMAP problem (8) is then equivalent to the following primal formulation, which we call LPMAPP:
(10) 
Note that, although the variables do not appear in the objective of (10), they play a fundamental role through the constraints in the last line, which are necessary to ensure that the marginals encoded in the variables are consistent on their overlaps. Indeed, it is this set of constraints that complicate the optimization problem, which would otherwise be separable into independent subproblems, one per factor. Introducing a Lagrange multiplier for each of these equality constraints leads to the Lagrangian function
(11) 
the maximization of which w.r.t. and will yield the (Lagrangian) dual objective. Since the variables are unconstrained, we have
(12) 
where is the dual objective function,
(13) 
and each is the solution of a local problem (called the subproblem):
(14)  
(15) 
the last equality is justified by the fact that maximizing a linear objective over the probability simplex gives the largest component of the score vector. Finally, the dual problem is
(16) 
Problem (16) is often referred to as the master or controller, and each subproblem (14)–(15) as a slave or worker. Note that each of these subproblems is itself a local MAP problem, involving only factor . As a consequence, the solution of (14) is an indicator vector corresponding to a particular configuration (the solution of (15) ), that is, .
The dual problem (16) can be solved with a projected subgradient algorithm.
(17) 
and the projection onto amounts to a centering operation.
The subproblems (14)–(15)
can be handled in parallel and then have their solutions gathered for
computing this projection and update the Lagrange variables.
Putting these pieces together yields Algorithm 1,
which assumes a blackbox procedure ComputeMap that returns
a local MAP assignment (as an indicator vector), given logpotentials as input.
At each iteration, the algorithm broadcasts the current Lagrange multipliers
to all the factors.
Each factor adjusts its internal unary logpotentials (line 6) and
invokes the ComputeMap procedure (line 7).
Proposition 2 (Convergence rate)
Proof: This is a property of projected subgradient algorithms (see, e.g., Bertsekas et al. 1999).
Proposition 3 (Certificate of optimality)
Proof: If all local subproblems are in agreement, then a vacuous update will occur in line 11, and no further changes will occur. Since the algorithm is guaranteed to converge, the current is optimal. Also, if all local subproblems are in agreement, the averaging in line 10 necessarily yields a binary vector . Since any binary solution of LPMAP is also a solution of MAP, the result follows.
Propositions 2–3 imply that, if the parameters are such that LPMAP is a tight relaxation, then Algorithm 1 yields the exact MAP configuration along with a certificate of optimality. According to Proposition 2, even if the relaxation is not tight, Algorithm 1 still converges to a solution of LPMAP. In some problem domains, the LPMAP is often tight (Koo and Collins, 2010; Rush et al., 2010); for problems with a relaxation gap, techniques to tighten the relaxation have been developed (Rush and Collins, 2011). However, in large graphs with many overlapping factors, it has been observed that Algorithm 1 can converge quite slowly in practice (Martins et al., 2011b). This is not surprising, given that it attempts to reach a consensus among all overlapping components; the larger this number, the harder it is to achieve consensus. In the next section, we propose a new LPMAP inference algorithm that is more suitable for this class of problems.
4 Alternating Directions Dual Decomposition (AD)
4.1 Addressing the Weaknesses of Dual Decomposition with Projected Subgradient
The main weaknesses of Algorithm 1 reside in the following two aspects.

The dual objective function is nonsmooth, this being why “subgradients” are used instead of “gradients.” It is wellknown that nonsmooth optimization lacks some of the good properties of its smooth counterpart. Namely, there is no guarantee of monotonic improvement in the objective (see Bertsekas et al. 1999, p. 611). Ensuring convergence requires using a diminishing step size sequence, which leads to slow convergence rates. In fact, as stated in Proposition 2, iterations are required to guarantee accuracy.

A close look at Algorithm 1 reveals that the consensus is promoted solely by the Lagrange multipliers (line 6). In a economic interpretation, these represent “price adjustments” that lead to a reallocation of resources. However, no “memory” exists about past allocations or adjustments, so the workers never know how far they are from consensus. It is thus conceivable that a smart use of these quantities could accelerate convergence.
To obviate the first of these problems, Johnson et al. (2007) proposed smoothing the objective function through an “entropic” perturbation (controlled by a “temperature” parameter), which boils down to replacing the in (16) by a “softmax.” As a result, all the local subproblems become marginal (rather than MAP) computations. A related and asymptotically faster method was proposed later by Jojic et al. (2010), who address the resulting smooth optimization problem with an accelerated gradient method (Nesterov, 1983, 2005). That approach guarantees an accurate solution after iterations, an improvement over the bound of Algorithm 1. However, those methods have some drawbacks. First, they need to operate at nearzero temperatures; e.g., the iteration bound of Jojic et al. (2010) requires setting the temperature to , which scales the potentials by and may lead to numerical instabilities in some problems. Second, the solution of the local subproblems are always dense; although some marginal values may be low, they are never exactly zero. This contrasts with the projected subgradient algorithm, for which the solutions of the local subproblems are MAP configurations. These configurations can be cached across iterations, leading to important speedups (Koo et al., 2010).
We will show that AD also yields a iteration bound without suffering from the two drawbacks above. Unlike Algorithm 1, AD broadcasts the current global solution to the workers, thus allowing them to regularize their subproblems toward that solution. This promotes a faster consensus, without sacrificing the modularity of dual decomposition. Another advantage of AD is that it keeps track of primal and dual residuals, allowing monitoring the LPMAP optimization process and stopping when a desired accuracy level is attained.
4.2 Augmented Lagrangians and the Alternating Directions Method of Multipliers
Let us start with a brief overview of augmented Lagrangian methods. Consider the following general convex optimization problem with equality constraints:
(18) 
where and are convex sets and and are concave functions. Note that the LPMAP problem stated in (10) has this form. For any , consider the problem
(19) 
which differs from (18) in the extra term penalizing violations of the equality constraints; since this term vanishes at feasibility, the two problems have the same solution. The reason to consider (19) is that its objective is strongly concave, even if is not. The Lagrangian of problem (19),
(20) 
is the augmented Lagrangian of problem (18). The socalled augmented Lagrangian (AL) methods (Bertsekas et al., 1999, Section 4.2) address problem (18) by seeking a saddle point of , for some sequence . The earliest instance is the method of multipliers (MM) (Hestenes, 1969; Powell, 1969), which alternates between a joint update of and through
(21) 
and a gradient update of the Lagrange multiplier,
(22) 
Under some conditions, this method is convergent, and even superlinear, if the sequence is increasing (Bertsekas et al. 1999, Section 4.2). A shortcoming of the MM is that problem (21) may be difficult, since the penalty term of the augmented Lagrangian couples the variables and . The alternating directions method of multipliers (ADMM) avoids this shortcoming, by replacing the joint optimization (21) by a single block GaussSeideltype step:
(23) 
(24) 
In general, problems (23)–(24) are simpler than the joint maximization in (21). ADMM was proposed by Glowinski and Marroco (1975) and Gabay and Mercier (1976) and is related to other optimization methods, such as DouglasRachford splitting (Eckstein and Bertsekas, 1992) and proximal point methods (see Boyd et al. 2011 for an historical overview).
4.3 Derivation of AD
The LPMAPP problem (10) can be cast into the form (18) by proceeding as follows:

let in (18) be the Cartesian product of simplices, , and ;

let and ;

let in (18) be a blockdiagonal matrix, where , with one block per factor, which is a vertical concatenation of the matrices ;

let be a matrix of gridstructured blocks, where the block in the th row and the th column is a negative identity matrix of size , and all the other blocks are zero;

let .
The augmented Lagrangian associated with (10) is
(25) 
This is the standard Lagrangian in (11) plus the Euclidean penalty term. The ADMM updates are
Broadcast:  (26)  
Gather:  (27)  
Multiplier update:  (28) 
We next analyze the broadcast and gather steps, and prove a proposition about the multiplier update.
Broadcast step. The maximization (26) can be carried out in parallel at the factors, as in Algorithm 1. The only difference is that, instead of a local MAP computation, each softfactor worker now needs to solve a quadratic program of the form:
(29) 
The subproblem (29) differs from the linear subproblem (14)–(15) in the projected subgradient algorithm by including an Euclidean penalty term, which penalizes deviations from the global consensus. Sections 5 and 6 propose procedures to solve the local subproblems (29).
Gather step. The solution of (27) has a closed form. Indeed, this problem is separable into independent optimizations, one for each ; defining ,
(30) 
Proof: The proof is by induction: satisfies (31); for , if satisfies (31), then,
i.e., also satisfies (31), simply by applying the update step (28).
Assembling all these pieces together leads to the AD (Algorithm 2). Notice that AD retains the modular structure of Algorithm 1: both are iterative consensus algorithms, alternating between a broadcast operation, where subproblems are distributed across local workers (lines 5–9 in Algorithm 2), and a gather operation, where the local solutions are assembled by a master node, which updates the global solution (line 10) and adjusts multipliers to promote a consensus (line 11). The key difference is that AD also broadcasts the current global solution to the workers, allowing them to regularize their subproblems toward that solution, thus speeding up the consensus. This is embodied in the procedure SolveQP (line 7), which replaces ComputeMAP of Algorithm 1.
4.4 Convergence Analysis
Proof of Convergence
Convergence of AD follows directly from the general convergence properties of ADMM. Remarkably, unlike in Algorithm 1 (projected subgradient), convergence is ensured with a fixed step size.
Proposition 5 (Convergence)
Let be the sequence of iterates produced by Algorithm 2 with a fixed . Then the following holds:
Proof: 1, 2, and 3 are general properties of ADMM (Glowinski and Le Tallec, 1989, Theorem 4.2). Statement 4 stems from Proposition 4 (simply compare (31) with (13)).
Although Proposition 5 guarantees convergence for any choice of , this parameter may strongly impact the behavior of the algorithm, as illustrated in Fig. 3. In our experiments, we dynamically adjust in earlier iterations using the heuristic described in Boyd et al. (2011), and freeze it afterwards, not to compromise convergence.
Primal and dual residuals
Recall from Proposition 3 that the projected subgradient algorithm yields
optimality certificates, if the relaxation is tight (i.e., if the solution of the LPMAP
problem is the exact MAP), whereas in problems with a relaxation gap, it is harder to decide when to stop.
It would be desirable to have similar guarantees concerning the relaxed primal.
(34) 
where the constant in the denominator ensures that . The dual residual ,
(35) 
is the amount by which a dual optimality condition is violated (Boyd et al., 2011). We adopt as stopping criterion that these two residuals fall below a threshold, e.g., .
Approximate solutions of the local subproblems
The next proposition states that convergence may still hold, even if the local subproblems are solved approximately, provided the sequence of errors is summable. The importance of this result will be clear in Section 6, where we describe a general iterative algorithm for solving the local quadratic subproblems. Essentially, Proposition 6 allows these subproblems to be solved numerically up to some accuracy without compromising global convergence, as long as the accuracy of the solutions improves sufficiently fast over ADMM iterations.
Convergence rate
Although ADMM was invented in the 1970s, its convergence rate was unknown until recently. The next proposition states the iteration bound of AD, asymptotically equivalent to that of the algorithm of Jojic et al. (2010), therefore better than the bound of Algorithm 1.
Proposition 7
Proof: The proof (detailed in Appendix A) uses recent results of He and Yuan (2011) and Wang and Banerjee (2012), concerning convergence of ADMM in a variational inequality setting.
As expected, the bound (37) increases with the number of overlapping variables, the size of the sets , and the magnitude of the optimal dual vector . Note that if there is a good estimate of , then (37) can be used to choose a step size that minimizes the bound. Moreover, unlike the bound of the accelerated method of Jojic et al. (2010), which requires specifying beforehand, AD does not require prespecifying a desired accuracy.
The bounds derived so far for all these algorithms are with respect to the dual problem—an open problem is to obtain bounds related to primal convergence.
Runtime and caching strategies
In practice, considerable speedups can be achieved by caching the subproblems, a strategy which has also been proposed for the projected subgradient algorithm (Koo et al., 2010). After a few iterations, many variables reach a consensus (i.e., ) and enter an idle state: they are left unchanged by the update (32), and so do the Lagrange variables (28)). If at iteration all variables in a subproblem at factor are idle, then , hence the corresponding subproblem does not need to be solved. Typically, many variables and subproblems enter this idle state after the first few rounds. We will show the practical benefits of caching in the experimental section (Section 7.5).
4.5 Exact Inference with BranchandBound
Recall that AD, as just described, solves the LPMAP relaxation of the actual problem. In some problems, this relaxation is tight (in which case a certificate of optimality will be obtained), but this is not always the case. When a fractional solution is obtained, it is desirable to have a strategy to recover the exact MAP solution.
Two observations are noteworthy. First, as we saw in Section 2.3, the optimal value of the relaxed problem LPMAP provides an upper bound to the original problem MAP. In particular, any feasible dual point provides an upper bound to the original problem’s optimal value. Second, during execution of the AD algorithm, we always keep track of a sequence of feasible dual points (as guaranteed by Proposition 5, item 4). Therefore, each iteration constructs tighter and tighter upper bounds. In recent work (?), we proposed a branchandbound search procedure that finds the exact solution of the ILP. The procedure works recursively as follows:

Initialize (our best value so far).

Find the “most fractional” component of (call it and branch: for every , constrain and go to step 2, eventually obtaining an integer solution or infeasibility. Return the that yields the largest objective value.
Although this procedure has worstcase exponential runtime, in many problems for which the relaxations are nearexact it is found empirically very effective. We will see one example in Section 7.4.
5 Local Subproblems in AD
5.1 Introduction
This section shows how to solve the AD local subproblems (29) exactly and efficiently, in several cases, including Ising models and logic constraint factors. These results will be complemented in Section 6, where a new procedure to handle arbitrary factors widens the applicability of AD.
By subtracting a constant from the objective (29), rescaling, and turning the maximization into a minimization, the problem can be written more compactly as
with respect to  
subject to  (38) 
where and .
We show that (5.1) has a closedform solution or can be solved exactly and efficiently, in several cases; e.g., for Ising models, for factor graphs imposing firstorder logic (FOL) constraints, and for Potts models (after binarization). In these cases, AD and the projected subgradient algorithm have (asymptotically) the same computational cost per iteration, up to a logarithmic factor.
5.2 Ising Models
Ising models are factor graphs containing only binary pairwise factors. A binary pairwise factor (say, ) is one connecting two binary variables (say, and ); thus and . Given that , we can write , . Furthermore, since and marginalization requires that and , we can also write