Alternating Directions Dual Decomposition
We propose AD, a new algorithm for approximate maximum a posteriori (MAP) inference on factor graphs based on the alternating directions method of multipliers. Like dual decomposition algorithms, AD uses worker nodes to iteratively solve local subproblems and a controller node to combine these local solutions into a global update. The key characteristic of AD is that each local subproblem has a quadratic regularizer, leading to a faster consensus than subgradient-based dual decomposition, both theoretically and in practice. We provide closed-form solutions for these AD subproblems for binary pairwise factors and factors imposing first-order logic constraints. For arbitrary factors (large or combinatorial), we introduce an active set method which requires only an oracle for computing a local MAP configuration, making AD applicable to a wide range of problems. Experiments on synthetic and real-world problems show that AD compares favorably with the state-of-the-art.
Graphical models enable compact representations of probability distributions, being widely used in computer vision, natural language processing (NLP), and computational biology (Pearl, 1988; Lauritzen, 1996; Koller and Friedman, 2009). When using these models, a central problem is that of inferring the most probable (a.k.a. maximum a posteriori – MAP) configuration. Unfortunately, exact MAP inference is an intractable problem for many graphical models of interest in applications, such as those involving non-local features and/or structural constraints. This fact has motivated a significant research effort on approximate MAP inference.
A class of methods that proved effective for approximate inference is based on linear programming relaxations of the MAP problem (LP-MAP) (Schlesinger, 1976). Several message-passing algorithms have been proposed to address the resulting LP problems, taking advantage of the underlying graph structure (Wainwright et al., 2005; Kolmogorov, 2006; Werner, 2007; Globerson and Jaakkola, 2008; Ravikumar et al., 2010). In the same line, Komodakis et al. (2007) proposed a method based on the classical dual decomposition (Dantzig and Wolfe, 1960; Everett III, 1963; Shor, 1985), which breaks the original problem into a set of smaller subproblems; this set of subproblems, together with the constraints that they should agree on the shared variables, yields a constrained optimization problem, the Lagrange dual of which is solved using a projected subgradient method. Initially proposed for computer vision (Komodakis et al., 2007; Komodakis and Paragios, 2009), this technique has also been shown effective in several NLP tasks that involve combinatorial constraints, such as parsing and machine translation (Koo et al., 2010; Rush et al., 2010; Auli and Lopez, 2011; Rush and Collins, 2011; Chang and Collins, 2011; DeNero and Macherey, 2011). The major drawback is that the subgradient algorithm is very slow to achieve consensus in problems with a large number of overlapping components, requiring iterations for an -accurate solution. Jojic et al. (2010) addressed this drawback by introducing an accelerated method that enjoys a faster convergence rate ; however, that method is sensitive to previous knowledge of and the prescription of a temperature parameter, which may yield numerical instabilities.
In this paper, we ally the simplicity of dual decomposition (DD) with
the effectiveness of augmented Lagrangian methods, which have a
long-standing history in optimization (Hestenes, 1969; Powell, 1969).
In particular, we employ the alternating directions method of multipliers
Our main contributions are:
We derive AD and establish its convergence properties, blending classical and newer results about ADMM (Glowinski and Le Tallec, 1989; Eckstein and Bertsekas, 1992; Wang and Banerjee, 2012). We show that the algorithm has the same form as the DD method of Komodakis et al. (2007), with the local MAP subproblems replaced by quadratic programs.
We show that these AD subproblems can be solved exactly and efficiently in many cases of interest, including Ising models and a wide range of hard factors representing arbitrary constraints in first-order logic. Up to a logarithmic term, the asymptotic cost is the same as that of passing messages or doing local MAP inference.
For factors lacking a closed-form solution of the AD subproblems, we introduce a new active set method. Remarkably, our method requires only a black box that returns local MAP configurations for each factor (the same requirement of the projected subgradient algorithm). This paves the way for using AD with large dense or structured factors, based on off-the-shelf combinatorial algorithms (e.g., Viterbi, Chu-Liu-Edmonds, Ford-Fulkerson).
We show that AD can be wrapped into a branch-and-bound procedure to retrieve the exact MAP configuration.
AD was originally introduced by Martins et al. (2010a, 2011a) (then called DD-ADMM). In addition to a considerably more detailed presentation, this paper contains contributions that substantially extend that preliminary work in several directions: the rate of convergence, the active set method for general factors, and the branch-and-bound procedure for exact MAP inference. It also reports a wider set of experiments and the release of open-source code (available at http://www.ark.cs.cmu.edu/AD3), which may be useful to other researchers in the field.
This paper is organized as follows. We start by providing background material: MAP inference in graphical models and its LP-MAP relaxation (Section 2); the projected subgradient algorithm for the DD of Komodakis et al. (2007) (Section 3). In Section 4, we derive AD and analyze its convergence. The AD local subproblems are addressed in Section 5, where closed-form solutions are derived for Ising models and several structural constraint factors. In Section 6, we introduce an active set method to solve the AD subproblems for arbitrary factors. Experiments with synthetic models, as well as in protein design and dependency parsing (Section 7) testify for the success of our approach. Finally, related work in discussed in Section 8, and Section 9 concludes the paper.
2.1 Factor Graphs
Let be a tuple of random variables, where each takes values in a finite set and takes values in the corresponding Cartesian product set . Graphical models (e.g., Bayesian networks, Markov random fields) are structural representations of joint probability distributions, which conveniently model statistical (in)dependencies among the variables. In this paper, we focus on factor graphs, introduced by Tanner (1981) and Kschischang et al. (2001).
Definition 1 (Factor graph.)
A factor graph is a bipartite graph , comprised of:
a set of variable nodes , corresponding to the variables ;
a set of factor nodes (disjoint from );
a set of edges linking variable nodes to factor nodes.
For notational convenience, we use Latin letters () and Greek letters () to refer to variable and factor nodes, respectively. We denote by the neighborhood set of its node argument, whose cardinality is called the degree of the node. Formally, , for variable nodes, and for factor nodes. We use the short notation to denote the tuple of variables linked to factor , i.e., , and lowercase to denote instantiations of random variables, e.g., and .
The joint probability distribution of is said to factor according to the factor graph if it can be written as
where and are called, respectively, the unary and higher-order log-potential functions. To accommodate hard constraints, we allow these functions to take values in . Fig. 1 shows examples of factor graphs with hard constraint factors (to be studied in detail in Section 5.3). For notational convenience, we represent potential functions as vectors, and , where . We denote by the vector of dimension obtained by stacking all these vectors.
2.2 MAP Inference
Given a factor graph , along with all the log-potential functions—which fully specify the joint distribution of —we are interested in finding an assignment with maximal probability (the so-called MAP assignment/configuration):
This combinatorial problem, which is NP-hard in general (Koller and Friedman, 2009), can be transformed into a linear program by using the concept of marginal polytope (Wainwright and Jordan, 2008), as next described. Let denote the -dimensional probability simplex,
where has all entries equal to zero except for the th one, which is equal to one. We introduce “local” probability distributions over the variables and factors, and . We stack these distributions into vectors and , with dimensions and , respectively, and denote by their concatenation. We say that is globally consistent if the local distributions and are the marginals of some global joint distribution. The set of all globally consistent vectors is called the marginal polytope of , denoted as . There is a one-to-one mapping between the vertices of and the set of possible configurations : each corresponds to a vertex of consisting of “degenerate” distributions and , for each and . The MAP problem in (2) is then equivalent to the following linear program (LP):
The equivalence between (2) and (4) stems from the fact that any linear program on a bounded convex constraint polytope attains a solution at a vertex of that polytope (see, e.g., Rockafellar (1970)). Unfortunately, the linear program (4) is not easier to solve than (2): for a graph with cycles (which induce global consistency constraints that are hard to specify concisely), the number of linear inequalities that characterize may grow superpolynomially with the size of . As a consequence, approximations of have been actively investigated in recent years.
2.3 LP-MAP Inference
Tractable approximations of can be built by using weaker constraints that all realizable marginals must satisfy to ensure local consistency:
Non-negativity: all local marginals and must be non-negative.
Normalization: local marginals must be properly normalized, i.e., , for all , and , for all .
Marginalization: a variable participating in a factor must have a marginal which is consistent with the factor marginals, i.e., , for all and .
Note that some of these constraints are redundant: the non-negativity of and the marginalization constraint imply the non-negativity of ; the normalization of and the marginalization constraints imply . For convenience, we express those constraints in vector notation. For each , we define a (consistency) matrix as follows:
The local marginalization constraints can be expressed as Combining all the local consistency constraints, and dropping redundant ones, yields the so-called local polytope:
The number of constraints that define grows linearly with , rather than superpolynomially. The elements of are called pseudo-marginals. Since any true marginal vector must satisfy the constraints above, is an outer approximation, i.e.,
as illustrated in Fig. 2.
This approximation is tight for tree-structured graphs and other special cases, but not in general (even for small graphs with cycles) (Wainwright and Jordan, 2008). The LP-MAP problem results from replacing by in (4) (Schlesinger, 1976):
It can be shown that the points of that are integer are the vertices of ; therefore, the solution of LP-MAP provides an upper bound on the optimal objective of MAP.
2.4 LP-MAP Inference Algorithms
While any off-the-shelf LP solver can be used for solving (8), specialized algorithms have been designed to exploit the graph structure, achieving superior performance on several benchmarks (Yanover et al., 2006). Most of these specialized algorithms belong to two classes: block (dual) coordinate descent, which take the form of message-passing algorithms, and projected subgradient algorithms, based on dual decomposition.
Block coordinate descent methods address the dual of (8) by alternately optimizing over blocks of coordinates. Different choices of blocks lead to different algorithms: max-sum diffusion (Kovalevsky and Koval, 1975; Werner, 2007); max-product sequential tree-reweighted belief propagation (TRW-S) (Wainwright et al., 2005; Kolmogorov, 2006); max-product linear programming (MPLP) (Globerson and Jaakkola, 2008). These algorithms work by passing messages (that require computing max-marginals) between factors and variables. Under certain conditions, if the relaxation is tight, one may obtain optimality certificates. If the relaxation is not tight, it is sometimes possible to reduce the problem size or use a cutting-plane method to move toward the exact MAP (Sontag et al., 2008). A disadvantage of block coordinate descent algorithms is that they may stop at suboptimal solutions, since the objective is non-smooth (Bertsekas et al. 1999, Section 6.3.4).
Projected subgradient based on the dual decomposition is a classical technique (Dantzig and Wolfe, 1960; Everett III, 1963; Shor, 1985), which was first proposed in the context of graphical models by Komodakis et al. (2007) and Johnson et al. (2007). Due to its strong relationship with the approach pursued in this paper, that method is described in detail in the next section.
3 Dual Decomposition with the Projected Subgradient Algorithm
The projected subgradient method for dual decomposition can be seen as an iterative “consensus algorithm,” alternating between a broadcast operation, where local subproblems are distributed across worker nodes, and a gather operation, where the local solutions are assembled by a master node to update the current global solution. At each round, the worker nodes perform local MAP inference independently; hence, the algorithm is completely modular and trivially parallelizable.
For each edge , define a potential function that satisfies ; a trivial choice is . The objective (8) can be rewritten as
because . The LP-MAP problem (8) is then equivalent to the following primal formulation, which we call LP-MAP-P:
Note that, although the -variables do not appear in the objective of (10), they play a fundamental role through the constraints in the last line, which are necessary to ensure that the marginals encoded in the -variables are consistent on their overlaps. Indeed, it is this set of constraints that complicate the optimization problem, which would otherwise be separable into independent subproblems, one per factor. Introducing a Lagrange multiplier for each of these equality constraints leads to the Lagrangian function
the maximization of which w.r.t. and will yield the (Lagrangian) dual objective. Since the -variables are unconstrained, we have
where is the dual objective function,
and each is the solution of a local problem (called the -subproblem):
the last equality is justified by the fact that maximizing a linear objective over the probability simplex gives the largest component of the score vector. Finally, the dual problem is
Problem (16) is often referred to as the master or controller, and each -subproblem (14)–(15) as a slave or worker. Note that each of these -subproblems is itself a local MAP problem, involving only factor . As a consequence, the solution of (14) is an indicator vector corresponding to a particular configuration (the solution of (15) ), that is, .
and the projection onto amounts to a centering operation.
The -subproblems (14)–(15)
can be handled in parallel and then have their solutions gathered for
computing this projection and update the Lagrange variables.
Putting these pieces together yields Algorithm 1,
which assumes a black-box procedure ComputeMap that returns
a local MAP assignment (as an indicator vector), given log-potentials as input.
At each iteration, the algorithm broadcasts the current Lagrange multipliers
to all the factors.
Each factor adjusts its internal unary log-potentials (line 6) and
invokes the ComputeMap procedure (line 7).
Proposition 2 (Convergence rate)
Proof: This is a property of projected subgradient algorithms (see, e.g., Bertsekas et al. 1999).
Proposition 3 (Certificate of optimality)
Proof: If all local subproblems are in agreement, then a vacuous update will occur in line 11, and no further changes will occur. Since the algorithm is guaranteed to converge, the current is optimal. Also, if all local subproblems are in agreement, the averaging in line 10 necessarily yields a binary vector . Since any binary solution of LP-MAP is also a solution of MAP, the result follows.
Propositions 2–3 imply that, if the parameters are such that LP-MAP is a tight relaxation, then Algorithm 1 yields the exact MAP configuration along with a certificate of optimality. According to Proposition 2, even if the relaxation is not tight, Algorithm 1 still converges to a solution of LP-MAP. In some problem domains, the LP-MAP is often tight (Koo and Collins, 2010; Rush et al., 2010); for problems with a relaxation gap, techniques to tighten the relaxation have been developed (Rush and Collins, 2011). However, in large graphs with many overlapping factors, it has been observed that Algorithm 1 can converge quite slowly in practice (Martins et al., 2011b). This is not surprising, given that it attempts to reach a consensus among all overlapping components; the larger this number, the harder it is to achieve consensus. In the next section, we propose a new LP-MAP inference algorithm that is more suitable for this class of problems.
4 Alternating Directions Dual Decomposition (AD)
4.1 Addressing the Weaknesses of Dual Decomposition with Projected Subgradient
The main weaknesses of Algorithm 1 reside in the following two aspects.
The dual objective function is non-smooth, this being why “subgradients” are used instead of “gradients.” It is well-known that non-smooth optimization lacks some of the good properties of its smooth counterpart. Namely, there is no guarantee of monotonic improvement in the objective (see Bertsekas et al. 1999, p. 611). Ensuring convergence requires using a diminishing step size sequence, which leads to slow convergence rates. In fact, as stated in Proposition 2, iterations are required to guarantee -accuracy.
A close look at Algorithm 1 reveals that the consensus is promoted solely by the Lagrange multipliers (line 6). In a economic interpretation, these represent “price adjustments” that lead to a reallocation of resources. However, no “memory” exists about past allocations or adjustments, so the workers never know how far they are from consensus. It is thus conceivable that a smart use of these quantities could accelerate convergence.
To obviate the first of these problems, Johnson et al. (2007) proposed smoothing the objective function through an “entropic” perturbation (controlled by a “temperature” parameter), which boils down to replacing the in (16) by a “soft-max.” As a result, all the local subproblems become marginal (rather than MAP) computations. A related and asymptotically faster method was proposed later by Jojic et al. (2010), who address the resulting smooth optimization problem with an accelerated gradient method (Nesterov, 1983, 2005). That approach guarantees an -accurate solution after iterations, an improvement over the bound of Algorithm 1. However, those methods have some drawbacks. First, they need to operate at near-zero temperatures; e.g., the iteration bound of Jojic et al. (2010) requires setting the temperature to , which scales the potentials by and may lead to numerical instabilities in some problems. Second, the solution of the local subproblems are always dense; although some marginal values may be low, they are never exactly zero. This contrasts with the projected subgradient algorithm, for which the solutions of the local subproblems are MAP configurations. These configurations can be cached across iterations, leading to important speedups (Koo et al., 2010).
We will show that AD also yields a iteration bound without suffering from the two drawbacks above. Unlike Algorithm 1, AD broadcasts the current global solution to the workers, thus allowing them to regularize their subproblems toward that solution. This promotes a faster consensus, without sacrificing the modularity of dual decomposition. Another advantage of AD is that it keeps track of primal and dual residuals, allowing monitoring the LP-MAP optimization process and stopping when a desired accuracy level is attained.
4.2 Augmented Lagrangians and the Alternating Directions Method of Multipliers
Let us start with a brief overview of augmented Lagrangian methods. Consider the following general convex optimization problem with equality constraints:
where and are convex sets and and are concave functions. Note that the LP-MAP problem stated in (10) has this form. For any , consider the problem
which differs from (18) in the extra term penalizing violations of the equality constraints; since this term vanishes at feasibility, the two problems have the same solution. The reason to consider (19) is that its objective is -strongly concave, even if is not. The Lagrangian of problem (19),
is the -augmented Lagrangian of problem (18). The so-called augmented Lagrangian (AL) methods (Bertsekas et al., 1999, Section 4.2) address problem (18) by seeking a saddle point of , for some sequence . The earliest instance is the method of multipliers (MM) (Hestenes, 1969; Powell, 1969), which alternates between a joint update of and through
and a gradient update of the Lagrange multiplier,
Under some conditions, this method is convergent, and even superlinear, if the sequence is increasing (Bertsekas et al. 1999, Section 4.2). A shortcoming of the MM is that problem (21) may be difficult, since the penalty term of the augmented Lagrangian couples the variables and . The alternating directions method of multipliers (ADMM) avoids this shortcoming, by replacing the joint optimization (21) by a single block Gauss-Seidel-type step:
In general, problems (23)–(24) are simpler than the joint maximization in (21). ADMM was proposed by Glowinski and Marroco (1975) and Gabay and Mercier (1976) and is related to other optimization methods, such as Douglas-Rachford splitting (Eckstein and Bertsekas, 1992) and proximal point methods (see Boyd et al. 2011 for an historical overview).
4.3 Derivation of AD
let in (18) be the Cartesian product of simplices, , and ;
let and ;
let in (18) be a block-diagonal matrix, where , with one block per factor, which is a vertical concatenation of the matrices ;
let be a matrix of grid-structured blocks, where the block in the th row and the th column is a negative identity matrix of size , and all the other blocks are zero;
The -augmented Lagrangian associated with (10) is
This is the standard Lagrangian in (11) plus the Euclidean penalty term. The ADMM updates are
We next analyze the broadcast and gather steps, and prove a proposition about the multiplier update.
Broadcast step. The maximization (26) can be carried out in parallel at the factors, as in Algorithm 1. The only difference is that, instead of a local MAP computation, each soft-factor worker now needs to solve a quadratic program of the form:
The subproblem (29) differs from the linear subproblem (14)–(15) in the projected subgradient algorithm by including an Euclidean penalty term, which penalizes deviations from the global consensus. Sections 5 and 6 propose procedures to solve the local subproblems (29).
Gather step. The solution of (27) has a closed form. Indeed, this problem is separable into independent optimizations, one for each ; defining ,
Assembling all these pieces together leads to the AD (Algorithm 2). Notice that AD retains the modular structure of Algorithm 1: both are iterative consensus algorithms, alternating between a broadcast operation, where subproblems are distributed across local workers (lines 5–9 in Algorithm 2), and a gather operation, where the local solutions are assembled by a master node, which updates the global solution (line 10) and adjusts multipliers to promote a consensus (line 11). The key difference is that AD also broadcasts the current global solution to the workers, allowing them to regularize their subproblems toward that solution, thus speeding up the consensus. This is embodied in the procedure SolveQP (line 7), which replaces ComputeMAP of Algorithm 1.
4.4 Convergence Analysis
Proof of Convergence
Convergence of AD follows directly from the general convergence properties of ADMM. Remarkably, unlike in Algorithm 1 (projected subgradient), convergence is ensured with a fixed step size.
Proposition 5 (Convergence)
Let be the sequence of iterates produced by Algorithm 2 with a fixed . Then the following holds:
Although Proposition 5 guarantees convergence for any choice of , this parameter may strongly impact the behavior of the algorithm, as illustrated in Fig. 3. In our experiments, we dynamically adjust in earlier iterations using the heuristic described in Boyd et al. (2011), and freeze it afterwards, not to compromise convergence.
Primal and dual residuals
Recall from Proposition 3 that the projected subgradient algorithm yields
optimality certificates, if the relaxation is tight (i.e., if the solution of the LP-MAP
problem is the exact MAP), whereas in problems with a relaxation gap, it is harder to decide when to stop.
It would be desirable to have similar guarantees concerning the relaxed primal.
where the constant in the denominator ensures that . The dual residual ,
is the amount by which a dual optimality condition is violated (Boyd et al., 2011). We adopt as stopping criterion that these two residuals fall below a threshold, e.g., .
Approximate solutions of the local subproblems
The next proposition states that convergence may still hold, even if the local subproblems are solved approximately, provided the sequence of errors is summable. The importance of this result will be clear in Section 6, where we describe a general iterative algorithm for solving the local quadratic subproblems. Essentially, Proposition 6 allows these subproblems to be solved numerically up to some accuracy without compromising global convergence, as long as the accuracy of the solutions improves sufficiently fast over ADMM iterations.
Although ADMM was invented in the 1970s, its convergence rate was unknown until recently. The next proposition states the iteration bound of AD, asymptotically equivalent to that of the algorithm of Jojic et al. (2010), therefore better than the bound of Algorithm 1.
As expected, the bound (37) increases with the number of overlapping variables, the size of the sets , and the magnitude of the optimal dual vector . Note that if there is a good estimate of , then (37) can be used to choose a step size that minimizes the bound. Moreover, unlike the bound of the accelerated method of Jojic et al. (2010), which requires specifying beforehand, AD does not require pre-specifying a desired accuracy.
The bounds derived so far for all these algorithms are with respect to the dual problem—an open problem is to obtain bounds related to primal convergence.
Runtime and caching strategies
In practice, considerable speed-ups can be achieved by caching the subproblems, a strategy which has also been proposed for the projected subgradient algorithm (Koo et al., 2010). After a few iterations, many variables reach a consensus (i.e., ) and enter an idle state: they are left unchanged by the -update (32), and so do the Lagrange variables (28)). If at iteration all variables in a subproblem at factor are idle, then , hence the corresponding subproblem does not need to be solved. Typically, many variables and subproblems enter this idle state after the first few rounds. We will show the practical benefits of caching in the experimental section (Section 7.5).
4.5 Exact Inference with Branch-and-Bound
Recall that AD, as just described, solves the LP-MAP relaxation of the actual problem. In some problems, this relaxation is tight (in which case a certificate of optimality will be obtained), but this is not always the case. When a fractional solution is obtained, it is desirable to have a strategy to recover the exact MAP solution.
Two observations are noteworthy. First, as we saw in Section 2.3, the optimal value of the relaxed problem LP-MAP provides an upper bound to the original problem MAP. In particular, any feasible dual point provides an upper bound to the original problem’s optimal value. Second, during execution of the AD algorithm, we always keep track of a sequence of feasible dual points (as guaranteed by Proposition 5, item 4). Therefore, each iteration constructs tighter and tighter upper bounds. In recent work (?), we proposed a branch-and-bound search procedure that finds the exact solution of the ILP. The procedure works recursively as follows:
Initialize (our best value so far).
Find the “most fractional” component of (call it and branch: for every , constrain and go to step 2, eventually obtaining an integer solution or infeasibility. Return the that yields the largest objective value.
Although this procedure has worst-case exponential runtime, in many problems for which the relaxations are near-exact it is found empirically very effective. We will see one example in Section 7.4.
5 Local Subproblems in AD
This section shows how to solve the AD local subproblems (29) exactly and efficiently, in several cases, including Ising models and logic constraint factors. These results will be complemented in Section 6, where a new procedure to handle arbitrary factors widens the applicability of AD.
By subtracting a constant from the objective (29), re-scaling, and turning the maximization into a minimization, the problem can be written more compactly as
|with respect to|
where and .
We show that (5.1) has a closed-form solution or can be solved exactly and efficiently, in several cases; e.g., for Ising models, for factor graphs imposing first-order logic (FOL) constraints, and for Potts models (after binarization). In these cases, AD and the projected subgradient algorithm have (asymptotically) the same computational cost per iteration, up to a logarithmic factor.
5.2 Ising Models
Ising models are factor graphs containing only binary pairwise factors. A binary pairwise factor (say, ) is one connecting two binary variables (say, and ); thus and . Given that , we can write , . Furthermore, since and marginalization requires that and , we can also write