Bregman Alternating Direction Method of Multipliers
The mirror descent algorithm (MDA) generalizes gradient descent by using a Bregman divergence to replace squared Euclidean distance. In this paper, we similarly generalize the alternating direction method of multipliers (ADMM) to Bregman ADMM (BADMM), which allows the choice of different Bregman divergences to exploit the structure of problems. BADMM provides a unified framework for ADMM and its variants, including generalized ADMM, inexact ADMM and Bethe ADMM. We establish the global convergence and the iteration complexity for BADMM. In some cases, BADMM can be faster than ADMM by a factor of where is the dimension of the problem. Experimental results are illustrated on the mass transportation problem, which can be solved in parallel by BADMM. BADMM is faster than ADMM and highly optimized commercial software Gurobi, particularly when implemented on GPU.
In recent years, the alternating direction method of multipliers (ADM or ADMM) [boyd10] has been successfully applied in a broad spectrum of applications, ranging from image processing [Figueiredo10, osher10:admm] to applied statistics and machine learning [yang09, wang12:oadm, wang13:ADMMMAP]. For further understanding of ADMM, we refer the readers to the comprehensive review by [boyd10] and references therein. In particular, ADMM considers the problem of minimizing composite objective functions subject to an equality constraint:
where and are convex functions, , , and and are convex sets. and can be non-smooth functions, including indicator functions of convex sets. Many machine learning problems can be cast into the framework of minimizing a composite objective [nest07:composite, Duchi10_comid], where is a loss function such as hinge or logistic loss, and is a regularizer, e.g., norm, norm, nuclear norm or total variation. The two functions usually have different structures and constraints because they have different tasks in data mining. Therefore, it is useful and sometimes necessary to split and solve them separately, which is exactly the forte of ADMM.
In each iteration, ADMM updates splitting variables separately and alternatively by solving the augmented Lagrangian of (1), which is defined as follows:
where is dual variable, is penalty parameter, and the quadratic penalty term is to penalize the violation of the equality constraint. ADMM consists of the following three updates:
Since the computational complexity of update (5) is trivial, the computational complexity of ADMM lies in the and updates (3)-(4) which amount to solving proximal minimization problems using the quadratic penalty term. Inexact ADMM [yang09, boyd10] and generalized ADMM [deng12:admm] have also been proposed to solve the updates inexactly by linearizing the functions and adding additional quadratic terms. Recently, online ADMM [wang12:oadm] and Bethe-ADMM [wang13:ADMMMAP] add an additional Bregman divergence on the update by keeping or linearizing the quadratic penalty term . As far as we know, all existing ADMMs use quadratic penalty terms.
A large amount of literature shows that replacing the quadratic term by Bregman divergence in gradient-type methods could greatly boost their performance in solving the constrained optimization problem. First, the use of Bregman divergence could effectively exploit the structure of problems [chen93:proxBreg, Beck03, Duchi10_comid] , e.g., in computerized tomography [ben01:mda], clustering problems and exponential family distributions [bane05:bregman]. Second, in some cases, the gradient descent method with Kullback-Leibler (KL) divergence can outperform the method with the quadratic term by a factor of where is the dimensionality of the problem [Beck03, ben01:mda]. Mirror descent algorithm (MDA) and composite objective mirror descent (COMID) [Duchi10_comid] use Bregman divergence to replace the quadratic term in gradient descent or proximal gradient [comb09:prox]. Proximal point method with D-functions (PMD) [chen93:proxBreg, ceze98] and Bregman proximal minimization (BPM) [kiwiel95:gbregman] generalize proximal point method by using Bregman divegence to replace the quadratic term.
On the side of ADMM, it is still unknown whether the quadratic penalty term in ADMM can be replaced by Bregman divergence, although the convergence of ADMM is well understood. The proof of global convergence of ADMM can be found in [Gabay83, boyd10]. Recently, it has been shown that ADMM converges at a rate of [wang12:oadm, he12:vi], where is the number of iterations. For strongly convex functions, the dual objective of an accelerated version of ADMM can converge at a rate of [Goldstein12:fadmm]. Under suitable assumptions like strongly convex functions or a sufficiently small step size for the dual variable update, ADMM can achieve a linear convergence rate [deng12:admm, luo12:admm]. However, as pointed out by [boyd10], “There is currently no proof of convergence known for ADMM with nonquadratic penalty terms.”
In this paper, we propose Bregman ADMM (BADMM) which uses Bregman divergences to replace the quadratic penalty term in ADMM, answering the question raised in [boyd10]. More specifically, the quadratic penalty term in the and updates (3)-(4) will be replaced by a Bregman divergence in BADMM. We also introduce a generalized version of BADMM where two additional Bregman divergences are added to the and updates. The generalized BADMM (BADMM for short) provides a unified framework for solving (1), which allows one to choose suitable Bregman divergence so that the and updates can be solved efficiently. BADMM includes ADMM and its variants as special cases. In particular, BADMM replaces all quadratic terms in generalized ADMM [deng12:admm] with Bregman divergences. By choosing a proper Bregman divergence, we also show that inexact ADMM [yang09] and Bethe ADMM [wang13:ADMMMAP] can be considered as special cases of BADMM. BADMM generalizes ADMM similar to how MDA generalizes gradient descent and how PMD generalizes proximal methods. In BADMM, the and updates can take the form of MDA or PMD. We establish the global convergence and the iteration complexity for BADMM. In some cases, we show that BADMM can outperform ADMM by a factor . We evaluate the performance of BADMM in solving the linear program problem of mass transportation [cock41:mt]. By exploiting the structure of the problem, BADMM leads to massive parallelism and can easily run on GPU. BADMM can even be orders of magnitude faster than highly optimized commercial software Gurobi. While Gurobi breaks down in solving a linear program of hundreds of millions of parameters in a server, BADMM takes hundreds of seconds running in a single GPU.
The rest of the paper is organized as follows. In Section 2, we propose Bregman ADMM and discuss several special cases of BADMM. In Section 3, we establish the convergence of BADMM. In Section 4, we consider illustrative applications of BADMM, and conclude in Section 5.
2 Bregman Alternating Direction Method of Multipliers
Let be a continuously differentiable and strictly convex function on the relative interior of a convex set . Denote as the gradient of at . We define Bregman divergence111The definition of Bregman divergence has been generalized to nondifferentiable functions [kiwiel95:gbregman, teda12:gbregman]. induced by as
Since is convex, where the equality holds if and only if . More details about Bregman divergence can be found in [chen93:proxBreg, bane05:bregman]. Two of the most commonly used examples are squared Euclidean distance and KL divergence .
Assuming is well defined, we replace the quadratic penalty term in the augmented Lagrangian (2) by a Bregman divergence as follows:
Unfortunately, we can not derive Bregman ADMM (BADMM) updates by simply solving alternatingly as ADMM does because Bregman divergences are not necessarily convex in the second argument. More specifically, given , can be obtained by solving , where the quadratic penalty term for ADMM in (3) is replaced with in the update of BADMM. However, given , we cannot obtain by solving , since the term need not be convex in . The observation motivates a closer look at the role of the quadratic term in ADMM.
In standard ADMM, the quadratic augmentation term added to the Lagrangian is just a penalty term to ensure the new updates do not violate the constraint significantly. Staying with these goals, we propose the update augmentation term of BADMM to be: , instead of the quadratic penalty term in (3). Then, we get the following updates for BADMM:
Compared to ADMM (3)-(5), BADMM simply uses a Bregman divergence to replace the quadratic penalty term in the and updates. It is worth noting that the same Bregman divergence is used in the and updates.
We consider a special case when . (7) is reduced to
If is a quadratic function, the constrained problem (10) requires the projection onto the constraint set . However, in some cases, if choosing a proper Bregman divergence, (10) can be solved efficiently or has a closed-form solution. For example, if is a linear function and is the unit simplex, should be KL divergence, leading to the exponentiated gradient [Beck03, ben01:mda, Nemi83:complexity]. Interestingly, if the update is also the exponentiated gradient, we have alternating exponentiated gradients. In Section 4, we will show the mass transportation problem can be cast into this scenario.
While the updates (7)-(8) use the same Bregman divergences, efficiently solving the and updates may not be feasible, especially when the structure of the original functions , the function used for augmentation, and the constraint sets are rather different. For example, if is a logistic function in (10), it will not have a closed-form solution even is the KL divergence and is the unit simplex. To address such concerns, we propose a generalized version of BADMM in Section 2.1.
2.1 Generalized BADMM
To allow the use of different Bregman divergences in the and updates (7)-(9) of BADMM, the generalized BADMM simply introduces an additional Bregman divergence for each update. The generalized BADMM has the following updates:
where . Note that we allow the use of a different step size in the dual variable update [deng12:admm, luo12:admm]. There are three Bregman divergences in the generalized BADMM. While the Bregman divergence is shared by the and updates, the update has its own Bregman divergence and the update has its own Bregman divergence . The two additional Bregman divergences in generalized BADMM are variable specific, and can be chosen to make sure that the updates are efficient. If all three Bregman divergences are quadratic functions, the generalized BADMM reduces to the generalized ADMM [deng12:admm]. We prove convergence of generalized BADMM in Section 3, which yields the convergence of BADMM with .
In the following, we illustrate how to choose a proper Bregman divergence so that the update can be solved efficiently, e.g., a closed-form solution, noting that the same arguments apply to the -updates. Consider the first three terms in (11) as , where denotes an easy term and is the problematic term which needs to be linearized for an efficient -update. We illustrate the idea with several examples later in the section. Now, we have
where efficient updates are difficult due to the mismatch in structure between and . The goal is to ‘linearize’ the function by using the fact that the Bregman divergence captures all the higher-order (beyond linear) terms in so that:
is a linear function of . Let be another convex function such that one can efficiently solve for any constant . Assuming is convex, we construct a Bregman divergence based proximal term to the original problem so that:
where the latter problem can be solved efficiently, by our assumption. To ensure is convex, we need the following condition:
If is smooth and has Lipschitz continuous gradients with constant under a -norm, then is -strongly convex w.r.t. the -norm.
This condition has been widely used in gradient-type methods, including MDA and COMID. Note that the convergence analysis of generalized ADMM in Section 4 holds for any additional Bregman divergence based proximal terms, and does not rely on such specific choices. Using the above idea, one can ‘linearize’ different parts of the update to yield an efficient update.
We consider three special cases, respectively focusing on linearizing the function , linearizing the Bregman divergence based augmentation term , and linearizing both terms, along with examples for each case.
Case 1: Linearization of smooth function : Let in (16), we have
where is the gradient of at .
Consider the following ADMM form for sparse logistic regression problem [hatf09, boyd10]:
where is the logistic function. If we use ADMM to solve (18), the update is as follows [boyd10]:
which is a ridge-regularized logistic regression problem and one needs an iterative algorithm like L-BFGS to solve it. Instead, if we linearize at and set to be a quadratic function, then
the update has a simple closed-form solution.
Case 2: Linearization of the quadratic penalty term: In ADMM, . Let . Then , we have
The case mainly solves the problem due to the term which makes updates nonseparable, whereas the linearized version can be solved with separable (parallel) updates. Several problems have been benefited from the linearization of quadratic term [deng12:admm], e.g., when is loss function [hatf09], and projection onto the unit simplex or ball [Duchi08].
Case 3: Mirror Descent: In some settings, we want to linearize both the function and the quadratic augmentation term . Let , we have
Note that (22) is a MDA-type update. Further, one can do a similar exercise with a general Bregman divergence based augmentation term , although there has to be a good motivation for going to this route.
[Bethe-ADMM [wang13:ADMMMAP]] Given an undirected graph , where is the vertex set and is the edge set. Assume a random discrete variable associated with node can take values. In a pairwise MRF, the joint distribution of a set of discrete random variables ( is the number of nodes in the graph) is defined in terms of nodes and cliques [wain08:graphical]. Consider solving the following graph-structured problem :
where is a decomposable function of and is the so-called local polytope [wain08:graphical] determined by the marginalization and normalization (MN) constraints for each node and edge in the graph :
where are pseudo-marginal distributions of node and edge respectively. In particular, (23) serves as a LP relaxation of MAP inference probem in a pairwise MRF if is defined as follows:
where are the potential functions of node and edge respectively.
The complexity of polytope makes (23) difficult to solve. One possible way is to decompose the graph into trees such that
where denotes the MN constraints (24) in the tree . is a vector of pseudo-marginals of nodes and edges in the tree . is a global variable which contains all trees and corresponds to the tree in the global variable. is the weight for sharing variables. The augmented Lagrangian is
which leads to the following update for in ADMM:
If is the negative Bethe entropy of , the update of becomes the Bethe entropy problem [wain08:graphical] and can be solved exactly by the sum-product algorithm in a linear time in the tree.
3 Convergence Analysis of BADMM
We need the following assumption in establishing the convergence of BADMM:
(a) and are closed, proper and convex.
(b) An optimal solution exists.
(c) The Bregman divergence is defined on an -strongly convex function with respect to a -norm , i.e., , where .
We start wth the Lagrangian, which is defined as follows:
Assume that satisfies the KKT conditions of (29), i.e.,
Define the residuals of optimality conditions (35) at as:
Using the convexity of and its subgradient given in (33), we have
Setting and using , we have
where the last equality uses the three point property of Bregman divergence, i.e.,
Similarly, using the convexity of and its subgradient given in (34), for any ,