Bregman Alternating Direction Method of Multipliers

# Bregman Alternating Direction Method of Multipliers

Huahua Wang
Dept of Computer Science & Engg
University of Minnesota, Twin Cities
huwang@cs.umn.edu
Arindam Banerjee
Dept of Computer Science & Engg
University of Minnesota, Twin Cities
banerjee@cs.umn.edu

## 1 Introduction

In recent years, the alternating direction method of multipliers (ADM or ADMM) [boyd10] has been successfully applied in a broad spectrum of applications, ranging from image processing [Figueiredo10, osher10:admm] to applied statistics and machine learning [yang09, wang12:oadm, wang13:ADMMMAP]. For further understanding of ADMM, we refer the readers to the comprehensive review by [boyd10] and references therein. In particular, ADMM considers the problem of minimizing composite objective functions subject to an equality constraint:

 (1)

where and are convex functions, , , and and are convex sets. and can be non-smooth functions, including indicator functions of convex sets. Many machine learning problems can be cast into the framework of minimizing a composite objective [nest07:composite, Duchi10_comid], where is a loss function such as hinge or logistic loss, and is a regularizer, e.g., norm, norm, nuclear norm or total variation. The two functions usually have different structures and constraints because they have different tasks in data mining. Therefore, it is useful and sometimes necessary to split and solve them separately, which is exactly the forte of ADMM.

In each iteration, ADMM updates splitting variables separately and alternatively by solving the augmented Lagrangian of (1), which is defined as follows:

 Lρ(x,z,y) =f(x)+g(z)+⟨y,Ax+Bz−c⟩+ρ2∥Ax+Bz−c∥22, (2)

where is dual variable, is penalty parameter, and the quadratic penalty term is to penalize the violation of the equality constraint. ADMM consists of the following three updates:

 (3) (4) yt+1=yt+ρ(Axt+1+Bzt+1−c) . (5)

A large amount of literature shows that replacing the quadratic term by Bregman divergence in gradient-type methods could greatly boost their performance in solving the constrained optimization problem. First, the use of Bregman divergence could effectively exploit the structure of problems [chen93:proxBreg, Beck03, Duchi10_comid] , e.g., in computerized tomography [ben01:mda], clustering problems and exponential family distributions [bane05:bregman]. Second, in some cases, the gradient descent method with Kullback-Leibler (KL) divergence can outperform the method with the quadratic term by a factor of where is the dimensionality of the problem [Beck03, ben01:mda]. Mirror descent algorithm (MDA) and composite objective mirror descent (COMID) [Duchi10_comid] use Bregman divergence to replace the quadratic term in gradient descent or proximal gradient [comb09:prox]. Proximal point method with D-functions (PMD) [chen93:proxBreg, ceze98] and Bregman proximal minimization (BPM)  [kiwiel95:gbregman] generalize proximal point method by using Bregman divegence to replace the quadratic term.

The rest of the paper is organized as follows. In Section 2, we propose Bregman ADMM and discuss several special cases of BADMM. In Section 3, we establish the convergence of BADMM. In Section 4, we consider illustrative applications of BADMM, and conclude in Section 5.

## 2 Bregman Alternating Direction Method of Multipliers

Let be a continuously differentiable and strictly convex function on the relative interior of a convex set . Denote as the gradient of at . We define Bregman divergence111The definition of Bregman divergence has been generalized to nondifferentiable functions [kiwiel95:gbregman, teda12:gbregman]. induced by as

 Bϕ(x,y)=ϕ(x)−ϕ(y)−⟨∇ϕ(y),x−y⟩ .

Since is convex, where the equality holds if and only if . More details about Bregman divergence can be found in [chen93:proxBreg, bane05:bregman]. Two of the most commonly used examples are squared Euclidean distance and KL divergence .

Assuming is well defined, we replace the quadratic penalty term in the augmented Lagrangian (2) by a Bregman divergence as follows:

 Lϕρ(x,z,y) =f(x)+g(z)+⟨y,Ax+Bz−c⟩+ρBϕ(c−Ax,Bz). (6)

Unfortunately, we can not derive Bregman ADMM (BADMM) updates by simply solving alternatingly as ADMM does because Bregman divergences are not necessarily convex in the second argument. More specifically, given , can be obtained by solving , where the quadratic penalty term for ADMM in (3) is replaced with in the update of BADMM. However, given , we cannot obtain by solving , since the term need not be convex in . The observation motivates a closer look at the role of the quadratic term in ADMM.

In standard ADMM, the quadratic augmentation term added to the Lagrangian is just a penalty term to ensure the new updates do not violate the constraint significantly. Staying with these goals, we propose the update augmentation term of BADMM to be: , instead of the quadratic penalty term in (3). Then, we get the following updates for BADMM:

 xt+1= argminx∈X f(x)+⟨yt,Ax+Bzt−c⟩+ρBϕ(c−Ax,Bzt) , (7) zt+1= (8) yt+1= yt+ρ(Axt+1+Bzt+1−c) . (9)

Compared to ADMM (3)-(5), BADMM simply uses a Bregman divergence to replace the quadratic penalty term in the and updates. It is worth noting that the same Bregman divergence is used in the and updates.

We consider a special case when . (7) is reduced to

 xt+1=argminx∈X f(x)+⟨yt,−x+zt⟩+ρBϕ(x,zt) . (10)

If is a quadratic function, the constrained problem (10) requires the projection onto the constraint set . However, in some cases, if choosing a proper Bregman divergence, (10) can be solved efficiently or has a closed-form solution. For example, if is a linear function and is the unit simplex, should be KL divergence, leading to the exponentiated gradient [Beck03, ben01:mda, Nemi83:complexity]. Interestingly, if the update is also the exponentiated gradient, we have alternating exponentiated gradients. In Section 4, we will show the mass transportation problem can be cast into this scenario.

While the updates (7)-(8) use the same Bregman divergences, efficiently solving the and updates may not be feasible, especially when the structure of the original functions , the function used for augmentation, and the constraint sets are rather different. For example, if is a logistic function in  (10), it will not have a closed-form solution even is the KL divergence and is the unit simplex. To address such concerns, we propose a generalized version of BADMM in Section 2.1.

To allow the use of different Bregman divergences in the and updates (7)-(9) of BADMM, the generalized BADMM simply introduces an additional Bregman divergence for each update. The generalized BADMM has the following updates:

 xt+1= argminx∈X f(x)+⟨yt,Ax+Bzt−c⟩+ρBϕ(c−Ax,Bzt)+ρxBφx(x,xt) , (11) zt+1= argminz∈Z g(z)+⟨yt,Axt+1+Bz−c⟩+ρBϕ(Bz,c−Axt+1)+ρzBφz(z,zt) , (12) yt+1= yt+τ(Axt+1+Bzt+1−c) . (13)

In the following, we illustrate how to choose a proper Bregman divergence so that the update can be solved efficiently, e.g., a closed-form solution, noting that the same arguments apply to the -updates. Consider the first three terms in (11) as , where denotes an easy term and is the problematic term which needs to be linearized for an efficient -update. We illustrate the idea with several examples later in the section. Now, we have

 xt+1=minx∈X s(x)+h(x)+ρxBφx(x,xt) . (14)

where efficient updates are difficult due to the mismatch in structure between and . The goal is to ‘linearize’ the function by using the fact that the Bregman divergence captures all the higher-order (beyond linear) terms in so that:

 h(x)−Bh(x,xt)=h(xt)+⟨x−xt,∇h(xt)⟩ (15)

is a linear function of . Let be another convex function such that one can efficiently solve for any constant . Assuming is convex, we construct a Bregman divergence based proximal term to the original problem so that:

 (16)

where the latter problem can be solved efficiently, by our assumption. To ensure is convex, we need the following condition:

###### Proposition 1

If is smooth and has Lipschitz continuous gradients with constant under a -norm, then is -strongly convex w.r.t. the -norm.

This condition has been widely used in gradient-type methods, including MDA and COMID. Note that the convergence analysis of generalized ADMM in Section 4 holds for any additional Bregman divergence based proximal terms, and does not rely on such specific choices. Using the above idea, one can ‘linearize’ different parts of the update to yield an efficient update.

We consider three special cases, respectively focusing on linearizing the function , linearizing the Bregman divergence based augmentation term , and linearizing both terms, along with examples for each case.

Case 1: Linearization of smooth function : Let in (16), we have

 xt+1 =argminx∈X ⟨∇f(xt),x−xt⟩+⟨yt,Ax⟩+ρBϕ(c−Ax,Bzt)+ρxBψx(x,xt) . (17)

where is the gradient of at .

###### Example 1

Consider the following ADMM form for sparse logistic regression problem [hatf09, boyd10]:

 minxh(x)+λ∥z∥1 , s.t. x=z , (18)

where is the logistic function. If we use ADMM to solve (18), the update is as follows [boyd10]:

 xt+1=argminx h(x)+⟨yt,x−zt⟩+ρ2∥x−zt∥22 , (19)

which is a ridge-regularized logistic regression problem and one needs an iterative algorithm like L-BFGS to solve it. Instead, if we linearize at and set to be a quadratic function, then

 xt+1=argminx ⟨∇ h(xt),x−xt⟩+⟨yt,x−zt⟩+ρ2∥x−zt∥22+ρx2∥x−xt∥22 , (20)

the update has a simple closed-form solution.

Case 2: Linearization of the quadratic penalty term: In ADMM, . Let . Then , we have

 xt+1=argminx∈X f(x)+⟨yt+ρ(Axt+Bzt−c),Ax⟩+ρxBψ(x,xt) . (21)

The case mainly solves the problem due to the term which makes updates nonseparable, whereas the linearized version can be solved with separable (parallel) updates. Several problems have been benefited from the linearization of quadratic term [deng12:admm], e.g., when is loss function [hatf09], and projection onto the unit simplex or ball [Duchi08].

Case 3: Mirror Descent: In some settings, we want to linearize both the function and the quadratic augmentation term . Let , we have

 xt+1 =argminx∈X⟨∇h(xt),x⟩+ρxBψ(x,xt) . (22)

Note that (22) is a MDA-type update. Further, one can do a similar exercise with a general Bregman divergence based augmentation term , although there has to be a good motivation for going to this route.

###### Example 2

[Bethe-ADMM [wang13:ADMMMAP]] Given an undirected graph , where is the vertex set and is the edge set. Assume a random discrete variable associated with node can take values. In a pairwise MRF, the joint distribution of a set of discrete random variables ( is the number of nodes in the graph) is defined in terms of nodes and cliques [wain08:graphical]. Consider solving the following graph-structured problem :

 min l(μ)  s.t.  μ∈L(G) , (23)

where is a decomposable function of and is the so-called local polytope [wain08:graphical] determined by the marginalization and normalization (MN) constraints for each node and edge in the graph :

 L(G)={μ≥0 ,∑xiμi(xi)=1 ,∑xjμij(xi,xj)=μi(xi)} , (24)

where are pseudo-marginal distributions of node and edge respectively. In particular,  (23) serves as a LP relaxation of MAP inference probem in a pairwise MRF if is defined as follows:

 l(μ)=∑i∑xiθi(xi)μi(xi)+∑ij∈E∑xijθij(xi,xj)μij(xi,xj), (25)

where are the potential functions of node and edge respectively.

The complexity of polytope makes (23) difficult to solve. One possible way is to decompose the graph into trees such that

 min ∑τcτlτ(μτ)  s.% t.  μτ∈Tτ,μτ=mτ , (26)

where denotes the MN constraints (24) in the tree . is a vector of pseudo-marginals of nodes and edges in the tree . is a global variable which contains all trees and corresponds to the tree in the global variable. is the weight for sharing variables. The augmented Lagrangian is

 Lρ(μτ,m,λτ)= ∑τcτlτ(μτ)+⟨λτ,μτ−mτ⟩+ρ2∥μτ−mτ∥22 . (27)

 μt+1τ=argminμτ∈Tτ cτlτ(μτ)+⟨λtτ,μτ⟩+ρ2∥μτ−mtτ∥22 (28)

(28) is difficult to solve due to the MN constraints in the tree. Let be the objective of (28). If linearizing and adding a Bregman divergence in (28), we have:

 μt+1τ=argminμτ∈Tτ ⟨∇h(μtτ),μτ⟩+ρxBψ(μτ,μtτ) =argminμτ∈Tτ ⟨∇h(μtτ)−ρx∇ψ(μtτ),μτ⟩+ρxψ(μτ) ,

If is the negative Bethe entropy of , the update of becomes the Bethe entropy problem [wain08:graphical] and can be solved exactly by the sum-product algorithm in a linear time in the tree.

## 3 Convergence Analysis of BADMM

We need the following assumption in establishing the convergence of BADMM:

###### Assumption 1

(a) and are closed, proper and convex.

(b) An optimal solution exists.

(c) The Bregman divergence is defined on an -strongly convex function with respect to a -norm , i.e., , where .

We start wth the Lagrangian, which is defined as follows:

 (29)

Assume that satisfies the KKT conditions of (29), i.e.,

 −ATy∗ ∈∂f(x∗) , (30) −BTy∗ ∈∂g(z∗) , (31) Ax∗+Bz∗−c =0 . (32)

is an optimal solution. The optimality conditions of (11) and (12) are

 −AT{yt+ρ(−∇ϕ(c−Axt+1)+∇ϕ(Bzt)}−ρx(∇φx(xt+1)−∇φx(xt)) ∈∂f(xt+1) , (33) ∈∂g(zt+1) . (34)

If , then . Therefore, (30) is satisfied if in (33). Similarly, (31) is satisfied if in (34). Overall, the KKT conditions (30)-(32) are satisfied if the following optimality conditions are satisfied:

 Bφx(xt+1,xt)=0 , Bφz(zt+1,zt)=0 , (35a) Axt+1+Bzt−c=0 , Axt+1+Bzt+1−c=0 . (35b)

For the exact BADMM, in (11) and (12), the optimality conditions are (35b), which is equivalent to the optimality conditions used in the proof of ADMM in [boyd10], i.e.,

 Bzt+1−Bzt=0 , Axt+1+Bzt+1−c=0 . (36)

Define the residuals of optimality conditions (35) at as:

 R(t+1)=ρxρBφx(xt+1,xt)+ρzρBφz(zt+1,zt)+Bϕ(c−Axt+1,Bzt)+γ∥Axt+1+Bzt+1−c∥22 , (37)

where . If , the optimality conditions (35) and (35b) are satisfied. It is sufficient to show the convergence of BADMM by showing converges to zero. We need the following lemma.

###### Lemma 1

Let the sequence be generated by Bregman ADMM (11)-(13). For any satisfying , we have

 f(xt+1)+g(zt+1)−(f(x∗)+g(z∗)) +ρz(Bφz(z∗,zt)−Bφz(z∗,zt+1)−Bφz(zt+1,zt)) . (38)
###### Proof.

Using the convexity of and its subgradient given in (33), we have

 f(xt+1)−f(x) ≤⟨−AT{yt+ρ(−∇ϕ(c−Axt+1)+∇ϕ(Bzt)}−ρx(∇φx(xt+1)−∇φx(xt)),xt+1−x⟩ =−⟨yt,A(xt+1−x)⟩+ρ⟨∇ϕ(c−Axt+1)−∇ϕ(Bzt),A(xt+1−x)⟩ −ρx⟨∇φx(xt+1)−∇φx(xt),xt+1−x⟩ . (39)

Setting and using , we have

 f(xt+1)−f(x∗) ≤−⟨yt,Axt+1+Bz∗−c⟩+ρ⟨∇ϕ(c−Axt+1)−∇ϕ(Bzt),Bz∗−(c−Axt+1)⟩ −ρx⟨∇φx(xt+1)−∇φx(xt),xt+1−x⟩ +ρx(Bφx(x∗,xt)−Bφx(x∗,xt+1)−Bφx(xt+1,xt)) . (40)

where the last equality uses the three point property of Bregman divergence, i.e.,

 ⟨∇ϕ(u)−∇ϕ(v),w−u⟩=Bϕ(w,v)−Bϕ(w,u)−Bϕ(u,v) . (41)

Similarly, using the convexity of and its subgradient given in (34), for any ,

 g(zt+1)−g(z) ≤⟨−BT{yt+ρ(∇ϕ(Bzt+1)−∇ϕ(c−Axt+1)}−ρz(∇φz(zt+1)−∇φz(zt)),zt+1−z⟩ =−⟨yt,B(zt+1−z)⟩+ρ⟨∇ϕ(Bzt+1)−∇ϕ(c−Axt+1),Bz−Bzt+1)⟩ −ρz⟨∇φz(zt+1)−∇φz(zt),zt+1−z⟩ =−⟨yt,B(zt+1−z)⟩+ρ{Bϕ(Bz,c−Axt+1)−Bϕ(Bz,Bzt+1)−Bϕ(Bzt+1,c−Axt+1)} +ρz(Bφz(z,zt)−Bφz(z,zt+1)−Bφz(zt+1,zt)) . (42)

where the last equality uses the three point property of Bregman divergence (41). Set in (3). Adding (3) and (3) completes the proof. ∎

Under Assumption 1(c), the following lemma shows that (37) is bounded by a telescoping series of , where defines the distance from the current iterate to a KKT point as follows:

 (43)
###### Lemma 2

Let the sequence be generated by Bregman ADMM (11)-(13) and satisfying  (30)-(32). Let the Assumption 1 hold. and are defined in (37) and (43) respectively. Set , where and . Then

 R(t+1)≤D(w∗,wt)−D(w∗,wt+1) . (44)
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters