Convergence of Bregman Alternating Direction Method with Multipliers for Nonconvex Composite Problems

# Convergence of Bregman Alternating Direction Method with Multipliers for Nonconvex Composite Problems

## Abstract

The alternating direction method with multipliers (ADMM) has been one of most powerful and successful methods for solving various convex or nonconvex composite problems that arise in the fields of image & signal processing and machine learning. In convex settings, numerous convergence results have been established for ADMM as well as its varieties. However, there have been few studies on the convergence properties of ADMM under nonconvex frameworks, since the convergence analysis of nonconvex algorithm is generally very difficult. In this paper we study the Bregman modification of ADMM (BADMM), which includes the conventional ADMM as a special case and can significantly improve the performance of the algorithm. Under some assumptions, we show that the iterative sequence generated by BADMM converges to a stationary point of the associated augmented Lagrangian function. The obtained results underline the feasibility of ADMM in applications under nonconvex settings.

nonconvex regularization, nonconvex sparse minimization, alternating direction method, sub-analytic function, K-L inequality, Bregman distance.

## I Introduction

Many problems arising in the fields of signal & image processing and machine learning [5, 28] involve finding a minimizer of some composite objective functions. More specifically, such problems can be formulated as:

 min f(x)+g(y) s.t. Ax=By, (1)

where and are given matrices, is usually a (quadratic, or logistic) loss function, and is often a regularizer such as the norm or quasi-norm.

Because of its separable structure, problem (I) can be efficiently solved by the alternating direction method with multipliers (ADMM), which decomposes the original joint minimization problem into two easily solved subproblems. The standard ADMM for problem (I) takes the form:

 yk+1 =argminy∈Rn2Lα(xk,y,pk) (2) xk+1 =argminx∈Rn1Lα(x,yk+1,pk) (3) pk+1 =pk+α(Axk+1−Byk+1), (4)

where is a penalty parameter and

 Lα(x,y,p) :=f(x)+g(y)+⟨p,Ax−By⟩ +α2∥Ax−By∥2

is the associated augmented Lagrangian function with multiplier . Generally speaking, ADMM is first minimized with respect to for fixed values of , then with respect to with fixed, and finally maximized with respect to with fixed. Updating the dual variable in the above system is a trivial task, but this is not so simple for the primal variables and . Indeed in many cases, the -subproblem (3) and -subproblem (2) cannot easily be solved. Recently, the Bregman modification of ADMM (BADMM) has been adopted by several researchers to improve the performance of the conventional ADMM algorithm [16, 35, 36, 47]. BADMM takes the following iterative form: 1

 yk+1 =argminy∈Rn2Lα(xk,y,pk)+△ψ(y,yk) (5) xk+1 =argminx∈Rn1Lα(x,yk+1,pk)+△ϕ(x,xk) (6) pk+1 =pk+α(Axk+1−Byk+1), (7)

where and respectively denote the Bregman distance with respect to function and The difference between this algorithm and the standard ADMM is that the objective function in (2)-(3) is replaced by the sum of a Bregman distance function and the augmented Lagrangian function. Moreover, as shown in [36, 47, 26] and the following section, an appropriate choice of Bregman distance does indeed simplify the original subproblems.

ADMM was introduced in the early 1970s [18, 17], and its convergence properties for convex objective functions have been extensively studied. The convergence of ADMM was first established for strongly convex functions [18, 17], before being extended to general convex functions [13, 14]. It has been shown that ADMM converges at a sublinear rate of [20, 30], or for the accelerated version [19]; furthermore, a linear convergence rate was also shown under certain additional assumptions [12]. The convergence of BADMM for convex objective functions has also been examined with the Euclidean distance [10], Mahalanobis distance [47], and the general Bregman distance [47].

Recent studies on nonnegative matrix factorization, distributed matrix factorization, distributed clustering, sparse zero variance discriminant analysis, polynomial optimization, tensor decomposition, and matrix completion have led to growing interest in ADMM for nonconvex objective functions (see e.g. [21, 27, 38, 44, 46]). It has been shown that the nonconvex ADMM works extremely well for these particular examples.

However, because the convergence analysis of nonconvex algorithms is generally very difficult, there have been few studies on the convergence properties of ADMM under nonconvex frameworks. One major difficulty is that the Féjer monotonicity of iterative sequences does not hold in the absence of convexity. Very recently, [22] analyzed the convergence of ADMM for certain nonconvex consensus and sharing problems. They demonstrated that with and set to the identity matrices, ADMM converges to the set of stationary solutions as long as the penalty parameter is sufficiently large. To show the convergence of ADMM to a stationary point, additional assumptions are required on the functions involved. For example, if and are both semi-algebraic, [26] proved that ADMM converges to a stationary point when is the identity matrix. This result requires that function is strongly convex or matrix has full-column rank.

In this paper, we study the convergencev of BADMM under nonconvex frameworks. First, we extend the convergence of the BADMM from semi-algebraic functions to sub-analytic functions. In particular, this implies that BADMM is convergent for logistic sparse loss functions, which are not semi-algebraic. Second, we establish a global convergence theorem for cases when has full-column rank. This allows us to choose , which covers a recent result in [26]. We also study the case when does not have full-column rank. In this instance, a suitable Bregman distance also leads to global BADMM convergence. This enhanced flexibility of BADMM enables its application to more general cases. More importantly, the main idea of our convergence analysis is different from that used in [26]. Instead of employing an augmented Lagrangian function at each iteration, we demonstrate global convergence using the descent property of an auxiliary function.

The paper is organized as follows. In Section 2, we recall the definitions of subdifferentials, Bregman distance, and Kurdyka-Łojasiewicz inequality. In Section 3, we establish the global convergence of BADMM to a critical point under certain assumptions. In Section 4, we conduct experimental studies to verify the convergence of BADMM.

## Ii Preliminaries

In what follows, will stand for the -dimensional Euclidean space,

 ⟨x,y⟩=x⊤y=n∑i=1xiyi, ∥x∥=√⟨x,x⟩,

where and stands for the transpose operation.

### Ii-a Subdifferentials

Given a function we denote by the domain of , namely . A function is said to be proper if lower semicontinuous at the point if

 liminfx→x0f(x)≥f(x0).

If is lower semicontinuous at every point of its domain of definition, then it is simply called a lower semicontinuous function.

###### Definition II.1.

Let be a proper lower semi-continuous function.

• Given the Fréchet subdifferential of at , written by , is the set of all elements which satisfy

 limy≠xinfy→xf(y)−f(x)−⟨u,y−x⟩∥x−y∥≥0.
• The limiting subdifferential, or simply subdifferential, of at , written by , is defined as

 ∂f(x)={u∈Rn:∃xk→x,f(xk)→f(x), uk∈ˆ∂f(xk)→u,k→∞}.
• A critical point or stationary point of is a point in the domain of satisfying

###### Definition II.2.

An element is called a critical point or stationary point of the Lagrangian function if it satisfies:

 ⎧⎪⎨⎪⎩−A⊤p∗=∇f(x∗)B⊤p∗∈∂g(y∗)Ax∗=By∗. (8)

Let us now collect some basic properties of the subdifferential (see [31]).

###### Proposition II.1.

Let and be proper lower semi-continuous functions.

• for each Moreover, the first set is closed and convex, while the second is closed, and not necessarily convex.

• Let be sequences such that and Then by the definition of the subdifferential, we have

• The Fermat’s rule remains true: if is a local minimizer of , then is a critical point or stationary point of , that is,

• If is continuously differentiable function, then

A function is said to be -Lipschitz continuous if

 ∥f(x)−f(y)∥≤ℓf∥x−y∥

for any ; -strongly convex if

 f(y)≥f(x)+⟨ξ(x),y−x⟩+μ2∥y−x∥2, (9)

for any and

### Ii-B Kurdyka-Łojasiewicz inequality

The Kurdyka-Łojasiewicz (K-L) inequality plays an important role in our subsequent analysis. This inequality was first introduced by Łojasiewicz [32] for real real analytic functions, and then was extended by Kurdyka [24] to smooth functions whose graph belongs to an o-minimal structure, and recently was further extended to nonsmooth sub-analytic functions [3].

###### Definition II.3 (K-L inequality).

A function is said to satisfy the K-L inequality at if there exists , such that for all

 φ′(f(x)−f(x0))dist(0,∂f(x))≥1,

where and stand for the class of functions such that (a) is continuous on ; (b) is smooth concave on ; (c) .

The following is an extension of the conventional K-L inequality [4].

###### Lemma II.2 (K-L inequality on compact subsets).

Let be a proper lower semi-continuous function and let be a compact set. If is a constant on and satisfies the K-L inequality at each point in , then there exists , such that for all and for all ,

 φ′(f(x)−f(x0))dist(0,∂f(x))≥1.

Typical functions satisfying the K-L inequality include strongly convex functions, real analytic functions, semi-algebraic functions and sub-analytic functions.

A subset is said to be semi-algebraic if it can be written as

 C=r⋃j=1s⋂i=1{x∈Rn:gi,j(x)=0,hi,j(x)<0},

where are real polynomial functions. Then a function is called semi-algebraic if its graph

 G(f):={(x,y)∈Rn+1:f(x)=y}

is a semi-algebraic subset in . For example, the quasi norm with , the sup-norm the Euclidean norm , , and are all semi-algebraic functions [4, 39].

A real function on is said to be analytic if it possesses derivatives of all orders and agrees with its Taylor series in a neighborhood of every point. For a real function on , it is said to be analytic if the function of one variable is analytic for any . It is readily seen that real polynomial functions such as quadratic functions are analytic. Moreover the -smoothed norm with and the logistic loss function are also examples for real analytic functions [39].

A subset is said to be sub-analytic if it can be written as

 C=r⋃j=1s⋂i=1{x∈Rn:gi,j(x)=0,hi,j(x)<0},

where are real analytic functions. Then a function is called sub-analytic if its graph is a sub-analytic subset in . It is clear that both real analytic and semi-algebraic functions are sub-analytic. Generally speaking, the sum of of two sub-analytic functions is not necessarily sub-analytic. As shown in [3, 39], for two sub-analytic functions, if at least one function maps bounded sets to bounded sets, then their sum is also sub-analytic. In particular, the sum of a sub-analytic function and a analytic function is sub-analytic. Some sub-analytic functions that are widely used are as follows:

• ;

• ;

• ;

• .

### Ii-C Bregman distance

The Bregman distance, first introduced in 1967 [6], plays an important role in various iterative algorithms. As a generalization of squared Euclidean distance, the Bregman distance share many similar nice properties of the Euclidean distance. However, the Bregman distance is not a metric, since it does not satisfy the triangle inequality nor symmetry. For a convex differential function , the associated Bregman distance is defined as

 △ϕ(x,y)=ϕ(x)−ϕ(y)−⟨∇ϕ(y),x−y⟩.

In particular, if we let in the above, then it is reduced to , namely the classical Euclidean distance. Some nontrivial examples of Bregman distance include [2]:

• Itakura-Saito distance: ;

• Kullback-Leibler divergence: ;

• Mahalanobis distance: with a symmetric positive definite matrix.

Let us now collect some useful properties about Bregman distance.

###### Proposition II.3.

Let be a convex differential function and the associated Bregman distance.

• Non-negativity: for all .

• Convexity: is convex in , but not necessarily in .

• Strong Convexity: If is -strongly convex, then for all .

As shown in the below, an appropriate choice of Bregman distance will simplify the and -subproblems, which in turn improve the performance of the algorithm. For example, in -subproblem (5), when taking then the problem is minimizing function

 ∥y∥1/21/2−⟨pk,y⟩+α2∥By−Axk∥2.

In general finding a minimizer of this function is not a easy task. However, if we take with , then it is transformed into minimizing a problem of

 ∥y∥1/21/2+α2μ∥y−(yk−μ−1B⊤(Byk−Axk−pk/α))∥2.

Such a problem has a closed form solution (see [40]), and thus it can be very easily solved.

### Ii-D Basic assumption

We need the following basic assumptions on problem (I). A basic assumption to guarantee the convergence of the BADMM is that the matrix has full-row rank. The only difference between Assumptions 1 and 2 is: one needs having full column rank in Assumption 1, while in Assumption 2 one needs being strongly convex. It worth noting that one can choose under Assumption 1, so that the BADMM includes the standard ADMM as a special case. It is also worth noting that the choice of is not available under Assumption 2.

###### Assumption 1.

Let a continuous differential function and a proper lower semi-continuous functions. Assume that the following hold.

• and is injective;

• either with respect to or is strongly convex;

• is a sub-analytic function, and and are Lipshitz continuous.

In condition (b), the strong convexity of is easily attained, for example while the strong convexity of in can be deduced from some standard assumptions, for example Neumann boundary condition in image processing [15]. Condition (b) will be used to guarantee the sufficient descent property of the augmented Lagrangian functions. More specifically, it implies

 Lα(xk+1,yk+1,pk) ≤Lα(xk,yk+1,pk)−μ12∥xk+1−xk∥2, (10)

where is generated by algorithm (5)-(7). As a matter of fact, if with respect to is -strongly convex, then is also -strongly convex because is convex from Proposition II.3. Thus the desired inequality will follow from the definition of strong convexity and Proposition II.3. If is strongly convex, then it follows again from Proposition II.3 that

 △ϕ(xk+1,xk)≥μ12∥x−xk∥2,

which together with the definition of yields the desired inequality.

The condition that is sub-analytic in (c) will be used to guarantee the auxiliary function constructed in the following section satisfying the K-L inequality. We notice that all functions mentioned in subsection II-B satisfy assumption (c). The Lipschitz continuity is a standard assumption for various algorithms, even in convex settings.

We also consider the BADMM under another set of conditions listed in Assumption 2 below. The only difference between Assumptions 1 and 2 is that one needs having full column rank in Assumption 1, where in Assumption 2 we assume that is strongly convex. It is worth noting that one can choose under Assumption 1, so that the BADMM includes the standard ADMM as a special case.

###### Assumption 2.

Let a continuous differential function and a proper lower semi-continuous functions. Assume that the following hold.

• and is -strongly convex.

• either with respect to or is strongly convex.

• is a sub-analytic function, and and are Lipshitz continuous.

## Iii Convergence Analysis

In this section we prove the convergence of BADMM under two different assumptions. In both assumptions, the parameter is chosen so that

 α>4((ℓf+ℓϕ)2+ℓ2ϕ)μ1μ0,

where and respectively stand for the Lipshitz constant of functions and .

According to a recent work [1], the key point for convergence analysis of nonconvex algorithms is to show the descent property of the augmented Lagrangian function. This is however not easily attained since the dual variable is updated by maximizing the augmented Lagrangian function. As an alternative way, we construct an auxiliary function below, which helps us to deduce the global convergence of BADMM.

### Iii-a The case B is injective

###### Lemma III.1.

Let Assumption 1 be fulfilled. Then there exists such that

 σ1∥xk+1−xk∥2≤^L(xk,yk,pk,xk−1)−^L(xk+1,yk+1,pk+1,xk),

where

###### Proof.

First we show that for each

 ∥pk+1−pk∥2 ≤2(ℓf+ℓϕ)2μ0∥xk+1−xk∥2 +2ℓ2ϕμ0∥xk−xk−1∥2. (11)

Indeed applying Fermat’s rule to (6) yields

 ∇f(xk+1)+A⊤pk+αA⊤(Axk+1−Byk+1)+∇ϕ(xk+1)−∇ϕ(xk)=0, (12)

which together with (7) implies that

 A⊤pk+1=A⊤(pk+α(Axk+1−Byk+1))=−∇f(xk+1)+∇ϕ(xk)−∇ϕ(xk+1). (13)

It then follows that

 ∥A⊤(pk+1−pk)∥2 =∥∇f(xk+1)−∇f(xk)+(∇ϕ(xk+1) −∇ϕ(xk))+(∇ϕ(xk−1)−∇ϕ(xk))∥2 ≤(∥∇f(xk+1)−∇f(xk)∥+∥∇ϕ(xk+1) −∇ϕ(xk)∥+∥∇ϕ(xk−1)−∇ϕ(xk)∥)2 ≤(ℓf∥xk+1−xk∥+ℓϕ∥xk−xk+1∥ +ℓϕ∥xk−xk−1∥)2 ≤2(ℓf+ℓϕ)2∥xk+1−xk∥2 +2ℓ2ϕ∥xk−xk−1∥2.

Since matrix is surjective, we have

 ∥A⊤(pk+1−pk)∥2 =⟨A⊤(pk+1−pk),A⊤(pk+1−pk)⟩ =⟨AA⊤(pk+1−pk),pk+1−pk⟩ ≥μ0∥pk+1−pk∥2,

which at once implies (III-A), as desired.

Next we claim that

 Lα(xk+1,yk+1,pk+1)−Lα(xk,yk,pk)≤−μ12∥xk+1−xk∥2+1α∥pk+1−pk∥2. (14)

To see this, we deduce from (10) and (5)-(7) that

 Lα(xk,yk+1,pk)≤Lα(xk,yk,pk), Lα(xk+1,yk+1,pk)≤Lα(xk,yk+1,pk) −μ12∥xk+1−xk∥2, Lα(xk+1,yk+1,pk+1)−Lα(xk+1,yk+1,pk) =⟨pk+1−pk,Axk+1−Byk+1⟩ =1α∥pk+1−pk∥2.

Adding up the above formulas at once yields (14).

Finally it follows from (III-A) and (14) that

 Lα(xk+1,yk+1,pk+1)−Lα(xk,yk,pk) ≤(2(ℓf+ℓϕ)2αμ0−μ12)∥xk+1−xk∥2 +2ℓ2ϕαμ0∥xk−xk−1∥2,

which is equivalent to

 Lα(xk+1,yk+1,pk+1)+2ℓ2ϕαμ0∥xk+1−xk∥2 ≤Lα(xk,yk,pk)+2ℓ2ϕαμ0∥xk−xk−1∥2 −(μ12−2(ℓf+ℓϕ)2αμ0−2ℓ2ϕαμ0)∥xk−xk+1∥2.

Let us now define

 σ0=2ℓ2ϕαμ0, σ1=(μ12−2(ℓf+ℓϕ)2αμ0−2ℓ2ϕαμ0).

Clearly both are positive and thus the desired inequality follows. ∎

###### Lemma III.2.

If the sequence is bounded, then

 ∞∑k=0∥zk−zk+1∥2<∞.

In particular the sequence is asymptotically regular, namely as Moreover any cluster point of is a stationary point of

###### Proof.

Let Since is clearly bounded, there exists a subsequence so that it is convergent to some element . By our hypothesis the function is lower semicontinuous, which leads to

 liminfj→∞^L(^zkj)≥^L(^z∗),

so that is bounded from below. By the previous lemma, is nonincreasing, so that is convergent. Moreover is also convergent and for each .

Now fix It then follows from Lemma III.1 that

 σ1k∑i=0∥xi−xi+1∥2 ≤k∑i=0^L(^zi)−^L(^zi+1) =^L(^z0)−^L(^zk+1) ≤^L(^z0)−^L(^z∗)<∞.

Since is chosen arbitrarily, we have which with (III-A) implies . Since is injective, it is readily seen that there exists so that

 α2μB∥yk−yk+1∥2 ≤∥αB(yk−yk+1)∥2 =∥(pk−pk+1)+(pk−pk−1) +α(Axk+1−Axk)∥2 ≤2(∥pk−pk+1∥2+∥pk−pk−1∥2 +α2∥A∥2∥xk+1−xk∥2). (15)

Hence , so that in particular .

Let be any cluster point of and let be a subsequence of converging to . Since tends to zero as and have the same limit point . Since is convergent, it is not hard to see that is also convergent. It then follows from (5)-(7) that

 pk+1 =pk+α(Axk+1−Byk+1), −∇f(xk+1) =A⊤pk+1+∇ϕ(xk+1)−∇ϕ(xk), ∂g(yk+1) ∋B⊤pk+αB⊤(Axk−Byk+1) +∇ψ(yk)−∇ψ(yk+1) =B⊤pk+1+αB⊤(Axk−Axk+1) +∇ψ(yk)−∇ψ(yk+1).

Letting in the above formulas yields

 A⊤p∗=−∇f(x∗),B⊤p∗∈∂g(y∗),Ax∗=By∗,

which implies that is a stationary point. ∎

###### Lemma III.3.

Let . Then there exists such that for each

 dist(0,∂^L(^zk+1)) ≤κ(∥xk−xk+1∥+∥xk−xk−1∥ +∥xk−1−xk−2∥).
###### Proof.

By the definitions of and algorithm (5)-(7), we have

 ∂^Lx(^zk+1) =∇f(xk+1)+A⊤pk+1+σ0(xk+1−xk) +αA⊤(Axk+1−Byk+1) =∇ϕ(xk)−∇ϕ(xk+1)+σ0(xk+1−xk) +αA⊤(Axk+1−Byk+1) =∇ϕ(xk)−∇ϕ(xk+1)+σ0(xk+1−xk) +A⊤(pk+1−pk),

where the last equality follows from (7). On the other hand, it follows from (5) that

 0 ∈∂g(yk+1)−B⊤pk−αB⊤(Axk−Byk+1) +∇ψ(yk+1)−∇ψ(yk) =∂g(yk+1)−B⊤pk+1−αB⊤(Axk−Axk+1) +∇ψ(yk+1)−∇ψ(yk),

which implies

 ∂^Ly(^zk+1) =∂g(yk+1)−B⊤pk+1+αB⊤(Byk+1−Axk+1) ∋∇ψ(yk)−∇ψ(yk+1)+αB⊤(Axk−Axk+1) −αB⊤(Axk+1−Byk+1) =∇ψ(yk)−∇ψ(yk+1)+αB⊤(Axk−Axk+1) +B⊤(pk−pk+1).