Convergence of Bregman Alternating Direction Method with Multipliers for Nonconvex Composite Problems
Abstract
The alternating direction method with multipliers (ADMM) has been one of most powerful and successful methods for solving various convex or nonconvex composite problems that arise in the fields of image & signal processing and machine learning. In convex settings, numerous convergence results have been established for ADMM as well as its varieties. However, there have been few studies on the convergence properties of ADMM under nonconvex frameworks, since the convergence analysis of nonconvex algorithm is generally very difficult. In this paper we study the Bregman modification of ADMM (BADMM), which includes the conventional ADMM as a special case and can significantly improve the performance of the algorithm. Under some assumptions, we show that the iterative sequence generated by BADMM converges to a stationary point of the associated augmented Lagrangian function. The obtained results underline the feasibility of ADMM in applications under nonconvex settings.
I Introduction
Many problems arising in the fields of signal & image processing and machine learning [5, 28] involve finding a minimizer of some composite objective functions. More specifically, such problems can be formulated as:
(1) 
where and are given matrices, is usually a (quadratic, or logistic) loss function, and is often a regularizer such as the norm or quasinorm.
Because of its separable structure, problem (I) can be efficiently solved by the alternating direction method with multipliers (ADMM), which decomposes the original joint minimization problem into two easily solved subproblems. The standard ADMM for problem (I) takes the form:
(2)  
(3)  
(4) 
where is a penalty parameter and
is the associated augmented Lagrangian function with multiplier .
Generally speaking, ADMM is first minimized with respect to for
fixed values of , then with respect to with fixed,
and finally maximized with respect to with fixed.
Updating the dual variable in the above system is a trivial
task, but this is not so simple for the primal variables and
. Indeed in many cases, the subproblem (3) and
subproblem (2) cannot easily be solved. Recently, the
Bregman modification of ADMM (BADMM) has been adopted by several
researchers to improve the performance of the conventional ADMM
algorithm [16, 35, 36, 47]. BADMM takes the following iterative
form:
(5)  
(6)  
(7) 
where and respectively denote the Bregman distance with respect to function and The difference between this algorithm and the standard ADMM is that the objective function in (2)(3) is replaced by the sum of a Bregman distance function and the augmented Lagrangian function. Moreover, as shown in [36, 47, 26] and the following section, an appropriate choice of Bregman distance does indeed simplify the original subproblems.
ADMM was introduced in the early 1970s [18, 17], and its convergence properties for convex objective functions have been extensively studied. The convergence of ADMM was first established for strongly convex functions [18, 17], before being extended to general convex functions [13, 14]. It has been shown that ADMM converges at a sublinear rate of [20, 30], or for the accelerated version [19]; furthermore, a linear convergence rate was also shown under certain additional assumptions [12]. The convergence of BADMM for convex objective functions has also been examined with the Euclidean distance [10], Mahalanobis distance [47], and the general Bregman distance [47].
Recent studies on nonnegative matrix factorization, distributed matrix factorization, distributed clustering, sparse zero variance discriminant analysis, polynomial optimization, tensor decomposition, and matrix completion have led to growing interest in ADMM for nonconvex objective functions (see e.g. [21, 27, 38, 44, 46]). It has been shown that the nonconvex ADMM works extremely well for these particular examples.
However, because the convergence analysis of nonconvex algorithms is generally very difficult, there have been few studies on the convergence properties of ADMM under nonconvex frameworks. One major difficulty is that the Féjer monotonicity of iterative sequences does not hold in the absence of convexity. Very recently, [22] analyzed the convergence of ADMM for certain nonconvex consensus and sharing problems. They demonstrated that with and set to the identity matrices, ADMM converges to the set of stationary solutions as long as the penalty parameter is sufficiently large. To show the convergence of ADMM to a stationary point, additional assumptions are required on the functions involved. For example, if and are both semialgebraic, [26] proved that ADMM converges to a stationary point when is the identity matrix. This result requires that function is strongly convex or matrix has fullcolumn rank.
In this paper, we study the convergencev of BADMM under nonconvex frameworks. First, we extend the convergence of the BADMM from semialgebraic functions to subanalytic functions. In particular, this implies that BADMM is convergent for logistic sparse loss functions, which are not semialgebraic. Second, we establish a global convergence theorem for cases when has fullcolumn rank. This allows us to choose , which covers a recent result in [26]. We also study the case when does not have fullcolumn rank. In this instance, a suitable Bregman distance also leads to global BADMM convergence. This enhanced flexibility of BADMM enables its application to more general cases. More importantly, the main idea of our convergence analysis is different from that used in [26]. Instead of employing an augmented Lagrangian function at each iteration, we demonstrate global convergence using the descent property of an auxiliary function.
The paper is organized as follows. In Section 2, we recall the definitions of subdifferentials, Bregman distance, and KurdykaŁojasiewicz inequality. In Section 3, we establish the global convergence of BADMM to a critical point under certain assumptions. In Section 4, we conduct experimental studies to verify the convergence of BADMM.
Ii Preliminaries
In what follows, will stand for the dimensional Euclidean space,
where and stands for the transpose operation.
Iia Subdifferentials
Given a function we denote by the domain of , namely . A function is said to be proper if lower semicontinuous at the point if
If is lower semicontinuous at every point of its domain of definition, then it is simply called a lower semicontinuous function.
Definition II.1.
Let be a proper lower semicontinuous function.

Given the Fréchet subdifferential of at , written by , is the set of all elements which satisfy

The limiting subdifferential, or simply subdifferential, of at , written by , is defined as

A critical point or stationary point of is a point in the domain of satisfying
Definition II.2.
An element is called a critical point or stationary point of the Lagrangian function if it satisfies:
(8) 
Let us now collect some basic properties of the subdifferential (see [31]).
Proposition II.1.
Let and be proper lower semicontinuous functions.

for each Moreover, the first set is closed and convex, while the second is closed, and not necessarily convex.

Let be sequences such that and Then by the definition of the subdifferential, we have

The Fermat’s rule remains true: if is a local minimizer of , then is a critical point or stationary point of , that is,

If is continuously differentiable function, then
A function is said to be Lipschitz continuous if
for any ; strongly convex if
(9) 
for any and
IiB KurdykaŁojasiewicz inequality
The KurdykaŁojasiewicz (KL) inequality plays an important role in our subsequent analysis. This inequality was first introduced by Łojasiewicz [32] for real real analytic functions, and then was extended by Kurdyka [24] to smooth functions whose graph belongs to an ominimal structure, and recently was further extended to nonsmooth subanalytic functions [3].
Definition II.3 (KL inequality).
A function is said to satisfy the KL inequality at if there exists , such that for all
where and stand for the class of functions such that (a) is continuous on ; (b) is smooth concave on ; (c) .
The following is an extension of the conventional KL inequality [4].
Lemma II.2 (KL inequality on compact subsets).
Let be a proper lower semicontinuous function and let be a compact set. If is a constant on and satisfies the KL inequality at each point in , then there exists , such that for all and for all ,
Typical functions satisfying the KL inequality include strongly convex functions, real analytic functions, semialgebraic functions and subanalytic functions.
A subset is said to be semialgebraic if it can be written as
where are real polynomial functions. Then a function is called semialgebraic if its graph
is a semialgebraic subset in . For example, the quasi norm with , the supnorm the Euclidean norm , , and are all semialgebraic functions [4, 39].
A real function on is said to be analytic if it possesses derivatives of all orders and agrees with its Taylor series in a neighborhood of every point. For a real function on , it is said to be analytic if the function of one variable is analytic for any . It is readily seen that real polynomial functions such as quadratic functions are analytic. Moreover the smoothed norm with and the logistic loss function are also examples for real analytic functions [39].
A subset is said to be subanalytic if it can be written as
where are real analytic functions. Then a function is called subanalytic if its graph is a subanalytic subset in . It is clear that both real analytic and semialgebraic functions are subanalytic. Generally speaking, the sum of of two subanalytic functions is not necessarily subanalytic. As shown in [3, 39], for two subanalytic functions, if at least one function maps bounded sets to bounded sets, then their sum is also subanalytic. In particular, the sum of a subanalytic function and a analytic function is subanalytic. Some subanalytic functions that are widely used are as follows:

;

;

;

.
IiC Bregman distance
The Bregman distance, first introduced in 1967 [6], plays an important role in various iterative algorithms. As a generalization of squared Euclidean distance, the Bregman distance share many similar nice properties of the Euclidean distance. However, the Bregman distance is not a metric, since it does not satisfy the triangle inequality nor symmetry. For a convex differential function , the associated Bregman distance is defined as
In particular, if we let in the above, then it is reduced to , namely the classical Euclidean distance. Some nontrivial examples of Bregman distance include [2]:

ItakuraSaito distance: ;

KullbackLeibler divergence: ;

Mahalanobis distance: with a symmetric positive definite matrix.
Let us now collect some useful properties about Bregman distance.
Proposition II.3.
Let be a convex differential function and the associated Bregman distance.

Nonnegativity: for all .

Convexity: is convex in , but not necessarily in .

Strong Convexity: If is strongly convex, then for all .
As shown in the below, an appropriate choice of Bregman distance will simplify the and subproblems, which in turn improve the performance of the algorithm. For example, in subproblem (5), when taking then the problem is minimizing function
In general finding a minimizer of this function is not a easy task. However, if we take with , then it is transformed into minimizing a problem of
Such a problem has a closed form solution (see [40]), and thus it can be very easily solved.
IiD Basic assumption
We need the following basic assumptions on problem (I). A basic assumption to guarantee the convergence of the BADMM is that the matrix has fullrow rank. The only difference between Assumptions 1 and 2 is: one needs having full column rank in Assumption 1, while in Assumption 2 one needs being strongly convex. It worth noting that one can choose under Assumption 1, so that the BADMM includes the standard ADMM as a special case. It is also worth noting that the choice of is not available under Assumption 2.
Assumption 1.
Let a continuous differential function and a proper lower semicontinuous functions. Assume that the following hold.

and is injective;

either with respect to or is strongly convex;

is a subanalytic function, and and are Lipshitz continuous.
In condition (b), the strong convexity of is easily attained, for example while the strong convexity of in can be deduced from some standard assumptions, for example Neumann boundary condition in image processing [15]. Condition (b) will be used to guarantee the sufficient descent property of the augmented Lagrangian functions. More specifically, it implies
(10) 
where is generated by algorithm (5)(7). As a matter of fact, if with respect to is strongly convex, then is also strongly convex because is convex from Proposition II.3. Thus the desired inequality will follow from the definition of strong convexity and Proposition II.3. If is strongly convex, then it follows again from Proposition II.3 that
which together with the definition of yields the desired inequality.
The condition that is subanalytic in (c) will be used to guarantee the auxiliary function constructed in the following section satisfying the KL inequality. We notice that all functions mentioned in subsection IIB satisfy assumption (c). The Lipschitz continuity is a standard assumption for various algorithms, even in convex settings.
We also consider the BADMM under another set of conditions listed in Assumption 2 below. The only difference between Assumptions 1 and 2 is that one needs having full column rank in Assumption 1, where in Assumption 2 we assume that is strongly convex. It is worth noting that one can choose under Assumption 1, so that the BADMM includes the standard ADMM as a special case.
Assumption 2.
Let a continuous differential function and a proper lower semicontinuous functions. Assume that the following hold.

and is strongly convex.

either with respect to or is strongly convex.

is a subanalytic function, and and are Lipshitz continuous.
Iii Convergence Analysis
In this section we prove the convergence of BADMM under two different assumptions. In both assumptions, the parameter is chosen so that
where and respectively stand for the Lipshitz constant of functions and .
According to a recent work [1], the key point for convergence analysis of nonconvex algorithms is to show the descent property of the augmented Lagrangian function. This is however not easily attained since the dual variable is updated by maximizing the augmented Lagrangian function. As an alternative way, we construct an auxiliary function below, which helps us to deduce the global convergence of BADMM.
Iiia The case is injective
Lemma III.1.
Proof.
First we show that for each
(11) 
Indeed applying Fermat’s rule to (6) yields
(12) 
which together with (7) implies that
(13) 
It then follows that
Since matrix is surjective, we have
which at once implies (IIIA), as desired.
Lemma III.2.
If the sequence is bounded, then
In particular the sequence is asymptotically regular, namely as Moreover any cluster point of is a stationary point of
Proof.
Let Since is clearly bounded, there exists a subsequence so that it is convergent to some element . By our hypothesis the function is lower semicontinuous, which leads to
so that is bounded from below. By the previous lemma, is nonincreasing, so that is convergent. Moreover is also convergent and for each .
Now fix It then follows from Lemma III.1 that
Since is chosen arbitrarily, we have which with (IIIA) implies . Since is injective, it is readily seen that there exists so that
(15) 
Hence , so that in particular .
Let be any cluster point of and let be a subsequence of converging to . Since tends to zero as and have the same limit point . Since is convergent, it is not hard to see that is also convergent. It then follows from (5)(7) that
Letting in the above formulas yields
which implies that is a stationary point. ∎
Lemma III.3.
Let . Then there exists such that for each