Alternating Direction Method of Multipliers for A Class of Nonconvex and Nonsmooth Problems with Applications to Background/Foreground Extraction
Abstract
In this paper, we study a general optimization model, which covers a large class of existing models for many applications in imaging sciences. To solve the resulting possibly nonconvex, nonsmooth and nonLipschitz optimization problem, we adapt the alternating direction method of multipliers (ADMM) with a general dual stepsize to solve a reformulation that contains three blocks of variables, and analyze its convergence. We show that for any dual stepsize less than the golden ratio, there exists a computable threshold such that if the penalty parameter is chosen above such a threshold and the sequence thus generated by our ADMM is bounded, then the cluster point of the sequence gives a stationary point of the nonconvex optimization problem. We achieve this via a potential function specifically constructed for our ADMM. Moreover, we establish the global convergence of the whole sequence if, in addition, this special potential function is a KurdykaŁojasiewicz function. Furthermore, we present a simple strategy for initializing the algorithm to guarantee boundedness of the sequence. Finally, we perform numerical experiments comparing our ADMM with the proximal alternating linearized minimization (PALM) proposed in [5] on the background/foreground extraction problem with real data. The numerical results show that our ADMM with a nontrivial dual stepsize is efficient.
onsmooth and nonconvex optimization; alternating direction method of multipliers; dual stepsize; background/foreground extraction
1 Introduction
In this paper, we consider the following optimization problem:
(1) 
where

are proper closed nonnegative functions, and is convex, while is possibly nonconvex, nonsmooth and nonLipschitz;

are linear maps and , are injective.
In particular, and in (1) can be regularizers used for inducing the desired structures. For instance, can be used for inducing low rank in . One possible choice is (see next section for notation and definitions). Alternatively, one may consider , where is a compact convex set such as with , or with ; the former choice restricts to have rank at most and makes (1) nuclearnormfree (see [30, 33]). On the other hand, can be used for inducing sparsity. In the literature, is typically separable, i.e., taking the form
(2) 
where is a nonnegative continuous function with and is a regularization parameter. Some concrete examples of include:

fraction penalty [20]: for ;

logistic penalty [39]: for ;

smoothly clipped absolute deviation [16]: for ;

minimax concave penalty [49]: for ;

hard thresholding penalty function [17]: .
The bridge penalty and the logistic penalty have also been considered in [13]. Finally, the linear map can be suitably chosen to model different scenarios. For example, can be chosen to be the identity map for extracting and from a noisy data , and the blurring map for a blurred data . The linear map can be the identity map or some “dictionary” that spans the data space (see, for example, [34]), and can be chosen to be the identity map or the inverse of certain sparsifying transform (see, for example, [40]). More examples of (1) can be found in [9, 8, 10, 47, 13, 41].
One representative application that is frequently modeled by (1) via a suitable choice of , , , and is the background/foreground extraction problem, which is an important problem in video processing; see [6, 7] for recent surveys. In this problem, one attempts to separate the relatively static information called “background” and the moving objects called “foreground” in a video. The problem can be modeled by (1), and such models are typically referred to as RPCAbased models. In these models, each image is stacked as a column of a data matrix , the relatively static background is then modeled as a low rank matrix, while the moving foreground is modeled as sparse outliers. The data matrix is then decomposed (approximately) as the sum of a low rank matrix modeling the background and a sparse matrix modeling the foreground. Various approximations are then used to induce low rank and sparsity, resulting in different RPCAbased models, most of which take the form of (1). One example is to set to be the nuclear norm of , i.e., the sum of singular values of , to promote low rank in and to be the norm of to promote sparsity in , as in [10]. Besides convex regularizers, nonconvex models have also been widely studied recently and their performances are promising; see [13, 44] for background/foreground extraction and [4, 12, 22, 38, 39, 50] for other problems in image processing. There are also nuclearnormfree models that do not require matrix decomposition of the matrix variable when solving them, making the model more practical especially when the size of matrix is large. For instance, in [30], the authors set to be the norm of and to be the indicator function of . A similar approach was also adopted in [33] with promising performances. Clearly, for nuclearnormfree models, one can also take to be some nonconvex sparsity inducing regularizers, resulting in a special case of (1) that has not been explicitly considered in the literature before; we will consider these models in our numerical experiments in Section 5. The above discussion shows that problem (1) is flexible enough to cover a wide range of RPCAbased models for background/foreground extraction.
Problem (1), though nonconvex in general, as we will show later in Section 3, can be reformulated into an optimization problem with three blocks of variables. This kind of problems containing several blocks of variables has been widely studied in the literature; see, for example, [30, 37, 41]. Hence, it is natural to adapt the algorithm used there, namely, the alternating direction method of multipliers (ADMM), for solving (1). Classically, the ADMM can be applied to solving problems of the following form that contains 2 blocks of variables:
(3) 
where and are proper closed convex functions, and are linear operators. The iterative scheme of ADMM is
where is the dual stepsize and is the augmented Lagrangian function for (3) defined as
with being the penalty parameter. Under some mild conditions, the sequence generated by the above ADMM can be shown to converge to an optimal solution of (3); see for example, [3, 15, 19, 21]. However, the ADMM used in [30, 37, 41] does not have a convergence guarantee; indeed, it is shown recently in [11] that the ADMM, when applied to a convex optimization problem with blocks of variables, can be divergent in general. This motivates the study of many provably convergent variants of the ADMM for convex problems with more than 2 blocks of variables; see, for example, [24, 25, 36, 35]. Recently, Hong et al. [26] established the convergence of the multiblock ADMM for certain types of nonconvex problems whose objective is a sum of a possibly nonconvex Lipschitz differentiable function and a bunch of convex nonsmooth functions when the penalty parameter is chosen above a computable threshold. The problem they considered covers (1) when is convex, or smooth and possibly nonconvex. Later, Wang et al. [44] considered a more general type of nonconvex problems that contains (1) as a special case and allows some nonconvex nonsmooth functions in the objective. To solve this type of problems, they considered a variant of the ADMM whose subproblems are simplified by adding a Bregman proximal term. However, their results cannot be applied to the direct adaptation of the ADMM for solving (1).
In this paper, following the studies in [26, 44] on convergence of nonconvex ADMM and its variant, and the recent studies in [1, 31, 45], we manage to analyze the convergence of the ADMM applied to solving the possibly nonconvex problem (1). In addition, we would like to point out that all the aforementioned nonconvex ADMM have a dual stepsize of . While it is known that the classical ADMM converges for any for convex problems, and that empirically works best (see, for example, [18, 19, 21, 36]), to our knowledge, the algorithm with a dual stepsize has never been studied in the nonconvex scenarios. Thus, we also study the ADMM with a general dual stepsize, which will allow more flexibility in the design of algorithms.
The contributions of this paper are as follows:

We show that for any positive dual stepsize less than the golden ratio, the cluster point of the sequence generated by our ADMM gives a stationary point of (1) if the penalty parameter is chosen above a computable threshold depending on , whenever the sequence is bounded. We achieve this via a potential function specifically constructed for our ADMM. To the best of our knowledge, this is the first convergence result for the ADMM in the nonconvex scenario with a possibly nontrivial dual stepsize (). This result is also new for the convex scenario for the multiblock ADMM.

We establish global convergence of the whole sequence generated by the ADMM under the additional assumption that the special potential function is a KurdykaŁojasiewicz function. Following the discussions in [2, Section 4], one can check that this condition is satisfied for all the aforementioned .

Furthermore, we discuss an initialization strategy to guarantee the boundedness of the sequence generated by the ADMM.
We also conduct numerical experiments to evaluate the performance of our ADMM by using different nonconvex regularizers and real data. Our computational results illustrate the efficiency of our ADMM with a nontrivial dual stepsize.
2 Notation and preliminaries
In this paper, we use to denote the set of all matrices. For a matrix , we let denote its th entry and denote its th column. The number of nonzero entries in is denoted by and the largest entry in magnitude is denoted by . Moreover, the Fröbenius norm is denoted by , the nuclear norm is denoted by , which is the sum of singular values of ; and norm and quasinorm () are given by and , respectively. Furthermore, for two matrices and of the same size, we denote their trace inner product by . Finally, for the linear map in (1), its adjoint is denoted by , while the largest (resp., smallest) eigenvalue of the linear map is denoted by (resp., ). The identity map is denoted by .
For an extendedrealvalued function , we say that it is proper if for all and its domain is nonempty. For a proper function , we use the notation to denote and . Our basic (limiting)subdifferential [42, Definition 8.3] of at used in this paper, denoted by , is defined as
where denotes the Fréchet subdifferential of at , which is the set of all satisfying
From the above definition, we can easily observe that
(4) 
We also recall that when is continuously differentiable or convex, the above subdifferential coincides with the classical concept of derivative or convex subdifferential of ; see, for example, [42, Exercise 8.8] and [42, Proposition 8.12]. Moreover, from the generalized Fermat’s rule [42, Theorem 10.1], we know that if is a local minimizer of , then . Additionally, for a function with several groups of variables, we write (resp., ) for the subdifferential (resp., derivative) of with respect to the group of variables .
For a compact convex set , its indicator function is defined by
The normal cone of at the point is given by . We also use to denote the distance from to , i.e., , and to denote the unique closest point to in .
Next, we recall the KurdykaŁojasiewicz (KL) property, which plays an important role in our global convergence analysis. For notational simplicity, we use () to denote the class of concave functions satisfying: (1) ; (2) is continuously differentiable on and continuous at ; (3) for all . Then the KL property can be described as follows. {definition}[KL property and KL function] Let be a proper lower semicontinuous function.

For , if there exist an , a neighborhood of and a function such that for all , it holds that
then is said to have the KurdykaŁojasiewicz (KL) property at .

If satisfies the KL property at each point of , then is called a KL function.
We refer the interested readers to [2] and references therein for examples of KL functions. We also recall the following uniformized KL property, which was established in [5, Lemma 6].
[Uniformized KL property] Suppose that is a proper lower semicontinuous function and is a compact set. If on for some constant and satisfies the KL property at each point of , then there exist , and such that
for all .
Before ending this section, we discuss firstorder necessary conditions for (1). First, recall that (1) is the same as
Hence, from [42, Theorem 10.1], we have at any local minimizer of (1). On the other hand, from [42, Exercise 8.8] and [42, Proposition 10.5], we see that
Consequently, the firstorder necessary conditions of (1) at the local minimizer is given by:
(5) 
In this paper, we say that is a stationary point of (1) if satisfies (5) in place of .
3 Alternating direction method of multipliers
In this section, we present an ADMM for solving (1), which can be equivalently written as
(6) 
To describe the iterates of the ADMM, we first introduce the augmented Lagrangian function of the above optimization problem:
where is the Lagrangian multiplier and is the penalty parameter. The ADMM for solving (6) (equivalently (1)) is then presented as follows:
Algorithm 1 ADMM for solving (6) Input: Initial point , dual stepsize parameter , penalty parameter , while a termination criterion is not met, do Step 1. Set
Comparing with the ADMM considered in [26], the above algorithm has an extra dual stepsize parameter in the update. Such a dual stepsize was introduced in [19, 21] for the classical ADMM (i.e., for convex problems with two separate blocks of variables), and was further studied in [48, 18, 43, 36] for other variants of the ADMM. Numerically, it was also demonstrated in [43] that a larger dual stepsize () results in faster convergence for the convex problems they consider. Thus, we adapt this dual stepsize in our algorithm above. Surprisingly, in our numerical experiments, a parameter choice of leads to the worst performance for our nonconvex problems.
When , the above algorithm is a special case of the general algorithm studied in [26] when and are smooth functions, or convex nonsmooth functions. The algorithm is shown to converge when is chosen above a computable threshold. However, their convergence result cannot be directly applied when or when is nonsmooth and nonconvex. Nevertheless, following their analysis and the related studies [31, 45, 44], the above algorithm can be shown to be convergent under suitable assumptions. We will present the convergence analysis in Section 4.
Before ending this section, we further discuss the three subproblems in Algorithm 1. First, notice that the update and update are given by
In general, these two subproblems are not easy to solve. However, when and are chosen to be some common regularizers used in the literature, for example, and , then these subproblems can be solved efficiently via the proximal gradient method. Additionally, when with being a closed convex set and , the update can be given explicitly by
which can be computed efficiently if is simple, for example, when for some . For the update, when is given by (2) with being one of the penalty functions presented in the introduction and , it can be solved efficiently via a simple rootfinding procedure. Finally, from the optimality conditions of (9), the can be obtained by solving the following linear system
whose complexity would depend on the choice of in our model (1). For example, when is just the identity map, the is given explicitly by
4 Convergence analysis
In this section, we discuss the convergence of Algorithm 1 for . We first present the firstorder optimality conditions for the subproblems in Algorithm 1 as follows, which will be used repeatedly in our convergence analysis below.
(10)  
(11)  
(12)  
(13) 
Our convergence analysis is largely based on the following potential function:
where
(14) 
Note that is a convex and nonnegative function on . Thus, for any , we have for , and the equality holds when (so that ).
Our convergence analysis also relies on the following assumption.
Assumption 4.1
, , , , and satisfy

for some and for some ;

is continuous in its domain;

the first iterate satisfies
Remark 4.1 (Note on Assumption 4.1)
(i) Since and in (1) are injective, (a1) holds trivially; (ii) (a2) holds for many common regularizers (for example, the nuclear norm) or the indicator function of a set; (iii) (a3) places conditions on the first iterate of the algorithm. It is not hard to observe that this assumption holds trivially if both and are coercive, i.e., if . We will discuss more sufficient conditions for this assumption after our convergence results, i.e., after Theorem 4.3.
We now start our convergence analysis by proving the following preparatory lemma, which states that the potential function is decreasing along the sequence generated from Algorithm 1 if the penalty parameter is chosen above a computable threshold.
Suppose that and is a sequence generated by Algorithm 1. If (a1) in Assumption 4.1 holds, then for , we have
(15) 
Moreover, if , then the sequence , is decreasing.
Proof. We start our proof by noticing that
(16)  
where the last equality follows from (13). We next derive an upper bound of . To proceed, we first note from (12) that
where the second equality follows from (13). Hence, for ,
(17)  
We now consider two separate cases: and .

For , it follows from the convexity of that
We further add to both sides of the above inequality and simplify the resulting inequality to get
(18) where the last equality follows from (13).
Thus, for , combining (18), (19) and recalling the definition of in (14), we have
(20) 
Next, note that the function is strongly convex with modulus at least . Using this fact and the definition of as a minimizer in (9), we see that
(21) 
Moreover, using the fact that is a minimizer in (8), we have
(22) 
Finally, note that is strongly convex with modulus at least from (a1) in Assumption 4.1. From this, we can similarly obtain