Nonconvex Nonsmooth Low-Rank Minimization via Iteratively Reweighted Nuclear Norm
The nuclear norm is widely used as a convex surrogate of the rank function in compressive sensing for low rank matrix recovery with its applications in image recovery and signal processing. However, solving the nuclear norm based relaxed convex problem usually leads to a suboptimal solution of the original rank minimization problem. In this paper, we propose to perform a family of nonconvex surrogates of -norm on the singular values of a matrix to approximate the rank function. This leads to a nonconvex nonsmooth minimization problem. Then we propose to solve the problem by Iteratively Reweighted Nuclear Norm (IRNN) algorithm. IRNN iteratively solves a Weighted Singular Value Thresholding (WSVT) problem, which has a closed form solution due to the special properties of the nonconvex surrogate functions. We also extend IRNN to solve the nonconvex problem with two or more blocks of variables. In theory, we prove that IRNN decreases the objective function value monotonically, and any limit point is a stationary point. Extensive experiments on both synthesized data and real images demonstrate that IRNN enhances the low-rank matrix recovery compared with state-of-the-art convex algorithms.
BENEFITING from the success of Compressive Sensing (CS) , the sparse and low rank matrix structures have attracted considerable research interests from the computer vision and machine learning communities. There have been many applications which exploit these two structures. For instance, sparse coding has been widely used for face recognition , image classification  and super-resolution , while low rank models are applied for background modeling , motion segmentation [7, 8] and collaborative filtering .
Conventional CS recovery uses the -norm, i.e., , as the surrogate of the -norm, i.e., , and the resulting convex problem can be solved by fast first-order solvers [10, 11]. Though for certain problems, the -minimization is equivalent to the -minimization under certain incoherence conditions , the obtained solution by -minimization is usually suboptimal to the original -minimization since the -norm is a loose approximation of the -norm. This motivates to approximate the -norm by nonconvex continuous surrogate functions. Many known nonconvex surrogates of -norm have been proposed, including -norm () , Smoothly Clipped Absolute Deviation (SCAD) , Logarithm , Minimax Concave Penalty (MCP) , Capped , Exponential-Type Penalty (ETP) , Geman  and Laplace . We summarize their definitions in Table I and visualize them in Figure 1. Numerical studies [21, 22] have shown that the nonconvex sparse optimization usually outperforms convex models in the areas of signal recovery, error correction and image processing.
|Penalty||Formula , ,||Supergradient|
The low rank structure of a matrix is the sparsity defined on its singular values. A particularly interesting model is the low rank matrix recovery problem
where is a linear mapping and . The above low rank minimization problem arises in many computer vision tasks such as multiple category classification , matrix completion , multi-task learning  and low-rank representation with squared loss for subspace segmentation . Similar to the -minimization, the rank minimization problem (1) is also challenging to solve. Thus, the rank function is usually replaced by the convex nuclear norm, , where ’s denote the singular values of . This leads to a relaxed convex formulation of (1):
The above convex problem can be efficiently solved by many known solvers [26, 27]. However, the obtained solution by solving (2) is usually suboptimal to (1) since the nuclear norm is also a loose approximation of the rank function. Such a phenomenon is similar to the difference between -norm and -norm for sparse vector recovery. However, different from the nonconvex surrogates of -norm, the nonconvex rank surrogates and the optimization solvers have not been well studied before.
In this paper, to achieve a better approximation of the rank function, we extend the nonconvex surrogates of -norm shown in Table I onto the singular values of the matrix, and show how to solve the following general nonconvex nonsmooth low rank minimization problem 
where denotes the -th singular value of (we assume that in this work). The penalty function and loss function satisfy the following assumptions:
is continuous, concave and monotonically increasing on . It is possibly nonsmooth.
: is a smooth function of type , i.e., the gradient is Lipschitz continuous,
for any , is called Lipschitz constant of . is possibly nonconvex.
Note that problem (3) is very general. All the nonconvex surrogates of -norm in Table I satisfy the assumption A1. So is the nonconvex surrogate of the rank function111Note that the singular values of a matrix are always nonegative. So we only consider the nonconvex definted on .. It is expected that it approximates the rank function better than the convex nuclear norm. To see this more intuitively, we show the balls of constant penalties for a symmetric matrix in Figure 2. For the loss function in assumption A2, the most widely used one is the squared loss .
There are some related work which consider the nonconvex rank surrogates. But they are different from this work. The work [28, 29] extend the -norm of a vector to the Schatten- norm () and use the iteratively reweighted least squares (IRLS) algorithm to solve the nonconvex rank minimization problem with affine constraint. IRLS is also applied for the unconstrained problem with the smoothed Schatten- norm regularizer . However, the obtained solution by IRLS may not be naturally of low rank, or it may require a lot of iterations to get a low rank solution. One may perform the singular value thresholding appropriately to achieve a low rank solution, but there has no theoretically sound rule to suggest a correct threshold. Another nonconvex rank surrogate is the truncated nuclear norm . Their proposed alternating updating optimization algorithm may not be efficient due to double loops of iterations and cannot be applied to solve (3). The nonconvex low rank matrix completion problem considered in  is a special case of our problem (3). Our solver shown later for (3) is also much more general. The work  uses the nonconvex log-det heuristic in  for image recovery. But their augmented Lagrangian multiplier based solver lacks of the convergence guarantee. A possible method to solve (3) is the proximal gradient algorithm , which requires to compute the proximal mapping of the nonconvex function . However, computing the proximal mapping requires solving a nonconvex problem exactly. To the best of our knowledge, without additional assumptions on (e.g., the convexity of ), there does not exist a general solver for computing the proximal mapping of the general nonconvex in assumption A1.
In this work, we observe that all the existing nonconvex surrogates in Table I are concave and monotonically increasing on . Thus their gradients (or supergradients at the nonsmooth points) are nonnegative and monotonically decreasing. Based on this key fact, we propose an Iteratively Reweighted Nuclear Norm (IRNN) algorithm to solve (3). It computes the proximal operator of the weighted nuclear norm, which has a closed form solution due to the nonnegative and monotonically decreasing supergradients. The cost is the same as the computing of singular value thresholding which is widely used in convex nuclear norm minimization. In theory, we prove that IRNN monotonically decreases the objective function value and any limit point is a stationary point.
Furthermore, note that problem (3) contains only one block of variable. But there are also some work which aim at finding several low rank matrices simultaneously, e.g., . So we further extend IRNN to solve the following problem with blocks of variables
where , (assume ), ’s satisfy the assumption A1, and is Lipschitz continuous defined as follows.
Let be differentiable. Then is called Lipschitz continuous if there exist , such that
for any and with . We call ’s as Lipschitz constants of .
Note that the Lipschitz continuity of the multivariable function is crucial for the extension of IRNN for (5). This definition is completely new and it is different from the one block variable case defined in (4). For , (6) holds if (4) holds (Lemma 1.2.3 in ). This motivates the above definition. But note that (4) does not guarantee to hold based on (6). So the definition of the Lipschitz continuity of the multivariable function is different from (4). This makes the extension of IRNN for problem (5) nontrivial. A widely used function which satisfies (6) is . Its Lipschitz constants are , , where denotes the spectral norm of matrix . This is easy to verified by using the property , where ’s are of compatible size.
In theory, we prove that IRNN for (5) also has the convergence guarantee. In practice, we propose a new nonconvex low rank tensor representation problem which is a special case of (5) for subspace clustering. The results demonstrate the effectiveness of nonconvex models over the convex counterpart.
In summary, the contributions of this paper are as follows.
We further extend IRNN to solve the nonconvex nonsmooth low rank minimization problem (5) with blocks of variables. Note that such an extension is nontrivial based on our new definition of Lipschitz continuity of the multivariable function in (6). In theory, we prove that IRNN converges with decreasing objective function values and any limit point is a stationary point.
For applications, we apply the nonconvex low rank models on image recovery and subspace clustering. Extensive experiments on both synthesized and real-world data well demonstrate the effectiveness of the nonconvex models.
The remainder of this paper is organized as follows: Section II presents the IRNN method for solving problem (3). Section III extends IRNN for solving problem (5) and provides the convergence analysis. The experimental results are presented in Section IV. Finally, we conclude this paper in Section V.
Ii Nonconvex Nonsmooth Low-Rank Minimization
In this section, we show how to solve the general problem (3). Note that in (3) is not necessarily smooth. An known example is the Capped norm, see Figure 1. To handle the nonsmooth penalty , we first introduce the concept of supergradient defined on the concave function.
Ii-a Supergradient of a Concave Function
If is convex but nonsmooth, its subgradient at is defined as
If is concave and differentiable at , it is known that
Let be concave. A vector is a supergradient of at the point if for every , the following inequality holds
The supergradient at a nonsmooth point may not be unique. All supergradients of at are called the superdifferential of at . We denote the set of all the supergradients at as . If is differentiable at , then is the unique supergradient, i.e., . Figure 3 illustrates the supergradients of a concave function at both differentiable and nondifferentiable points.
For concave , is convex, and vice versa. From this fact, we have the following relationship between the supergradient of and the subgradient of .
Let be concave and . For any , , and vice versa.
It is trivial to prove the above fact by using (7) and (9). The relationship of the supergradient and subgradient shown in Lemma 1 is useful for exploring some properties of the supergradient. It is known that the subdiffierential of a convex function is a monotone operator, i.e.,
for any , . Now we show that the superdifferential of a concave function is an antimonotone operator.
The superdifferential of a concave function is an antimonotone operator, i.e.,
for any and .
when . That is to say, the supergradient of is monotonically decreasing on . The supergradients of some usual concave functions are shown in Table I. We also visualize them in Figure 1. Note that for the penalty, we further define that . This will not affect our algorithm and convergence analysis as shown later. The Capped penalty is nonsmooth at with its superdifferential .
Ii-B Iteratively Reweighted Nuclear Norm Algorithm
In this subsection, based on the above concept of the supergradient of concave function, we show how to solve the general nonconvex and possibly nonsmooth problem (3). For the simplicity of notation, we denote as the singular values of . The variable in the -th iteration is denoted as and is the -th singular value of .
In assumption A1, is concave on . So, by the definition (9) of the supergradient, we have
Since , by the antimonotone property of supergradient (12), we have
In (15), the nonnegativeness of ’s is due to the monotonically increasing property of in assumption A1. As we will see later, property (15) plays an important role for solving the subproblem of our proposed IRNN.
However, the weighted nuclear norm in (16) is nonconvex (it is convex if and only if ), while the weighted -norm in (17) is convex. For convex in (16) and in (17), solving the nonconvex problem (16) is much more challenging than the convex weighted -norm problem. In fact, it is not easier than solving the original problem (3).
Input: - A Lipschitz constant of .
Initialize: , , and , .
while not converge do
Update by solving problem (20).
Update the weights , , by
Instead of updating by solving (16), we linearize at and add a proximal term:
[39, Theorem 2.3] For any , and , a globally optimal solution to the following problem
is given by the Weighted Singular Value Thresholding (WSVT)
where is the SVD of , and .
From Lemma 3, it can be seen that to solve (20) by using (22), (15) plays an important role and it holds for all satisfying the assumption A1. If , then reduces to the convex nuclear norm . In this case, for all . Then WSVT reduces to the conventional Singular Value Thresholding (SVT) , which is an important subroutine in convex low rank optimization. The updating rule (20) then reduces to the known proximal gradient method .
After updating by solving (20), we then update the weights , . Iteratively updating and the weights corresponding to its singular values leads to the proposed Iteratively Reweighted Nuclear Norm (IRNN) algorithm. The whole procedure of IRNN is shown in Algorithm 1. If the Lipschitz constant is not known or computable, the backtracking rule can be used to estimate in each iteration .
It is worth mentioning that for the penalty, if , then . By the updating rule of in (20), we have . This guarantees that the rank of the sequence is nonincreasing.
In theory, we can prove that IRNN converges. Since IRNN is a special case of IRNN with Parallel Splitting (IRNN-PS) in Section III, so we only give the convergence results of IRNN-PS later.
At the end of this section, we would like to remark some more differences between previous work and ours.
Our IRNN and IRNN-PS for nonconvex low rank minimization are different from previous iteratively reweighted solvers for nonconvex sparse minimization, e.g., [21, 30]. The key difference is that the weighted nuclear norm regularized problem is nonconvex while the weighted -norm regularized problem is convex. This makes the convergence analysis different.
Iii Extensions of IRNN and the Convergence Analysis
In this section, we extend IRNN to solve two types of problems which are more general than (3). The first one is to solve some similar problems as (3) but with more general nonconvex penalties. The second one is to solve problem (5) which has blocks of variables.
Iii-a IRNN for the Problems with More General Nonconvex Penalties
IRNN can be extended to solve the following problem
where ’s are concave and their supergradients satisfy for any , . The truncated nuclear norm  is an interesting example. Indeed, let
Then and its supergradients is
Compared with the alternating updating algorithm in , which require double loops, our IRNN will be more efficient and with stronger convergence guarantee.
Iii-B IRNN for the Multi-Blocks Problem (5)
Here we propose a more general Tensor Low Rank Representation (TLRR) as follows
where is an -way tensor and denotes the -mode product . TLRR is an extension of LRR  and LatLRR. It can also be applied for subspace clustering, see Section IV. If we replace in (26) as with ’s satisfying the assumption A1, then we have the Nonconvex TLRR (NTLRR) model which is a special case of (5).
Iii-C Convergence Analysis
In this section, we give the convergence analysis of IRNN-PS for (5). For the simplicity of notation, we denote as the -th singular value of in the -th iteration.
In problem (5), assume that ’s satisfies the assumption A1 and is Lipschitz continuous. Then the sequence generated by IRNN-PS satisfies the following properties:
is monotonically decreasing. Indeed,
Proof. First, since is optimal to (III-B), we have
It can be rewritten as
Second, since is Lipschitz continuous, by (6), we have
Summing the above three equations for all and leads to
Thus is monotonically decreasing. Summing the above inequality for , we get
This implies that .
Proof. Due to the above assumption, is bounded. Thus there exists a matrix and a subsequence such that . Note that in Theorem 1, we have . Thus for and . By Lemma 1, implies that . From the upper semi-continuous property of the subdifferential [42, Proposition 2.1.5], there exists such that . Again by Lemma 1, and .
Denote . Since is optimal to (III-B), there exists , such that
Let in (30). Then there exists , such that
Thus is a stationary point to (5).
In this section, we present several experiments to demonstrate that the models with nonconvex rank surrogates outperform the ones with convex nuclear norm. We conduct three experiments. The first two aim to examine the convergence behavior of IRNN for the matrix completion problem  on both synthetic data and real images. The last experiment is tested on the tensor low rank representation problem (27) solved by IRNN-PS for face clustering.
For the first two experiments, we consider the nonconvex low rank matrix completion problem
where is the set of indices of samples, and is a linear operator that keeps the entries in unchanged and those outside zeros. The gradient of squared loss function in (32) is Lipschitz continuous, with a Lipschitz constant . We set in IRNN. For the choice of , we use five nonconvex surrogates in Table I, including -norm, SCAD, Logarithm, MCP and ETP. The other three nonconvex surrogates, including Capped , Geman and Laplace, are not used since we find that their recovery performances are very sensitive to the choices of and in different cases. For the choice of in , we use a continuation technique to enhance the low rank matrix recovery. The initial value of is set to a larger value , and dynamically decreased by with . It is stopped till reaching a predefined target . is initialized as a zero matrix. For the choice of parameters (e.g., and ) in , we search them from a candidate set and use the one which obtains good performance in most cases.
Iv-a Low Rank Matrix Recovery on the Synthetic Data
We first compare the low rank matrix recovery performances of nonconvex model (32) with the convex one by using nuclear norm  on the synthetic data. We conduct two tasks. The first one is tested on the observed matrix without noises, while the other one is tested on with noises.
For the noise free case, we generate the rank matrix as , where , and are generated by the Matlab command dn. We randomly set elements of to be missing. The Augmented Lagrange Multiplier (ALM)  method is used to solve the noise free problem
The default parameters of in the released codes222Code: http://perception.csl.illinois.edu/matrix-rank/sample_code.html. of ALM are used. For problem (32), it is solved by IRNN with the parameters , and . The algorithm is stopped when . The matrix recovery performance is evaluated by the Relative Error defined as
where is the recovered matrix by different algorithms. If the Relative Error is smaller than , then is regarded as a successful recovery of . For each , we repeat the experiments times. Then we define the , where is the times of successful recovery. We also vary the underlying rank of from 20 to 33 for each algorithm. We show the frequency of success in Figure 3(a). The legend IRNN- in Figure 3(a) denotes the model (32) with penalty solved by IRNN. It can be seen that IRNN for (32) with nonconvex rank surrogates significantly outperforms ALM for (33) with convex rank surrogate. This is because the nonconvex surrogates approximate the rank function much better than the convex nuclear norm. This also verifies that our IRNN achieves good solutions of (32), though its optimal solutions are in general not computable.
For the second task, we assume that the observed matrix is noisy. It is generated by +0.1dn. We compare IRNN for (32) with convex Accelerated Proximal Gradient with Line search (APGL)333Code: http://www.math.nus.edu.sg/~mattohkc/NNLS.html.  which solves the noisy problem
For this task, we set and in IRNN. We run the experiments for 100 times and the underlying rank is varying from 15 and 35. For each test, we compute the relative error in (34). Then we show the mean relative error over 100 tests in Figure 3(c). Similar to the noise free case, IRNN with nonconvex rank surrogates achieves much smaller recovery error than APGL for convex problem (35).
It is worth mentioning that though Logarithm seems to perform better than other nonconvex penalties for low rank matrix completion from Figure 4. It is still not clear which one is the best rank surrogate since the obtained solutions are not globally optimal. Answering this question is beyond the scope of this work.
Figure 3(b) shows the running times of the compared methods. It can be seen that IRNN is slower than the convex ALM. This is due to the reinitialization of IRNN when using the continuation technique. Figure 3(d) plots the objective function values in each iterations of IRNN with different nonconvex penalties. As verified in theory, it can be seen that the values are decreasing.
Iv-B Application to Image Recovery
In this section, we apply the low rank matrix completion models (35) and (3) for image recovery. We follow the experimental settings in . Here we consider two types of noises on the real images. The first one replaces of pixels with random values (sample image (1) in Figure 4(b)). The other one adds some unrelated texts on the image (sample image (2) in Figure 4(b)). The goal is to remove the noises by using low rank matrix completion. Actually, the real images may not be of low-rank. But their top singular values dominate the main information. Thus, the image can be approximately recovered by a low-rank matrix. For the color image, there are three channels. Matrix completion is applied for each channel independently. We compare IRNN with some state-of-the-art methods on this task, including APGL, Low-Rank Matrix Fitting (LMaFit)444Code: http://lmafit.blogs.rice.edu/.  and Truncated Nuclear Norm Regularization (TNNR)555Code: https://sites.google.com/site/zjuyaohu/. . For the obtained solution, we evaluate its quality by the Peak Signal-to-Noise Ratio (PSNR) and the relative error (34).
Figure 5 (c)-(g) show the recovered images by different methods. It can be seen that our IRNN method for nonconvex models achieve much better recovery performance than APGL and LMaFit. The performances of low rank models (3) using different nonconvex surrogates are quite similar, so we only show the results by IRNN- and IRNN-SCAD due to the limit of space. Some more results are shown in Figure 6. Figure 7 shows the PSNR values, relative errors and running time of different methods on all the tested images. It can be seen that IRNN with all the evaluated nonconvex functions achieves higher PSNR values and smaller relative error. This verifies that the nonconvex penalty functions are effective in this situation. The nonconvex truncated nuclear norm is close to our methods, but its running time is 35 times of ours.
Iv-C Tensor Low-Rank Representation
In this section, we consider to use the Tensor Low-Rank Representation (TLRR) (27) for face clustering [46, 36]. Problem (27) can be solved by the Accelerated Proximal Gradient (APG)  method with the optimal convergence rate , where is the number of iterations. The corresponding Nonconvex TLRR (NTLRR) related to (27) is