Nonconvex Sparse Spectral Clustering by Alternating Direction Method of Multipliers and Its Convergence Analysis
Spectral Clustering (SC) is a widely used data clustering method which first learns a low-dimensional embedding of data by computing the eigenvectors of the normalized Laplacian matrix, and then performs k-means on to get the final clustering result. The Sparse Spectral Clustering (SSC) method extends SC with a sparse regularization on by using the block diagonal structure prior of in the ideal case. However, encouraging to be sparse leads to a heavily nonconvex problem which is challenging to solve and the work (?) proposes a convex relaxation in the pursuit of this aim indirectly. However, the convex relaxation generally leads to a loose approximation and the quality of the solution is not clear. This work instead considers to solve the nonconvex formulation of SSC which directly encourages to be sparse. We propose an efficient Alternating Direction Method of Multipliers (ADMM) to solve the nonconvex SSC and provide the convergence guarantee. In particular, we prove that the sequences generated by ADMM always exist a limit point and any limit point is a stationary point. Our analysis does not impose any assumptions on the iterates and thus is practical. Our proposed ADMM for nonconvex problems allows the stepsize to be increasing but upper bounded, and this makes it very efficient in practice. Experimental analysis on several real data sets verifies the effectiveness of our method.
Nonconvex Sparse Spectral Clustering by Alternating Direction Method of Multipliers and Its Convergence Analysis
Canyi Lu, Jiashi Feng, Zhouchen Lin††thanks: Corresponding author., Shuicheng Yan Department of Electrical and Computer Engineering, National University of Singapore Key Laboratory of Machine Perception (MOE), School of EECS, Peking University Cooperative Medianet Innovation Center, Shanghai Jiao Tong University 360 AI Institute email@example.com, firstname.lastname@example.org, email@example.com, firstname.lastname@example.org
Copyright © 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
Data clustering is one of the most fundamental topics in unsupervised learning and has been widely applied in computer vision, data mining and many others. Clustering aims to divide the unlabeled data set into groups which consist of similar data points. Many clustering methods have been proposed up to now, e.g. k-means, spectral clustering (?; ?) and nonnegative matrix factorization (?). The Spectral Clustering (SC) is one of the most widely used methods and it has a lot of applications in computer vision and signal analysis, e.g., image segmentation (?), motion segmentation (?), and co-clustering problems of words and documents (?).
Assume that we are given data points , where denotes the -th group with points, and is the number of clusters. SC (?) partitions these points into clusters by the following procedures: First, compute an affinity matrix with its element measuring the similarity between and . Second, construct the normalized Laplacian matrix , where is a diagonal matrix with each diagonal element and is the identity matrix. Third, compute by solving
Finally, compute by normalizing each row of to have unit Euclidean length, treat the rows of as data points in , and cluster them into groups by k-means. Due to the significance of SC, many variants of SC have been proposed based on different ways of affinity matrix construction and different normalizations of the Laplacian matrix (?; ?).
A recent work (?) proposes the Sparse Spectral Clustering (SSC) method which computes the low-dimensional embedding in a different way:
where and is the -norm which encourages to be sparse. The motivation for such a sparse regularizer is that is block diagonal (thus sparse) when is block diagonal in the ideal case. Consider the ideal case that the affinity matrix is block diagonal, i.e., if and are from different clusters. Let denotes the indicator matrix whose row entries indicate to which group the points belong. That is, if belongs to the group , and for all . Then, for any orthogonal matrix , we have . In this case, is block diagonal, i.e.,
where denotes the all one vector of length and is all one vector/matrix of proper size. Hence, implies the true membership of the data clusters and it is naturally sparse. Note that is obtained by normalizing each row of and thus is also sparse. However, such a block diagonal or sparse property may not appear in real applications since the affinity matrix is usually not block diagonal. This motivates the sparse regularization on and thus leads to the SSC model in (2). However, the key challenge is that problem (2) is nonconvex and difficult to solve. The work (?) proposes a convex relaxation formulation as follows
where the -norm is used as a surrogate of -norm while the nonconvex constraint consisting of all the fixed rank projection matrices, i.e., , is replaced as its convex hull (?). Here, denotes the set of symmetric matrices. For , means that is positive semi-definite. Problem (3) is convex and the optimal solution can be computed by Alternating Direction Method of Multipliers (ADMM) (?; ?), which is efficient in practice. After solving (3) with the solution , the low-dimensional embedding of data can be approximated by using the first eigenvectors corresponding to the largest eigenvalues of . This is equivalent to computing by solving
where denotes the Frobenius norm of a matrix. After obtaining , one is able to cluster the data points into groups as that in SC.
From the above discussions, it can be seen that a main limitation of the convex SSC relaxation (3) is that the obtained solution may be far from optimal to (2). The reason is that the difference in (4) is not guaranteed to be 0 or sufficiently small. Thus, enforcing to be sparse in (3) does not guarantee a sparse in (4). It is obvious that such an issue is mainly caused by the usage of the relaxation of the convex hull instead of .
In this work, we aim to address the above possibly loose relaxation issue of the convex SSC model by directly solving the following nonconvex problem
where is a sparse regularizer. The choice of the spare regularizer is not very important in SSC. We allow to be nonconvex (see more detailed assumption in the next section). Problem (5) is nonconvex and challenging to solve due to the orthogonal constraint and the complex term . This work proposes to solve it by Alternating Direction Method of Multipliers (ADMM) and provide the convergence guarantee. Our proposed ADMM is flexible as we allow the stepsize in ADMM to be monotonically increasing (but upper bounded). Such a choice of stepsize is widely used and has been verified to be effective in improving the convergence of ADMM for convex optimization. Note that we are the first one to use such a choice of stepsize in ADMM for nonconvex problems and provide its support in theory. When characterizing the convergence of ADMM, we show that the augmented Lagrangian function value is monotonically decreasing. This further guarantees that the sequences generated by the proposed ADMM are bounded and there always exists a limit point and any limit point is a stationary point.
The Proposed ADMM Algorithm
In this section, we present the ADMM algorithm for solving the nonconvex problem (5). We first reformulate it as the following equivalent problem
The standard augmented Lagrangian function is
where and are the dual variables and . However, it is difficult to update by minimizing the above augmented Lagrangian function when fixing other variables. To update efficiently, we instead use the following partial augmented Lagrangian function
Then we can solve problem (6) by Alternating Direction Method of Multipliers by the following rules.
1. Fix and update by
2. Fix and update by
3. Update the dual variable by
4. Update the stepsize by
The whole procedure of ADMM for (6) is given in Algorithm 1. It can be seen that the -subproblem (8) has a closed form solution. The -subproblem (9) requires computing the proximal mapping of . It usually has a closed form solution when is simple.
We would like to emphasize that, for nonconvex optimization, our ADMM allows the stepsize to be increasing (but upper bounded), while previous nonconvex ADMM methods have to fix it. Such a choice of stepsize has been verified to be effective in improving the convergence of ADMM for convex optimization in practice and the convergence guarantee is also known (?; ?). To the best of our knowledge, this is the first work which uses varying stepsize in ADMM for nonconvex problems and the convergence analysis for supporting this is also different from the convex case.
Main Result: The Convergence Analysis
The most important contribution is the convergence analysis of the proposed ADMM in Algorithm 1 for nonconvex problems which is generally challenging. This section gives the details of the convergence results. We first introduce the subgradient of any function (?), which will be used later.
Let and . A vector is normal to at in the regular sense, denoted as , if
where is defined by . A vector is normal to at in the general sense, denoted as , if there exist sequences , such that and with . The cone is called the normal cone to at .
Consider a lower semi-continuous function and a point with finite. For a vector , one says that
is a regular subgradient of at , denoted as , if
is a (general) subgradient of at , denoted as , if there exist sequences , such that , and with .
Let be a closed non-empty subset of and its indicator function be
Then its subgradient is In this work, we denote and the indicator function as .
To guarantee the convergence, we need some assumptions for problem (6) as follows:
is positive semi-definite.
is lower bounded, differential and is Lipschitz continuous, i.e., there exists such that
The stepsize is chosen large enough such that
The -subproblem (9) is strongly convex with modulus .
We have some remarks regarding the above assumptions. First, A1 holds for the SSC model since is the normalized Laplacian matrix; Second, can be nonconvex. In SSC, one may use convex or nonconvex sparse regularizer . But should be Lipschitz differential which can be achieved by using the smoothing technique (?) if necessary (see an example in the Experiment section); Third, the -subproblem (9) is eventually strongly convex when is large enough.
At the following, we will prove several properties of Algorithm 1 and give the convergence results.
Under assumptions A1-A3, all the subproblems in Algorithm 1 are well defined.
Under assumptions A1-A3, we have
Under assumptions A1-A3, the sequences generated by Algorithm 1 satisfy
is monotonically decreasing, i.e.,
for some constant .
When , , and .
The sequences , and are bounded.
There exists , where
The proof of Lemma 3 can be found in the supplementary material. The property (12) is important for proving (15), which guarantees that is monotonically decreasing due to the choice of in assumption A3. This combined with the lower bounded property guarantees that converges to some . For convex problems, there have several different quantities to characterize the convergence of ADMM, see (?; ?; ?). However, they are not applicable to ADMM for nonconvex problems. Here, the convergence is characterizing based on a different way by using the decreasing sequence . Note that the decreasing property of does not necessary hold for ADMM for convex optimization. This difference implies that the nonconvex problems which can be solved by ADMM are relatively limited and the convergence guarantee of ADMM for nonconvex problems is much more challenging. Based on (15), many other properties are proved. For example, Lemma 3 (c) gives some necessary results when the algorithm converges and (d) and (e) are important for proving the convergence to stationary point shown below.
From the boundedness of in Lemma 3, there exists a convergent subsequence and a limit point, denoted by as . Then, by using , and (16) in Lemma 3, for , there exists such that . In particular, as . By the definition of general subgradient, we have . This implies that (17)-(19) hold. Thus, any limit point is a stationary point. ∎
For every , the sequences generated by Algorithm 1 satisfies
Theorem 2 gives the convergence rate of our proposed ADMM based on the smallest difference of iterates of , and the residual. To the best of our knowledge, this is the first convergence rate of ADMM for nonconvex problems. In theory, such a result is relatively weaker than the convex case since the used measure is but not . But in practice, we observe that the sequence seems to decrease (see Figure 2 (b) in the Experiment section), though this is currently not clear in theory. This observation in practice implies that the above convergence rate makes sense.
It is worth mentioning that the convergence guarantee of ADMM for convex problems has been well established (?; ?; ?). However, for nonconvex cases, the convergence analysis of ADMM for different nonconvex problems is quite different. There are some recent works (?; ?) which apply ADMM to solve nonconvex problems and provide some analysis. However, these works are not able to solve our problem (6) since the constraints in their considered problems should be relatively simple while our problem has a special nonconvex constraint . The work (?) is not applicable to our problem since it requires all the subproblems to be strongly convex while our -subproblem (8) is nonconvex. When considering to apply the method in (?) to solve (5), one has to exactly solve the problem of the following type
in each iteration. This is generally very chellenging even when is convex. Also, we would like to emphasize that our ADMM allows the stepsize to be increasing (but upper bounded), while previous nonconvex ADMM algorithms simply fix it. Though our considered problem is specific, the analysis for the varying stepsize is applicable to other nonconvex problems in (?). In practice, the convergence speed of ADMM is sensitive to the choice of , but it is generally difficult to find a proper constant stepsize for fast convergence. Our choice of has been shown to be effective in improving the convergence speed and widely used in convex optimization (?; ?). In practice, we find that such a technique is also very useful for fast implementation of ADMM for nonconvex problems. We are also the first one to give the convergence rate (in the sense of Theorem 2) of ADMM for nonconvex problems.
In this section, we conduct some experiments to analyze the convergence of the proposed ADMM for nonconvex SSC and show its effectiveness for data clustering. We consider to solve the following nonconvex SSC model
where is the smoothed -norm with a smoothness parameter defined as follows
where . According to Theorem 1 in (?), the gradient of is given by and is Lipschitz continuous with Lipschitz constant . Note that is convex. So we set , which guarantees the assumption A3 holds. In Algorithm 1, we set , , and is initialized as the eigenvectors associated to the smallest eigenvalues of , where is the number of the clusters and is the normalized Laplacian matrix constructed based on the given affinity matrix . Then we set and . We use the following stopping criteria for Algorithm 1
which is implied by our convergence analysis. For all the experiments, we use (in practice, we find that the clustering performance is not sensitive when ).
We conduct two experiments based on two different ways of affinity matrix construction. The first experiment considers the subspace clustering problem in (?). A sparse affinity matrix is computed by using the sparse subspace clustering method (-graph) (?), and then it is used as the input for SC, convex SSC (?) and our nonconvex SSC model solved by ADMM. The second experiment instead uses the Gaussian kernel to construct the affinity matrix which is generally dense. We will show the effectiveness of nonconvex SSC in both settings.
Affinity matrix construction by the -graph
For the first experiment, we consider the case that the affinity matrix is constructed by the -graph (?). We test on the Extended Yale B database (?) to analyze the effectiveness of our nonconvex SSC model in (20). The Extended Yale B dataset consists of 2,414 face images of 38 subjects. Each subject has 64 faces. We resize the images to and vectorized them as 1,024-dimensional data points. We construct 5 subsets which consist of all the images of the randomly selected 2, 3, 5, 8 and 10 subjects of this dataset. For each trial, we follow the settings in (?) to construct the affinity matrix by solving a sparse representation model (or -graph), which is the most widely used method. Based on the learned affinity matrix by -graph, the following three methods are compared to find the final clustering results:
SC: traditional spectral clustering method (?).
SSC: convex SSC model (?).
SSC-ADMM: our nonconvex SSC model solved by the proposed ADMM in Algorithm 1.
SSC-PG: our nonconvex SSC model (5) can also be solved by the Proximal Gradient (PG) (?) method (a special case of Algorithm 1 in (?)). In each iteration, PG updates by
It is easy to see that the above problem has a closed form solution. We use the stopping criteria . We name the above method as SSC-PG.
Note that all the above four methods use the same affinity matrix as the input and their main difference is the different ways of learning of low-dimensional embedding . In the nonconvex model (20), we set . The experiments are repeated 20 times and the mean and standard deviation of the clustering error rates (see the definition in (?)) are reported.
The clustering results are shown in Table 1. It can be seen that our nonconvex SSC models outperform convex SSC in most cases. The main reason is that nonconvex SSC is able to directly encourage to be sparse while SSC achieves this in a two-stage way (required solving (3) and (4)). Considering a clustering example with subjects from the Yale B dataset, Figure 1 plots the learned affinity matrix by -graph, and learned by SC, SSC and SSC-ADMM, respectively. Note that is important for data clustering since ( is the row normalization of ) implies the true membership of the data clusters in the ideal case (see the discussions in the Introduction section). It can be seen that by SSC-ADMM looks more discriminative since the within-cluster connections are much stronger than the between-cluster connections. Also, for the convergence of the proposed ADMM, we plot the augmented Lagrangian function and the stopping criteria in (22). It can be seen that is monotonically decreasing and the stopping criteria converges towards 0. The convergence behavior is consistent with our theoretical analysis.
Affinity matrix construction by the Gaussian kernel
For the second experiment, we consider the case that the affinity matrix is constructed by the Gaussian kernel. We test on 10 datasets which are of different sizes and are widely used in pattern analysis. They include 5 datasets (Wine, USPS, Glass, Letter, Vehicle) from the UCI website (?)111https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/., UMIST222https://www.sheffield.ac.uk/eee/research/iel/research/face, PIE (?), COIL20333http://www1.cs.columbia.edu/CAVE/software/softlib/coil-20.php, CSTR444http://www.cs.rochester.edu/trs/ and AR (?). For some datasets, e.g., USPS and Letter, we use a subset instead due to the relatively large size. For some image datasets, e.g., UMIST, PIE, and COIL20, the images are resized to and then vectorized as features. The statistics of these datasets are summarized in Table 2. The key difference of this experiment from the first one is that we use the Gaussian kernel instead of the sparse subspace clustering to construct the affinity matrix .
The Gaussian kernel parameter is tuned by the grid . We use the same affinity matrix as the input of SC, SSC (?), SSC-PG and SSC-ADMM. The parameter in SSC, SSC-PG and SSC-ADMM is searched from . We further compare the these methods with k-means and Nonnegative Matrix Factorization (NMF). Under each parameter setting of each method mentioned above, we repeat the clustering for 20 times, and compute the average result. We report the best average accuracy for each method in Table 3.
From Table 3, we have the following observations. First, it can be seen that the SSC models (SSC, SSC-PG and SSC-ADMM) improve the traditional SC and our SSC-ADMM achieves the best performances in most cases. Second, this experiment not only verifies the superiority of SSC over SC, but also shows the importance of the nonconvex SSC and the effectiveness of our solver. Third, this experiment implies that the nonconvex SSC improves the convex SSC when given the dense affinity matrix constructed by the Gaussian kernel which is different from the sparse -graph in the first experiment. Beyond the accuracy, we further compare the efficiency of SSC-PG and SSC-ADMM which use different solvers for the equivalent nonconvex SSC model. Figure 3 gives the average running time of both methods. It can be seen that SSC-ADMM is much more efficient than SSC-PG. The main reason behind is that SSC-PG has to construct a relatively loose majorant surrogate for in each iteration (?) and thus requires many more (usually more than ) iterations. Note that the same phenomenon appears in the convex optimization (?).
Conclusion and Future Work
This paper addressed the loose convex relaxation issue of SSC proposed in (?). We proposed to use ADMM to solve the nonconvex SSC problem (6) directly instead of the convex relaxation. More importantly, we provided the convergence guarantee of ADMM for such a nonconvex problem which is challenging and has not been addressed before. It is worth mentioning that our ADMM method and analysis allow the stepsize to be monotonically increased (but upper bounded). Such a technique has been verified to be effective in improving the convergence in practice for convex optimization and this is the first work which introduces it to ADMM for nonconvex optimization. Also, our convergence guarantee generally requires relatively weak assumptions, e.g., no assumption on the iterates and the subproblems are not necessarily to be strongly convex. Thus it is more practical and can be used to solve other related problems. Beyond the convergence guarantee, we also gave some experimental studies to verify our analysis and the clustering results on several real datasets demonstrated the improvement of nonconvex SSC over convex SSC.
There have some interesting future works. Though the solved nonconvex problem in this work is specific, the problems with nonconvex orthogonal constraint are interesting and such a nonconvex constraint or related ones appear in many models in component analysis. It will be interesting to apply ADMM to solve other problems with similar constraints and provide the convergence guarantee. It will be also interesting to apply our technique to solve some other nonconvex low rank minimization models in (?; ?).
J. Feng is partially supported by National University of Singapore startup grant R-263-000-C08-133 and Ministry of Education of Singapore AcRF Tier One grant R-263-000-C21-112. Z. Lin is supported by National Basic Research Program of China (973 Program) (Grant no. 2015CB352502) and National Natural Science Foundation (NSF) of China (Grant nos. 61625301 and 61731018).
- [Beck and Teboulle 2009] Beck, A., and Teboulle, M. 2009. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences.
- [Dhillon 2001] Dhillon, I. S. 2001. Co-clustering documents and words using bipartite spectral graph partitioning. In KDD, 269–274. ACM.
- [Elhamifar and Vidal 2013] Elhamifar, E., and Vidal, R. 2013. Sparse subspace clustering: Algorithm, theory, and applications. TPAMI 35(11):2765–2781.
- [Fillmore and Williams 1971] Fillmore, P., and Williams, J. 1971. Some convexity theorems for matrices. Glasgow Mathematical Journal 12(02):110–117.
- [Gabay and Mercier 1976] Gabay, D., and Mercier, B. 1976. A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Computers & Mathematics with Applications 2(1):17–40.
- [Georghiades, Belhumeur, and Kriegman 2001] Georghiades, A. S.; Belhumeur, P. N.; and Kriegman, D. J. 2001. From few to many: Illumination cone models for face recognition under variable lighting and pose. TPAMI 23(6):643–660.
- [He and Yuan 2012] He, B., and Yuan, X. 2012. On the convergence rate of the Douglas-Rachford alternating direction method. SIAM Journal on Numerical Analysis 50(2):700–709.
- [Hong, Luo, and Razaviyayn 2016] Hong, M.; Luo, Z.-Q.; and Razaviyayn, M. 2016. Convergence analysis of alternating direction method of multipliers for a family of nonconvex problems. SIAM Journal on Optimization 26(1):337–364.
- [Lee and Seung 2001] Lee, D. D., and Seung, H. S. 2001. Algorithms for non-negative matrix factorization. In NIPS, 556–562.
- [Lichman 2013] Lichman, M. 2013. UCI machine learning repository.
- [Lin, Chen, and Ma 2010] Lin, Z.; Chen, M.; and Ma, Y. 2010. The augmented Lagrange multiplier method for exact recovery of corrupted low-rank matrices. arXiv preprint arXiv:1009.5055.
- [Liu, Lin, and Su 2013] Liu, R.; Lin, Z.; and Su, Z. 2013. Linearized alternating direction method with parallel splitting and adaptive penalty for separable convex programs in machine learning. In ACML, 116–132.
- [Lu et al. 2012] Lu, C.-Y.; Min, H.; Zhao, Z.-Q.; Zhu, L.; Huang, D.-S.; and Yan, S. 2012. Robust and efficient subspace segmentation via least squares regression. ECCV 347–360.
- [Lu et al. 2015] Lu, C.; Zhu, C.; Xu, C.; Yan, S.; and Lin, Z. 2015. Generalized singular value thresholding. In AAAI, 1805–1811.
- [Lu et al. 2016] Lu, C.; Tang, J.; Yan, S.; and Lin, Z. 2016. Nonconvex nonsmooth low rank minimization via iteratively reweighted nuclear norm. TIP 25(2):829–839.
- [Lu et al. 2017] Lu, C.; Feng, J.; Yan, S.; and Lin, Z. 2017. A unified alternating direction method of multipliers by majorization minimization. IEEE Transactions on Pattern Analysis and Machine Intelligence.
- [Lu, Yan, and Lin 2016] Lu, C.; Yan, S.; and Lin, Z. 2016. Convex sparse spectral clustering: Single-view to multi-view. TIP 25(6):2833–2843.
- [Mairal 2013] Mairal, J. 2013. Optimization with first-order surrogate functions. In ICML, 783–791.
- [Martinez 1998] Martinez, A. M. 1998. The AR face database. CVC Technical Report 24.
- [Nesterov 2005] Nesterov, Y. 2005. Smooth minimization of non-smooth functions. Mathematical programming 103(1):127–152.
- [Ng et al. 2002] Ng, A. Y.; Jordan, M. I.; Weiss, Y.; et al. 2002. On spectral clustering: Analysis and an algorithm. NIPS 2:849–856.
- [Rockafellar and Wets 2009] Rockafellar, R. T., and Wets, R. J.-B. 2009. Variational analysis, volume 317. Springer Science & Business Media.
- [Shi and Malik 2000] Shi, J., and Malik, J. 2000. Normalized cuts and image segmentation. TPAMI 22(8):888–905.
- [Sim, Baker, and Bsat 2002] Sim, T.; Baker, S.; and Bsat, M. 2002. The CMU pose, illumination, and expression (PIE) database. In IEEE International Conference on Automatic Face and Gesture Recognition, 46–51. IEEE.
- [Von Luxburg 2007] Von Luxburg, U. 2007. A tutorial on spectral clustering. Statistics and computing 17(4):395–416.
[Wang, Yin, and Zeng 2015]
Wang, Y.; Yin, W.; and Zeng, J.
Global convergence of ADMM in nonconvex nonsmooth optimization.
arXiv preprint arXiv:1511.06324.
Proof of Lemma 3
Proof of (a). We deduce
Consider the first two terms in (23), we have
Consider the optimality of to problem (9), note that is strongly convex with modulus , we have
where we uses the Lemma B.5 in (?).
By the choice of and in assumption A3 and (15), we can see that is monotonically decreasing.
Proof of (b). To show that converges to some constant , we only need to show that is lower bounded. Indeed,
where ④ uses (14) and ⑤ uses the Lipschitz continuous gradient property of and by assumption A3. Note that since by assumption A1. This combines with the lower bounded assumption of in assumption A2 implies that is lower bounded.
Proof of (c). Summing over both sides of (15) from 0 to leads to
This implies that under assumption A3. Thus due to (12). Finally, since is bounded ().
Proof of (d). First, it is obvious that is bounded due to the constraint . Thus, is bounded. Then, we deduce