Abstract
Constructing the adjacency graph is fundamental to graphbased clustering. Graph learning in kernel space has shown impressive performance on a number of benchmark data sets. However, its performance is largely determined by the choosed kernel matrix. To address this issue, previous multiple kernel learning algorithm has been applied to learn an optimal kernel from a group of predefined kernels. This approach might be sensitive to noise and limits the representation ability of the consensus kernel. In contrast to existing methods, we propose to learn a lowrank kernel matrix which exploits the similarity nature of the kernel matrix and seeks an optimal kernel from the neighborhood of candidate kernels. By formulating graph construction and kernel learning in a unified framework, the graph and consensus kernel can be iteratively enhanced by each other. Extensive experimental results validate the efficacy of the proposed method.
keywords:
Lowrank Kernel Matrix, Graph Construction, Multiple Kernel Learning, Clustering, Noise+2113
1 Introduction
Clustering is a fundamental and important technique in machine learning, data mining, and pattern recognition jain1999data (); zhu2018yotube (); huang2018robust (). It aims to divide data samples into certain clusters such that similar objects lie in the same group. It has been utilized in various domains, such as image segmentation felzenszwalb2004efficient (), gene expression analysis jiang2004cluster (), motion segmentation elhamifar2009sparse (), image clustering yang2015class (), heterogeneous data analysis liu2015spectral (), document clustering yan2017novel (); huang2018adaptive (), social media analysis he2014comment (), subspace learning gao2015multi (); chen2012fgkm (). During the past decades, clustering has been extensively studied and many clustering methods have been developed, such as Kmeans clustering macqueen1967some (); chen2013twkm (), spectral clustering von2007tutorial (); kang2018unified (), subspace clustering liu2010robust (); peng2018structured (), hierarchical clustering johnson1967hierarchical (), matrix factorizationbased algorithms ding2010convex (); huang2018self (); huang2018robustDMKD (), graphbased clustering huang2015new (); xuan2015topic (), and kernelbased clustering kang2017twin (). Among them, the Kmeans and spectral clustering are especially popular and have been extensively applied in practice.
Basically, the Kmeans method iteratively assigns data points to their closest clusters and updates cluster centers. Nonetheless, it can not partition arbitrarily shaped clusters and is notorious for its sensitivity to the initialization of cluster centers ng2002spectral (). Later, the kernel Kmeans (KKM) was proposed to characterize data nonlinear structure information scholkopf1998nonlinear (). However, the user has to specify a kernel matrix as input, i.e., the user must assume a certain shape of the data distribution which is generally unknown. Consequently, the performance of KKM is highly dependent on the choice of the kernel matrix. This will be a stumbling block for the practical use of kernel method in real applications. This issue is partially alleviated by multiple kernel learning (MKL) technique which lets an algorithm do the picking or combination from a set of candidate kernels liu2017optimal (); zhou2015recovery (). Since the kernels might be corrupted due to the contamination of the original data with noise and outliers. Thus, the induced kernel might still not be optimal gonen2011multiple (). Moreover, enforcing the optimal kernel being a linear combination of base kernels could lead to limited representation ability of the optimal kernel. Sometimes, MKL approach indeed performs worse than a single kernel method kang2018self ().
Spectral clustering, another classic method, presents more capability in detecting complex structures of data compared to other clustering methods yang2015multitask (); yang2017discrete (). It works by embedding the data points into a vector space that is spanned by the spectrum of affinity matrix (or data similarity matrix). Therefore, the quality of the similarity graph is crucial to the performance of spectral clustering algorithm. Previously, the Gaussian kernel function is usually employed to build the graph matrix. Unfortunately, how to select a proper Gaussian parameter is an open problem zelnik2004self (). Moreover, the Gaussian kernel function is sensitive to noise and outliers.
Recently, some advanced techniques have been developed to construct better similarity graphs. For instance, Zhu et al. zhu2014constructing () used a random forestbased method to identify discriminative features, so that subtle and weak data affinity can be captured. More importantly, adaptive neighbors method nie2014clustering () and selfexpression approach kang2017kernel () have been proposed to learn a graph automatically from the data. This automatic strategy can tackle data with structures at different scales of size and density and often provides a highquality graph, as demonstrated in clustering nie2014clustering (); patel2014kernel (), semisupervised classification zhuang2012non (); li2015learning (), and many others.
In this paper, we learn the graph in kernel space. To address the kernel dependence issue, we develop a novel method to learn the consensus kernel. Finally, a unified model which seamlessly integrates graph learning and kernel learning is proposed. On one hand, the quality of the graph will be enhanced if it is learned with an adaptive kernel. On the other hand, the learned graph will help to improve the kernel learning since graph and kernel are the same in essence in terms of the pairwise similarity measure.
The main novelty of this paper is revealing the underlying structure of the kernel matrix by imposing a lowrank regularizer on it. Moreover, we find an ideal kernel in the neighborhood of base kernels, which can improve the robustness of the learned kernel. This is beneficial in practice since the candidate kernels are often corrupted. Consequently, the optimal kernel can reside in some kernels’ neighborhood. In summary, we highlight the main contributions of this paper as follows:

We propose a unified model for learning an optimal consensus kernel and a similarity graph matrix, where the result of one task is used to improve the other one. In other words, we consider the possibility that these two learning processes may need to negotiate with each other to achieve the overall optimality.

By assuming the lowrank structure of the kernel matrix, our model is in a better position to deal with real data. Instead of enforcing the optimal kernel being a linear combination of predefined kernels, our model allows the most suitable kernel to reside in its neighborhood.

Extensive experiments are conducted to compare the performance of our proposed method with existing stateoftheart clustering methods. Experimental results demonstrate the superiority of our method.
The rest of the paper is organized as follows. Section 2 describes related works. Section 3 introduces the proposed graph and kernel learning method. Experimental results and analysis are presented in Section 4. Section 5 draws conclusions.
Notations. Given a data matrix with features and samples, we denote its th element and th column as and , respectively. The norm of vector is represented by , where is the transpose of . The norm of is denoted by . The squared Frobenius norm is defined as . The definition of ’s nuclear norm is , where is the th singular value of . represents the identity matrix with proper size. denotes the trace operator. means all elements of are nonnegative. Inner product is denoted by .
2 Related Work
To cope with noise and outliers, robust kernel Kmeans (RKKM) du2015robust () algorithm has been proposed recently. In this method, the squared norm of error construction term is replaced by the norm. RKKM demonstrates compelling performance on a number of benchmark data sets. To alleviate the efforts for exhaustive search of the most suitable kernel on a prespecified pool of kernels, the authors further proposed a robust multiple kernel Kmeans (RMKKM) algorithm. RMKKM conducts robust Kmeans by learning an appropriate consensus kernel from a linear combination of multiple candidate kernels. It shows that RMKKM has great potential to integrate complementary information from different sources along with heterogeneous features yu2012optimized (). This leads to better performance of RMKKM than that of RKKM.
As aforementioned, the graphbased clustering methods have achieved impressive performance. To resolve the graph construction challenge, simplex sparse representation (SSR) huang2015new () was proposed to learn the affinity between pairs of samples. It is based on the socalled selfexpression property, i.e., each data point can be represented as a weighted combination of other points liu2010robust (). More similar data points will receive larger weights. Therefore, the induced weight matrix reveals the relationships between data points and encodes the data structure. Next, the learned affinity graph matrix is inputted to the spectral clustering algorithm. Empirical experiments demonstrate the superior performance of this approach.
Recently, Kang et al. kang2017twin () have proposed to learn the similarity matrix in kernel space based on selfexpression. They built a joint framework for similarity matrix construction and cluster label learning. Both single kernel method (SCSK) and multiple kernel approach (SCMK) were developed. They learn an optimal kernel using the same way as adopted by RMKKM. In specific, SCMK and RMKKM directly replace the kernel matrix in single kernel model with a combined kernel, which is expressed as a linear combination of prespecified kernels in the constraint. This is a straightforward way and also a popular approach in the literature. However, it ignores the structure information of the kernel matrix. In essence, the kernel matrix is a measure of pairwise similarity between data points. Hence, the kernel matrix is lowrank in general xia2014robust (). Moreover, they strictly require that the optimal kernel is a linear combination of base kernels. This might limit its realistic application since realworld data is often corrupted and the ideal kernel might reside in the neighborhood of the combined kernel. Besides, it is timeconsuming and impractical to design a large pool of kernels. Hence it is impossible to obtain a globally optimal kernel. What we can do is to find a way to make the best use of candidate kernels.
In this paper, we propose to learn a similarity graph and kernel matrix jointly by exploring the kernel matrix structure. With the lowrank requirement on the kernel matrix, we are expected to exploit the similarity nature of the kernel matrix. Different from existing methods, we relax the strict condition that the optimal kernel is a linear combination of predefined kernels in order to account noise in real data. This enlarges the region from which an ideal kernel can be chosen and therefore is in a better position than the previous approach to finding a more suitable kernel. In particular, in a similar spirit of robust principal component analysis (RPCA) kang2015robust (), the combined kernel is factorized into a lowrank component (optimal kernel matrix) and residual.
3 Proposed Methodology
3.1 Formulation
In general, the selfexpression based graph learning problem can be formulated as
(1) 
where is a regularization parameter, selfexpression coefficient is often assumed to be nonnegative, is the regularization term on . Two commonly used assumptions about are lowrank and sparse, corresponding to and respectively. Suppose maps the data points from the input space to a reproducing kernel Hilbert space . Then, based on the kernel trick, the th element of kernel matrix is . In kernel space, Eq. (1) gives
(2) 
This model is capable of recovering the linear relationships among the data samples in the new space, and thus the nonlinear relationships in the original representation. One limitation of Eq. (2) is that its performance will heavily depend on the inputted kernel matrix. To overcome this drawback, we can learn a suitable kernel from predefined kernels . Different from existing MKL method, we aim to increase the consensus kernel’s representation ability by considering noise effect. Finally, our proposed Lowrank Kernel learning for Graph matrix (LKG) is formulated as following
(3) 
where is the weight for kernel , kernel matrix is nonnegative, the constraints for are from standard MKL method. If a kernel is not appropriate due to the bad choice of metric or parameter, or a kernel is severely corrupted by noise or outliers, the corresponding will be assigned a small value.
In Eq. (3), explores the structure of the kernel matrix, so that the learned will respect the correlations among samples, i.e., the cluster structure of data. Moreover, enforcing the nuclear norm regularizer on will make robust to noise and errors. The last term in Eq. (3) means that we seek an optimal kernel in the neighborhood of , which makes our model in a better position than the previous approach to identify a more suitable kernel. Due to noise and outliers, could be a noisy observation of the ideal kernel . As a matter of fact, this is similar to RPCA candes2011robust (); kang2015robustpca (), where the original noise data is decomposed into a lowrank part and an error part. Formulating and learning in a unified model reinforces the underlying connections between learning the optimal kernel and graph learning. By iteratively updating , , , they can be repeatedly improved.
3.2 Optimization
We propose to solve the problem (3) based on the alternating direction method of multipliers (ADMM) boyd2011distributed (). First, we introduce two auxiliary variables to make variables separable and rewrite the problem (3) in the following equivalent form
(4) 
The corresponding augmented Lagrangian function is
(5) 
where is a penalty parameter and , are lagrangian multipliers. These variables can be updated alternatingly, one at each step, while keeping the others fixed.
To solve Z, the objective function (5) becomes
(6) 
It can be solved by setting its first derivative to zero. Then we have
(7) 
Similarly, we can obtain the updating rule for as
(8) 
To solve , the subproblem is
(9) 
Depending on the regularization strategy, we obtain different closedform solutions for . Let’s define and write the singular value decomposition (SVD) of as . Then, for lowrank representation, it yields
(10) 
To obtain a sparse representation, we can update elemently as
(11) 
To solve , we have
(12) 
By letting and , then we have
(13) 
To solve , the optimization problem (3) becomes
(14) 
where and . It is a Quadratic Programming problem with linear constraints, which can be easily solved with existing packages. In sum, our algorithm for solving the problem (3) is outlined in Algorithm 1.
After obtaining the graph , we can use it to do clustering, semisupervised classification, and so on. In this work, we focus on the clustering task. Specifically, we run the spectral clustering ng2002spectral () algorithm on to achieve the final results.
3.3 Complexity Analysis
The time complexity for each kernel construction is . The computational cost for and is . For , it requires an SVD for every iteration and its complexity is , which can be if we employ partial SVD ( is the lowest rank we can find) based on package PROPACK larsen2004propack (). For , depending on the choice of regularizer, we have different complexity. For lowrank representation, it is the same as . The complexity of obtaining a sparse solution is . It is a quadratic programing problem for , which can be solved in polynomial time. Fortunately, the size of is a small number . The updating of and cost .
4 Experiments
4.1 Data Sets
We examine the effectiveness of our method using eight realworld benchmark data sets, which are commonly used in the literature. The basic information of data sets is shown in Table 1. In specific, the first five data sets are images, and the other four are text corpora^{1}^{1}1http://wwwusers.cs.umn.edu/ han/data/tmdata.tar.gz^{2}^{2}2http://www.cad.zju.edu.cn/home/dengcai/Data/TextData.html.
Five image data sets include four famous face databases (ORL^{3}^{3}3http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html, YALE^{4}^{4}4http://vision.ucsd.edu/content/yalefacedatabase, AR^{5}^{5}5http://www2.ece.ohiostate.edu/ aleix/ARdatabase.html and JAFFE^{6}^{6}6http://www.kasrl.org/jaffe.html), and a binary alpha digits data set BA^{7}^{7}7http://www.cs.nyu.edu/ roweis/data.html. As shown in Figure (a)a, BA consists of digits of “0” through “9” and letters of capital “A” through “Z”. In YALE, ORL, AR, and JAFEE, each image has different facial expressions or configurations due to times, illumination conditions, and glasses/no glasses. Hence, these data sets are contaminated at different levels. Figure (b)b and (c)c show some example images from YALE and JAFFE database.
Following the setting in kang2017twin (), we manually construct 12 kernels. They consist of seven Gaussian kernels with , where denotes the maximal distance between data points; a linear kernel ; four polynomial kernels of the form with and . In addition, all kernel matrices are normalized to range. This can be done through dividing each element by the largest element in its corresponding kernel matrix.
# instances  # features  # classes  

YALE  165  1024  15 
JAFFE  213  676  10 
ORL  400  1024  40 
AR  840  768  120 
BA  1404  320  36 
TR11  414  6429  9 
TR41  878  7454  10 
TR45  690  8261  10 
TDT2  9394  36771  30 
4.2 Evaluation Metrics
To quantitatively assess our algorithm’s performance on the clustering task, we use the popular measures, i.e., accuracy (Acc) and normalized mutual information (NMI).
Acc discovers the onetoone relationship between clusters and classes. Let and be the clustering result and the ground truth cluster label of , respectively. Then the Acc is defined by
where is the total number of samples, delta function equals one if and only if and zero otherwise, and map() is the best permutation mapping function that maps each cluster index to a true class label based on KuhnMunkres algorithm.
The NMI measures the quality of clustering. Given two sets of clusters and ,
where and represent the marginal probability distribution functions of and , respectively, induced from the joint distribution of and . is the entropy function. The greater NMI means the better clustering performance.
4.3 Comparison Methods
To fully examine the effectiveness of our proposed algorithm, we compare with both graphbased clustering methods and kernel methods. More concretely, we have Kernel Kmeans (KKM) scholkopf1998nonlinear (), Spectral Clustering (SC) ng2002spectral (), Robust Kernel Kmeans (RKKM) du2015robust (), Simplex Sparse Representation (SSR) huang2015new () and SCSK kang2017twin (). Among them, SC, SSR, and SCSK are graphbased clustering methods. Since SSR is developed in the feature space, we only need run it once. For other techniques, we run them on each kernel and report their best performances as well as their average performances over those kernels.
We also compare with a number of multiple kernel learning methods. We directly implement the downloaded programs of the comparison methods on those 12 kernels:
Multiple Kernel Kmeans (MKKM)^{8}^{8}8http://imp.iis.sinica.edu.tw/IVCLab/research/Sean/mkfc/code. The MKKM huang2012multiple () is an extension of Kmeans to the situation when multiple kernels exist.
Affinity Aggregation for Spectral Clustering (AASC)^{9}^{9}9http://imp.iis.sinica.edu.tw/IVCLab/research/Sean/aasc/code. The AASC huang2012affinity () extends spectral clustering to deal with multiple affinities.
Robust Multiple Kernel Kmeans (RMKKM)^{10}^{10}10https://github.com/csliangdu/RMKKM. The RMKKM du2015robust () extends Kmeans to deal with noise and outliers in a multiple kernel setting.
Twin learning for Similarity and Clustering with Multiple Kernel (SCMK) kang2017twin (). Recently proposed graphbased clustering method with multiple kernel learning capability. Both RMKKM and SCMK rigorously require that the consensus kernel is a combination of base kernels.
Lowrank Kernel learning for Graph matrix (LKG). Our proposed lowrank kernel learning for graphbased clustering method. After obtaining similarity graph matrix , we run the spectral clustering algorithm to finish the clustering task. We examine both lowrank and sparse regularizer and denote their corresponding methods as LKGr and LKGs, respectively.
4.4 Results


For the compared methods, we either use their existing parameter settings or tune them to obtain the best performances. In particular, we can directly obtain the optimal results for KKM, SC, RKKM, MKKM, AASC, and RMKKM methods by implementing the package in du2015robust (). SSR is a parameterfree model. Hence we only need to tune the parameters for SCSK and SCMK. The experimental results are presented in Table 2. In most cases, our proposed method LKG achieves the best performance among all stateoftheart algorithms. In particular, we have the following observations.

For nonmultiple kernel based techniques, we see big differences between the best and average results. This validates the fact that the selection of kernel has a big impact on the final results. Therefore, it is imperative to develop multiple kernel learning method.

As expected, multiple kernel methods work better than single kernel approaches. This is consistent with our belief that multiple kernel methods often exploit complementary information.

Graphbased clustering methods often perform much better than Kmeans and its extensions. As can be seen, SSR, SCSK, SCMK, and LKG improve clustering performance considerably.

By comparing the performance of SCMK and LKG, we can clearly see the advantage of our lowrank kernel learning approach. This demonstrates that it is beneficial to adopt our proposed kernel learning method.
Method  Metric  KKM  SC  RKKM  SSR  SCSK  MKKM  AASC  RMKKM  SCMK 
LKGs  Acc  .0039  .0078  .0039  .0117  .3008  .0039  .0039  .0078  .5703 
NMI  .0039  .0039  .0039  .0078  .2031  .0039  .0039  .0039  .5703  
LKGr  Acc  .0039  .0078  .0039  .0273  .3008  .0039  .0039  .0039  .4268 
NMI  .0039  .0039  .0039  .0195  .0547  .0039  .0039  .0078  .3008 
To see the significance of improvements, we further apply the Wilcoxon signed rank test to Table 2. We show the values in Table 3. We note that the testing results are under 0.05 when comparing LKGs and LKGr to all other methods except SCSK and SCMK, which were proposed in 2017. Therefore, LKGs and LKGr outperform KKM, SC, RKKM, SSR, MKKM, AASC, and RMKKM with statistical significance.
4.5 Parameter Sensitivity
There are three hyperparameters in our model: , , and . To better see the effects of and , we fix with and , and search and in the range . We analyze the sensitivity of our model LKGr to them by using YALE, JAFFE, and ORL data sets as examples, in terms of accuracy. Figures 2 to 4 show our model gives reasonable results in a wide range of parameters.
4.6 Examination on Multiview Data
Nowadays, data of multiple views are prevailing. Hence, we test our model on multiview data in this subsection. We employ two widely used multiview data sets for performance evaluation, namely, Cora sen2008collective () and NUSWIDE chua2009nus (). Note that most of the data sets used in this paper have imbalanced clusters. For example, there are 818, 180, 217, 426, 351, 418, 298 samples in Cora for each cluster, respectively. For clustering, imbalance issue is seldom discussed zhu2017entropy (). Hence we expect that our method can work well in general circumstances. To do a comprehensive evaluation, more measures, including Fscore, Precision, Recall, Adjusted Rand Index (ARI), Entropy, Purity, are used here. Each metric characterizes different properties for the clustering. Except for entropy, the other metrics with a larger value means a better performance.
We implement the algorithms on each view of them and report the clustering results in Table 4 and 5. For our algorithms LKGs and LKGr, we repeat 20 times and show the mean values and standard deviations. As can be seen, our approach performs better than all the other baselines in most measures. It is unsurprising that different views give different performances. Our proposed method can work well in general.




5 Conclusion
In this paper, we propose a multiple kernel learning based graph clustering method. Different from the existing multiple kernel learning methods, our method explicitly assumes that the consensus kernel matrix should be lowrank and lies in the neighborhood of the combined kernel. As a result, the learned graph is more informative and discriminative, especially when the data is subject to noise and outliers. Experimental results on both image clustering and document clustering demonstrate that our method indeed improves clustering performance compared to existing clustering techniques.
Acknowledgments
This paper was in part supported by Grants from the Natural Science Foundation of China (Nos. 61806045, 61572111, and 61772115), a 985 Project of UESTC (No. A1098531023601041), three Fundamental Research Fund for the Central Universities of China (Nos. A03017023701012, ZYGX2017KYQD177, and ZYGX2016J086), and the China Postdoctoral Science Foundation (No. 2016M602677).
References
References
 (1) A. K. Jain, M. N. Murty, P. J. Flynn, Data clustering: a review, ACM computing surveys (CSUR) 31 (3) (1999) 264–323.
 (2) H. Zhu, R. Vial, S. Lu, X. Peng, H. Fu, Y. Tian, X. Cao, Yotube: Searching action proposal via recurrent and static regression networks, IEEE Transactions on Image Processing 27 (6) (2018) 2609–2622.
 (3) S. Huang, Y. Ren, Z. Xu, Robust multiview data clustering with multiview cappednorm kmeans, Neurocomputing.
 (4) P. F. Felzenszwalb, D. P. Huttenlocher, Efficient graphbased image segmentation, International journal of computer vision 59 (2) (2004) 167–181.
 (5) D. Jiang, C. Tang, A. Zhang, Cluster analysis for gene expression data: a survey, IEEE Transactions on knowledge and data engineering 16 (11) (2004) 1370–1386.
 (6) E. Elhamifar, R. Vidal, Sparse subspace clustering, in: Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, IEEE, 2009, pp. 2790–2797.
 (7) S. Yang, Z. Yi, X. He, X. Li, A class of manifold regularized multiplicative update algorithms for image clustering, IEEE Transactions on Image Processing 24 (12) (2015) 5302–5314.
 (8) H. Liu, T. Liu, J. Wu, D. Tao, Y. Fu, Spectral ensemble clustering, in: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, 2015, pp. 715–724.
 (9) W. Yan, B. Zhang, S. Ma, Z. Yang, A novel regularized concept factorization for document clustering, KnowledgeBased Systems 135 (2017) 147–158.
 (10) S. Huang, Z. Xu, J. Lv, Adaptive local structure learning for document coclustering, KnowledgeBased Systems 148 (2018) 74–84.
 (11) X. He, M.Y. Kan, P. Xie, X. Chen, Commentbased multiview clustering of web 2.0 items, in: Proceedings of the 23rd international conference on World wide web, ACM, 2014, pp. 771–782.
 (12) H. Gao, F. Nie, X. Li, H. Huang, Multiview subspace clustering, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 4238–4246.
 (13) X. Chen, Y. Ye, X. Xu, J. Z. Huang, A feature group weighting method for subspace clustering of highdimensional data, Pattern Recognition 45 (1) (2012) 434–446.
 (14) J. MacQueen, et al., Some methods for classification and analysis of multivariate observations, in: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Vol. 1, Oakland, CA, USA., 1967, pp. 281–297.
 (15) X. Chen, X. Xu, Y. Ye, J. Z. Huang, TWkmeans: Automated Twolevel Variable Weighting Clustering Algorithm for Multiview Data, IEEE Transactions on Knowledge and Data Engineering 25 (4) (2013) 932–944.
 (16) U. Von Luxburg, A tutorial on spectral clustering, Statistics and computing 17 (4) (2007) 395–416.
 (17) Z. Kang, C. Peng, Q. Cheng, Z. Xu, Unified spectral clustering with optimal graph., in: AAAI, 2018, pp. 3366–3373.
 (18) G. Liu, Z. Lin, Y. Yu, Robust subspace segmentation by lowrank representation, in: Proceedings of the 27th international conference on machine learning (ICML10), 2010, pp. 663–670.
 (19) X. Peng, J. Feng, S. Xiao, W.Y. Yau, J. T. Zhou, S. Yang, Structured autoencoders for subspace clustering, IEEE transactions on image processing 27 (10) (2018) 5076–5086.
 (20) S. C. Johnson, Hierarchical clustering schemes, Psychometrika 32 (3) (1967) 241–254.
 (21) C. H. Ding, T. Li, M. I. Jordan, Convex and seminonnegative matrix factorizations, IEEE transactions on pattern analysis and machine intelligence 32 (1) (2010) 45–55.
 (22) S. Huang, Z. Kang, Z. Xu, Selfweighted multiview clustering with soft capped norm, KnowledgeBased Systems 158 (2018) 1–8.
 (23) S. Huang, H. Wang, T. Li, T. Li, Z. Xu, Robust graph regularized nonnegative matrix factorization for clustering, Data Mining and Knowledge Discovery 32 (2) (2018) 483–503.
 (24) J. Huang, F. Nie, H. Huang, A new simplex sparse learning model to measure data similarity for clustering, in: Proceedings of the 24th International Conference on Artificial Intelligence, AAAI Press, 2015, pp. 3569–3575.
 (25) J. Xuan, J. Lu, G. Zhang, X. Luo, Topic model for graph mining., IEEE Trans. Cybernetics 45 (12) (2015) 2792–2803.
 (26) Z. Kang, C. Peng, Q. Cheng, Twin learning for similarity and clustering: A unified kernel approach., in: AAAI, 2017, pp. 2080–2086.
 (27) A. Y. Ng, M. I. Jordan, Y. Weiss, et al., On spectral clustering: Analysis and an algorithm, Advances in neural information processing systems 2 (2002) 849–856.
 (28) B. Schölkopf, A. Smola, K.R. Müller, Nonlinear component analysis as a kernel eigenvalue problem, Neural computation 10 (5) (1998) 1299–1319.
 (29) X. Liu, S. Zhou, Y. Wang, M. Li, Y. Dou, E. Zhu, J. Yin, H. Li, Optimal neighborhood kernel clustering with multiple kernels., in: AAAI, 2017, pp. 2266–2272.
 (30) P. Zhou, L. Du, L. Shi, H. Wang, Y.D. Shen, Recovery of corrupted multiple kernels for clustering., in: IJCAI, 2015, pp. 4105–4111.
 (31) M. Gönen, E. Alpaydın, Multiple kernel learning algorithms, Journal of Machine Learning Research 12 (Jul) (2011) 2211–2268.
 (32) Z. Kang, X. Lu, J. Yi, Z. Xu, Selfweighted multiple kernel learning for graphbased clustering and semisupervised classification, in: IJCAI, 2018, pp. 2312–2318.
 (33) Y. Yang, Z. Ma, Y. Yang, F. Nie, H. T. Shen, Multitask spectral clustering by exploring intertask correlation, IEEE transactions on cybernetics 45 (5) (2015) 1083–1094.
 (34) Y. Yang, F. Shen, Z. Huang, H. T. Shen, X. Li, Discrete nonnegative spectral clustering, IEEE Transactions on Knowledge and Data Engineering 29 (9) (2017) 1834–1845.
 (35) L. ZelnikManor, P. Perona, Selftuning spectral clustering., in: NIPS, Vol. 17, 2004, p. 16.
 (36) X. Zhu, C. Change Loy, S. Gong, Constructing robust affinity graphs for spectral clustering, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1450–1457.
 (37) F. Nie, X. Wang, H. Huang, Clustering and projected clustering with adaptive neighbors, in: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2014, pp. 977–986.
 (38) Z. Kang, C. Peng, Q. Cheng, Kerneldriven similarity learning, Neurocomputing 267 (2017) 210–219.
 (39) V. M. Patel, R. Vidal, Kernel sparse subspace clustering, in: 2014 IEEE International Conference on Image Processing (ICIP), IEEE, 2014, pp. 2849–2853.
 (40) L. Zhuang, H. Gao, Z. Lin, Y. Ma, X. Zhang, N. Yu, Nonnegative low rank and sparse graph for semisupervised learning, in: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, IEEE, 2012, pp. 2328–2335.
 (41) C.G. Li, Z. Lin, H. Zhang, J. Guo, Learning semisupervised representation towards a unified optimization framework for semisupervised learning, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 2767–2775.
 (42) L. Du, P. Zhou, L. Shi, H. Wang, M. Fan, W. Wang, Y.D. Shen, Robust multiple kernel kmeans using ℓ 2; 1norm, in: Proceedings of the 24th International Conference on Artificial Intelligence, AAAI Press, 2015, pp. 3476–3482.
 (43) S. Yu, L. Tranchevent, X. Liu, W. Glanzel, J. A. Suykens, B. De Moor, Y. Moreau, Optimized data fusion for kernel kmeans clustering, IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (5) (2012) 1031–1039.
 (44) R. Xia, Y. Pan, L. Du, J. Yin, Robust multiview spectral clustering via lowrank and sparse decomposition., in: AAAI, 2014, pp. 2149–2155.
 (45) Z. Kang, C. Peng, Q. Cheng, Robust subspace clustering via smoothed rank approximation, IEEE Signal Processing Letters 22 (11) (2015) 2088–2092.
 (46) E. J. Candès, X. Li, Y. Ma, J. Wright, Robust principal component analysis?, Journal of the ACM (JACM) 58 (3) (2011) 11.
 (47) Z. Kang, C. Peng, Q. Cheng, Robust pca via nonconvex rank approximation, in: Proceedings of the 2015 IEEE International Conference on Data Mining (ICDM), IEEE Computer Society, 2015, pp. 211–220.
 (48) S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, Distributed optimization and statistical learning via the alternating direction method of multipliers, Foundations and Trends® in Machine Learning 3 (1) (2011) 1–122.
 (49) R. M. Larsen, Propacksoftware for large and sparse svd calculations, Available online. URL http://sun. stanford. edu/rmunk/PROPACK (2004) 2008–2009.
 (50) H.C. Huang, Y.Y. Chuang, C.S. Chen, Multiple kernel fuzzy clustering, IEEE Transactions on Fuzzy Systems 20 (1) (2012) 120–134.
 (51) H.C. Huang, Y.Y. Chuang, C.S. Chen, Affinity aggregation for spectral clustering, in: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, IEEE, 2012, pp. 773–780.
 (52) P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, T. EliassiRad, Collective classification in network data, AI magazine 29 (3) (2008) 93.
 (53) T.S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, Y. Zheng, Nuswide: a realworld web image database from national university of singapore, in: Proceedings of the ACM international conference on image and video retrieval, ACM, 2009, p. 48.
 (54) C. Zhu, Z. Wang, Entropybased matrix learning machine for imbalanced data sets, Pattern Recognition Letters 88 (2017) 72–80.