Low-rank Kernel Learning for Graph-based Clustering
Abstract

Constructing the adjacency graph is fundamental to graph-based clustering. Graph learning in kernel space has shown impressive performance on a number of benchmark data sets. However, its performance is largely determined by the choosed kernel matrix. To address this issue, previous multiple kernel learning algorithm has been applied to learn an optimal kernel from a group of predefined kernels. This approach might be sensitive to noise and limits the representation ability of the consensus kernel. In contrast to existing methods, we propose to learn a low-rank kernel matrix which exploits the similarity nature of the kernel matrix and seeks an optimal kernel from the neighborhood of candidate kernels. By formulating graph construction and kernel learning in a unified framework, the graph and consensus kernel can be iteratively enhanced by each other. Extensive experimental results validate the efficacy of the proposed method.

keywords:
Low-rank Kernel Matrix, Graph Construction, Multiple Kernel Learning, Clustering, Noise
journal: Journal of LaTeX Templates\DeclareUnicodeCharacter

+2113 

1 Introduction

Clustering is a fundamental and important technique in machine learning, data mining, and pattern recognition jain1999data (); zhu2018yotube (); huang2018robust (). It aims to divide data samples into certain clusters such that similar objects lie in the same group. It has been utilized in various domains, such as image segmentation felzenszwalb2004efficient (), gene expression analysis jiang2004cluster (), motion segmentation elhamifar2009sparse (), image clustering yang2015class (), heterogeneous data analysis liu2015spectral (), document clustering yan2017novel (); huang2018adaptive (), social media analysis he2014comment (), subspace learning gao2015multi (); chen2012fgkm (). During the past decades, clustering has been extensively studied and many clustering methods have been developed, such as K-means clustering macqueen1967some (); chen2013twkm (), spectral clustering von2007tutorial (); kang2018unified (), subspace clustering liu2010robust (); peng2018structured (), hierarchical clustering johnson1967hierarchical (), matrix factorization-based algorithms ding2010convex (); huang2018self (); huang2018robustDMKD (), graph-based clustering huang2015new (); xuan2015topic (), and kernel-based clustering kang2017twin (). Among them, the K-means and spectral clustering are especially popular and have been extensively applied in practice.

Basically, the K-means method iteratively assigns data points to their closest clusters and updates cluster centers. Nonetheless, it can not partition arbitrarily shaped clusters and is notorious for its sensitivity to the initialization of cluster centers ng2002spectral (). Later, the kernel K-means (KKM) was proposed to characterize data nonlinear structure information scholkopf1998nonlinear (). However, the user has to specify a kernel matrix as input, i.e., the user must assume a certain shape of the data distribution which is generally unknown. Consequently, the performance of KKM is highly dependent on the choice of the kernel matrix. This will be a stumbling block for the practical use of kernel method in real applications. This issue is partially alleviated by multiple kernel learning (MKL) technique which lets an algorithm do the picking or combination from a set of candidate kernels liu2017optimal (); zhou2015recovery (). Since the kernels might be corrupted due to the contamination of the original data with noise and outliers. Thus, the induced kernel might still not be optimal gonen2011multiple (). Moreover, enforcing the optimal kernel being a linear combination of base kernels could lead to limited representation ability of the optimal kernel. Sometimes, MKL approach indeed performs worse than a single kernel method kang2018self ().

Spectral clustering, another classic method, presents more capability in detecting complex structures of data compared to other clustering methods yang2015multitask (); yang2017discrete (). It works by embedding the data points into a vector space that is spanned by the spectrum of affinity matrix (or data similarity matrix). Therefore, the quality of the similarity graph is crucial to the performance of spectral clustering algorithm. Previously, the Gaussian kernel function is usually employed to build the graph matrix. Unfortunately, how to select a proper Gaussian parameter is an open problem zelnik2004self (). Moreover, the Gaussian kernel function is sensitive to noise and outliers.

Recently, some advanced techniques have been developed to construct better similarity graphs. For instance, Zhu et al. zhu2014constructing () used a random forest-based method to identify discriminative features, so that subtle and weak data affinity can be captured. More importantly, adaptive neighbors method nie2014clustering () and self-expression approach kang2017kernel () have been proposed to learn a graph automatically from the data. This automatic strategy can tackle data with structures at different scales of size and density and often provides a high-quality graph, as demonstrated in clustering nie2014clustering (); patel2014kernel (), semi-supervised classification zhuang2012non (); li2015learning (), and many others.

In this paper, we learn the graph in kernel space. To address the kernel dependence issue, we develop a novel method to learn the consensus kernel. Finally, a unified model which seamlessly integrates graph learning and kernel learning is proposed. On one hand, the quality of the graph will be enhanced if it is learned with an adaptive kernel. On the other hand, the learned graph will help to improve the kernel learning since graph and kernel are the same in essence in terms of the pairwise similarity measure.

The main novelty of this paper is revealing the underlying structure of the kernel matrix by imposing a low-rank regularizer on it. Moreover, we find an ideal kernel in the neighborhood of base kernels, which can improve the robustness of the learned kernel. This is beneficial in practice since the candidate kernels are often corrupted. Consequently, the optimal kernel can reside in some kernels’ neighborhood. In summary, we highlight the main contributions of this paper as follows:

  • We propose a unified model for learning an optimal consensus kernel and a similarity graph matrix, where the result of one task is used to improve the other one. In other words, we consider the possibility that these two learning processes may need to negotiate with each other to achieve the overall optimality.

  • By assuming the low-rank structure of the kernel matrix, our model is in a better position to deal with real data. Instead of enforcing the optimal kernel being a linear combination of predefined kernels, our model allows the most suitable kernel to reside in its neighborhood.

  • Extensive experiments are conducted to compare the performance of our proposed method with existing state-of-the-art clustering methods. Experimental results demonstrate the superiority of our method.

The rest of the paper is organized as follows. Section 2 describes related works. Section 3 introduces the proposed graph and kernel learning method. Experimental results and analysis are presented in Section 4. Section 5 draws conclusions.

Notations. Given a data matrix with features and samples, we denote its -th element and -th column as and , respectively. The -norm of vector is represented by , where is the transpose of . The -norm of is denoted by . The squared Frobenius norm is defined as . The definition of ’s nuclear norm is , where is the -th singular value of . represents the identity matrix with proper size. denotes the trace operator. means all elements of are nonnegative. Inner product is denoted by .

2 Related Work

To cope with noise and outliers, robust kernel K-means (RKKM) du2015robust () algorithm has been proposed recently. In this method, the squared -norm of error construction term is replaced by the -norm. RKKM demonstrates compelling performance on a number of benchmark data sets. To alleviate the efforts for exhaustive search of the most suitable kernel on a pre-specified pool of kernels, the authors further proposed a robust multiple kernel K-means (RMKKM) algorithm. RMKKM conducts robust K-means by learning an appropriate consensus kernel from a linear combination of multiple candidate kernels. It shows that RMKKM has great potential to integrate complementary information from different sources along with heterogeneous features yu2012optimized (). This leads to better performance of RMKKM than that of RKKM.

As aforementioned, the graph-based clustering methods have achieved impressive performance. To resolve the graph construction challenge, simplex sparse representation (SSR) huang2015new () was proposed to learn the affinity between pairs of samples. It is based on the so-called self-expression property, i.e., each data point can be represented as a weighted combination of other points liu2010robust (). More similar data points will receive larger weights. Therefore, the induced weight matrix reveals the relationships between data points and encodes the data structure. Next, the learned affinity graph matrix is inputted to the spectral clustering algorithm. Empirical experiments demonstrate the superior performance of this approach.

Recently, Kang et al. kang2017twin () have proposed to learn the similarity matrix in kernel space based on self-expression. They built a joint framework for similarity matrix construction and cluster label learning. Both single kernel method (SCSK) and multiple kernel approach (SCMK) were developed. They learn an optimal kernel using the same way as adopted by RMKKM. In specific, SCMK and RMKKM directly replace the kernel matrix in single kernel model with a combined kernel, which is expressed as a linear combination of pre-specified kernels in the constraint. This is a straightforward way and also a popular approach in the literature. However, it ignores the structure information of the kernel matrix. In essence, the kernel matrix is a measure of pairwise similarity between data points. Hence, the kernel matrix is low-rank in general xia2014robust (). Moreover, they strictly require that the optimal kernel is a linear combination of base kernels. This might limit its realistic application since real-world data is often corrupted and the ideal kernel might reside in the neighborhood of the combined kernel. Besides, it is time-consuming and impractical to design a large pool of kernels. Hence it is impossible to obtain a globally optimal kernel. What we can do is to find a way to make the best use of candidate kernels.

In this paper, we propose to learn a similarity graph and kernel matrix jointly by exploring the kernel matrix structure. With the low-rank requirement on the kernel matrix, we are expected to exploit the similarity nature of the kernel matrix. Different from existing methods, we relax the strict condition that the optimal kernel is a linear combination of predefined kernels in order to account noise in real data. This enlarges the region from which an ideal kernel can be chosen and therefore is in a better position than the previous approach to finding a more suitable kernel. In particular, in a similar spirit of robust principal component analysis (RPCA) kang2015robust (), the combined kernel is factorized into a low-rank component (optimal kernel matrix) and residual.

3 Proposed Methodology

3.1 Formulation

In general, the self-expression based graph learning problem can be formulated as

(1)

where is a regularization parameter, self-expression coefficient is often assumed to be nonnegative, is the regularization term on . Two commonly used assumptions about are low-rank and sparse, corresponding to and respectively. Suppose maps the data points from the input space to a reproducing kernel Hilbert space . Then, based on the kernel trick, the -th element of kernel matrix is . In kernel space, Eq. (1) gives

(2)

This model is capable of recovering the linear relationships among the data samples in the new space, and thus the nonlinear relationships in the original representation. One limitation of Eq. (2) is that its performance will heavily depend on the inputted kernel matrix. To overcome this drawback, we can learn a suitable kernel from predefined kernels . Different from existing MKL method, we aim to increase the consensus kernel’s representation ability by considering noise effect. Finally, our proposed Low-rank Kernel learning for Graph matrix (LKG) is formulated as following

(3)

where is the weight for kernel , kernel matrix is nonnegative, the constraints for are from standard MKL method. If a kernel is not appropriate due to the bad choice of metric or parameter, or a kernel is severely corrupted by noise or outliers, the corresponding will be assigned a small value.

In Eq. (3), explores the structure of the kernel matrix, so that the learned will respect the correlations among samples, i.e., the cluster structure of data. Moreover, enforcing the nuclear norm regularizer on will make robust to noise and errors. The last term in Eq. (3) means that we seek an optimal kernel in the neighborhood of , which makes our model in a better position than the previous approach to identify a more suitable kernel. Due to noise and outliers, could be a noisy observation of the ideal kernel . As a matter of fact, this is similar to RPCA candes2011robust (); kang2015robustpca (), where the original noise data is decomposed into a low-rank part and an error part. Formulating and learning in a unified model reinforces the underlying connections between learning the optimal kernel and graph learning. By iteratively updating , , , they can be repeatedly improved.

3.2 Optimization

We propose to solve the problem (3) based on the alternating direction method of multipliers (ADMM) boyd2011distributed (). First, we introduce two auxiliary variables to make variables separable and rewrite the problem (3) in the following equivalent form

(4)

The corresponding augmented Lagrangian function is

(5)

where is a penalty parameter and , are lagrangian multipliers. These variables can be updated alternatingly, one at each step, while keeping the others fixed.

To solve Z, the objective function (5) becomes

(6)

It can be solved by setting its first derivative to zero. Then we have

(7)

Similarly, we can obtain the updating rule for as

(8)

To solve , the sub-problem is

(9)

Depending on the regularization strategy, we obtain different closed-form solutions for . Let’s define and write the singular value decomposition (SVD) of as . Then, for low-rank representation, it yields

(10)

To obtain a sparse representation, we can update elemently as

(11)

To solve , we have

(12)

By letting and , then we have

(13)

To solve , the optimization problem (3) becomes

(14)

where and . It is a Quadratic Programming problem with linear constraints, which can be easily solved with existing packages. In sum, our algorithm for solving the problem (3) is outlined in Algorithm 1.

After obtaining the graph , we can use it to do clustering, semi-supervised classification, and so on. In this work, we focus on the clustering task. Specifically, we run the spectral clustering ng2002spectral () algorithm on to achieve the final results.

Input: Kernel matrices , parameters , , , .
Initialize: Random matrix , , , .
REPEAT
1:  Calculate by (7).
2:  =max(, 0).
3:  Update according to (8).
4:  =max(, 0).
5:  Calculate using (11) or (10).
6:  =max(, 0).
7:  Calculate using (13).
8:  =max(, 0).
9:  Solve using (14).
10:  Update Lagrange multipliers and as
UNTIL stopping criterion is met.
ALGORITHM 1 The algorithm to solve (3)

3.3 Complexity Analysis

The time complexity for each kernel construction is . The computational cost for and is . For , it requires an SVD for every iteration and its complexity is , which can be if we employ partial SVD ( is the lowest rank we can find) based on package PROPACK larsen2004propack (). For , depending on the choice of regularizer, we have different complexity. For low-rank representation, it is the same as . The complexity of obtaining a sparse solution is . It is a quadratic programing problem for , which can be solved in polynomial time. Fortunately, the size of is a small number . The updating of and cost .

(a) BA
(b) YALE
(c) JAFFE
Figure 1: Sample images of BA, YALE and JAFFE.

4 Experiments

4.1 Data Sets

We examine the effectiveness of our method using eight real-world benchmark data sets, which are commonly used in the literature. The basic information of data sets is shown in Table 1. In specific, the first five data sets are images, and the other four are text corpora111http://www-users.cs.umn.edu/ han/data/tmdata.tar.gz222http://www.cad.zju.edu.cn/home/dengcai/Data/TextData.html.

Five image data sets include four famous face databases (ORL333http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html, YALE444http://vision.ucsd.edu/content/yale-face-database, AR555http://www2.ece.ohio-state.edu/ aleix/ARdatabase.html and JAFFE666http://www.kasrl.org/jaffe.html), and a binary alpha digits data set BA777http://www.cs.nyu.edu/ roweis/data.html. As shown in Figure (a)a, BA consists of digits of “0” through “9” and letters of capital “A” through “Z”. In YALE, ORL, AR, and JAFEE, each image has different facial expressions or configurations due to times, illumination conditions, and glasses/no glasses. Hence, these data sets are contaminated at different levels. Figure (b)b and (c)c show some example images from YALE and JAFFE database.

Following the setting in kang2017twin (), we manually construct 12 kernels. They consist of seven Gaussian kernels with , where denotes the maximal distance between data points; a linear kernel ; four polynomial kernels of the form with and . In addition, all kernel matrices are normalized to range. This can be done through dividing each element by the largest element in its corresponding kernel matrix.

# instances # features # classes
YALE 165 1024 15
JAFFE 213 676 10
ORL 400 1024 40
AR 840 768 120
BA 1404 320 36
TR11 414 6429 9
TR41 878 7454 10
TR45 690 8261 10
TDT2 9394 36771 30
Table 1: Description of the data sets

4.2 Evaluation Metrics

To quantitatively assess our algorithm’s performance on the clustering task, we use the popular measures, i.e., accuracy (Acc) and normalized mutual information (NMI).

Acc discovers the one-to-one relationship between clusters and classes. Let and be the clustering result and the ground truth cluster label of , respectively. Then the Acc is defined by

where is the total number of samples, delta function equals one if and only if and zero otherwise, and map() is the best permutation mapping function that maps each cluster index to a true class label based on Kuhn-Munkres algorithm.

The NMI measures the quality of clustering. Given two sets of clusters and ,

where and represent the marginal probability distribution functions of and , respectively, induced from the joint distribution of and . is the entropy function. The greater NMI means the better clustering performance.

4.3 Comparison Methods

To fully examine the effectiveness of our proposed algorithm, we compare with both graph-based clustering methods and kernel methods. More concretely, we have Kernel K-means (KKM) scholkopf1998nonlinear (), Spectral Clustering (SC) ng2002spectral (), Robust Kernel K-means (RKKM) du2015robust (), Simplex Sparse Representation (SSR) huang2015new () and SCSK kang2017twin (). Among them, SC, SSR, and SCSK are graph-based clustering methods. Since SSR is developed in the feature space, we only need run it once. For other techniques, we run them on each kernel and report their best performances as well as their average performances over those kernels.

We also compare with a number of multiple kernel learning methods. We directly implement the downloaded programs of the comparison methods on those 12 kernels:

Multiple Kernel K-means (MKKM)888http://imp.iis.sinica.edu.tw/IVCLab/research/Sean/mkfc/code. The MKKM huang2012multiple () is an extension of K-means to the situation when multiple kernels exist.

Affinity Aggregation for Spectral Clustering (AASC)999http://imp.iis.sinica.edu.tw/IVCLab/research/Sean/aasc/code. The AASC huang2012affinity () extends spectral clustering to deal with multiple affinities.

Robust Multiple Kernel K-means (RMKKM)101010https://github.com/csliangdu/RMKKM. The RMKKM du2015robust () extends K-means to deal with noise and outliers in a multiple kernel setting.

Twin learning for Similarity and Clustering with Multiple Kernel (SCMK) kang2017twin (). Recently proposed graph-based clustering method with multiple kernel learning capability. Both RMKKM and SCMK rigorously require that the consensus kernel is a combination of base kernels.

Low-rank Kernel learning for Graph matrix (LKG). Our proposed low-rank kernel learning for graph-based clustering method. After obtaining similarity graph matrix , we run the spectral clustering algorithm to finish the clustering task. We examine both low-rank and sparse regularizer and denote their corresponding methods as LKGr and LKGs, respectively.

4.4 Results

Data KKM SC RKKM SSR SCSK MKKM AASC RMKKM SCMK LKGs LKGr
YALE 47.12(38.97) 49.42(40.52) 48.09(39.71) 54.55 55.85(45.35) 45.70 40.64 52.18 56.97 62.42 66.06
JAFFE 74.39(67.09) 74.88(54.03) 75.61(67.98) 87.32 99.83(86.64) 74.55 30.35 87.07 100.00 98.12 98.60
ORL 53.53(45.93) 57.96(46.65) 54.96(46.88) 69.00 62.35(50.50) 47.51 27.20 55.60 65.25 71.5 73.50
AR 33.02(30.89) 28.83(22.22) 33.43(31.20) 65.00 56.79(41.35) 28.61 33.23 34.37 62.38 65.83 60.47
BA 41.20(33.66) 31.07(26.25) 42.17(34.35) 23.97 47.72(39.50) 40.52 27.07 43.42 47.34 47.93 50.50
TR11 51.91(44.65) 50.98(43.32) 53.03(45.04) 41.06 71.26(54.79) 50.13 47.15 57.71 73.43 67.63 65.70
TR41 55.64(46.34) 63.52(44.80) 56.76(46.80) 63.78 67.43(53.13) 56.10 45.90 62.65 67.31 62.64 63.44
TR45 58.79(45.58) 57.39(45.96) 58.13(45.69) 71.45 74.02(53.38) 58.46 52.64 64.00 74.35 75.94 77.39
TDT2 47.05(35.58) 52.63(45.26) 48.35(36.67) 20.86 55.74(44.82) 34.36 19.82 37.57 56.42 58.77 60.48
(a) Accuracy(%)
Data KKM SC RKKM SSR SCSK MKKM AASC RMKKM SCMK LKGs LKGr
YALE 51.34(42.07) 52.92(44.79) 52.29(42.87) 57.26 56.50(45.07) 50.06 46.83 55.58 56.52 61.72 64.57
JAFFE 80.13(71.48) 82.08(59.35) 83.47(74.01) 92.93 99.35(84.67) 79.79 27.22 89.37 100.00 97.00 98.73
ORL 73.43(63.36) 75.16(66.74) 74.23(63.91) 84.23 78.96(63.55) 68.86 43.77 74.83 80.04 83.93 85.10
AR 65.21(60.64) 58.37(56.05) 65.44(60.81) 84.16 76.02(59.70) 59.17 65.06 65.49 81.51 84.69 81.05
BA 57.25(46.49) 50.76(40.09) 57.82(46.91) 30.29 63.04(52.17) 56.88 42.34 58.47 62.94 60.12 63.20
TR11 48.88(33.22) 43.11(31.39) 49.69(33.48) 27.60 58.60(37.58) 44.56 39.39 56.08 60.15 62.30 63.50
TR41 59.88(40.37) 61.33(36.60) 60.77(40.86) 59.56 65.50(43.18) 57.75 43.05 63.47 65.11 66.23 61.78
TR45 57.87(38.69) 48.03(33.22) 57.86(38.96) 67.82 74.24(44.36) 56.17 41.94 62.73 74.97 70.97 75.22
TDT2 55.28(38.47) 52.23(27.16) 54.46(42.19) 02.44 58.35(46.37) 41.36 02.14 47.13 59.84 60.75 62.85
(b) NMI(%)
Table 2: Performance of various clustering methods on benchmark data sets. For single kernel methods (The 1st, 2nd, 3rd, 5th columns), the average performance over those 12 kernels is put in parenthesis. The best results for these algorithms are highlighted in bold.
(a)
(b)
Figure 2: The clustering accuracy of LKGr on YALE Data w.r.t. and .
(a)
(b)
Figure 3: The clustering accuracy of LKGr on JAFFEE Data w.r.t. and .
(a)
(b)
Figure 4: The clustering accuracy of LKGr on ORL Data w.r.t. and .

For the compared methods, we either use their existing parameter settings or tune them to obtain the best performances. In particular, we can directly obtain the optimal results for KKM, SC, RKKM, MKKM, AASC, and RMKKM methods by implementing the package in du2015robust (). SSR is a parameter-free model. Hence we only need to tune the parameters for SCSK and SCMK. The experimental results are presented in Table 2. In most cases, our proposed method LKG achieves the best performance among all state-of-the-art algorithms. In particular, we have the following observations.

  1. For non-multiple kernel based techniques, we see big differences between the best and average results. This validates the fact that the selection of kernel has a big impact on the final results. Therefore, it is imperative to develop multiple kernel learning method.

  2. As expected, multiple kernel methods work better than single kernel approaches. This is consistent with our belief that multiple kernel methods often exploit complementary information.

  3. Graph-based clustering methods often perform much better than K-means and its extensions. As can be seen, SSR, SCSK, SCMK, and LKG improve clustering performance considerably.

  4. By comparing the performance of SCMK and LKG, we can clearly see the advantage of our low-rank kernel learning approach. This demonstrates that it is beneficial to adopt our proposed kernel learning method.

Method Metric KKM SC RKKM SSR SCSK MKKM AASC RMKKM SCMK
LKGs Acc .0039 .0078 .0039 .0117 .3008 .0039 .0039 .0078 .5703
NMI .0039 .0039 .0039 .0078 .2031 .0039 .0039 .0039 .5703
LKGr Acc .0039 .0078 .0039 .0273 .3008 .0039 .0039 .0039 .4268
NMI .0039 .0039 .0039 .0195 .0547 .0039 .0039 .0078 .3008
Table 3: Wilcoxon Signed Rank Test on all Data sets.

To see the significance of improvements, we further apply the Wilcoxon signed rank test to Table 2. We show the -values in Table 3. We note that the testing results are under 0.05 when comparing LKGs and LKGr to all other methods except SCSK and SCMK, which were proposed in 2017. Therefore, LKGs and LKGr outperform KKM, SC, RKKM, SSR, MKKM, AASC, and RMKKM with statistical significance.

4.5 Parameter Sensitivity

There are three hyper-parameters in our model: , , and . To better see the effects of and , we fix with and , and search and in the range . We analyze the sensitivity of our model LKGr to them by using YALE, JAFFE, and ORL data sets as examples, in terms of accuracy. Figures 2 to 4 show our model gives reasonable results in a wide range of parameters.

4.6 Examination on Multi-view Data

Nowadays, data of multiple views are prevailing. Hence, we test our model on multi-view data in this subsection. We employ two widely used multi-view data sets for performance evaluation, namely, Cora sen2008collective () and NUS-WIDE chua2009nus (). Note that most of the data sets used in this paper have imbalanced clusters. For example, there are 818, 180, 217, 426, 351, 418, 298 samples in Cora for each cluster, respectively. For clustering, imbalance issue is seldom discussed zhu2017entropy (). Hence we expect that our method can work well in general circumstances. To do a comprehensive evaluation, more measures, including F-score, Precision, Recall, Adjusted Rand Index (ARI), Entropy, Purity, are used here. Each metric characterizes different properties for the clustering. Except for entropy, the other metrics with a larger value means a better performance.

We implement the algorithms on each view of them and report the clustering results in Table 4 and 5. For our algorithms LKGs and LKGr, we repeat 20 times and show the mean values and standard deviations. As can be seen, our approach performs better than all the other baselines in most measures. It is unsurprising that different views give different performances. Our proposed method can work well in general.

Methods F-score Precision Recall NMI ARI Entropy Acc Purity
KKM .306(.302) .186(.183) .993(.891) .108(.070) .015(.008) .863(.419) .341(.322) .357(.335)
SC .304(.289) .192(.181) .995(.772) .128(.026) .028(.004) 2.269(.707) .344(.295) .370(.312)
MKKM .282 .194 .525 .172 .029 1.688 .349 .402
AASC .293 .178 .836 .044 -.004 .614 .290 .312
RMKKM .311 .190 .859 .141 .025 .634 .361 .376
LKGs .303(0) .179(0) .989(.001) .005(.001) 0(0) .062(.003) .302(0) .303(0)
LKGr .335(.012) .326(.015) .346(.027) .298(.008) .184(.014) 2.536(.103) .405(.016) .499(.013)
(a) 1st view
Methods F-score Precision Recall NMI ARI Entropy Acc Purity
KKM .304(.268) .264(.215) .996(.515) .169(.080) .103(.045) 2.686(1.614) .359(.313) .416(.350)
SC .301(.271) .183(.180) .930(.641) .045(.017) .006(.001) 2.025(1.041) .300(.269) .323(.308)
MKKM .246 .259 .235 .147 .091 2.707 .330 .392
AASC .301 .180 .922 .006 .002 .299 .300 .305
RMKKM .264 .271 .259 .171 .108 2.636 .358 .415
LKGs .304(0) .180(0) .997(0) .005(0) .001(0) .028(0) .304(0) .304(0)
LKGr .340(.006) .351(.010) .330(.009) .280(.004) .201(.009) 2.675(.033) .452(.009) .517(.006)
(b) 2nd view
Table 4: Performance of various clustering methods on the Cora data set.
Methods F-score Precision Recall NMI ARI Entropy Acc Purity
KKM .416(.399) .393(.338) .939(.524) .242(.195) .202(.133) .1.898(1.558) .501(.438) .533(.483)
SC .459(.407) .391(.287) .992(.796) .202(.073) .212(.059) 1.821(.665) .529(.368) .530(.375)
MKKM .401 .351 .475 .231 .155 1.719 .431 .508
AASC .381 .263 .692 .101 .020 .825 .351 .354
RMKKM .408 .378 .448 .256 .185 1.845 .450 .552
LKGs .460(.036) .408(.060) .523(.024) .303(.580) .234(.075) 1.743(.192) .518(.039) .556(.050)
LKGr .449(.003) .421(.009) .480(.007) .224(.009) .244(.009) 1.881(.032) .543(.014) .544(.010)
(a) 1st view
Methods F-score Precision Recall NMI ARI Entropy Acc Purity
KKM .428(.397) .425(.335) .986(.532) .269(.203) .231(.126) 1.973(1.533) .497(.439) .547(.477)
SC .428(.398) .359(.270) .985(.817) .178(.057) .179(.030) 1.658(.612) .490(.354) .490(.358)
MKKM .404 .382 .429 .248 .184 1.885 .479 .521
AASC .368 .265 .605 .087 .023 1.184 .340 .363
RMKKM .439 .416 .465 .287 .233 1.892 .482 .571
LKGs .478(.013) .449(.020) .512(.016) .315(.028) .283(.021) 1.850(.067) .548(.014) .568(.011)
LKGr .439(.025) .434(.035) .445(.014) .286(.021) .244(.040) 1.958(.047) .548(.030) .558(.025)
(b) 2nd view
Table 5: Performance of various clustering methods on the NUS-WIDE data set.

5 Conclusion

In this paper, we propose a multiple kernel learning based graph clustering method. Different from the existing multiple kernel learning methods, our method explicitly assumes that the consensus kernel matrix should be low-rank and lies in the neighborhood of the combined kernel. As a result, the learned graph is more informative and discriminative, especially when the data is subject to noise and outliers. Experimental results on both image clustering and document clustering demonstrate that our method indeed improves clustering performance compared to existing clustering techniques.

Acknowledgments

This paper was in part supported by Grants from the Natural Science Foundation of China (Nos. 61806045, 61572111, and 61772115), a 985 Project of UESTC (No. A1098531023601041), three Fundamental Research Fund for the Central Universities of China (Nos. A03017023701012, ZYGX2017KYQD177, and ZYGX2016J086), and the China Postdoctoral Science Foundation (No. 2016M602677).

References

References

  • (1) A. K. Jain, M. N. Murty, P. J. Flynn, Data clustering: a review, ACM computing surveys (CSUR) 31 (3) (1999) 264–323.
  • (2) H. Zhu, R. Vial, S. Lu, X. Peng, H. Fu, Y. Tian, X. Cao, Yotube: Searching action proposal via recurrent and static regression networks, IEEE Transactions on Image Processing 27 (6) (2018) 2609–2622.
  • (3) S. Huang, Y. Ren, Z. Xu, Robust multi-view data clustering with multi-view capped-norm k-means, Neurocomputing.
  • (4) P. F. Felzenszwalb, D. P. Huttenlocher, Efficient graph-based image segmentation, International journal of computer vision 59 (2) (2004) 167–181.
  • (5) D. Jiang, C. Tang, A. Zhang, Cluster analysis for gene expression data: a survey, IEEE Transactions on knowledge and data engineering 16 (11) (2004) 1370–1386.
  • (6) E. Elhamifar, R. Vidal, Sparse subspace clustering, in: Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, IEEE, 2009, pp. 2790–2797.
  • (7) S. Yang, Z. Yi, X. He, X. Li, A class of manifold regularized multiplicative update algorithms for image clustering, IEEE Transactions on Image Processing 24 (12) (2015) 5302–5314.
  • (8) H. Liu, T. Liu, J. Wu, D. Tao, Y. Fu, Spectral ensemble clustering, in: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, 2015, pp. 715–724.
  • (9) W. Yan, B. Zhang, S. Ma, Z. Yang, A novel regularized concept factorization for document clustering, Knowledge-Based Systems 135 (2017) 147–158.
  • (10) S. Huang, Z. Xu, J. Lv, Adaptive local structure learning for document co-clustering, Knowledge-Based Systems 148 (2018) 74–84.
  • (11) X. He, M.-Y. Kan, P. Xie, X. Chen, Comment-based multi-view clustering of web 2.0 items, in: Proceedings of the 23rd international conference on World wide web, ACM, 2014, pp. 771–782.
  • (12) H. Gao, F. Nie, X. Li, H. Huang, Multi-view subspace clustering, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 4238–4246.
  • (13) X. Chen, Y. Ye, X. Xu, J. Z. Huang, A feature group weighting method for subspace clustering of high-dimensional data, Pattern Recognition 45 (1) (2012) 434–446.
  • (14) J. MacQueen, et al., Some methods for classification and analysis of multivariate observations, in: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Vol. 1, Oakland, CA, USA., 1967, pp. 281–297.
  • (15) X. Chen, X. Xu, Y. Ye, J. Z. Huang, TW-k-means: Automated Two-level Variable Weighting Clustering Algorithm for Multi-view Data, IEEE Transactions on Knowledge and Data Engineering 25 (4) (2013) 932–944.
  • (16) U. Von Luxburg, A tutorial on spectral clustering, Statistics and computing 17 (4) (2007) 395–416.
  • (17) Z. Kang, C. Peng, Q. Cheng, Z. Xu, Unified spectral clustering with optimal graph., in: AAAI, 2018, pp. 3366–3373.
  • (18) G. Liu, Z. Lin, Y. Yu, Robust subspace segmentation by low-rank representation, in: Proceedings of the 27th international conference on machine learning (ICML-10), 2010, pp. 663–670.
  • (19) X. Peng, J. Feng, S. Xiao, W.-Y. Yau, J. T. Zhou, S. Yang, Structured autoencoders for subspace clustering, IEEE transactions on image processing 27 (10) (2018) 5076–5086.
  • (20) S. C. Johnson, Hierarchical clustering schemes, Psychometrika 32 (3) (1967) 241–254.
  • (21) C. H. Ding, T. Li, M. I. Jordan, Convex and semi-nonnegative matrix factorizations, IEEE transactions on pattern analysis and machine intelligence 32 (1) (2010) 45–55.
  • (22) S. Huang, Z. Kang, Z. Xu, Self-weighted multi-view clustering with soft capped norm, Knowledge-Based Systems 158 (2018) 1–8.
  • (23) S. Huang, H. Wang, T. Li, T. Li, Z. Xu, Robust graph regularized nonnegative matrix factorization for clustering, Data Mining and Knowledge Discovery 32 (2) (2018) 483–503.
  • (24) J. Huang, F. Nie, H. Huang, A new simplex sparse learning model to measure data similarity for clustering, in: Proceedings of the 24th International Conference on Artificial Intelligence, AAAI Press, 2015, pp. 3569–3575.
  • (25) J. Xuan, J. Lu, G. Zhang, X. Luo, Topic model for graph mining., IEEE Trans. Cybernetics 45 (12) (2015) 2792–2803.
  • (26) Z. Kang, C. Peng, Q. Cheng, Twin learning for similarity and clustering: A unified kernel approach., in: AAAI, 2017, pp. 2080–2086.
  • (27) A. Y. Ng, M. I. Jordan, Y. Weiss, et al., On spectral clustering: Analysis and an algorithm, Advances in neural information processing systems 2 (2002) 849–856.
  • (28) B. Schölkopf, A. Smola, K.-R. Müller, Nonlinear component analysis as a kernel eigenvalue problem, Neural computation 10 (5) (1998) 1299–1319.
  • (29) X. Liu, S. Zhou, Y. Wang, M. Li, Y. Dou, E. Zhu, J. Yin, H. Li, Optimal neighborhood kernel clustering with multiple kernels., in: AAAI, 2017, pp. 2266–2272.
  • (30) P. Zhou, L. Du, L. Shi, H. Wang, Y.-D. Shen, Recovery of corrupted multiple kernels for clustering., in: IJCAI, 2015, pp. 4105–4111.
  • (31) M. Gönen, E. Alpaydın, Multiple kernel learning algorithms, Journal of Machine Learning Research 12 (Jul) (2011) 2211–2268.
  • (32) Z. Kang, X. Lu, J. Yi, Z. Xu, Self-weighted multiple kernel learning for graph-based clustering and semi-supervised classification, in: IJCAI, 2018, pp. 2312–2318.
  • (33) Y. Yang, Z. Ma, Y. Yang, F. Nie, H. T. Shen, Multitask spectral clustering by exploring intertask correlation, IEEE transactions on cybernetics 45 (5) (2015) 1083–1094.
  • (34) Y. Yang, F. Shen, Z. Huang, H. T. Shen, X. Li, Discrete nonnegative spectral clustering, IEEE Transactions on Knowledge and Data Engineering 29 (9) (2017) 1834–1845.
  • (35) L. Zelnik-Manor, P. Perona, Self-tuning spectral clustering., in: NIPS, Vol. 17, 2004, p. 16.
  • (36) X. Zhu, C. Change Loy, S. Gong, Constructing robust affinity graphs for spectral clustering, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1450–1457.
  • (37) F. Nie, X. Wang, H. Huang, Clustering and projected clustering with adaptive neighbors, in: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2014, pp. 977–986.
  • (38) Z. Kang, C. Peng, Q. Cheng, Kernel-driven similarity learning, Neurocomputing 267 (2017) 210–219.
  • (39) V. M. Patel, R. Vidal, Kernel sparse subspace clustering, in: 2014 IEEE International Conference on Image Processing (ICIP), IEEE, 2014, pp. 2849–2853.
  • (40) L. Zhuang, H. Gao, Z. Lin, Y. Ma, X. Zhang, N. Yu, Non-negative low rank and sparse graph for semi-supervised learning, in: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, IEEE, 2012, pp. 2328–2335.
  • (41) C.-G. Li, Z. Lin, H. Zhang, J. Guo, Learning semi-supervised representation towards a unified optimization framework for semi-supervised learning, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 2767–2775.
  • (42) L. Du, P. Zhou, L. Shi, H. Wang, M. Fan, W. Wang, Y.-D. Shen, Robust multiple kernel k-means using ℓ 2; 1-norm, in: Proceedings of the 24th International Conference on Artificial Intelligence, AAAI Press, 2015, pp. 3476–3482.
  • (43) S. Yu, L. Tranchevent, X. Liu, W. Glanzel, J. A. Suykens, B. De Moor, Y. Moreau, Optimized data fusion for kernel k-means clustering, IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (5) (2012) 1031–1039.
  • (44) R. Xia, Y. Pan, L. Du, J. Yin, Robust multi-view spectral clustering via low-rank and sparse decomposition., in: AAAI, 2014, pp. 2149–2155.
  • (45) Z. Kang, C. Peng, Q. Cheng, Robust subspace clustering via smoothed rank approximation, IEEE Signal Processing Letters 22 (11) (2015) 2088–2092.
  • (46) E. J. Candès, X. Li, Y. Ma, J. Wright, Robust principal component analysis?, Journal of the ACM (JACM) 58 (3) (2011) 11.
  • (47) Z. Kang, C. Peng, Q. Cheng, Robust pca via nonconvex rank approximation, in: Proceedings of the 2015 IEEE International Conference on Data Mining (ICDM), IEEE Computer Society, 2015, pp. 211–220.
  • (48) S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, Distributed optimization and statistical learning via the alternating direction method of multipliers, Foundations and Trends® in Machine Learning 3 (1) (2011) 1–122.
  • (49) R. M. Larsen, Propack-software for large and sparse svd calculations, Available online. URL http://sun. stanford. edu/rmunk/PROPACK (2004) 2008–2009.
  • (50) H.-C. Huang, Y.-Y. Chuang, C.-S. Chen, Multiple kernel fuzzy clustering, IEEE Transactions on Fuzzy Systems 20 (1) (2012) 120–134.
  • (51) H.-C. Huang, Y.-Y. Chuang, C.-S. Chen, Affinity aggregation for spectral clustering, in: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, IEEE, 2012, pp. 773–780.
  • (52) P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, T. Eliassi-Rad, Collective classification in network data, AI magazine 29 (3) (2008) 93.
  • (53) T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, Y. Zheng, Nus-wide: a real-world web image database from national university of singapore, in: Proceedings of the ACM international conference on image and video retrieval, ACM, 2009, p. 48.
  • (54) C. Zhu, Z. Wang, Entropy-based matrix learning machine for imbalanced data sets, Pattern Recognition Letters 88 (2017) 72–80.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
345699
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description