Self-weighted Multiple Kernel Learning for Graph-based Clustering and Semi-supervised Classification
Multiple kernel learning (MKL) method is generally believed to perform better than single kernel method. However, some empirical studies show that this is not always true: the combination of multiple kernels may even yield an even worse performance than using a single kernel. There are two possible reasons for the failure: (i) most existing MKL methods assume that the optimal kernel is a linear combination of base kernels, which may not hold true; and (ii) some kernel weights are inappropriately assigned due to noises and carelessly designed algorithms. In this paper, we propose a novel MKL framework by following two intuitive assumptions: (i) each kernel is a perturbation of the consensus kernel; and (ii) the kernel that is close to the consensus kernel should be assigned a large weight. Impressively, the proposed method can automatically assign an appropriate weight to each kernel without introducing additional parameters, as existing methods do. The proposed framework is integrated into a unified framework for graph-based clustering and semi-supervised classification. We have conducted experiments on multiple benchmark datasets and our empirical results verify the superiority of the proposed framework.
As a principled way of introducing non-linearity into linear models, kernel methods have been widely applied in many machine learning tasks [\citeauthoryearHofmann et al.2008, \citeauthoryearXu et al.2010]. Although improved performance has been reported in a wide variety of problems, the kernel methods require the user to select and tune a single pre-defined kernel. This is not user-friendly since the most suitable kernel for a specific task is usually challenging to decide. Moreover, it is time-consuming and impractical to exhaustively search from a large pool of candidate kernels. Multiple kernel learning (MKL) was proposed to address this issue as it offers an automatical way of learning an optimal combination of distinct base kernels [\citeauthoryearXu et al.2009]. Generally speaking, MKL method should yield a better performance than that of single kernel approach.
A key step in MKL is to assign a reasonable weight to each kernel according to its importance. One popular approach considers a weighted combination of candidate kernel matrices, leading to a convex quadratically constraint quadratic program. However, this method over-reduces the feasible set of the optimal kernel, which may lead to a less representative solution. In fact, these MKL algorithms sometimes fail to outperform single kernel methods or traditional non-weighted kernel approaches [\citeauthoryearYu et al.2010, \citeauthoryearGehler and Nowozin2009]. Another issue is the inappropriate weights assignment. Some attempts aim to learn the local importance of features by assuming that samples may vary locally [\citeauthoryearGönen and Alpaydin2008]. However, they induce more complex computational problems.
To address these issues, in this paper, we model the differences among kernels by following two intuitive assumptions: (i) each kernel is a perturbation of the consensus kernel; and (ii) the kernel that is close to the consensus kernel should receive a large weight. As a result, instead of enforcing the optimal kernel being a linear combination of predefined kernels, this approach allows the most suitable kernel to reside in some kernels’ neighborhood. And our proposed method can assign an optimal weight for each kernel automatically without introducing an additive parameter as existing methods do.
Then we combine this novel weighting scheme with graph-based clustering and semi-supervised learning (SSL). Due to its effectiveness in similarity graph construction, graph-based clustering and SSL have shown impressive performance [\citeauthoryearNie et al.2017a, \citeauthoryearKang et al.2017b]. Finally, a novel multiple kernel learning framework for clustering and semi-supervised learning is developed.
In summary, our main contributions are two-fold:
We proposed a novel way to construct the optimal kernel and assign weights to base kernels. Notably, our proposed method can find a better kernel in the neighborhood of candidate kernels. This weight is a function of kernel matrix, so we do not need to introduce an additional parameter as existing methods do. This also eases the burden of solving the constraint quadratic program.
A unified framework for clustering and SSL is developed. It seamlessly integrates the components of graph construction, label learning, and kernel learning by incorporating the graph structure constraint. This allows them to negotiate with each other to achieve overall optimality. Our experiments on multiple real-world datasets verify the effectiveness of the proposed framework.
2 Related Work
In this section, we divide the related work into two categories, namely graph-based clusteirng and SSL, and paremeter-weighted multiple kernel learning.
2.1 Graph-based Clustering and SSL
Graph-based clustering [\citeauthoryearNg et al.2002, \citeauthoryearYang et al.2017] and SSL [\citeauthoryearZhu et al.2003] have been popular for its simple and impressive performance. The graph matrix to measure the similarity of data points is crucial to their performance and there is no satisfying solution for this problem. Recently, automatic learning graph from data has achieved promising results. One approach is based on adaptive neighbor idea, i.e., is learned as a measure of the probability that is neighbor of . Then is treated as graph input to do clustering [\citeauthoryearNie et al.2014, \citeauthoryearHuang et al.2018b] and SSL [\citeauthoryearNie et al.2017a]. Another one is using the so-called self-expressiveness property, i.e., each data point is expressed as a weighted combination of other points and this learned weight matrix behaves like the graph matrix. Representative work in this category include [\citeauthoryearHuang et al.2015, \citeauthoryearLi et al.2015, \citeauthoryearKang et al.2018]. These methods are all developed in the original feature space. To make it more general, we develop our model in kernel space in this paper. Our purpose is to learn a graph with exactly number of connected components if the data contains clusters or classes. In this work, we will consider this condition explicitly.
2.2 Parameter-weighted Multiple Kernel Learning
It is well-known that the performance of kernel method crucially depends on the kernel function as it intrinsically specifies the feature space. MKL is an efficient way for automatic kernel selection and embedding different notions of similarity [\citeauthoryearKang et al.2017a]. It is generally formulated as follows:
where is the objective function, is the consensus kernel, is our artificial constructed kernel, represents the weight for kernel , is used to smoothen the weight distribution. Therefore, we frequently sovle and tune . Though this approach is widely used, it still suffers the following problems. First, the linear combination of base kernels over reduces the feasible set of optimal kernels, which could result in the learned kernel with limited representation ability. Second, the optimization of kernel weights may lead to inappropriate assignments due to noise and carelessly designed algorithms. Indeed, contrary to the original intention of MKL, this approach sometimes obtains lower accuracy than that of using equally weighted kernels or merely single kernel method. This will hinder the practical use of MKL. This phenomenon has been observed for many years [\citeauthoryearCortes2009] but rarely studied. Thus, it is vital to develop some new approaches.
Throughout the paper, all the matrices are written as uppercase. For a matrix , its -th element and -th column is denoted as and , respectively. The trace of is denoted by . The -th kernel of is written as . The -norm of vector is represented by . The Frobenius norm of matrix is denoted by . is an identity matrix with proper size. means all entries of are nonnegative.
3.2 Self-weighted Multiple Kernel Learning
Aforementioned self-expressiveness based graph learning method can be formulated as:
where is the trade-off parameter. To recap the powerfulness of kernel method, we extend Eq. (2) to its kernel version by using kernel mapping . According to the kernel trick , we have
Ideally, we hope to achieve a graph with exactly connected components if the data contain clusters or classes. In other words, the graph is block diagonal with proper permutations. It is straightforward to check that in Eq. (3) can hardly satisfy to such a constraint condition.
If the similarity graph matrix is nonnegative, then the Laplacian matrix , where is the diagonal degree matrix defined as , associated with has an important property as follows [\citeauthoryearMohar et al.1991]
The multiplicity of the eigenvalue 0 of the Laplacian matrix is equal to the number of connected components in the graph associated with .
Theorem 1 indicates that if , then the constraint on will be held. Therefore, the problem (3) can be rewritten as:
Suppose is the -th smallest eigenvalue of . Note that because is positive semi-definite. The problem (4) is equivalent to the following problem for a large enough :
According to the Ky Fan’s Theorem [\citeauthoryearFan1949], we have:
can be cluster indicator matrix or label matrix. Therefore, the problem (5) is further equivalent to the following problem
This problem (7) is much easier to solve compared with the rank constrained problem (4). We name this model as Kernel-based Graph Learning (KGL). Note that this model’s input is kernel matrix . It is generally recognized that its performance is largely determined by the choice of kernel. Unfortunately, the most suitable kernel for a particular task is often unknown in advance. Although MKL as in Eq. (1) can be applied to resolve this issue, it is still not satisfying as we discussed in subsection 2.2.
In this work, we design a novel MKL strategy. It is based on the following two intuitive assumptions: 1) each kernel is a perturbation of the consensus kernel, and 2) the kernel that is close to the consensus kernel should receive a large weight. Motivated by these, we can have the following MKL form:
We can see that is dependent on the target variable , so it is not directly available. But can be set to be stationary, i.e., after obtaining , we update correspondingly [\citeauthoryearNie et al.2017b]. Instead of enforcing the optimal kernel being a linear combination of candidate kernels as in Eq. (1), Eq. (8) allows the most suitable kernel to reside in some kernels’ neighborhood [\citeauthoryearLiu et al.2009]. This enhances the representation ability of the learned optimal kernel [\citeauthoryearLiu et al.2017, \citeauthoryearLiu et al.2013]. Furthermore, we don’t introduce an additive hyperparameter , which often leads to a quadratic program. The optimal weight for each kernel is directly calculated according to kernel matrices. Then our Self-weighted Multiple Kernel Learning (SMKL) framework can be formulated as:
This model enjoys the following properties:
This unified framework sufficiently considers the negotiation between the process of learning the optimal kernel and that of graph/label learning. By iteratively updating , , , they can be repeatly improved.
By treating the optimal kernel as a perturbation of base kernels, it effectively enlarges the region from which an optimal kernel can be chosen, and therefore is in a better position than the traditional ones to identify a more suitable kernel.
The kernel weight is directly calculated from kernel matrices. Therefore, we avoid solving quadratic program.
To see the effect of our proposed MKL method, we need to examine the approach with traditional kernel learning. For convenience, we denote it as Parameterized MKL (PMKL). It can be written as:
We divide the problem in Eq. (10) into three subproblems, and develop an alternative and iterative algorithm to solve them.
For , we fix and . The problem in Eq. (10) becomes:
Based on , we can equivalently solve the following problem for each sample:
where with . By setting its first derivative w.r.t. to be zero, we obtain:
Thus can be achieved in parallel.
For , we fix and . The problem in Eq. (10) becomes:
Similar to (14), it yields:
For , we fix and . The problem in Eq. (10) becomes:
The optimal solution is the eigenvectors of corresponding to the smallest eigenvalues.
3.4 Extend to Semi-supervised Classification
Model (10) also lends itself to semi-supervised classification. Graph construction and label inference are two fundamental stages in SSL. Solving two separate problems only once is suboptimal since label information is not exploited when it learns the graph. SMKL unifies these two fundamental components into a unified framework. Then the given labels and estimated labels will be utilized to build the graph and to predict the unknown labels.
Based on a similar approach, we can reformulate SMKL for semi-supervised classification as:
where denote the label matrix and is the number of labeled points. is one-hot and indicates that the -th sample belongs to the -th class. , where the unlabeled points in the back. (18) can be solved in the same procedure as (10), the difference lies in updating .
To solve , we take the derivative of (18) with respect to , we have , i.e.,
Finally, the class label for unlabeled points could be assigned according to the following decision rule:
|# instances||# features||# classes|
In this section, we conduct clustering experiments to demonstrate the efficacy of our method.
4.1 Data Sets
We implement experiments on six publicly available data sets. We summarize the information of these data sets in Table 1. In specific, the first two data sets YALE and JAFFE consist of face images. YEAST is microarray data set. Tr11, Tr41, and Tr45 are derived from NIST TREC Document Database.
We design 12 kernels. They are: seven Gaussian kernels of the form , where is the maximal distance between samples and varies over the set ; a linear kernel ; four polynomial kernels with and . Besides, all kernels are rescaled to by dividing each element by the largest pairwise squared distance.
4.2 Comparison Methods
We compare with a number of single kernel and multiple kernel learning based clustering methods. They include: Spectral Clustering (SC) [\citeauthoryearNg et al.2002], Simplex Sparse Representation (SSR) [\citeauthoryearHuang et al.2015], Multiple Kernel k-means (MKKM) [\citeauthoryearHuang et al.2012b], Affinity Aggregation for Spectral Clustering (AASC) [\citeauthoryearHuang et al.2012a], Robust Multiple Kernel k-means (RMKKM)
4.3 Performance Evaluation
SMKL is compared with other techniques. We show the clustering results in terms of accuracy (Acc), NMI in Table 2. For SC and KGL, we report its best performance achieved from those 12 kernels. It can clearly be seen that SMKL achieves the best performance in most cases. Compared to PMKL, SMKL
4.4 Parameter Analysis
There are three parameters in our model (10). Figure 1 shows the clustering accuracy of YALE data set with varying , , and . We can observe that the performance is not so sensitive to those parameters. This conclusion is also true for NMI.
5 Semi-supervised Classification
In this section, we assess the effectiveness of SMKL on semi-supervised classification task.
5.1 Data Sets
1) Evaluation on Face Recognition: We examine the effectiveness of our graph learning for face recognition on two frequently used face databases: YALE and JEFFE. The YALE face data set contains 15 individuals, and each person has 11 near frontal images taken under different illuminations. Each image is resized to 3232 pixels. The JAFFE face database consists of 10 individuals, and each subject has 7 different facial expressions (6 basic facial expressions +1 neutral). The images are resized to 2626 pixels.
2) Evaluation on Digit/Letter Recognition: In this experiment, we address the digit/letter recognition problem on the BA database. The data set consists of digits of “0” through “9” and letters of capital “A” to “Z”. Therefore, there are 39 classes and each class has 39 samples.
3) Evaluation on Visual Object Recognition: We conduct visual object recognition experiment on the COIL20 database. The database consists of 20 objects and 72 images for each object. For each object, the images were taken 5 degrees apart as the object is rotating on a turntable. The size of each image is 3232 pixels.
Similar to clustering experiment, we construct 7 kernels for each data set. They include: four Gaussian kernels with varies over ; a linear kernel ; two polynomial kernels with .
5.2 Comparison Methods
We compare our method with several other state-of-the-art algorithms.
Local and Global Consistency (LGC) [\citeauthoryearZhou et al.2004]: LGC is a popular label propagation method. For this method, kernel matrix is used to compute .
Gaussian Field and Harmonic function (GFHF) [\citeauthoryearZhu et al.2003]: Different from LGC, GFHF is another mechanics to infer those unknown labels as a process of propagating labels through the pairwise similarity.
Semi-supervised Classification with Adaptive Neighbours (SCAN) [\citeauthoryearNie et al.2017a]: Based on adaptive neighbors method, SCAN shows much better performance than many other techniques.
A Unified Optimization Framework for SSL [\citeauthoryearLi et al.2015]: Li et al. propose a unified framework based on self-expressiveness approach. By using low-rank and sparse regularizer, they have SLRR and SR method, respectively.
5.3 Performance Evaluation
We randomly choose 10%, 30%, 50% portions of samples as labeled data and repeat 20 times. Classification accuracy and deviation are shown in Table 3. More concretely, for GFHF and LGC, the constructed seven kernels are tested and the best performance is reported. Unlike them, SCAN, SLRR, SR, and SMKL, the label prediction and graph learning are conducted in a unified framework.
As expected, the classification accuracy for all methods monotonically increase with the increase of the percentage of labeled samples. As can be observed, our SMKL method outperforms other state-of-the-art methods in most cases. This confirms the effectiveness of our proposed method on SSL task.
This paper proposes a novel multiple kernel learning framework for clustering and semi-supervised classification. In this model, a more flexible kernel learning strategy is developed to enhance the representation ability of the learned optimal kernel and to assign weight for each base kernel. An iterative algorithm is designed to solve the resultant optimization problem, so that graph construction, label learning, kernel learning are boosted by each other. Comprehensive experimental results clearly demonstrates the superiority of our method.
This paper was in part supported by two Fundamental Research Funds for the Central Universities of China (Nos. ZYGX2017KYQD177, ZYGX2016Z003), Grants from the Natural Science Foundation of China (No. 61572111) and a 985 Project of UESTC (No. A1098531023601041) .
- Corinna Cortes. Can learning kernels help performance. In Invited talk at International Conference on Machine Learning (ICML 2009). Montréal, Canada, 2009.
- Liang Du, Peng Zhou, Lei Shi, Hanmo Wang, Mingyu Fan, Wenjian Wang, and Yi-Dong Shen. Robust multiple kernel k-means using l21-norm. In Proceedings of the 24th International Conference on Artificial Intelligence, pages 3476–3482. AAAI Press, 2015.
- Ky Fan. On a theorem of weyl concerning eigenvalues of linear transformations i. Proceedings of the National Academy of Sciences, 35(11):652–655, 1949.
- Peter Gehler and Sebastian Nowozin. On feature combination for multiclass object classification. In Computer Vision, 2009 IEEE 12th International Conference on, pages 221–228. IEEE, 2009.
- Mehmet Gönen and Ethem Alpaydin. Localized multiple kernel learning. In Proceedings of the 25th international conference on Machine learning, pages 352–359. ACM, 2008.
- Thomas Hofmann, Bernhard Schölkopf, and Alexander J Smola. Kernel methods in machine learning. The annals of statistics, pages 1171–1220, 2008.
- Hsin-Chien Huang, Yung-Yu Chuang, and Chu-Song Chen. Affinity aggregation for spectral clustering. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 773–780. IEEE, 2012.
- Hsin-Chien Huang, Yung-Yu Chuang, and Chu-Song Chen. Multiple kernel fuzzy clustering. IEEE Transactions on Fuzzy Systems, 20(1):120–134, 2012.
- Jin Huang, Feiping Nie, and Heng Huang. A new simplex sparse learning model to measure data similarity for clustering. In IJCAI, pages 3569–3575, 2015.
- Shudong Huang, Zhao Kang, and Zenglin Xu. Self-weighted multi-view clustering with soft capped norm. Knowledge-Based Systems, 2018.
- Shudong Huang, Zenglin Xu, and Jiancheng Lv. Adaptive local structure learning for document co-clustering. Knowledge-Based Systems, 148:74–84, 2018.
- Zhao Kang, Chong Peng, and Qiang Cheng. Kernel-driven similarity learning. Neurocomputing, 267:210–219, 2017.
- Zhao Kang, Chong Peng, and Qiang Cheng. Twin learning for similarity and clustering: A unified kernel approach. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17). AAAI Press, 2017.
- Zhao Kang, Chong Peng, Qiang Cheng, and Zenglin Xu. Unified spectral clustering with optimal graph. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18). AAAI Press, 2018.
- Chun-Guang Li, Zhouchen Lin, Honggang Zhang, and Jun Guo. Learning semi-supervised representation towards a unified optimization framework for semi-supervised learning. In Proceedings of the IEEE International Conference on Computer Vision, pages 2767–2775, 2015.
- Jun Liu, Jianhui Chen, Songcan Chen, and Jieping Ye. Learning the optimal neighborhood kernel for classification. In IJCAI, pages 1144–1149, 2009.
- Xinwang Liu, Jianping Yin, Lei Wang, Lingqiao Liu, Jun Liu, Chenping Hou, and Jian Zhang. An adaptive approach to learning optimal neighborhood kernels. IEEE transactions on cybernetics, 43(1):371–384, 2013.
- Xinwang Liu, Sihang Zhou, Yueqing Wang, Miaomiao Li, Yong Dou, En Zhu, Jianping Yin, and Han Li. Optimal neighborhood kernel clustering with multiple kernels. In AAAI, pages 2266–2272, 2017.
- Bojan Mohar, Y Alavi, G Chartrand, and OR Oellermann. The laplacian spectrum of graphs. Graph theory, combinatorics, and applications, 2(871-898):12, 1991.
- Andrew Y Ng, Michael I Jordan, Yair Weiss, et al. On spectral clustering: Analysis and an algorithm. Advances in neural information processing systems, 2:849–856, 2002.
- Feiping Nie, Xiaoqian Wang, and Heng Huang. Clustering and projected clustering with adaptive neighbors. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 977–986. ACM, 2014.
- Feiping Nie, Guohao Cai, and Xuelong Li. Multi-view clustering and semi-supervised classification with adaptive neighbours. In AAAI, pages 2408–2414, 2017.
- Feiping Nie, Jing Li, and Xuelong Li. Self-weighted multiview clustering with multiple graphs. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, pages 2564–2570, 2017.
- Xi Peng, Shijie Xiao, Jiashi Feng, Wei-Yun Yau, and Zhang Yi. Deep subspace clustering with sparsity prior. In IJCAI, pages 1925–1931, 2016.
- Zenglin Xu, Rong Jin, Irwin King, and Michael Lyu. An extended level method for efficient multiple kernel learning. In Advances in neural information processing systems, pages 1825–1832, 2009.
- Zenglin Xu, Rong Jin, Haiqin Yang, Irwin King, and Michael R Lyu. Simple and efficient multiple kernel learning by group lasso. In Proceedings of the 27th international conference on machine learning (ICML-10), pages 1175–1182. Citeseer, 2010.
- Yang Yang, Fumin Shen, Zi Huang, , Heng Tao Shen, and Xuelong Li. Discrete nonnegative spectral clustering. IEEE Transactions on Knowledge and Data Engineering, 29(9):1834–1845, 2017.
- Shi Yu, Tillmann Falck, Anneleen Daemen, Leon-Charles Tranchevent, Johan AK Suykens, Bart De Moor, and Yves Moreau. L 2-norm multiple kernel learning and its application to biomedical data fusion. BMC bioinformatics, 11(1):309, 2010.
- Denny Zhou, Olivier Bousquet, Thomas N Lal, Jason Weston, and Bernhard Schölkopf. Learning with local and global consistency. In Advances in neural information processing systems, pages 321–328, 2004.
- Xiaojin Zhu, Zoubin Ghahramani, and John D Lafferty. Semi-supervised learning using gaussian fields and harmonic functions. In Proceedings of the 20th International conference on Machine learning (ICML-03), pages 912–919, 2003.