Selfweighted Multiple Kernel Learning for Graphbased Clustering and Semisupervised Classification
Abstract
Multiple kernel learning (MKL) method is generally believed to perform better than single kernel method. However, some empirical studies show that this is not always true: the combination of multiple kernels may even yield an even worse performance than using a single kernel. There are two possible reasons for the failure: (i) most existing MKL methods assume that the optimal kernel is a linear combination of base kernels, which may not hold true; and (ii) some kernel weights are inappropriately assigned due to noises and carelessly designed algorithms. In this paper, we propose a novel MKL framework by following two intuitive assumptions: (i) each kernel is a perturbation of the consensus kernel; and (ii) the kernel that is close to the consensus kernel should be assigned a large weight. Impressively, the proposed method can automatically assign an appropriate weight to each kernel without introducing additional parameters, as existing methods do. The proposed framework is integrated into a unified framework for graphbased clustering and semisupervised classification. We have conducted experiments on multiple benchmark datasets and our empirical results verify the superiority of the proposed framework.
1 Introduction
As a principled way of introducing nonlinearity into linear models, kernel methods have been widely applied in many machine learning tasks [\citeauthoryearHofmann et al.2008, \citeauthoryearXu et al.2010]. Although improved performance has been reported in a wide variety of problems, the kernel methods require the user to select and tune a single predefined kernel. This is not userfriendly since the most suitable kernel for a specific task is usually challenging to decide. Moreover, it is timeconsuming and impractical to exhaustively search from a large pool of candidate kernels. Multiple kernel learning (MKL) was proposed to address this issue as it offers an automatical way of learning an optimal combination of distinct base kernels [\citeauthoryearXu et al.2009]. Generally speaking, MKL method should yield a better performance than that of single kernel approach.
A key step in MKL is to assign a reasonable weight to each kernel according to its importance. One popular approach considers a weighted combination of candidate kernel matrices, leading to a convex quadratically constraint quadratic program. However, this method overreduces the feasible set of the optimal kernel, which may lead to a less representative solution. In fact, these MKL algorithms sometimes fail to outperform single kernel methods or traditional nonweighted kernel approaches [\citeauthoryearYu et al.2010, \citeauthoryearGehler and Nowozin2009]. Another issue is the inappropriate weights assignment. Some attempts aim to learn the local importance of features by assuming that samples may vary locally [\citeauthoryearGönen and Alpaydin2008]. However, they induce more complex computational problems.
To address these issues, in this paper, we model the differences among kernels by following two intuitive assumptions: (i) each kernel is a perturbation of the consensus kernel; and (ii) the kernel that is close to the consensus kernel should receive a large weight. As a result, instead of enforcing the optimal kernel being a linear combination of predefined kernels, this approach allows the most suitable kernel to reside in some kernels’ neighborhood. And our proposed method can assign an optimal weight for each kernel automatically without introducing an additive parameter as existing methods do.
Then we combine this novel weighting scheme with graphbased clustering and semisupervised learning (SSL). Due to its effectiveness in similarity graph construction, graphbased clustering and SSL have shown impressive performance [\citeauthoryearNie et al.2017a, \citeauthoryearKang et al.2017b]. Finally, a novel multiple kernel learning framework for clustering and semisupervised learning is developed.
In summary, our main contributions are twofold:

We proposed a novel way to construct the optimal kernel and assign weights to base kernels. Notably, our proposed method can find a better kernel in the neighborhood of candidate kernels. This weight is a function of kernel matrix, so we do not need to introduce an additional parameter as existing methods do. This also eases the burden of solving the constraint quadratic program.

A unified framework for clustering and SSL is developed. It seamlessly integrates the components of graph construction, label learning, and kernel learning by incorporating the graph structure constraint. This allows them to negotiate with each other to achieve overall optimality. Our experiments on multiple realworld datasets verify the effectiveness of the proposed framework.
2 Related Work
In this section, we divide the related work into two categories, namely graphbased clusteirng and SSL, and paremeterweighted multiple kernel learning.
2.1 Graphbased Clustering and SSL
Graphbased clustering [\citeauthoryearNg et al.2002, \citeauthoryearYang et al.2017] and SSL [\citeauthoryearZhu et al.2003] have been popular for its simple and impressive performance. The graph matrix to measure the similarity of data points is crucial to their performance and there is no satisfying solution for this problem. Recently, automatic learning graph from data has achieved promising results. One approach is based on adaptive neighbor idea, i.e., is learned as a measure of the probability that is neighbor of . Then is treated as graph input to do clustering [\citeauthoryearNie et al.2014, \citeauthoryearHuang et al.2018b] and SSL [\citeauthoryearNie et al.2017a]. Another one is using the socalled selfexpressiveness property, i.e., each data point is expressed as a weighted combination of other points and this learned weight matrix behaves like the graph matrix. Representative work in this category include [\citeauthoryearHuang et al.2015, \citeauthoryearLi et al.2015, \citeauthoryearKang et al.2018]. These methods are all developed in the original feature space. To make it more general, we develop our model in kernel space in this paper. Our purpose is to learn a graph with exactly number of connected components if the data contains clusters or classes. In this work, we will consider this condition explicitly.
2.2 Parameterweighted Multiple Kernel Learning
It is wellknown that the performance of kernel method crucially depends on the kernel function as it intrinsically specifies the feature space. MKL is an efficient way for automatic kernel selection and embedding different notions of similarity [\citeauthoryearKang et al.2017a]. It is generally formulated as follows:
(1) 
where is the objective function, is the consensus kernel, is our artificial constructed kernel, represents the weight for kernel , is used to smoothen the weight distribution. Therefore, we frequently sovle and tune . Though this approach is widely used, it still suffers the following problems. First, the linear combination of base kernels over reduces the feasible set of optimal kernels, which could result in the learned kernel with limited representation ability. Second, the optimization of kernel weights may lead to inappropriate assignments due to noise and carelessly designed algorithms. Indeed, contrary to the original intention of MKL, this approach sometimes obtains lower accuracy than that of using equally weighted kernels or merely single kernel method. This will hinder the practical use of MKL. This phenomenon has been observed for many years [\citeauthoryearCortes2009] but rarely studied. Thus, it is vital to develop some new approaches.
3 Methodology
3.1 Notations
Throughout the paper, all the matrices are written as uppercase. For a matrix , its th element and th column is denoted as and , respectively. The trace of is denoted by . The th kernel of is written as . The norm of vector is represented by . The Frobenius norm of matrix is denoted by . is an identity matrix with proper size. means all entries of are nonnegative.
3.2 Selfweighted Multiple Kernel Learning
Aforementioned selfexpressiveness based graph learning method can be formulated as:
(2) 
where is the tradeoff parameter. To recap the powerfulness of kernel method, we extend Eq. (2) to its kernel version by using kernel mapping . According to the kernel trick , we have
(3) 
Ideally, we hope to achieve a graph with exactly connected components if the data contain clusters or classes. In other words, the graph is block diagonal with proper permutations. It is straightforward to check that in Eq. (3) can hardly satisfy to such a constraint condition.
If the similarity graph matrix is nonnegative, then the Laplacian matrix , where is the diagonal degree matrix defined as , associated with has an important property as follows [\citeauthoryearMohar et al.1991]
Theorem 1
The multiplicity of the eigenvalue 0 of the Laplacian matrix is equal to the number of connected components in the graph associated with .
Theorem 1 indicates that if , then the constraint on will be held. Therefore, the problem (3) can be rewritten as:
(4) 
Suppose is the th smallest eigenvalue of . Note that because is positive semidefinite. The problem (4) is equivalent to the following problem for a large enough :
(5) 
According to the Ky Fan’s Theorem [\citeauthoryearFan1949], we have:
(6) 
can be cluster indicator matrix or label matrix. Therefore, the problem (5) is further equivalent to the following problem
(7) 
This problem (7) is much easier to solve compared with the rank constrained problem (4). We name this model as Kernelbased Graph Learning (KGL). Note that this model’s input is kernel matrix . It is generally recognized that its performance is largely determined by the choice of kernel. Unfortunately, the most suitable kernel for a particular task is often unknown in advance. Although MKL as in Eq. (1) can be applied to resolve this issue, it is still not satisfying as we discussed in subsection 2.2.
In this work, we design a novel MKL strategy. It is based on the following two intuitive assumptions: 1) each kernel is a perturbation of the consensus kernel, and 2) the kernel that is close to the consensus kernel should receive a large weight. Motivated by these, we can have the following MKL form:
(8) 
and
(9) 
We can see that is dependent on the target variable , so it is not directly available. But can be set to be stationary, i.e., after obtaining , we update correspondingly [\citeauthoryearNie et al.2017b]. Instead of enforcing the optimal kernel being a linear combination of candidate kernels as in Eq. (1), Eq. (8) allows the most suitable kernel to reside in some kernels’ neighborhood [\citeauthoryearLiu et al.2009]. This enhances the representation ability of the learned optimal kernel [\citeauthoryearLiu et al.2017, \citeauthoryearLiu et al.2013]. Furthermore, we don’t introduce an additive hyperparameter , which often leads to a quadratic program. The optimal weight for each kernel is directly calculated according to kernel matrices. Then our Selfweighted Multiple Kernel Learning (SMKL) framework can be formulated as:
(10) 
This model enjoys the following properties:

This unified framework sufficiently considers the negotiation between the process of learning the optimal kernel and that of graph/label learning. By iteratively updating , , , they can be repeatly improved.

By treating the optimal kernel as a perturbation of base kernels, it effectively enlarges the region from which an optimal kernel can be chosen, and therefore is in a better position than the traditional ones to identify a more suitable kernel.

The kernel weight is directly calculated from kernel matrices. Therefore, we avoid solving quadratic program.
To see the effect of our proposed MKL method, we need to examine the approach with traditional kernel learning. For convenience, we denote it as Parameterized MKL (PMKL). It can be written as:
(11) 
3.3 Optimization
We divide the problem in Eq. (10) into three subproblems, and develop an alternative and iterative algorithm to solve them.
For , we fix and . The problem in Eq. (10) becomes:
(12) 
Based on , we can equivalently solve the following problem for each sample:
(13) 
where with . By setting its first derivative w.r.t. to be zero, we obtain:
(14) 
Thus can be achieved in parallel.
For , we fix and . The problem in Eq. (10) becomes:
(15) 
Similar to (14), it yields:
(16) 
From Eq. (16) and Eq. (14), we can observe that and are seamlessly coupled, hence they are allowed to negotiate with each other to achieve better results.
For , we fix and . The problem in Eq. (10) becomes:
(17) 
The optimal solution is the eigenvectors of corresponding to the smallest eigenvalues.
3.4 Extend to Semisupervised Classification
Model (10) also lends itself to semisupervised classification. Graph construction and label inference are two fundamental stages in SSL. Solving two separate problems only once is suboptimal since label information is not exploited when it learns the graph. SMKL unifies these two fundamental components into a unified framework. Then the given labels and estimated labels will be utilized to build the graph and to predict the unknown labels.
Based on a similar approach, we can reformulate SMKL for semisupervised classification as:
(18) 
where denote the label matrix and is the number of labeled points. is onehot and indicates that the th sample belongs to the th class. , where the unlabeled points in the back. (18) can be solved in the same procedure as (10), the difference lies in updating .
Finally, the class label for unlabeled points could be assigned according to the following decision rule:
(20) 
# instances  # features  # classes  

YALE  165  1024  15 
JAFFE  213  676  10 
YEAST  1484  1470  10 
TR11  414  6429  9 
TR41  878  7454  10 
TR45  690  8261  10 
4 Clustering
In this section, we conduct clustering experiments to demonstrate the efficacy of our method.
4.1 Data Sets
We implement experiments on six publicly available data sets. We summarize the information of these data sets in Table 1. In specific, the first two data sets YALE and JAFFE consist of face images. YEAST is microarray data set. Tr11, Tr41, and Tr45 are derived from NIST TREC Document Database.
We design 12 kernels. They are: seven Gaussian kernels of the form , where is the maximal distance between samples and varies over the set ; a linear kernel ; four polynomial kernels with and . Besides, all kernels are rescaled to by dividing each element by the largest pairwise squared distance.
Data  Metric  SC  SSR  MKKM  RMKKM  AASC  KGL  PMKL  SMKL 

YALE 
Acc  0.4942  0.5455  0.4570  0.5218  0.4064  0.5549  0.5605  0.6000 
NMI  0.5292  0.5726  0.5006  0.5558  0.4683  0.5498  0.5643  0.6029  
JAFFE  Acc  0.7488  0.8732  0.7455  0.8707  0.3035  0.9877  0.9802  0.9906 
NMI  0.8208  0.9293  0.7979  0.8937  0.2722  0.9825  0.9806  0.9834  
YEAST  Acc  0.3555  0.2999  0.1304  0.3163  0.3538  0.3892  0.3952  0.4326 
NMI  0.2138  0.1585  0.1029  0.2071  0.2119  0.2315  0.2361  0.2652  
TR11  Acc  0.5098  0.4106  0.5013  0.5771  0.4715  0.7425  0.7485  0.8309 
NMI  0.4311  0.2760  0.4456  0.5608  0.3939  0.6000  0.6137  0.7167  
TR41  Acc  0.6352  0.6378  0.5610  0.6265  0.4590  0.6942  0.6724  0.7631 
NMI  0.6133  0.5956  0.5775  0.6347  0.4305  0.6008  0.6500  0.6148  
TR45  Acc  0.5739  0.7145  0.5846  0.6400  0.5264  0.7425  0.7468  0.7536 
NMI  0.4803  0.6782  0.5617  0.6273  0.4190  0.6824  0.7523  0.6965 
4.2 Comparison Methods
We compare with a number of single kernel and multiple kernel learning based clustering methods. They include: Spectral Clustering (SC) [\citeauthoryearNg et al.2002], Simplex Sparse Representation (SSR) [\citeauthoryearHuang et al.2015], Multiple Kernel kmeans (MKKM) [\citeauthoryearHuang et al.2012b], Affinity Aggregation for Spectral Clustering (AASC) [\citeauthoryearHuang et al.2012a], Robust Multiple Kernel kmeans (RMKKM)
4.3 Performance Evaluation
SMKL is compared with other techniques. We show the clustering results in terms of accuracy (Acc), NMI in Table 2. For SC and KGL, we report its best performance achieved from those 12 kernels. It can clearly be seen that SMKL achieves the best performance in most cases. Compared to PMKL, SMKL
4.4 Parameter Analysis
There are three parameters in our model (10). Figure 1 shows the clustering accuracy of YALE data set with varying , , and . We can observe that the performance is not so sensitive to those parameters. This conclusion is also true for NMI.
Data  Labeled ()  GFHF  LGC  SR  SLRR  SCAN  SMKL 

YALE  10  38.011.91  47.3313.96  38.838.60  28.779.59  45.071.30  55.8712.26 
30  54.139.47  63.082.20  58.254.25  42.585.93  60.924.03  74.081.92  
50  60.285.16  69.565.42  69.006.57  51.226.78  68.944.57  82.443.61  
JAFFE  10  92.857.76  96.682.76  97.331.51  94.386.23  96.921.68  97.571.55 
30  98.501.01  98.861.14  99.250.81  98.821.05  98.201.22  99.670.33  
50  98.941.11  99.290.94  99.820.60  99.470.59  99.255.79  99.910.27  
BA  10  45.093.09  48.371.98  25.321.14  20.102.51  55.051.67  46.621.98 
30  62.740.92  63.311.03  44.161.03  43.841.54  68.841.09  68.990.93  
50  68.301.31  68.451.32  54.101.55  52.491.27  72.201.44  84.671.06  
COIL20  10  87.742.26  85.431.40  93.571.59  81.101.69  90.091.15  91.052.03 
30  95.481.40  87.821.03  96.520.68  87.691.39  95.270.93  97.892.00  
50  96.270.71  88.470.45  97.870.10  90.921.19  97.530.82  99.970.04 
5 Semisupervised Classification
In this section, we assess the effectiveness of SMKL on semisupervised classification task.
5.1 Data Sets
1) Evaluation on Face Recognition: We examine the effectiveness of our graph learning for face recognition on two frequently used face databases: YALE and JEFFE. The YALE face data set contains 15 individuals, and each person has 11 near frontal images taken under different illuminations. Each image is resized to 3232 pixels. The JAFFE face database consists of 10 individuals, and each subject has 7 different facial expressions (6 basic facial expressions +1 neutral). The images are resized to 2626 pixels.
2) Evaluation on Digit/Letter Recognition: In this experiment, we address the digit/letter recognition problem on the BA database. The data set consists of digits of “0” through “9” and letters of capital “A” to “Z”. Therefore, there are 39 classes and each class has 39 samples.
3) Evaluation on Visual Object Recognition: We conduct visual object recognition experiment on the COIL20 database. The database consists of 20 objects and 72 images for each object. For each object, the images were taken 5 degrees apart as the object is rotating on a turntable. The size of each image is 3232 pixels.
Similar to clustering experiment, we construct 7 kernels for each data set. They include: four Gaussian kernels with varies over ; a linear kernel ; two polynomial kernels with .
5.2 Comparison Methods
We compare our method with several other stateoftheart algorithms.

Local and Global Consistency (LGC) [\citeauthoryearZhou et al.2004]: LGC is a popular label propagation method. For this method, kernel matrix is used to compute .

Gaussian Field and Harmonic function (GFHF) [\citeauthoryearZhu et al.2003]: Different from LGC, GFHF is another mechanics to infer those unknown labels as a process of propagating labels through the pairwise similarity.

Semisupervised Classification with Adaptive Neighbours (SCAN) [\citeauthoryearNie et al.2017a]: Based on adaptive neighbors method, SCAN shows much better performance than many other techniques.

A Unified Optimization Framework for SSL [\citeauthoryearLi et al.2015]: Li et al. propose a unified framework based on selfexpressiveness approach. By using lowrank and sparse regularizer, they have SLRR and SR method, respectively.
5.3 Performance Evaluation
We randomly choose 10%, 30%, 50% portions of samples as labeled data and repeat 20 times. Classification accuracy and deviation are shown in Table 3. More concretely, for GFHF and LGC, the constructed seven kernels are tested and the best performance is reported. Unlike them, SCAN, SLRR, SR, and SMKL, the label prediction and graph learning are conducted in a unified framework.
As expected, the classification accuracy for all methods monotonically increase with the increase of the percentage of labeled samples. As can be observed, our SMKL method outperforms other stateoftheart methods in most cases. This confirms the effectiveness of our proposed method on SSL task.
6 Conclusion
This paper proposes a novel multiple kernel learning framework for clustering and semisupervised classification. In this model, a more flexible kernel learning strategy is developed to enhance the representation ability of the learned optimal kernel and to assign weight for each base kernel. An iterative algorithm is designed to solve the resultant optimization problem, so that graph construction, label learning, kernel learning are boosted by each other. Comprehensive experimental results clearly demonstrates the superiority of our method.
Acknowledgments
This paper was in part supported by two Fundamental Research Funds for the Central Universities of China (Nos. ZYGX2017KYQD177, ZYGX2016Z003), Grants from the Natural Science Foundation of China (No. 61572111) and a 985 Project of UESTC (No. A1098531023601041) .
Footnotes
 https://github.com/csliangdu/RMKKM
 https://github.com/sckangz/IJCAI2018
References
 Corinna Cortes. Can learning kernels help performance. In Invited talk at International Conference on Machine Learning (ICML 2009). Montréal, Canada, 2009.
 Liang Du, Peng Zhou, Lei Shi, Hanmo Wang, Mingyu Fan, Wenjian Wang, and YiDong Shen. Robust multiple kernel kmeans using l21norm. In Proceedings of the 24th International Conference on Artificial Intelligence, pages 3476–3482. AAAI Press, 2015.
 Ky Fan. On a theorem of weyl concerning eigenvalues of linear transformations i. Proceedings of the National Academy of Sciences, 35(11):652–655, 1949.
 Peter Gehler and Sebastian Nowozin. On feature combination for multiclass object classification. In Computer Vision, 2009 IEEE 12th International Conference on, pages 221–228. IEEE, 2009.
 Mehmet Gönen and Ethem Alpaydin. Localized multiple kernel learning. In Proceedings of the 25th international conference on Machine learning, pages 352–359. ACM, 2008.
 Thomas Hofmann, Bernhard Schölkopf, and Alexander J Smola. Kernel methods in machine learning. The annals of statistics, pages 1171–1220, 2008.
 HsinChien Huang, YungYu Chuang, and ChuSong Chen. Affinity aggregation for spectral clustering. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 773–780. IEEE, 2012.
 HsinChien Huang, YungYu Chuang, and ChuSong Chen. Multiple kernel fuzzy clustering. IEEE Transactions on Fuzzy Systems, 20(1):120–134, 2012.
 Jin Huang, Feiping Nie, and Heng Huang. A new simplex sparse learning model to measure data similarity for clustering. In IJCAI, pages 3569–3575, 2015.
 Shudong Huang, Zhao Kang, and Zenglin Xu. Selfweighted multiview clustering with soft capped norm. KnowledgeBased Systems, 2018.
 Shudong Huang, Zenglin Xu, and Jiancheng Lv. Adaptive local structure learning for document coclustering. KnowledgeBased Systems, 148:74–84, 2018.
 Zhao Kang, Chong Peng, and Qiang Cheng. Kerneldriven similarity learning. Neurocomputing, 267:210–219, 2017.
 Zhao Kang, Chong Peng, and Qiang Cheng. Twin learning for similarity and clustering: A unified kernel approach. In Proceedings of the ThirtyFirst AAAI Conference on Artificial Intelligence (AAAI17). AAAI Press, 2017.
 Zhao Kang, Chong Peng, Qiang Cheng, and Zenglin Xu. Unified spectral clustering with optimal graph. In Proceedings of the ThirtySecond AAAI Conference on Artificial Intelligence (AAAI18). AAAI Press, 2018.
 ChunGuang Li, Zhouchen Lin, Honggang Zhang, and Jun Guo. Learning semisupervised representation towards a unified optimization framework for semisupervised learning. In Proceedings of the IEEE International Conference on Computer Vision, pages 2767–2775, 2015.
 Jun Liu, Jianhui Chen, Songcan Chen, and Jieping Ye. Learning the optimal neighborhood kernel for classification. In IJCAI, pages 1144–1149, 2009.
 Xinwang Liu, Jianping Yin, Lei Wang, Lingqiao Liu, Jun Liu, Chenping Hou, and Jian Zhang. An adaptive approach to learning optimal neighborhood kernels. IEEE transactions on cybernetics, 43(1):371–384, 2013.
 Xinwang Liu, Sihang Zhou, Yueqing Wang, Miaomiao Li, Yong Dou, En Zhu, Jianping Yin, and Han Li. Optimal neighborhood kernel clustering with multiple kernels. In AAAI, pages 2266–2272, 2017.
 Bojan Mohar, Y Alavi, G Chartrand, and OR Oellermann. The laplacian spectrum of graphs. Graph theory, combinatorics, and applications, 2(871898):12, 1991.
 Andrew Y Ng, Michael I Jordan, Yair Weiss, et al. On spectral clustering: Analysis and an algorithm. Advances in neural information processing systems, 2:849–856, 2002.
 Feiping Nie, Xiaoqian Wang, and Heng Huang. Clustering and projected clustering with adaptive neighbors. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 977–986. ACM, 2014.
 Feiping Nie, Guohao Cai, and Xuelong Li. Multiview clustering and semisupervised classification with adaptive neighbours. In AAAI, pages 2408–2414, 2017.
 Feiping Nie, Jing Li, and Xuelong Li. Selfweighted multiview clustering with multiple graphs. In Proceedings of the TwentySixth International Joint Conference on Artificial Intelligence, pages 2564–2570, 2017.
 Xi Peng, Shijie Xiao, Jiashi Feng, WeiYun Yau, and Zhang Yi. Deep subspace clustering with sparsity prior. In IJCAI, pages 1925–1931, 2016.
 Zenglin Xu, Rong Jin, Irwin King, and Michael Lyu. An extended level method for efficient multiple kernel learning. In Advances in neural information processing systems, pages 1825–1832, 2009.
 Zenglin Xu, Rong Jin, Haiqin Yang, Irwin King, and Michael R Lyu. Simple and efficient multiple kernel learning by group lasso. In Proceedings of the 27th international conference on machine learning (ICML10), pages 1175–1182. Citeseer, 2010.
 Yang Yang, Fumin Shen, Zi Huang, , Heng Tao Shen, and Xuelong Li. Discrete nonnegative spectral clustering. IEEE Transactions on Knowledge and Data Engineering, 29(9):1834–1845, 2017.
 Shi Yu, Tillmann Falck, Anneleen Daemen, LeonCharles Tranchevent, Johan AK Suykens, Bart De Moor, and Yves Moreau. L 2norm multiple kernel learning and its application to biomedical data fusion. BMC bioinformatics, 11(1):309, 2010.
 Denny Zhou, Olivier Bousquet, Thomas N Lal, Jason Weston, and Bernhard Schölkopf. Learning with local and global consistency. In Advances in neural information processing systems, pages 321–328, 2004.
 Xiaojin Zhu, Zoubin Ghahramani, and John D Lafferty. Semisupervised learning using gaussian fields and harmonic functions. In Proceedings of the 20th International conference on Machine learning (ICML03), pages 912–919, 2003.