Beyond Supervised Classification: Extreme Minimal Supervision with the Graph 1-Laplacian
We consider the task of classifying when an extremely reduced amount of labelled data is available. This problem is of a great interest, in several real-world problems, as obtaining large amounts of labelled data is expensive and time consuming. We present a novel semi-supervised framework for multi-class classification that is based on the non-smooth norm of a normalised Dirichlet energy based on the graph Laplacian. Our transductive framework is framed under a novel functional with carefully selected class priors – that enforces a sufficiently smooth solution that strengthens the intrinsic relation between the labelled and unlabelled data. We demonstrate through extensive experimental results on large datasets CIFAR-10 and ChestX-ray14, that our method outperforms classic methods and readily competes with recent deep-learning approaches.
In this era of big data, deep learning (DL) has reported astonishing results for different tasks in computer vision including image classification e.g. krizhevsky2012imagenet ; hu2018squeeze , detection and segmentation just to name few. In particular, for the task of image classification, a major breakthrough has been reported in the setting of supervised learning. In this context, majority of methods are based on deep convolutional neural networks including ResNet he2016deep , VGG simonyan2014very and SE-Net hu2018squeeze in which pre-trained, fine tuned and trained from scratch solutions have been considered. A key factor, for these impressive results, is the assumption of a large corpus of labelled data. These labels can be generated either by humans or automatically on proxy tasks. However, to obtain well-annotated labels is expensive and time consuming, and one should account for either human bias and uncertainty that adversely effect the classification output. These drawbacks have motivated semi-supervised learning (SSL) to be a focus of great interest in the community. The key idea of SSL is to exploit both labelled and unlabelled data to produce a good classification output. The desirable advantages of this setting is that one decreases the dependency for a large amounts of well-annotated data whilst gaining further understanding of the relationships in the data. A comprehensive revision on SSL can be seen in chapelle2006semi . In the transductive setting, several algorithmic approaches have been proposed such as zhu2002learning ; zhou2004learning ; wang2008graph ; zhu2002learning5 ; joachims2003transductive ; zhang2011fast whilst in the inductive setting also promising results have been reported including weston2008deep ; tarvainen2017mean . More recently, DL for semi-supervised learning has been explored in both settings such as in laine2016temporal ; tarvainen2017mean ; iscen2019label . We refer the reader to fergus2009semi ; dai2013ensemble for a detailed revision on SSL for image classification. In this work, we focus on the transductive setting for image classification with the normalised Dirichlet energy (1) based on the graph Laplacian. Although promising results have been shown in this context, for example, the seminal algorithm of zhou2004learning was introduced to perform such a graph transduction through the propagation of few labels by the minimisation of energy (1) for . Latter machine learning studies nevertheless showed that the use of non-smooth energies with the norm, related to non local total variation, can achieve better clustering performances HH , but original algorithms were only approximating . More advanced optimisation tools were therefore proposed to consider the exact norm for binary hein2013total or multi-class bresson2013multiclass graph transduction. As underlined in vonLuxburg2007 , the normalisation of the operator is nevertheless crucial, to ensure within-cluster similarity when the degrees of the nodes are broadly distributed in the graph. Contributions. In order to address these different issues, we propose a new graph based semi-supervised framework called EMS-1L. The novelty of our framework largely relies on:
A new multi-class classification functional based on the normalised and non-smooth () energy (1), where the selection of carefully chosen class priors enforces a sufficiently smooth solution that strengthens the intrinsic relation between the labelled and unlabelled data.
We demonstrate that our framework accurately learns to classify different challenging datasets such as ChestX-ray14, with a performance comparable to state of the art DL techniques, whilst using an extremely smaller amount of labelled data.
We show that our framework can be extended to deep SSL, and that it achieves the lowest error rate in comparison with state-of-the-art SSL approaches on CIFAR-10 dataset.
2 Extreme Minimal Supervision with the Normalised Dirichlet Energy: Preliminaries
Formally speaking, we aim at solving the following problem. Given a small amount of labeled data with provided labels and and a large amount of unlabelled data , we seek to infer a function such that gets a good estimate for . This problem is illustrated in Figure 1, where visualisations were obtained from one of our experiments. For addressing this problem, we consider functions defined over a set of nodes. The main focus of interest in this work are convex and absolutely -homogeneous (i.e. ) non-local functionals of the form:
with weights taken such that the vector has non null entries satisfying: . This energy acts on the graph defined by nodes and weights . With respect to classical Dirichlet energies associated to the graph -Laplacian andreu2008nonlocal ; elmoataz2008nonlocal ; hein2013total ; bresson2013multiclass , it includes a normalisation through the rescaling with the degree of the node. We will focus our attention to the non smooth case with the absolutely one homogeneous energy defined by the function that can be rewritten as:
with a diagonal matrix , containing the nodes degree so that , and a matrix that encodes the edges in the graph. Each of these edges is represented on a different line of the sparse matrix , with the value (resp. ) on the column (resp. ).
Let us first define as the set of possible subdifferentials of : . Any absolutely one homogeneous function checks:
so that . For the particular function defined in (2), we can observe that
Considering the finite dimension setting, there exists such that , . We also have the following property.
For all , with defined in (2), one has
Observing that and using (4) we have that such that
Since the weights satisfy , then for all :
Eigenfunctions of any functional satisfy . For being the nonlocal total variation, (i.e. when is constant), eigenfunctions are known to be essential tools to provide a relevant clustering of the graph vonLuxburg2007 . Methods HH ; NIPS2012_4726 ; bresson2013adaptive ; AGP18 ; Feld have thus been designed to estimate such eigenfunctions through the local minimisation of the Rayleigh quotient, which reads:
with another absolutely one homogeneous function , that is typically a norm. Taking as the norm, one can recover eigenfunctions of AGP18 . For being the norm, these approaches can compute bi-valued functions that are local minima of (5) and eigenfunctions of Feld . Being bivalued, these estimations can easily be used to realise a partition of the domain. Such schemes also relate to the Cheeger cut of the graph induced by nodes and edges . Balanced cuts can also be obtained by considering bresson2013multiclass ; Feld . A last point to underline comes from Proposition 1, that states that eigenfunctions should be orthogonal to . It is thus important to design schemes that ensure this property.
3 Classifying under Extreme Minimal Supervision the Normalised Dirichlet Energy
In the following, instead of , we will denote by the value of function at node . In order to realise a binary partition of the domain of the graph through the minimisation of the quotient , we adapt the method of Feld to incorporate the scaling of (2) and consider the semi-explicit PDE:
with , , . We recall that both and are absolutely one homogeneous and satisfy (3). Since , , the shift with is necessary to show the convergence of the PDE as we have , for and . Such sequence satisfies the following properties.
For , the trajectory given by (6) satisfies
is non increasing,
The proof is given in the Supplementary Material. It namely uses the fact that is the unique minimiser of:
Hence, we can show the convergence of the trajectory.
The sequence defined in (6) converges to a non constant steady point .
As is the unique minimizer of in (7) that checks , and as we have , we get
so that converges to . Since all the quantities are bounded, we can show (see Feld , Theorem 2.1) that up to a subsequence . From Proposition 2, the points being of constant norm and being zero (with positive weights ), the limit point of the trajectory (6) necessarily has negative and positive entries. ∎
In practice, to realise a partition of the graph with the scheme (6), we miniminise the functional (7) at each iteration with the primal dual algorithm in CP11 to obtain , and then normalise this estimation. As it is non constant and satisfies , the limit of the scheme can be used for partitioning with the simple criteria .
We now aim at finding coupled functions that are all local minima of the ratio . The issue is to define a good coupling constraint between the ’s such that it is easy to project on. Let , we here consider the simple linear coupling :
Projection on this linear constraint is explicit with a simple shift of the vector for each node . On the other hand, simplex constraint (, , ) requires more expensive projections of the vectors on the simplex. Last, projection on the orthogonal constraint of the ’s is a non convex problem.
Contrary to the simplex constraint, it is compatible with the weighted zero mean condition that any eigenfunction of should satisfy, as shown in Proposition 1.
The characteristic function of a linear constraint is absolutely one homogeneous. This leads to a natural extension of the binary case.
We now consider the problem:
To find a local minima of (10), we define the iterative multi-class functional, which reads:
where and is the characterstic function of the constraints (9). Starting from an initial point that satisfies the constraint () and has been normalised (), the scheme we consider reads:
where and , and the point in the above PDE corresponds to the global minimiser of (11). Notice that the subgradient of the one homogeneous functional can be characterised with:
In practice, if for some , vanishes, then we define for the next iteration. With such assumptions, the sequence have the following properties, that are shown in Supplementary Material.
For , , the trajectory given by (12) satisfies
Point 3 of Proposition 4 contains weights that prevent from showing the exact decrease of the sum of ratios. This is thus similar to the approach in bresson2013multiclass . To ensure the decrease of the sum of ratios , is is possible to introduce auxiliary variables dealing with individual ratio decrease, as in rangapuram2014tight . The involved sub-problem at each iteration is nevertheless more complex to solve. Also notice that as there is no prior information on nodes’ labels, clusters can vanish or clusters may become proportional one to the other. Such issues can nevertheless not happen in the transductive setting we now consider.
Label Propagation: Multi-Class Classification.
The previous settings are unsupervised. We now consider a semi-supervised setting where we know small subsets of labeled nodes (with ) belonging to each cluster , with . Denoting , the objective is to propagate the prior information in the graph in order to predict the labels of the remaining nodes . To that end, we simply have to modify the coupling constraint in (9) as
With such constraint, clusters can no more vanish or merge since they all contain different active nodes satisfying . The same PDE (12) can be applied to propagate these labels. Once it has converged, the label of each node is taken as:
Soft labelling can either be obtained by considering all the clusters with non negative weights with relative weights and the convention that , in the case (that has never been observed in our experiments) that for all . The parameter in (14) is set to a small numerical value. Indeed, even if by construction, a small is required to ensure that, after the rescaling, . One can consider different values for each class. In the case where , is constant and , is expected to be bivalued Feld and the value of has a clear meaning. In that framework, corresponds to no prior on the size of the clusters, whereas encourage the clusters to be of homogeneous size.
4 Experimental Results
This section is focused on describing in detail the experiments that we conducted to evaluate our proposed approach.
4.1 Implementation Details
We here describe the specifics of our experimental setting including the data description and the evaluation methodology. Data Description. We validate our approach using three datasets - one small-scale and two large-scale datasets. 1) UCI ML hand-written digits dataset, we use the test set composed of images of size , and 10 classes. We also use 2) ChestX-ray14 dataset wang2017chestx , which is composed of 112,120 frontal chest view X-ray with size of 10241024. The dataset is composed of 14 classes. 3) The CIFAR-10 datase contains 60,000 color images of size 3232 and 10 different classes. All classification results were performed using these datasets. Evaluation Protocol. We design the following evaluation scheme to validate our theory. Firstly, we evaluate our proposed EMS-1L approach against two classic methods: Label Propagation (LP) zhu2002learning and Local to global consistency (LCG) zhou2004learning . For output quality evaluation, we computed the error rate and F1-score. Secondly and using ChestX-ray14 dataset wang2017chestx , we compared our approach against two deep learning approaches - WANG17wang2017chestx and YAO18 yao2018weakly . The quality of the classification was performed by a ROC analysis using the area under the curve (AUC). Finally, we demonstrate that our method can be extended to deep SSL, which evaluation is performed on the CIFAR-10 dataset and compared against state-of-the-art deep SSLsalimans2016improved ; shi2018transductive ; tarvainen2017mean ; iscen2019label and a fully supervised technique luo2018smooth . For this part, we evaluate the quality of the classifiers by reporting the error rate for a range of number of labelled samples. Each experiment has been repeated 10 time and the average and standard deviation are reported. For the compared methods, the parameters were set using the default values provided in the demo code or referenced in the papers themselves.
4.2 How good is EMS-1L?
We start by giving some insight into the performance of our approach with a comparison against two classic methods LP zhu2002learning and LCG zhou2004learning , which results, using the digits dataset, are reported in Table 1. One can see that for all metrics and percentages of labeled samples, our approach outperforms the compared methods by a significant margin. In particular, one can observe that with even 1% of labelled data, the error rate of our EMS-1L approach is almost half the second best method which is extrapolated to the remaining percentages of labeled samples and evaluation metrics. This shows that our EMS-1L approach is outperforms the compared methods even under extremely minimal labeled samples.
|Percentage of Labeled Samples|
To further evaluate the results of our approach, we move to a large scale dataset ChestX-ray14. Our motivation to use this dataset is coming from a central problem in medical imaging which is the lack of reliable quality annotated data. In particular, the interpretation of X-ray data heavily relies on the radiologist’s expertise and there is still a substantial clinical error on the outcome bruno2015understanding . We ran our approach and compared against two state-of-the-art works on X-ray classification WANG17wang2017chestx and YAO18 yao2018weakly , which are supervised methods and, therefore, assume a large corpus of annotated data.
|SNGT luo2018smooth (Fully Supervised)||46.431.21||33.940.73||20.660.57|
|DSSL iscen2019label (diffusion+W)||22.020.88||15.660.35||12.690.29|
In Figure 2, we show few sample output that were correctly classified by our approach. Table 2 shows the averaged AUC for all classes of our approach compared against WANG17 wang2017chestx , YAO18 yao2018weakly , and MT tarvainen2017mean using the official data partition. From a inspection in the table, one can see that our EMS-1L approach outperformed the compared methods with only 20% of the data whilst the compared approaches rely on 70% of annotated data. Moreover, we noticed that the classification output is very stable with respect to changes in the partition of the dataset, which is due to the semi-supervised nature of our EMS-1L approach. This is well reflected in the Figure 3 where we show the AUC results of both EMS-1L and WANG17 wang2017chestx using three different random data partitions, including the partition suggested by WANG17 wang2017chestx . The plot shows that WANG17 is sensitive to changes in partition which can be explained by the fact that supervised methods heavily rely on the training set being representative. On the other hand, EMS-1L had minimal change in the performance over the three different partitions as the underlying graphical representation is invariant to the partition. To further analyse the dependency on the portioning and show the advantage of EMS-1L, we compare the AUC results of EMS-1L against WANG17 and MT17 using a random data partitions. The results are reported in Figure 4 - it shows that EML-1L produces a more accurate classification using only 2% of the data labels than WANG17 or MT17 methods do using 70% of the data labels. The plot also shows that as we feed EML-1L more data labels, the classification accuracy increases and significantly outperforms compared approached whilst still using a far smaller amount of data labels.
4.3 Deep EMS-1L: An Alternative View
One interesting observation about our proposed framework is the fact https://www.overleaf.com/project/5cf5a22625299828a84c2376that it can be adapted to DL for semi-supervised learning SSL. To show this ability, we followed the philosophy of iscen2019label in which they considered the seminal work LCG zhou2004learning . We used their pseudo-labelling approach and connected our EMS-1L (i.e. we replace LCG with our approach). Then we performed the image classification task on the CIFAR-10 dataset for different label sample counts. The results of this experiment can be seen in Table2 in which we show as a baseline a fully supervised approach luo2018smooth followed by four state of the art DL semi-supervised approaches salimans2016improved ; shi2018transductive ; tarvainen2017mean ; iscen2019label . One can observe that lowest error rate across different counts of labelled samples is achieved by our extension Deep EMS-1L. After a detailed inspection of the table, we observe that even though the outputs generated with SSL-GAN salimans2016improved started close to our score, they were not significantly improved with the increased number of samples.
In this work, we addressed the problem of classifying under minimal supervision (i.e. SSL), in particular, in the transductive setting. We proposed a new semi-supervised framework which is framed under a novel optimisation model for the task of image classification. From extensive experimental results, we found the following. Firstly, we showed that our approach significantly outperformed the classic SSL methods. Secondly, we evaluated our EMS-1L method for the task of X-ray classification and demonstrated that our approach competes against the state-of-the-art results in this context whilst requiring an extremely minimal amount of labelled data. Finally, to demonstrate the capabilities of our approach, we showed that it can be extended as a Deep SSL framework. In this context we observed the lowest error rate results on the CIFAR-10 with respect to the state-of-the-art SSL methods. Future work will include investigation of our approach in terms of data aggregation and how to handle unseen classes.
This work was supported by the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 777826. Support from the CMIH, University of Cambridge is greatly acknowledged.
-  F. Andreu, J. Mazón, J. Rossi, and J. Toledo. A nonlocal p-laplacian evolution equation with neumann boundary conditions. Journal de mathématiques pures et appliquées, 90(2):201–227, 2008.
-  J. Aujol, G. Gilboa, and N. Papadakis. Theoretical analysis of flows estimating eigenfunctions of one-homogeneous functionals. SIAM Journal on Imaging Sciences, 11(2):1416–1440, 2018.
-  X. Bresson, T. Laurent, D. Uminsky, and J. Von Brecht. Convergence and energy landscape for Cheeger Cut clustering. In Advances in Neural Information Processing Systems (NIPS), pages 1385–1393, 2012.
-  X. Bresson, T. Laurent, D. Uminsky, and J. Von Brecht. Multiclass total variation clustering. In Advances in Neural Information Processing Systems (NIPS), pages 1421–1429, 2013.
-  X. Bresson, T. Laurent, D. Uminsky, and J. H. Von Brecht. An adaptive total variation algorithm for computing the balanced cut of a graph. arXiv preprint arXiv:1302.2717, 2013.
-  M. A. Bruno, E. A. Walker, and H. H. Abujudeh. Understanding and confronting our mistakes: the epidemiology of error in radiology and strategies for error reduction. Radiographics, 35(6):1668–1676, 2015.
-  T. Bühler and M. Hein. Spectral clustering based on the graph p-laplacian. International Conference on Machine Learning (ICML), 2009.
-  A. Chambolle and T. Pock. A first-order primal-dual algorithm for convex problems with applications to imaging. J. Math. Imaging Vis., 40:120–145, 2011.
-  O. Chapelle, B. Scholkopf, and A. Zien. Semi-supervised learning. MIT Press, 20(3):542–542, 2006.
-  D. Dai and L. Van Gool. Ensemble projection for semi-supervised image classification. In IEEE International Conference on Computer Vision (ICCV), pages 2072–2079, 2013.
-  L. Dodero, A. Gozzi, A. Liska, V. Murino, and D. Sona. Group-wise functional community detection through joint laplacian diagonalization. In International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), pages 708–715. Springer, 2014.
-  A. Elmoataz, O. Lezoray, and S. Bougleux. Nonlocal discrete regularization on weighted graphs: a framework for image and manifold processing. IEEE transactions on Image Processing, 17(7):1047–1060, 2008.
-  T. Feld, J. Aujol, G. Gilboa, and N. Papadakis. Rayleigh quotient minimization for absolutely one-homogeneous functionals. Inverse Problems, 2019.
-  R. Fergus, Y. Weiss, and A. Torralba. Semi-supervised learning in gigantic image collections. In Advances in neural information processing systems (NIPS), pages 522–530, 2009.
-  Y. Gao, E. Adeli-M, M. Kim, P. Giannakopoulos, S. Haller, and D. Shen. Medical image retrieval using multi-graph learning for MCI diagnostic assistance. In International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), pages 86–93, 2015.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE conference on computer vision and pattern recognition (CVPR), pages 770–778, 2016.
-  M. Hein, S. Setzer, L. Jost, and S. S. Rangapuram. The total variation on hypergraphs-learning on hypergraphs revisited. In Advances in Neural Information Processing Systems (NIPS), pages 2427–2435, 2013.
-  J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. In IEEE conference on computer vision and pattern recognition (CVPR), pages 7132–7141, 2018.
-  A. Iscen, G. Tolias, Y. Avrithis, and O. Chum. Label propagation for deep semi-supervised learning. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
-  T. Joachims. Transductive learning via spectral graph partitioning. In International Conference on Machine Learning (ICML), pages 290–297, 2003.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS), pages 1097–1105, 2012.
-  S. Laine and T. Aila. Temporal ensembling for semi-supervised learning. International conference on Machine learning (ICML), 2017.
-  Y. Luo, J. Zhu, M. Li, Y. Ren, and B. Zhang. Smooth neighbors on teacher graphs for semi-supervised learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 8896–8905, 2018.
-  S. S. Rangapuram, P. K. Mudrakarta, and M. Hein. Tight continuous relaxation of the balanced k-cut problem. In Advances in Neural Information Processing Systems (NIPS), pages 3131–3139, 2014.
-  T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. In Advances in neural information processing systems (NIPS), pages 2234–2242, 2016.
-  W. Shi, Y. Gong, C. Ding, Z. MaXiaoyu Tao, and N. Zheng. Transductive semi-supervised deep learning using min-max features. In European Conference on Computer Vision (ECCV), pages 299–315, 2018.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. International Conference on Learning Representations (ICLR), 2015.
-  A. Tarvainen and H. Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Advances in neural information processing systems (NIPS), pages 1195–1204, 2017.
-  U. von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395–416, 2007.
-  J. Wang, T. Jebara, and S.-F. Chang. Graph transduction via alternating minimization. In International conference on Machine learning (ICML), pages 1144–1151. ACM, 2008.
-  X. Wang, Y. Peng, L. Lu, Z. Lu, M. Bagheri, and R. M. Summers. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In IEEE conference on computer vision and pattern recognition (CVPR), pages 2097–2106, 2017.
-  J. Weston, F. Ratle, and R. Collobert. Deep learning via semi-supervised embedding. In International conference on Machine learning (ICML), pages 1168–1175, 2008.
-  L. Yao, J. Prosky, E. Poblenz, B. Covington, and K. Lyman. Weakly supervised medical diagnosis and localization from multiple resolutions. arXiv preprint arXiv:1803.07703, 2018.
-  Y.-M. Zhang, K. Huang, and C.-L. Liu. Fast and robust graph-based transductive learning via minimum tree cut. In IEEE International Conference on Data Mining, pages 952–961, 2011.
-  D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Schölkopf. Learning with local and global consistency. In Advances in neural information processing systems, pages 321–328, 2004.
-  X. Zhu and Z. Ghahramani. Learning from labeled and unlabeled data with label propagation. Technical report, Technical Report CMU-CALD-02-107, Carnegie Mellon University, 2002.
-  X. Zhu, Z. Ghahramani, and J. D. Lafferty. Semi-supervised learning using gaussian fields and harmonic functions. In International conference on Machine learning (ICML’03), pages 912–919, 2003.
Supplementary Material for:
Beyond Supervised Classification: Extreme Minimal Supervision with the Graph 1-Laplacian
This supplementary material extends further details and proofs that support the content of the main paper. In particular, the proof of Proposition 2 and Proposition 3 from the main paper.
Appendix A Proofs
a.1 Proof of Proposition 2
For , we have
where we used Proposition 1 in the right part of the previous relation to get . We conclude with the fact that is a rescaling of .
Since is a norm, it is absolutely one homogeneous and . Next, we observe that and we get
We then conclude with the fact that .
Since for all and , then . Next, we recall that . Hence we have
where the final rescaling with is possible since and are absolutely one homogeneous functions.
In the finite dimension setting, there exists such that and for an absolutely one homogeneous functionals defined in (2) and a norm . Then one has
Hence from the equivalence of norms in finite dimensions, there exists such that .
a.2 Proof of Proposition 3
We follow the point 2 of Proposition 2 to first get: , for . Then, as , we deduce that . Next we have
Summing on , we get
Notice that we defined for . As is a norm, the equivalence of norm in finite dimensions implies that is bounded by some constant . We then have .
Since is the global minimizer of (11), then: