Beyond Supervised Classification: Extreme Minimal Supervision with the Graph 1Laplacian
Abstract
We consider the task of classifying when an extremely reduced amount of labelled data is available. This problem is of a great interest, in several realworld problems, as obtaining large amounts of labelled data is expensive and time consuming. We present a novel semisupervised framework for multiclass classification that is based on the nonsmooth norm of a normalised Dirichlet energy based on the graph Laplacian. Our transductive framework is framed under a novel functional with carefully selected class priors – that enforces a sufficiently smooth solution that strengthens the intrinsic relation between the labelled and unlabelled data. We demonstrate through extensive experimental results on large datasets CIFAR10 and ChestXray14, that our method outperforms classic methods and readily competes with recent deeplearning approaches.
capbtabboxtable[][\FBwidth]
1 Introduction
In this era of big data, deep learning (DL) has reported astonishing results for different tasks in computer vision including image classification e.g. krizhevsky2012imagenet ; hu2018squeeze , detection and segmentation just to name few. In particular, for the task of image classification, a major breakthrough has been reported in the setting of supervised learning. In this context, majority of methods are based on deep convolutional neural networks including ResNet he2016deep , VGG simonyan2014very and SENet hu2018squeeze in which pretrained, fine tuned and trained from scratch solutions have been considered. A key factor, for these impressive results, is the assumption of a large corpus of labelled data. These labels can be generated either by humans or automatically on proxy tasks. However, to obtain wellannotated labels is expensive and time consuming, and one should account for either human bias and uncertainty that adversely effect the classification output. These drawbacks have motivated semisupervised learning (SSL) to be a focus of great interest in the community. The key idea of SSL is to exploit both labelled and unlabelled data to produce a good classification output. The desirable advantages of this setting is that one decreases the dependency for a large amounts of wellannotated data whilst gaining further understanding of the relationships in the data. A comprehensive revision on SSL can be seen in chapelle2006semi . In the transductive setting, several algorithmic approaches have been proposed such as zhu2002learning ; zhou2004learning ; wang2008graph ; zhu2002learning5 ; joachims2003transductive ; zhang2011fast whilst in the inductive setting also promising results have been reported including weston2008deep ; tarvainen2017mean . More recently, DL for semisupervised learning has been explored in both settings such as in laine2016temporal ; tarvainen2017mean ; iscen2019label . We refer the reader to fergus2009semi ; dai2013ensemble for a detailed revision on SSL for image classification. In this work, we focus on the transductive setting for image classification with the normalised Dirichlet energy (1) based on the graph Laplacian. Although promising results have been shown in this context, for example, the seminal algorithm of zhou2004learning was introduced to perform such a graph transduction through the propagation of few labels by the minimisation of energy (1) for . Latter machine learning studies nevertheless showed that the use of nonsmooth energies with the norm, related to non local total variation, can achieve better clustering performances HH , but original algorithms were only approximating . More advanced optimisation tools were therefore proposed to consider the exact norm for binary hein2013total or multiclass bresson2013multiclass graph transduction. As underlined in vonLuxburg2007 , the normalisation of the operator is nevertheless crucial, to ensure withincluster similarity when the degrees of the nodes are broadly distributed in the graph. Contributions. In order to address these different issues, we propose a new graph based semisupervised framework called EMS1L. The novelty of our framework largely relies on:

A new multiclass classification functional based on the normalised and nonsmooth () energy (1), where the selection of carefully chosen class priors enforces a sufficiently smooth solution that strengthens the intrinsic relation between the labelled and unlabelled data.

We demonstrate that our framework accurately learns to classify different challenging datasets such as ChestXray14, with a performance comparable to state of the art DL techniques, whilst using an extremely smaller amount of labelled data.

We show that our framework can be extended to deep SSL, and that it achieves the lowest error rate in comparison with stateoftheart SSL approaches on CIFAR10 dataset.
2 Extreme Minimal Supervision with the Normalised Dirichlet Energy: Preliminaries
Formally speaking, we aim at solving the following problem. Given a small amount of labeled data with provided labels and and a large amount of unlabelled data , we seek to infer a function such that gets a good estimate for . This problem is illustrated in Figure 1, where visualisations were obtained from one of our experiments. For addressing this problem, we consider functions defined over a set of nodes. The main focus of interest in this work are convex and absolutely homogeneous (i.e. ) nonlocal functionals of the form:
(1) 
with weights taken such that the vector has non null entries satisfying: . This energy acts on the graph defined by nodes and weights . With respect to classical Dirichlet energies associated to the graph Laplacian andreu2008nonlocal ; elmoataz2008nonlocal ; hein2013total ; bresson2013multiclass , it includes a normalisation through the rescaling with the degree of the node. We will focus our attention to the non smooth case with the absolutely one homogeneous energy defined by the function that can be rewritten as:
(2) 
with a diagonal matrix , containing the nodes degree so that , and a matrix that encodes the edges in the graph. Each of these edges is represented on a different line of the sparse matrix , with the value (resp. ) on the column (resp. ).
Subdifferential
Let us first define as the set of possible subdifferentials of : . Any absolutely one homogeneous function checks:
(3) 
so that . For the particular function defined in (2), we can observe that
(4) 
Considering the finite dimension setting, there exists such that , . We also have the following property.
Proposition 1.
For all , with defined in (2), one has
Proof.
Eigenfunction.
Eigenfunctions of any functional satisfy . For being the nonlocal total variation, (i.e. when is constant), eigenfunctions are known to be essential tools to provide a relevant clustering of the graph vonLuxburg2007 . Methods HH ; NIPS2012_4726 ; bresson2013adaptive ; AGP18 ; Feld have thus been designed to estimate such eigenfunctions through the local minimisation of the Rayleigh quotient, which reads:
(5) 
with another absolutely one homogeneous function , that is typically a norm. Taking as the norm, one can recover eigenfunctions of AGP18 . For being the norm, these approaches can compute bivalued functions that are local minima of (5) and eigenfunctions of Feld . Being bivalued, these estimations can easily be used to realise a partition of the domain. Such schemes also relate to the Cheeger cut of the graph induced by nodes and edges . Balanced cuts can also be obtained by considering bresson2013multiclass ; Feld . A last point to underline comes from Proposition 1, that states that eigenfunctions should be orthogonal to . It is thus important to design schemes that ensure this property.
3 Classifying under Extreme Minimal Supervision the Normalised Dirichlet Energy
In the following, instead of , we will denote by the value of function at node . In order to realise a binary partition of the domain of the graph through the minimisation of the quotient , we adapt the method of Feld to incorporate the scaling of (2) and consider the semiexplicit PDE:
(6) 
with , , . We recall that both and are absolutely one homogeneous and satisfy (3). Since , , the shift with is necessary to show the convergence of the PDE as we have , for and . Such sequence satisfies the following properties.
Proposition 2.
For , the trajectory given by (6) satisfies

,

,

is non increasing,

.
The proof is given in the Supplementary Material. It namely uses the fact that is the unique minimiser of:
(7) 
Hence, we can show the convergence of the trajectory.
Proposition 3.
The sequence defined in (6) converges to a non constant steady point .
Proof.
As is the unique minimizer of in (7) that checks , and as we have , we get
(8) 
Since is the orthogonal projection of on the ball then . Finally, from point 4 of Proposition 2, we have that . We then sum relation (8) from to and deduce that:
so that converges to . Since all the quantities are bounded, we can show (see Feld , Theorem 2.1) that up to a subsequence . From Proposition 2, the points being of constant norm and being zero (with positive weights ), the limit point of the trajectory (6) necessarily has negative and positive entries. ∎
In practice, to realise a partition of the graph with the scheme (6), we miniminise the functional (7) at each iteration with the primal dual algorithm in CP11 to obtain , and then normalise this estimation. As it is non constant and satisfies , the limit of the scheme can be used for partitioning with the simple criteria .
Multiclass clustering.
We now aim at finding coupled functions that are all local minima of the ratio . The issue is to define a good coupling constraint between the ’s such that it is easy to project on. Let , we here consider the simple linear coupling :
(9) 
There are three main reasons for considering such coupling instead of classical simplex bresson2013multiclass ; rangapuram2014tight ; gao2015medical or orthogonality dodero2014group constraints:

Projection on this linear constraint is explicit with a simple shift of the vector for each node . On the other hand, simplex constraint (, , ) requires more expensive projections of the vectors on the simplex. Last, projection on the orthogonal constraint of the ’s is a non convex problem.

Contrary to the simplex constraint, it is compatible with the weighted zero mean condition that any eigenfunction of should satisfy, as shown in Proposition 1.

The characteristic function of a linear constraint is absolutely one homogeneous. This leads to a natural extension of the binary case.
Multiclass flow.
We now consider the problem:
(10) 
To find a local minima of (10), we define the iterative multiclass functional, which reads:
(11) 
where and is the characterstic function of the constraints (9). Starting from an initial point that satisfies the constraint () and has been normalised (), the scheme we consider reads:
(12) 
where and , and the point in the above PDE corresponds to the global minimiser of (11). Notice that the subgradient of the one homogeneous functional can be characterised with:
(13) 
In practice, if for some , vanishes, then we define for the next iteration. With such assumptions, the sequence have the following properties, that are shown in Supplementary Material.
Proposition 4.
For , , the trajectory given by (12) satisfies

,

,

.
Point 3 of Proposition 4 contains weights that prevent from showing the exact decrease of the sum of ratios. This is thus similar to the approach in bresson2013multiclass . To ensure the decrease of the sum of ratios , is is possible to introduce auxiliary variables dealing with individual ratio decrease, as in rangapuram2014tight . The involved subproblem at each iteration is nevertheless more complex to solve. Also notice that as there is no prior information on nodes’ labels, clusters can vanish or clusters may become proportional one to the other. Such issues can nevertheless not happen in the transductive setting we now consider.
Label Propagation: MultiClass Classification.
The previous settings are unsupervised. We now consider a semisupervised setting where we know small subsets of labeled nodes (with ) belonging to each cluster , with . Denoting , the objective is to propagate the prior information in the graph in order to predict the labels of the remaining nodes . To that end, we simply have to modify the coupling constraint in (9) as
(14) 
With such constraint, clusters can no more vanish or merge since they all contain different active nodes satisfying . The same PDE (12) can be applied to propagate these labels. Once it has converged, the label of each node is taken as:
Soft labelling can either be obtained by considering all the clusters with non negative weights with relative weights and the convention that , in the case (that has never been observed in our experiments) that for all . The parameter in (14) is set to a small numerical value. Indeed, even if by construction, a small is required to ensure that, after the rescaling, . One can consider different values for each class. In the case where , is constant and , is expected to be bivalued Feld and the value of has a clear meaning. In that framework, corresponds to no prior on the size of the clusters, whereas encourage the clusters to be of homogeneous size.
4 Experimental Results
This section is focused on describing in detail the experiments that we conducted to evaluate our proposed approach.
4.1 Implementation Details
We here describe the specifics of our experimental setting including the data description and the evaluation methodology. Data Description. We validate our approach using three datasets  one smallscale and two largescale datasets. 1) UCI ML handwritten digits dataset, we use the test set composed of images of size , and 10 classes. We also use 2) ChestXray14 dataset wang2017chestx , which is composed of 112,120 frontal chest view Xray with size of 10241024. The dataset is composed of 14 classes. 3) The CIFAR10 datase contains 60,000 color images of size 3232 and 10 different classes. All classification results were performed using these datasets. Evaluation Protocol. We design the following evaluation scheme to validate our theory. Firstly, we evaluate our proposed EMS1L approach against two classic methods: Label Propagation (LP) zhu2002learning and Local to global consistency (LCG) zhou2004learning . For output quality evaluation, we computed the error rate and F1score. Secondly and using ChestXray14 dataset wang2017chestx , we compared our approach against two deep learning approaches  WANG17wang2017chestx and YAO18 yao2018weakly . The quality of the classification was performed by a ROC analysis using the area under the curve (AUC). Finally, we demonstrate that our method can be extended to deep SSL, which evaluation is performed on the CIFAR10 dataset and compared against stateoftheart deep SSLsalimans2016improved ; shi2018transductive ; tarvainen2017mean ; iscen2019label and a fully supervised technique luo2018smooth . For this part, we evaluate the quality of the classifiers by reporting the error rate for a range of number of labelled samples. Each experiment has been repeated 10 time and the average and standard deviation are reported. For the compared methods, the parameters were set using the default values provided in the demo code or referenced in the papers themselves.
4.2 How good is EMS1L?
We start by giving some insight into the performance of our approach with a comparison against two classic methods LP zhu2002learning and LCG zhou2004learning , which results, using the digits dataset, are reported in Table 1. One can see that for all metrics and percentages of labeled samples, our approach outperforms the compared methods by a significant margin. In particular, one can observe that with even 1% of labelled data, the error rate of our EMS1L approach is almost half the second best method which is extrapolated to the remaining percentages of labeled samples and evaluation metrics. This shows that our EMS1L approach is outperforms the compared methods even under extremely minimal labeled samples.
Percentage of Labeled Samples  

Metric  Method  1%  2%  5%  10%  20%  
LP zhu2002learning  40.535.38  28.914.01  22.703.23  10.041.49  5.831.38  
LCG zhou2004learning  29.578.22  11.003.09  9.632.41  5.161.45  3.441.28  

EMS1L  14.215.63  6.511.86  3.460.91  1.800.54  1.090.24  
LP zhu2002learning  59.486.99  67.665.15  76.083.67  89.861.53  94.121.46  
LCG zhou2004learning  63.8010.74  88.234.16  89.952.89  94.801.49  95.551.28  

EMS1L  84.507.48  93.401.98  96.520.93  98.200.11  98.910.24  
LP zhu2002learning  56.485.38  71.094.01  77.303.22  89.961.49  94.171.38  
LCG zhou2004learning  70.438.22  89.003.09  90.372.41  94.841.45  95.561.28  

EMS1L  85.795.63  93.491.86  96.540.91  98.540.91  98.910.24 
Approach  Average AUC 

WANG17wang2017chestx  0.7451 
YAO18 yao2018weakly  0.7614 
MT tarvainen2017mean  0.5 
EMS1L (20%)  0.7888 
To further evaluate the results of our approach, we move to a large scale dataset ChestXray14. Our motivation to use this dataset is coming from a central problem in medical imaging which is the lack of reliable quality annotated data. In particular, the interpretation of Xray data heavily relies on the radiologist’s expertise and there is still a substantial clinical error on the outcome bruno2015understanding . We ran our approach and compared against two stateoftheart works on Xray classification WANG17wang2017chestx and YAO18 yao2018weakly , which are supervised methods and, therefore, assume a large corpus of annotated data.
Method  Labelled samples  

1000  2000  4000  
SNGT luo2018smooth (Fully Supervised)  46.431.21  33.940.73  20.660.57 
SSLGAN salimans2016improved  21.832.01  19.612.09  18.632.32 
TDCNN shi2018transductive  32.671.93  22.990.79  16.170.37 
MT tarvainen2017mean  21.551.48  15.730.31  12.310.28 
DSSL iscen2019label (diffusion+W)  22.020.88  15.660.35  12.690.29 
Deep EMS1L  20.451.08  13.910.23  11.080.24 
In Figure 2, we show few sample output that were correctly classified by our approach. Table 2 shows the averaged AUC for all classes of our approach compared against WANG17 wang2017chestx , YAO18 yao2018weakly , and MT tarvainen2017mean using the official data partition. From a inspection in the table, one can see that our EMS1L approach outperformed the compared methods with only 20% of the data whilst the compared approaches rely on 70% of annotated data. Moreover, we noticed that the classification output is very stable with respect to changes in the partition of the dataset, which is due to the semisupervised nature of our EMS1L approach. This is well reflected in the Figure 3 where we show the AUC results of both EMS1L and WANG17 wang2017chestx using three different random data partitions, including the partition suggested by WANG17 wang2017chestx . The plot shows that WANG17 is sensitive to changes in partition which can be explained by the fact that supervised methods heavily rely on the training set being representative. On the other hand, EMS1L had minimal change in the performance over the three different partitions as the underlying graphical representation is invariant to the partition. To further analyse the dependency on the portioning and show the advantage of EMS1L, we compare the AUC results of EMS1L against WANG17 and MT17 using a random data partitions. The results are reported in Figure 4  it shows that EML1L produces a more accurate classification using only 2% of the data labels than WANG17 or MT17 methods do using 70% of the data labels. The plot also shows that as we feed EML1L more data labels, the classification accuracy increases and significantly outperforms compared approached whilst still using a far smaller amount of data labels.
4.3 Deep EMS1L: An Alternative View
One interesting observation about our proposed framework is the fact https://www.overleaf.com/project/5cf5a22625299828a84c2376that it can be adapted to DL for semisupervised learning SSL. To show this ability, we followed the philosophy of iscen2019label in which they considered the seminal work LCG zhou2004learning . We used their pseudolabelling approach and connected our EMS1L (i.e. we replace LCG with our approach). Then we performed the image classification task on the CIFAR10 dataset for different label sample counts. The results of this experiment can be seen in Table2 in which we show as a baseline a fully supervised approach luo2018smooth followed by four state of the art DL semisupervised approaches salimans2016improved ; shi2018transductive ; tarvainen2017mean ; iscen2019label . One can observe that lowest error rate across different counts of labelled samples is achieved by our extension Deep EMS1L. After a detailed inspection of the table, we observe that even though the outputs generated with SSLGAN salimans2016improved started close to our score, they were not significantly improved with the increased number of samples.
5 Conclusion
In this work, we addressed the problem of classifying under minimal supervision (i.e. SSL), in particular, in the transductive setting. We proposed a new semisupervised framework which is framed under a novel optimisation model for the task of image classification. From extensive experimental results, we found the following. Firstly, we showed that our approach significantly outperformed the classic SSL methods. Secondly, we evaluated our EMS1L method for the task of Xray classification and demonstrated that our approach competes against the stateoftheart results in this context whilst requiring an extremely minimal amount of labelled data. Finally, to demonstrate the capabilities of our approach, we showed that it can be extended as a Deep SSL framework. In this context we observed the lowest error rate results on the CIFAR10 with respect to the stateoftheart SSL methods. Future work will include investigation of our approach in terms of data aggregation and how to handle unseen classes.
Acknowledgments
This work was supported by the European Union’s Horizon 2020 research and innovation programme under the Marie SkłodowskaCurie grant agreement No 777826. Support from the CMIH, University of Cambridge is greatly acknowledged.
References
 [1] F. Andreu, J. Mazón, J. Rossi, and J. Toledo. A nonlocal plaplacian evolution equation with neumann boundary conditions. Journal de mathématiques pures et appliquées, 90(2):201–227, 2008.
 [2] J. Aujol, G. Gilboa, and N. Papadakis. Theoretical analysis of flows estimating eigenfunctions of onehomogeneous functionals. SIAM Journal on Imaging Sciences, 11(2):1416–1440, 2018.
 [3] X. Bresson, T. Laurent, D. Uminsky, and J. Von Brecht. Convergence and energy landscape for Cheeger Cut clustering. In Advances in Neural Information Processing Systems (NIPS), pages 1385–1393, 2012.
 [4] X. Bresson, T. Laurent, D. Uminsky, and J. Von Brecht. Multiclass total variation clustering. In Advances in Neural Information Processing Systems (NIPS), pages 1421–1429, 2013.
 [5] X. Bresson, T. Laurent, D. Uminsky, and J. H. Von Brecht. An adaptive total variation algorithm for computing the balanced cut of a graph. arXiv preprint arXiv:1302.2717, 2013.
 [6] M. A. Bruno, E. A. Walker, and H. H. Abujudeh. Understanding and confronting our mistakes: the epidemiology of error in radiology and strategies for error reduction. Radiographics, 35(6):1668–1676, 2015.
 [7] T. Bühler and M. Hein. Spectral clustering based on the graph plaplacian. International Conference on Machine Learning (ICML), 2009.
 [8] A. Chambolle and T. Pock. A firstorder primaldual algorithm for convex problems with applications to imaging. J. Math. Imaging Vis., 40:120–145, 2011.
 [9] O. Chapelle, B. Scholkopf, and A. Zien. Semisupervised learning. MIT Press, 20(3):542–542, 2006.
 [10] D. Dai and L. Van Gool. Ensemble projection for semisupervised image classification. In IEEE International Conference on Computer Vision (ICCV), pages 2072–2079, 2013.
 [11] L. Dodero, A. Gozzi, A. Liska, V. Murino, and D. Sona. Groupwise functional community detection through joint laplacian diagonalization. In International Conference on Medical Image Computing and ComputerAssisted Intervention (MICCAI), pages 708–715. Springer, 2014.
 [12] A. Elmoataz, O. Lezoray, and S. Bougleux. Nonlocal discrete regularization on weighted graphs: a framework for image and manifold processing. IEEE transactions on Image Processing, 17(7):1047–1060, 2008.
 [13] T. Feld, J. Aujol, G. Gilboa, and N. Papadakis. Rayleigh quotient minimization for absolutely onehomogeneous functionals. Inverse Problems, 2019.
 [14] R. Fergus, Y. Weiss, and A. Torralba. Semisupervised learning in gigantic image collections. In Advances in neural information processing systems (NIPS), pages 522–530, 2009.
 [15] Y. Gao, E. AdeliM, M. Kim, P. Giannakopoulos, S. Haller, and D. Shen. Medical image retrieval using multigraph learning for MCI diagnostic assistance. In International Conference on Medical Image Computing and ComputerAssisted Intervention (MICCAI), pages 86–93, 2015.
 [16] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE conference on computer vision and pattern recognition (CVPR), pages 770–778, 2016.
 [17] M. Hein, S. Setzer, L. Jost, and S. S. Rangapuram. The total variation on hypergraphslearning on hypergraphs revisited. In Advances in Neural Information Processing Systems (NIPS), pages 2427–2435, 2013.
 [18] J. Hu, L. Shen, and G. Sun. Squeezeandexcitation networks. In IEEE conference on computer vision and pattern recognition (CVPR), pages 7132–7141, 2018.
 [19] A. Iscen, G. Tolias, Y. Avrithis, and O. Chum. Label propagation for deep semisupervised learning. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
 [20] T. Joachims. Transductive learning via spectral graph partitioning. In International Conference on Machine Learning (ICML), pages 290–297, 2003.
 [21] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS), pages 1097–1105, 2012.
 [22] S. Laine and T. Aila. Temporal ensembling for semisupervised learning. International conference on Machine learning (ICML), 2017.
 [23] Y. Luo, J. Zhu, M. Li, Y. Ren, and B. Zhang. Smooth neighbors on teacher graphs for semisupervised learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 8896–8905, 2018.
 [24] S. S. Rangapuram, P. K. Mudrakarta, and M. Hein. Tight continuous relaxation of the balanced kcut problem. In Advances in Neural Information Processing Systems (NIPS), pages 3131–3139, 2014.
 [25] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. In Advances in neural information processing systems (NIPS), pages 2234–2242, 2016.
 [26] W. Shi, Y. Gong, C. Ding, Z. MaXiaoyu Tao, and N. Zheng. Transductive semisupervised deep learning using minmax features. In European Conference on Computer Vision (ECCV), pages 299–315, 2018.
 [27] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. International Conference on Learning Representations (ICLR), 2015.
 [28] A. Tarvainen and H. Valpola. Mean teachers are better role models: Weightaveraged consistency targets improve semisupervised deep learning results. In Advances in neural information processing systems (NIPS), pages 1195–1204, 2017.
 [29] U. von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395–416, 2007.
 [30] J. Wang, T. Jebara, and S.F. Chang. Graph transduction via alternating minimization. In International conference on Machine learning (ICML), pages 1144–1151. ACM, 2008.
 [31] X. Wang, Y. Peng, L. Lu, Z. Lu, M. Bagheri, and R. M. Summers. Chestxray8: Hospitalscale chest xray database and benchmarks on weaklysupervised classification and localization of common thorax diseases. In IEEE conference on computer vision and pattern recognition (CVPR), pages 2097–2106, 2017.
 [32] J. Weston, F. Ratle, and R. Collobert. Deep learning via semisupervised embedding. In International conference on Machine learning (ICML), pages 1168–1175, 2008.
 [33] L. Yao, J. Prosky, E. Poblenz, B. Covington, and K. Lyman. Weakly supervised medical diagnosis and localization from multiple resolutions. arXiv preprint arXiv:1803.07703, 2018.
 [34] Y.M. Zhang, K. Huang, and C.L. Liu. Fast and robust graphbased transductive learning via minimum tree cut. In IEEE International Conference on Data Mining, pages 952–961, 2011.
 [35] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Schölkopf. Learning with local and global consistency. In Advances in neural information processing systems, pages 321–328, 2004.
 [36] X. Zhu and Z. Ghahramani. Learning from labeled and unlabeled data with label propagation. Technical report, Technical Report CMUCALD02107, Carnegie Mellon University, 2002.
 [37] X. Zhu, Z. Ghahramani, and J. D. Lafferty. Semisupervised learning using gaussian fields and harmonic functions. In International conference on Machine learning (ICML’03), pages 912–919, 2003.
Supplementary Material for:
Beyond Supervised Classification: Extreme Minimal Supervision with the Graph 1Laplacian
This supplementary material extends further details and proofs that support the content of the main paper. In particular, the proof of Proposition 2 and Proposition 3 from the main paper.
Appendix A Proofs
a.1 Proof of Proposition 2

For , we have
where we used Proposition 1 in the right part of the previous relation to get . We conclude with the fact that is a rescaling of .

Since is a norm, it is absolutely one homogeneous and . Next, we observe that and we get
We then conclude with the fact that .

Since for all and , then . Next, we recall that . Hence we have
(15) where the final rescaling with is possible since and are absolutely one homogeneous functions.

In the finite dimension setting, there exists such that and for an absolutely one homogeneous functionals defined in (2) and a norm . Then one has
Hence from the equivalence of norms in finite dimensions, there exists such that .
a.2 Proof of Proposition 3
Proof.

We have
We follow the point 2 of Proposition 2 to first get: , for . Then, as , we deduce that . Next we have
Summing on , we get
Notice that we defined for . As is a norm, the equivalence of norm in finite dimensions implies that is bounded by some constant . We then have .

Since is the global minimizer of (11), then:
∎