SemiSupervised Learning by Augmented Distribution Alignment
Abstract
In this work, we propose a simple yet effective semisupervised learning approach called Augmented Distribution Alignment. We reveal that an essential sampling bias exists in semisupervised learning due to the limited amount of labeled samples, which often leads to a considerable empirical distribution mismatch between labeled data and unlabeled data. To this end, we propose to align the empirical distributions of labeled and unlabeled data to alleviate the bias. On one hand, we adopt an adversarial training strategy to minimize the distribution distance between labeled and unlabeled data as inspired by domain adaptation works. On the other hand, to deal with the small sample size issue of labeled data, we also propose a simple interpolation strategy to generate pseudo training samples. Those two strategies can be easily implemented into existing deep neural networks. We demonstrate the effectiveness of our proposed approach on the benchmark SVHN and CIFAR10 datasets, on which we achieve new stateoftheart error rates of and , respectively. Our code will be available at https://github.com/qinenergy/adanet.
1 Introduction
SemiSupervised Learning (SSL) aims to learn a robust model with a limited number of labeled samples and a abundant number of unlabeled samples. As a classical learning paradigm, it has gained many interests from both machine learning and computer vision communities. Many approaches have been proposed in recent decades, including label propagation, graph regularization, etc. [6, 5, 2, 17, 4, 47]. Recently, there is an increasing interest in training deep neural networks in the semisupervised learning scenario[27, 40, 26, 30, 32, 8, 7]. This is partially due to the dataintensive nature of the conventional deep learning techniques, which often impose heavy demands on data annotation and bring high cost.
While many strategies have been proposed to utilize the unlabeled data for boosting the model performance, the essential sampling bias issue in SSL has rarely been discussed in the literature. That is, the empirical distribution of labeled data often deviates from the true samples distribution, due to the limited sampling size of labeled data. We illustrate this issue with the classical twomoon data in Figure 1, in which we plot labeled samples (bottom left) and unlabeled samples (bottom middle). It can be observed the twomoon structure is well depicted by the unlabeled samples. However, due to the randomness in sampling and the small sample size, it can hardly tell the underlying distribution with the labeled data, though it is also sampled from the same twomoon distribution. In terms of empirical distribution, this also leads to a considerable difference between labeled and unlabeled data, as shown by the density estimation results on their xaxis projection (top left and top middle).
Similar empirical distribution mismatch is also observed in real world datasets for SSL (see Section 5.3). As observed in domain adaptation works, the model performance can often be significantly degraded when applying on a sample set with considerable empirical distribution difference. Therefore, the SSL models could also be potentially affected by the empirical distribution mismatch between labeled and unlabeled data when exploiting different SSL strategies, e.g., label propagation from labeled data to unlabeled data.
To tackle this issue, we propose to explicitly reduce the empirical distribution mismatch in SSL. Specifically, we develop a simple yet effective approach called Augmented Distribution Alignment. On one hand, we adopt the adversarial training strategy to minimize the distribution distance between labeled and unlabeled data, such that the feature distributions are well aligned in the latent space, as illustrated in the top right of Figure 1. On the other hand, to alleviate the small sampling size issue and enhance the distribution alignment, we also propose a data augmentation strategy to generate pseudo samples by interpolating between labeled and unlabeled training sets, as illustrated in the bottom right of Figure 1. It is also worth mentioning that both strategies can be implemented easily, where the adversarial training could be achieved with a simple gradient reverse layer, and the data augmentation can be implemented by interpolation. Thus, they can be readily incorporated into existing neural networks for SSL with little effort. We demonstrate the effectiveness of our proposed approach on the benchmark SVHN and CIFAR10 datasets, on which we achieve new stateoftheart classification performance.
Our contributions are summarized as follows:

We offer a new perspective of empirical distribution mismatch to understand semisupervised learning. The empirical distribution mismatch problem commonly exists in SSL scenarios, however, has not been revealed by existing semisupervised learning works.

We propose an augmented distribution alignment approach to explicitly address the empirical distribution mismatch for SSL.

Our approach can be easily implemented into existing neural networks for SSL with little efforts.

Despite of the simplicity, our proposed approach achieves new stateoftheart classification performance on the the benchmark SVHN and CIFAR10 datasets for the SSL task.
2 Related Work
Semisupervised learning: As a classical learning paradigm, many works have been proposed for semisupervised learning with various methods, including label propagation, graph regularization, cotraining, etc. [44, 35, 5, 29, 17, 4, 24, 1]. We refer interested readers to [47] for a comprehensive survey. Recently, there is an increasing interest in training deep neural networks in the semisupervised learning scenario[40, 26, 30, 32, 8, 7]. This is partially due to the dataintensive nature of the conventional deep learning techniques, which often impose heavy demands on data annotation and bring high cost. Different models have been designed for deep semisupervised learning. For example, [26, 40, 30] proposed to add small perturbations to unlabeled data, and enforce a consistency regularization [32] on the output of model. Other works [7, 8] adopt the idea of selftraining and used propagated labels with a memory module or regularized by training speed. The ensemble approach was also explored, where [26] used an averaged prediction using the outputs of the networkintraining over time to regularize the model, while [40] instead used accumulated parameters to for prediction.
Different from above works, we tackle the SSL problem with a new perspective of empirical distribution mismatch, which was rarely discussed in the literature. By simply dealing with the distribution mismatch, we show that our newly proposed augmented distribution alignment with vanilla neural networks performs competitively with the stateofthearts SSL methods. Moreover, since we deal the SSL problem in a new way, our approach is potentially complementary to those approaches, and is shown to be able to further boost their performance.
Sampling bias problem: Sampling bias was usually discussed in the literature under the supervised learning and domain adaptation scenarios [37, 10, 22]. Many works have been proposed to measure or address the sampling bias in the learning process [11, 12, 28, 38]. Recently, following the generative adversarial networks[15], the adversarial training strategy was widely used to address the empirical distribution mismatch in domain adaptation[12, 41, 46]. Although people generally assume samples in two domains are sampled from two different distributions, while in SSL the labeled and unlabeled samples are from the identical distribution, the techniques for reducing domain distribution mismatch used in domain adaptation can be readily used to solve the empirical distribution mismatch in SSL. In this work, we employ the adversarial training strategy proposed in [12]. A potential challenge as discussed in this paper is the small sample size of labeled data might lead to a lack of supports problem when aligning distribution, for which we additionally employ a sample augmentation strategy.
Other related works: Our work is also related to the recent proposed interpolation based data augmentation methods for training neural networks [45, 23, 42]. In particular, the Mixup method [45] proposed to generate new training samples using convex combinations of pairs of training samples and their labels. In order to address the small sample size issue when aligning distributions, we generalize their approach to the semisupervised learning by using pseudolabels for unlabeled samples in the interpolation process. Moreover, we also show that by interpolating between labeled and unlabeled data, the empirical distribution of generate data actually gets closer to the unlabeled samples.
3 Problem Statement and Motivations
In semisupervised learning, we are given a small amount of labeled training samples and a large set of unlabeled training samples. Formally, let us denote by as the set of labeled training data, where is the th sample, is its corresponding label, and is the total number of labeled samples. Similarly, the set of unlabeled training data can be represented as where is the th unlabeled training sample, and is the number of unlabeled samples. Usually is a small number, and we have . The task of semisupervised learning is to train a classifier which performs well on the test data drawn from the same distribution with the training data.
3.1 Empirical Distribution Mismatch in SSL
In semisupervised learning, the labeled training samples and unlabeled training samples are assumed to be drawn from an identical distribution. However, due to the limited number of labeled training samples, a considerable difference of empirical distributions can often be observed between the labeled and unlabeled training samples.
More concretely, we take the twomoon data as an example to illustrate the empirical distribution mismatch problem in Figure 1. In particular, the unlabeled samples well describe the underlying distribution (bottom middle), while the labeled samples can hardly represent the twomoon distribution (bottom left). This can be further verified by their distribution by projecting to the xaxis (upper left and upper middle), from which we observe an obvious distribution difference. Actually, when performing multiple rounds of sampling on labeled samples, the empirical distribution of labeled data varies significantly, due to the small sample number.
This phenomenon was also discussed as the sampling bias problem in the literature [18, 19]. In particular, Greton et al. [18] pointed out that the difference between two samplings measured by Maximum Mean Discrepancy(MMD) depends on their sampling sizes. In semisupervised learning where the underlying distribution of labeled and unlabeled data is assumed identical, the MMD of labeled and unlabeled data tends to vanish if and only if both sizes of two sampling are large, which is described as follows,
Proposition 1.
Let us denote as a class of witness functions : in the reproduced kernel Hilbert space (RKHS) induced by a kernel function , and assume , then the MMD distance of and can be bounded by ,
Proof.
The proof can be derived with Theorem 7 in [18] by assuming the two distributions and are identical. ∎
In semisupervised learning, the number of labeled samples is usually small, which would lead to a notable empirical distribution difference with the unlabeled samples as stated in above proposition. Specifically, we illustrate the sampling bias problem with the twomoon data in the semisupervised learning scenario in Figure 2. We plot the MMD between labeled and unlabeled samples with regarding to different numbers of labeled samples. As shown in the figure, when the sample size of labeled data is small, both the mean and variance of MMD are large, and the MMD tends to be minor only when becomes sufficiently large.
This implies that in SSL the small sampling size often causes the empirical approximation of labeled data deviates from the true sample distribution. Consequentially, a model trained from this empirical distribution is unlikely to generalize well on the test data. While various strategies have been exploited for utilizing the unlabeled data in conventional SSL methods [27, 8, 7], the empirical distribution mismatch issue was rarely discussed, which is one of the hidden factors of potentially unstable problem for conventional SSL methods. This was also verified by the recent work [32], which shows that the performance of SSL methods could be degraded when the size of labeled dataset is decreased.
3.2 Healing the Empirical Distribution Mismatch
To overcome the empirical distribution mismatch issue in SSL, in this work, we propose an augmented distribution alignment approach. In addition to training the classifier with supervision from labeled data, we also simultaneously minimize the distribution divergence between labeled and unlabeled data, such that the empirical distributions of labeled and unlabeled samples are well aligned in the latent space (as illustrated in upper right of Figure 1).
Formally, let us denote the loss function as where is the classifier to be learnt. We also define as the distribution divergence of labeled and unlabeled data measure with certain metric. Then, our main idea can be formulated as the following objective,
(1) 
where is a tradeoff parameter to balance two terms.
An issue with the above solution is that the small number of labeled samples (i.e., ) potentially makes the optimization of (1) unstable. To address this issue, we further propose a simple yet effective data augmentation strategy. Inspired by the recent mixup approach for supervised learning, we iteratively generate new training samples by interpolating between the labeled samples and unlabeled samples, and feed them for both learning the classifier and reducing the empirical distribution divergence. We refer to our approach as Augmented Distribution Alignment, and detail it in the following section.
4 Augmented Distribution Alignment for SSL
In this section, we introduce our augmented distribution alignment method for SSL, in which we respectively propose two strategies, adversarial distribution alignment and crossset sample augmentation, to tackle the empirical distribution mismatch and the small sample issues.
4.1 Adversarial Distribution Alignment
We employ Divergence [3, 9] to measure distribution divergence as inspired by recent domain adaptation works.
In particular, let us denote by a feature extractor (e.g., convolutional layers) which maps sample into a latent feature space. Moreover, let be a binary discriminator which predicts 0 for labeled samples and 1 for unlabeled samples. The Divergence between labeled and unlabeled samples can be written as:
where is the prediction error of the discriminator on labeled samples, and is similarly defined for unlabeled samples.
Intuitively, when the empirical distribution mismatch is large, the discriminator could easily distinguish the labeled and unlabeled samples, thus its prediction errors would be small, and the divergence is higher, and vice versa. Therefore, to reduce the empirical distribution mismatch of labeled and unlabeled samples, we then minimize the distribution distance to enforce the feature extractor to generate a latent space in which two sets of features are well aligned. This is therefore achieved by solving the following problem:
The above maxmin problem can be optimized with the adversarial training methods. In [13], Ganin and Lempitsky showed that it can be implemetned as a simple gradient reverse layer (GRL) which automatically reverse the gradient after discriminator, thus one can directly minimize the classification loss of the discriminator with the standard propagation optimization library.
4.2 Crossset Sample Augmentation
As discussed in Section 3, in SSL, the limited sampling size of labeled data often causes unstable in optimization and leads to performance degradation. In order to reinforce the alignment, as inspired by [45], we propose to generate new training samples by interpolating between labeled and unlabeled samples. In particular, for each , we assign it a pseudolabel , which is generated by using the prediction from the model trained in previous iteration in this work. Then, given a labeled sample and an unlabeled sample , the interpolated sample can be represented as,
(2)  
(3)  
(4) 
where is a random variable that is generated from an prior distribution, i.e. with being a hyperparameter to control the shape of the distribution, is the interpolated sample, is its class label, and is its label for the distribution discriminator.
The benefits of such crossset sample augmentation are twofold. First, the interpolated samples greatly enlarged the training data set, making the learning process more stable, especially for deep neural networks models. It was also shown in [45] that such data augmentation helps to improve model robustness.
Second, each pseudosample is generated by interpolating between a labeled sample and an unlabeled sample, thus the distribution of pseudosamples is expected to be closer to the real distribution than that of the original labeled training samples. We prove this using the euclidean generalized energy distance [39] in below.
Let us denote and as the empirical distributions of labeled and unlabeled data, their euclidean generalized energy distance [39] can be written as,
where is the euclidean distance, and (resp., and ) are two samples independent sampled from (resp., ). Then, we show that crossset sample augmentation helps to bridge the gap between two distributions by the following proposition,
Proposition 2.
Let be the empirical distribution of the pseudo sample generated using (2), then we have . In other words, the euclidean generalized energy distance between the empirical distribution of the pseudo and unlabeled samples is smaller or equal than that of labeled and unlabeled samples.
Proof.
Using Proposition 2 from [39], we rewrite the energy distance as follows,
In addition, we have
because the expectation of is , and the same applies to . Therefore,
Here we complete the proof. ∎
This implies that the new generated pseudosamples can be deemed as being sampled from the intermediate distributions between the empirical distributions of labeled and unlabeled samples. As shown in previous domain adaptation works [16, 14], such intermediate distributions are beneficial to alleviate the gap between two distributions, and learn more robust models.
4.3 Summary
We unify the adversarial distribution alignment and crossset sample augmentation strategies into one framework, finally leading to our augmented distribution alignment approach.
In Figure 3, we demonstrate an example of incorporating our augmented distribution alignment approach into an vanilla convolutional neural networks, which is referred to as ADANet. Specifically, in addition to the classification branch, we add several fully connected layers as the discriminator to distinguish labeled and unlabeled samples (i.e., discussed in Section 4.1). A gradient reverse layer is added before the discriminator, which will automatically reverse the sign of gradient from the discriminator during backpropagation. Then, for each minibatch, we use the crossset sample augmentation strategy in (2),(3),(4) to generate interpolated samples and labels, and use them as training data to train our ADANet. The objective for training the network can be obtained by replacing the training samples and term in (1), i.e.,
(5) 
where are respectively the feature extractor, classifier, and discriminator, and is the loss function for which we use the crossentropy in this work.
We depict the training pipeline Algorithm 1. Aside from the simple sample interpolation, the network can be optimized with the standard propagation approaches. Therefore, our augmented distribution alignment can be easily incorporated into existing neural networks by appending a discriminator with the GRL layer, and adding the proposed crossset sample augmentation during minibatch data preparation.
5 Experiments
In this section, we evaluate our proposed ADANet for semisupervised learning on benchmark datasetss including SVHN, and CIFAR10.
5.1 Experimental Setup
Svhn:
The Street View House Numbers (SVHN) dataset [31] is a dataset consists of realworld digit photos. It includes ten classes and 73,257 training images of 3232 size. Following [30], out of the full training set, 1000 images are used with labels for supervised learning. The rest training photos are provided without labels. Random translation is the only augmentation used for this dataset.
Cifar10:
The CIFAR10 dataset [25] contains 10 classes, and consists of 50,000 training images as well as 10,000 testing images. All images are of the size 3232. 4,000 samples from the training images are used as labeled set for our experiments, the rest training images are used as unlabeled samples.
We use the PreActResNet18 [21] as the backbone network, and implement our ADANet in tensorflow based on the open source TensorPack library [43]. For the class classifier, a single fully connected layer is used to map the features to logits. For the domain classifier, two fully connected layers, each with 1,024 units, followed by another fully connected layer are used to produce two channels of soft domain labels.
The batch size is set as 128. The learning rate starts from 0.1, and is divided by 10 when 50%, and 75% epochs are reached. The network is trained for 100 epochs in total for SVHN, and 300 epochs for CIFAR10, where one epoch is defined as one iteration over all unlabeled data. We use a momentum optimizer with 0.9 as the momentum. The following hyperparameters are used for our reported results: weightdecay, interpolation for SVHN and for CIFAR10. The experiments on SVHN and CIFAR10 share the exact same network and protocol. Source codes will be released for reproducing our experiments.
5.2 Experimental Results
We summarize the classification error rates on the SVHN and CIFAR10 dataset in Table 1. We include the baseline CNN model that is trained with labeled data only as a reference. To validate the effectiveness of the two modules in our ADANet, we also report two variants of our proposed approach. In the first variant, we do not use crossset sample augmentation and apply the distribution alignment using original labeled and unlabeled samples. In the second variant, we remove the discriminator and perform only crossset sample augmentation for learning the classifier.
As shown in Table 1, our ADANet significantly improves the classification performance on both datasets. We also observe that both the distribution alignment and crossset sample augmentation are important for improving the classification performance. The distribution alignment module brings 1.30% and 3.04% improvement on CIFAR10 and SVHN, and the crossset sample augmentation module gives 6.18% and 3.06% improvement, respectively. By integrating both modules, the classification error rates can be reduced by our ADANet from 19.97% and 13.80% to 8.87% and 5.90% on the CIFAR10 and SVHN datasets, respectively. The experimental results clearly validate our motivations, and also demonstrate the effectiveness of our proposed approach.
5.3 Experimental Analysis
Feature visualization:
To better understand how our ADANet works, we use the base CNN block as a feature extractor, and visualize with the tSNE approach for the labeled samples, unlabeled samples, and the generated pseudosamples on the SVHN dataset in Figure 4. The features extracted using the baseline CNN trained with only labeled data are also visualized for comparison. As shown in Figure 4, a considerable distribution difference between labeled and unlabeled samples can be observed for the baseline CNN model, and the generated pseudosamples distribute in between those two sets. Nevertheless, with our ADANet, the distributions of three types of samples are similar since we explicitly align the distributions of labeled and unlabeled samples in the training procedure.



dist  aug  CIFAR10  SVHN  
Baseline  19.97%  13.80%  
Ours  ✓  18.67%  10.76%  
✓  13.79%  10.74%  
✓  ✓  8.87%  5.90% 
Feature distribution:
To further show the effectiveness of our ADANet in reducing the distribution mismatch, we take the first five activations of the baseline CNN model and our ADANet as examples, and plot the distribution of labeled and unlabeled samples on each dimension individually. The distribution is obtained by performing kernel density estimation [33, 36] on each type of samples and each dimension individually. As shown in Figure 6, we again observe a considerable mismatch between the estimated empirical distribution of labeled and unlabeled samples for the baseline CNN model. And also, such distribution mismatch is then well reduced in our ADANet model. We have similar observations for other feature activations.
Varying number of labeled samples:
As discussed in Section 3.1, the distribution mismatch in semisupervised learning is correlated with the number of labeled samples. It often becomes more serious when the number of labeled samples are less. To validate the effectiveness of our ADANet with different sampling size, we conduct experiments on the SVHN dataset by varying the number of labeled samples. In particular, we train models using , , , and labeled samples, and all other experimental settings remain the same. The error rates of our ADANet and the baseline CNN are plotted in Figure 5. We observe that the error rate of baseline CNN model increases dramatically when reducing the number of labeled samples, which indicates that the sampling bias makes the learning problem more challenging. Nevertheless, our ADANet consistently improves the classification performance by alleviating such sampling bias with the augmented distribution alignment, the relative improvement is more obvious when the labeled samples are rare.
5.4 Comparison with Stateofthearts



Method  CIFAR10  SVHN 
Model [26]  12.36%  4.82% 
Temporal ensembling [26]  12.16%  4.42% 
Mean Teacher[40]  12.31%  3.95% 
VAT [30]  11.36%  5.42% 
VAT + Ent [30]  10.55%  3.86% 
SaaS [8]  13.22%  4.77% 
MADNN [7]  11.91%  4.21% 
ADANet (Ours) 
10.30%  4.62% 
ADANet+ (Ours)  10.09%  3.54% 



Method  Top1  Top5 
100% Supervised  30.43%  10.76% 
10% Supervised  52.23%  27.54% 
Mean Teacher [40]  49.07%  23.59% 
DualView Deep CoTraining [34]  46.50%  22.73% 
ADANet (Ours)  44.91%  21.18% 
We further compare our ADANet with recently proposed stateoftheart SSL learning approaches, including Model [26], Temporal ensembling [26], Mean Teacher[40], VAT [30], VAT + Ent [30], SaaS [8], and MADNN [7].
As discussed in [32], minor modification in the network structure and data processing method often lead to different results. To ensure a fair comparison, we take the VAT method [30] as a reference, and strictly follow their experimental setup. In particular, we reimplement our ADANet based on the released codes^{1}^{1}1https://github.com/takerum/vat_tf. The same ConvLarge architecture is used as the backbone network, and hyperparameters are also set to be the same as [30].
We report the results of different methods on the CIFAR10 and SVHN datasets in Table 2. Our ADANet achieves competitive results with those stateoftheart SSL methods. Despite the simplicity of our augmented distribution alignment, the results clearly validate the importance on dealing with the empirical distribution mismatch in the semisupervised learning, and also demonstrates the effectiveness of our ADANet. More importantly, as we solve the SSL problem from a new perspective that was not revealed by previous works, our augmented distribution alignment strategy is generally complementary to other methods. Therefore, the performance of existing SSL methods can be boosted by incorporating the distribution alignment and crossset sample augmented modules proposed in this work. As shown in Table 2, combining our ADANet with the VAT+Ent method (denoted as “ADANet+”), we push the envelope of SSL on these two benchmark datasets, and achieve new stateoftheart error rates of and .
We additionally report our result on 1000class ImageNet in Table 3, with 10% labels. We compare our results with previous stateoftheart methods Mean Teacher [40] and Deep CoTraining [34]. The result of Deep CoTraining is quoted from their paper, and the performance of Mean Teacher is from running their official implementation by [34]. Following [34], we train ResNet18 [20] for 600 epochs with a batch size of 256, and we set . ADANet performs better than both methods and outperforms DualView Deep CoTraining by 1.69% on Top1 error rate and 1.55% on Top5 error rate.
6 Conclusions
In this work, we have proposed a new semisupervised learning method called augmented distribution alignment. In particular, we tackle the semisupervised learning problem from a new perspective that labeled and unlabled data often exhibts a considerable difference in terms of the empirical distribution. We therefore employed an adversarial training strategy to align the distributions of labeled and unlabeled samples when training the neural networks. A crossset sample augmentation was further proposed to deal with the limited sampling size and bridge the distribution gap. Those two strategies can be readily unified into the existing deep nerual networks, leading to our ADANet. Experiments on the benchmark CIFAR10 and SVHN datasets have validated the effectiveness of our approach.
References
 [1] R. K. Ando and T. Zhang. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6(Nov):1817–1853, 2005.
 [2] M. Belkin and P. Niyogi. Semisupervised learning on riemannian manifolds. Machine learning, 56(13):209–239, 2004.
 [3] S. BenDavid, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan. A theory of learning from different domains. Machine learning, 79(12):151–175, 2010.
 [4] A. Blum and S. Chawla. Learning from labeled and unlabeled data using graph mincuts. 2001.
 [5] A. Blum and T. Mitchell. Combining labeled and unlabeled data with cotraining. In Proceedings of the eleventh annual conference on Computational learning theory, pages 92–100. ACM, 1998.
 [6] O. Chapelle, J. Weston, and B. Schölkopf. Cluster kernels for semisupervised learning. In Advances in neural information processing systems, pages 601–608, 2003.
 [7] Y. Chen, X. Zhu, and S. Gong. Semisupervised deep learning with memory. In The European Conference on Computer Vision (ECCV), 2018.
 [8] S. Cicek, A. Fawzi, and S. Soatto. SaaS: Speed as a supervisor for semisupervised learning. In Proceedings of the European Conference on Computer Vision (ECCV), pages 149–163, 2018.
 [9] C. Cortes and M. Mohri. Domain adaptation in regression. In International Conference on Algorithmic Learning Theory, pages 308–323. Springer, 2011.
 [10] M. Dudik, S. J. Phillips, and R. E. Schapire. Correcting sample selection bias in maximum entropy density estimation. In Advances in neural information processing systems, pages 323–330, 2006.
 [11] B. Fernando, A. Habrard, M. Sebban, and T. Tuytelaars. Unsupervised visual domain adaptation using subspace alignment. In Proceedings of the IEEE international conference on computer vision, pages 2960–2967, 2013.
 [12] Y. Ganin and V. Lempitsky. Unsupervised domain adaptation by backpropagation. In International Conference on Machine Learning, pages 1180–1189, 2015.
 [13] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky. Domainadversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096–2030, 2016.
 [14] B. Gong, Y. Shi, F. Sha, and K. Grauman. Geodesic flow kernel for unsupervised domain adaptation. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 2066–2073. IEEE, 2012.
 [15] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
 [16] R. Gopalan, R. Li, and R. Chellappa. Domain adaptation for object recognition: An unsupervised approach. In 2011 international conference on computer vision, pages 999–1006. IEEE, 2011.
 [17] Y. Grandvalet and Y. Bengio. Semisupervised learning by entropy minimization. In Advances in neural information processing systems, pages 529–536, 2005.
 [18] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola. A kernel twosample test. Journal of Machine Learning Research, 13:723–773, 2012.
 [19] A. Gretton, K. Fukumizu, Z. Harchaoui, and B. K. Sriperumbudur. A fast, consistent kernel twosample test. In Advances in neural information processing systems, pages 673–681, 2009.
 [20] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 [21] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In European conference on computer vision, pages 630–645. Springer, 2016.
 [22] J. Huang, A. Gretton, K. M. Borgwardt, B. Schölkopf, and A. J. Smola. Correcting sample selection bias by unlabeled data. In Advances in neural information processing systems, pages 601–608, 2007.
 [23] H. Inoue. Data augmentation by pairing samples for images classification. arXiv preprint arXiv:1801.02929, 2018.
 [24] T. Joachims. Transductive learning via spectral graph partitioning. In Proceedings of the 20th International Conference on Machine Learning (ICML), pages 290–297, 2003.
 [25] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
 [26] S. Laine and T. Aila. Temporal ensembling for semisupervised learning. arXiv preprint arXiv:1610.02242, 2016.
 [27] D.H. Lee. Pseudolabel: The simple and efficient semisupervised learning method for deep neural networks. In Workshop on Challenges in Representation Learning, ICML, volume 3, page 2, 2013.
 [28] M. Long, Y. Cao, J. Wang, and M. Jordan. Learning transferable features with deep adaptation networks. In International Conference on Machine Learning, pages 97–105, 2015.
 [29] T. M. Mitchell. The role of unlabeled data in supervised learning. In Language, Knowledge, and Representation, pages 103–111. Springer, 2004.
 [30] T. Miyato, S.i. Maeda, S. Ishii, and M. Koyama. Virtual adversarial training: a regularization method for supervised and semisupervised learning. IEEE transactions on pattern analysis and machine intelligence, 2018.
 [31] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, volume 2011, page 5, 2011.
 [32] A. Odena, A. Oliver, C. Raffel, E. D. Cubuk, and I. Goodfellow. Realistic evaluation of semisupervised learning algorithms. 2018.
 [33] E. Parzen. On estimation of a probability density function and mode. The annals of mathematical statistics, 33(3):1065–1076, 1962.
 [34] S. Qiao, W. Shen, Z. Zhang, B. Wang, and A. Yuille. Deep cotraining for semisupervised image recognition. In Proceedings of the European Conference on Computer Vision (ECCV), pages 135–152, 2018.
 [35] C. Rosenberg, M. Hebert, and H. Schneiderman. Semisupervised selftraining of object detection models. 2005.
 [36] M. Rosenblatt. Remarks on some nonparametric estimates of a density function. The Annals of Mathematical Statistics, pages 832–837, 1956.
 [37] S. Rosset, J. Zhu, H. Zou, and T. J. Hastie. A method for inferring label sampling mechanisms in semisupervised learning. In Advances in neural information processing systems, pages 1161–1168, 2005.
 [38] B. Sun and K. Saenko. Subspace distribution alignment for unsupervised domain adaptation. In BMVC, pages 24–1, 2015.
 [39] G. J. Székely and M. L. Rizzo. Energy statistics: A class of statistics based on distances. Journal of statistical planning and inference, 143(8):1249–1272, 2013.
 [40] A. Tarvainen and H. Valpola. Mean teachers are better role models: Weightaveraged consistency targets improve semisupervised deep learning results. In Advances in neural information processing systems, pages 1195–1204, 2017.
 [41] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7167–7176, 2017.
 [42] V. Verma, A. Lamb, C. Beckham, A. Najafi, A. Courville, I. Mitliagkas, and Y. Bengio. Manifold mixup: Learning better representations by interpolating hidden states. 2018.
 [43] Y. Wu et al. Tensorpack, 2016.
 [44] D. Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. In 33rd annual meeting of the association for computational linguistics, 1995.
 [45] H. Zhang, M. Cisse, Y. N. Dauphin, and D. LopezPaz. mixup: Beyond empirical risk minimization. In International Conference on Learning Representations, 2018.
 [46] W. Zhang, W. Ouyang, W. Li, and D. Xu. Collaborative and adversarial network for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3801–3809, 2018.
 [47] X. J. Zhu. Semisupervised learning literature survey. Technical report, University of WisconsinMadison Department of Computer Sciences, 2005.