Domain Confusion with Self Ensembling for Unsupervised Adaptation
Abstract
Data collection and annotation are timeconsuming in machine learning, expecially for large scale problem. A common approach for this problem is to transfer knowledge from a related labeled domain to a target one. There are two popular ways to achieve this goal: adversarial learning and self training. In this article, we first analyze the training unstablity problem and the mistaken confusion issue in adversarial learning process. Then, inspired by domain confusion and selfensembling methods, we propose a combined model to learn feature and class jointly invariant representation, namely Domain Confusion with Self Ensembling (DCSE). The experiments verified that our proposed approach can offer better performance than empirical art in a variety of unsupervised domain adaptation benchmarks.
1 Introduction
An essential task in visual recognition is to design a model that can adapt to dataset distribution bias Du et al. (2013); Yuan et al. (2018); Sun et al. (2018), in which one attempts to transfer labeled source domain knowledge to unlabeled target domain. For example, we sometimes have a real world recognition task in one domain of interest, but we only have limitted training data in this domain. If we can use almost infinite simulation images in the 3D virtual world with labels to train a recognition model, and then generalize it to the real world, it would greatly reduce the cost of manual labelling Rusu et al. (2016); Tan et al. (2018). In order to obtain satisfactory generalization capability, we turn to deep learning, which is the best known method having the robost generalization performance Simonyan and Zisserman (2014); Krizhevsky et al. (2012); He et al. (2016); Luong et al. (2015); Szegedy et al. (2014); Russakovsky et al. (2014). However, deep learning models often needs millions of labeled data to fit millions of parameters. It is hard to obtain enough data to train in supervised setting where labeled data is hard to collect and annotate. As alternative methods, semisupervised and unsupervised learning methods can reduce the large requirements Pan and Yang (2010); Pan (2016); Pan et al. (2016). Semisupervised learning aims at combining labeled and unlabeled datasets for other unlabeled data from target to perform the adaptation Wang et al. (2017); Tan et al. (2017); Tang et al. (2018). Unsupervised domain adaptation is a similar problem, in which model attempts to exploit the knowledge from source domain and classify unlabeled dataset in target domain Tzeng et al. (2014); Ganin and Lempitsky (2014); Tzeng et al. (2017); Gheisari and Baghshah (2015); Tonioni et al. (2017).
There have been extensive works in domain adaptation Ganin and Lempitsky (2014); Ghifary et al. (2016); Tzeng et al. (2014, 2017), which focus on building an unify representation between source and target doamins for the domain adaptation problem. One of the domain adaptation methods is Domain Confusion (DC) Tzeng et al. (2014). We analyzed the DC methods based on conventional GANform loss function, which is approximately equivalent to RevGrad approach Ganin and Lempitsky (2014), and found that the model is hard to train stably. In this regard, we named it as training unstability problem. Additionally, we visualized the prediction results by confusion matrix (see Figure 1), which demonstrate the fact that DC methods only can align featured probability distribution rather than the feature distribution conditioning on certain class images. Figure 2 shows how the confused feature distribution looks like. We named it as feature conditional distribution misalignment problem. Apart from this, recent work presented another domain adaptation technique based on meanteacher framework French et al. (2017); Tarvainen and Valpola (2017), which achieved pretty good results in some relatively simple domain adaptation situations, such as USPS MNIST, SVHN MNIST and so on. But the task to transfer knowledge from MNIST to SVHN is still difficult, where the model should learn knowledge about grayscale images and generalize it to RGB images.
In this paper, we aim at developing a “Domain Confusion with Self Ensembling (DCSE)” method for domain adaptation. The contributions are as follows (see Figure 4 for a schematic overview):

To address the training unstability problem in unsupervised domain adaptation task, we introduce WassersteinGAN (WGAN) algorithm which is theoretically proved to be more stable than the conventional GAN by Arjovsky et al Arjovsky et al. (2017). It shows us a way that we no longer need to balance the discriminator and feature extractor in our model. Therefore, we are able to improve the model performance based on this stable WGAN model.

To address the feature conditional distribution misalignment problem, we aim at forming a model that can find a joint representation of classes and features. Thus we not only need the feature informations within source domain, but also need to consider the label informations in target domain. The main intuition here is that we can use selfensembling method to provide pseudo labels in target domain Tarvainen and Valpola (2017); French et al. (2017).

Based on these idea, we proposed an improved unsupervised domain adaptation method that combines two domain adaptation methods mentioned above. We simultaneously use domain confusion method and selfensembling method to guide the target domain representation, so that the target representation distribution not only can regard as a whole to align with the source representation, but also can align with the feature conditional distribution of source domain. As an addtional benefit, we found these two methods can improve classification accuracy. For example, selfensembling framework can utilize unlabeled data to improve accuracy in domain adaptation problem. But it often stucks in low accuracy due to the huge gaps of appearance between source and target datasets French et al. (2017). However, in our combined model, DC method can help it improve performance by finding a domain invariant representation despite the dataset bias. Details will be described in Section 4.2. As the result of this combination, we improve the stateoftheart in cross tasks MNIST SVHN and CIFAR STL under unsupervised domain adaptation settings.
The rest of this paper is organized as follows: In Section 2, we will review the related works about domain confusion and selfensembling techniques. And in Section 3, we will analyze the limitations of existing methods to elicit our motivation. Our approach is described in Section 4. Then we give experimental details and results in Section 5. And finally we present conclusions in Section 6.
2 Related Works
We first review imagebased approaches in visual domain adaptation tasks. Paolo Russo et al. Russo et al. (2017) presented SBADAGAN, aiming at simultaneously transforming the source images to target images and vice versa. It jointly optimizes bidirectional image mapping with classification loss, adversarial loss, and a class consistency loss, which aligns the generators in two directions. The main views here can be treated as data augmentation by style transfer, which avoids to decide a prior of which is the best strategy to augment data. Finally, SBADAGAN promotes the performance greatly in case of MNIST SVHN from prior art 52.8% to 61.1%.
Besides, recent works also focused on featurebased method aiming at transfering deep feature representation of neural network from labeled source domain to unlabeled target domain. Ganin et al. Ganin and Lempitsky (2014) proposed a Gradient Reversal Layer, achieving unsupervised domain adaptation through aligning the distributions of features over the source and target domains, where it considers the classification task as finding domain invariant representation in the feature space. Furthermore, Eric Tzeng et al. Tzeng et al. (2017) used a unified framework ADDA for unsupervised domain adaptation task. It provides a simple and easy understanding view for recently proposed domain adaptation researches, which combines discriminative modeling, untied weight sharing, and a GAN loss to form a general framework. ADDA first pretrains a source encoder CNN using labeled source data, and then learns a target encoder CNN by confusing domain features with adversarial learning method. In a result, the discriminator can not properly distinguish the feature representation of source and target. In other words, ADDA learns a joint invariant representation between domains. During testing, target domain images feature are computed with the target encoder CNN, in which the features are classified by the source domain classifier.
On the contrary, there is a SelfEnsembling (SE) method French et al. (2017) which is compeletly diffrent from the previous. It is derived from meanteacher Tarvainen and Valpola (2017) and used in unsupervised domain adaptation problem. The model is formed of a student network and a teacher network. The student is trained using the cross entropy loss as usual and the teacher’s weights are equal to the exponential moving average weights of the student. Meanwhile, when the maximum predicted probability of a sample is greater than a predefined confidence threshold of 0.968, the teacher gives the selfensembling predicted probability as the consistency labels of the student. Their approach achieved some stateoftheart results in a variety of benchmarks.
3 Limitations of existing methods
Before introducing our domain confusion with selfensembling approach, we first give the intuition behind our proposed method. In domain adaptation task, we only have the labels in source domain. We aim to train a feature extractor to learn the joint representation of source and target domains so that the classifier can also classify the target domain image. One of the most popular methods is DC Tzeng et al. (2014) which exploits a domain classifier to predict the domain of the feature representation, and a Gradient Reversal Layer (GRL) to minimize the domain discrepancy. The process that GRL reverses the gradient from domain classifier can be considered as backpropagation with a loss function multiplied by a negative constant, consequently, the model is equivalent to conventional GAN. So the following we will discuss GANform DC methods, showing its two main limitations in domain adaptation task. The first cons is training unstability problem, the second one is feature conditional distribution misalignment problem.
3.1 Training unstability problem of conventional GANform techniques
First we construct a feature extractor, a classifier and a discriminator (see Figure 4 except the teacher part for an overview). When source and target domain image simultaneously come into the feature extractor, the classifier will try to recognize the category of source images by using cross entropy loss function as usual. And the discriminator attempts to discriminate the feature representation come from which domain. Meanwhile, the feature extractor tries to use the feature representation from target domain to fool the discriminator with logit “1”. So here we formalize a conventional GAN loss described in Goodfellow et al. (2014). The GAN objective is defined as follows:
(1) 
(2) 
where is the feature extractor mapping function. It uses the image sampled from source domain distribution and target domain distribution as input, and outputs their feature representation with distribution . is the discriminator, which maximize the probability of all training examples that assign the correct domain label.
After evaluating this method in MNIST SVHN domain adaptation task, we find that the model often gets the unstable result even after we carefully tune the hyperparameters and add some regularization skills. We follow the discussions in Gulrajani et al. (2017) to explain the unstable problem. In unsupervised domain adaptation task, when we use GANform DC technique, if we train the discriminator too well until the discriminator is optimal, optimizing the feature extractor will be the same as minimizing the Jensen Shannon divergence of the feature distributions of source and target Goodfellow et al. (2014). In addtion, we could simply assume the support set of natural images are lying on low dimensional manifolds, so their feature maps can’t “fullfill” the whole feature space. Therefore, the probability measure of the intersection manifolds between the source and target domains tends to be zero. When the manifolds of these two feature distributions are not overlap, the Jensen Shannon divergence term will always be a constant, thus the feature extractor can not receive adequate gradients to update itself.
Therefore, when we optimize the forementioned GAN loss functions in domain adaptation task, the feature distribution of target domain would be far from the of source domain, so that the target domain images are hard to be classified correctly by the classifier. For this reason, the model will get a worse accuracy in target domain recognition task. According to the above analysis, it is hard to improve the model performance based on the GANform domain confusion methods.
Description  Shape 

Feature extractor  
Conv 3 3 128 stride 1, pad 1, instance norm  32 32 128 
Conv 3 3 128 stride 1, pad 1, instance norm  32 32 128 
Conv 3 3 128 stride 1, pad 1, instance norm  32 32 128 
Dropout, 50%  32 32 128 
Conv 3 3 256 stride 2, pad 1, instance norm  16 16 256 
Conv 3 3 256 stride 1, pad 1, instance norm  16 16 256 
Conv 3 3 256 stride 1, pad 1, instance norm  16 16 256 
Description  Shape 

Classifier/Critic  
Dropout, 50%  16 16 256 
Conv 3 3 512 stride 2, pad 1, instance norm  8 8 512 
Conv 3 3 512 stride 1, pad 1, instance norm  8 8 512 
Conv 3 3 512 stride 1, pad 1, instance norm  8 8 512 
Global pooling layer  1 1 512 
Fully connected layer, 10 (critic:1) units, softmax  10 (critic:1) 
3.2 Feature conditional distribution misalignment problem
There are some unsupervised domain adaptation tasks remain undefeated due to the big gaps between domains. It is still a challenge even if we have used the adversarial training paradigm because of feature conditional distribution misalignment problem. We make the following analysis on this. First we build the model shown in table 1,2, which is jointly trained with cross entropy classification loss and adversarial loss.
According to the repeated experiments using the same model described in Section 3.2, we find the results showed in confusion matrix (Figure 1) are very different although we train the model in completely the same architecture and same hyperparameters. One possible explanation is that adversarial loss can only push target feature distribution as a whole to align with source feature distribution instead of aligning the feature distribution conditioning on different category images between source and target domains, as shown in Figure 2.
Formally speaking, let be the class images of source domain, so is target domain images . Our target of domain adaptation task is to make the source domain conditional distribution as similar as target domain conditional distribution , where and are the class random variables over label space. Let be the feature variable of where is a random variable of source images with density and is the parameters of feature extractor , so is . Decomposition of the above formula produces:
(3) 
(4) 
where is the feature distribution conditioning on class image. And is the class distribution conditioning on features, which is parameterized by classifier. It is easy to see when is similar to and is similar to , our goal of domain adaptation is achieved. In our conventional GANform domain adaptation model, we have used source domain classification loss to determine the conditional probability distribution , it fixes both domain distributions of and . Then we use the GAN loss to restrict . The rest of the question is, although , only holds partly because of , in which marginal distributions . This means sometimes would align mistakenly with . Therefore, is difficult to approximate .
The above analysis tells us adversarial learning method is congenitally deficient, so that it is hard to substantially surpass the best results in unsupervised domain adaptation benchmarks.
4 Method
According to the descriptions in the previous section, there are two problems when the GAN adversarial learning method is used in domain adaptation task. In this section, we first intorduce Wasserstein GAN technique to address the training unstability problem. Then we use selfensembling method to tackle the feature conditional distribution misalignment problem. In the following subsections, we will introduce our method in details.
4.1 Domain confusion using WassersteinGAN
Instead of using GANform domain confusion method with training unstability problem, following the works of Arjovsky et al. Arjovsky et al. (2017), we now introduce the advantages of EarthMover distance (Wasserstein distance) in domain adaptation task.
Let be the sets of all joint distributions whose marginal distributions are and . Here are the Wasserstein distance:
(5) 
Compared with Jensen Shannon (JS) divergence which has been discussed in section 3.1, Wasserstein distance is still able to reflect the distance between two distributions without overlap of their manifolds Arjovsky et al. (2017). For this purpose, we choose Wasserstein distance as the measurement of two distributions and . This superior property can make the feature extractor trained stably. Thus by the way of minimizing the Wasserstein distance between feature distributions, we can reduce the shift of datasets. Here is the WGAN loss function Gulrajani et al. (2017):
(6) 
(7) 
In this case, we call it critic instead of discriminator when we use WGAN. But for convenience, the critic is still wrote as D in the formula. The last term of Formula (6) claims the parameters of critic must obey KLipschitz continuity. The hyperparameters and are empirically set to 10 and 1 respectively in our expriments, which just simply follow the setting of Gulrajani et al. Gulrajani et al. (2017). We find it very robust after we experiented this WGAN loss in many different settings, see Figure 3 as a case. Impressively, now we can use domain confusion technique easily without considering stability problem. This characteristic allows us making improvements based on this WGAN model.
4.2 Domain confusion with selfensembling
Selfensembling model builds upon WGAN domain confusion model which is mentioned previously. In general, in order to solve feature conditional distribution misalignment problem, we add a selfensembling loss on the basis of classification loss and WGAN loss, which is given by adding a teacher structure. Detailed model is shown in Figure 4.
SelfEnsembling technique
Following the works of French et al. French et al. (2017), first we constrct a student network and a teacher network which are share the same architecture. The teacher’s parameters are weighted by the student’s using formula over every batch during training, is set with 0.99. We use selfensembling label given by the teacher network to construct class label consistency loss, which guides the target domain distribution to align with the source domain feature representation . The way to caculate selfensembling loss is, for each target domain unlabeled sample , let pass through the student network with data augmentation and dropout in the network, while without any modification for the teacher side. Next we can obtain their predicted vector and pseudo predicted vector , where indicates the predicted class probability. Then we compute their selfensembling loss of each sample, precisely, mean square errors. At last, we compute the confidence of the sample. If is below the confidence threshold of 0.968, the selfensembling loss for the sample will be masked to 0, which follows the setting of French et al. French et al. (2017).
Selfensembling technique we used here can be treated as softclustering algorithm. The teacher in our model can be seen as an ensemble model weighted over time. So the samples with high confidence prediction from teacher are more likely to be classified correctly. Therefore their predictions can be used as pseudo labels of the student. Moreover, the noisy student can be seen as a regularizer to smooth the decision boundary in feature space. For example, when images from target domain pass a teacher network then get high confidence outputs, we believe the student’s features manifold should have high probability of being around the teacher’s, because the data augmentation and dropout won’t actually change the category of images. When we push the noisy student’s outputs as similar as the teacher’s, the features with same category would be clustered. This discussion exactly follows the manifold assumption, wherein unlabeled data is able to make the feature space more dense. It facilitates the local catagory division in the feature space and helps the decision boundary perform data fitting more satisfactorily.
Class balance
As French et al. French et al. (2017) demonstrated, the challenging MNIST SVHN benchmarks remain undefeated because of the training unstability problem of normal selfensembling technique. During the model training without using class balance, the error rate first decreases and then rises to high values before training stops. The authors hypothesize class imbalance in the SVHN dataset cause the fact that the selfensembling pseudo labels always predict the “1” class more often than the others, which gives rise to the model degenerate in local minimum.
They addressed this problem by using a class balance technique that average different class loss in each minibatch. For example, let indicate the class probability of the target domain sample, then compute its mean by average over the batch dimension. The class balance loss is computed as the binary cross entropy between the probability vector and a uniform probability vector. At last, they consider the selfensembling loss is only applied to the confident training samples, the class balance loss should be weighted by the ratio of confident samples to all training samples. By the way, we also weight it with 0.005 for not overwhelming the other loss. We use this technique in MNIST SVHN experiments.
Noise & Data augmentation
Recent works have showed that using noise and data augmentation in selfensembling model can significantly improve the model generalization capability Laine and Aila (2016); Tarvainen and Valpola (2017). Our goal, then, is to add more priors to expand the manifolds of data points in the input space or feature space. Thus the model can force decision boundaries pass through the sparse data manifolds rather than the dense data manifolds. This technique can avoid the wrong predictions leading the model to degenerate performance, which helps the teacher model reduce the misclassification on target domain. Therefore, we apply 5 augmentation strategies in our DCSE algorithm: 1) Using dropout on student model with random drop rates of 0.5. 2) Random brightness in the range of [0.7, 1.3]. 3) Random contrast in the range of [0.7, 1.3]. 4) Random saturation in the range of [0.7, 1.3]. 5) Random affine augmentation using affine transformation matrix which is shown as follows French et al. (2017).
(8) 
4.3 Functions of different losses
To stablize the training process, we analyze the effects of different losses in details and propose an adaptive technique to address it. In our domain adaptation problem, we have the labeled images in source domain and unlabeled images in target domain. In training process, we design three different losses, which are cross entropy as classification loss in supervised setting as usual, domain confusion loss for aligning source and target domains feature representation, and selfensembling loss for aligning each feature conditional distribution between domains. We think our selfensembling model works by relying on model assumption: when the model assumption is positive, unlabeled samples will help to improve model performance Sajjadi et al. (2016); Laine and Aila (2016). In our case, when supervised term leads the model to a great model assumption in source domain, the first unsupervised term domain confusion loss will help the classifier much easier to classify the target domain images by aligning the feature distribution. The second unsupervised term selfensembling loss makes the model gaining a much better result by aligning feature conditional distribution.
Based on the above analysis, we find that these losses play different roles in different training stages. First, following the previous analysis in Section , we consider using source domain classification loss to construct distribution along the whole training phase. Second, domain confusion loss should lead the model to a feasible assumption in the early stage of training, so that it can help selfensembling loss leading the model to a better performance. In addition, because different datasets are suitable for different weights of domain confusion loss, we apply a weight decayed factor using cosine rampdown method Loshchilov and Hutter (2016). Thus the factor would gradually decrease till the zero during training. Then, to tackle the feature conditional distribution misalignment problem, we introduce selfensembling loss. Due to the fact that it strongly relys on the selfensembling pseudo labels which should be relatively accurate enough, we apply it to align feature conditional distribution when pseudo predict probability is larger than 0.968, which makes it become zero in the early unstable training stage. Furthermore, we also set a factor to weight the selfensembling loss for fitting different datasets. The program flow of our proposed approach DCSE is described in Algorithm 1.
5 Experiments
Our implementation was developed using Pytorch Paszke et al. (2017).
5.1 Setting
Figure 4 shows a student network and a teacher network, which recept source data and target data as input, then output the class probability predictions. The training losses are the sum of supervised term and unsupervised term. Supervised term consists of classification loss which uses the cross entropy as the loss function for labeled source images. Unsupervised term consists of WGAN loss and selfensembling loss. WGAN loss is used for confusing source and target domain feature representation. Selfensembling loss is computed as mean square error between and when exceeds the threshold 0.968. It is worth noting that gradient descent only applies in student network. And the teacher is the exponential moving average of weights of the student over training batches. We test the results in experiments by teacher network. More details of the model architecture see table 1,2.
5.2 Hyperparameters
In all experiments, we used RMSprop optimizer with learning rates of 0.001 to execute gradient descent. The minibatches of our algorithm composed of 128 samples. For hyperparameters and , we have not found an applicable way to optimize them in unsupervised domain adaptation, because there is no target domain labels can be used for evaluation, which is still an open research question, and out of scope in our study. So here we just use the test sets example for evaluation and our experimental accuracy can be seen as upper bounds of our algorithm. Besides, because the other researches also have the same problem, our results are fair and comparable. The hyperparameters are shown in Table 3.
MNIST SVHN  3  5 
SVHN MNIST  0.3  3 
CIFAR STL  0.05  1 
STL CIFAR  0.1  1 
5.3 Datasets
MNIST  SVHN  CIFAR  STL  

        
SVHN  MNIST  STL  CIFAR  
RevGrad Ganin and Lempitsky (2014)  35.67  73.91  66.12  56.91 
DRCN Ghifary et al. (2016)  40.05  81.97  66.37  58.65 
SE French et al. (2017)  41.98  99.22  75.51  69.15 
SBADAGAN Russo et al. (2017)  61.1  76.1     
DC (ours)  48.65  66.10  71.38  59.83 
SE (ours)  32.80  99.23  77.68  66.73 
DCSE (ours)  83.65  99.53  78.64  72.98 
All results can be seen in Table 4. We evaluated our algorithm over 4 cross domain task pairs:1)MNIST SVHN, 2)SVHN MNIST, 3)CIFAR STL, 4)STL CIFAR.
MNIST LÃ©cun et al. (1998) is a grayscale handwritten digit dataset (see Figure (a)a). All images were converted to RGB channels for matching the colorful dataset SVHN (see Figure (b)b) Netzer et al. (2011). They were rescaled to 32 32 RGB images, and applied by [0,1] normalization, which forms MNIST 60,000 images for training and 10,000 for testing, 73,257 labeled SVHN for training, and 26,032 for testing.
CIFAR (see Figure (c)c) Krizhevsky (2009) and STL (see Figure (d)d) Adam Coates (2011) consist of RGB images that share nine object classes:plane, car, bird, cat, deer, dog, horse, ship, truck. There are 45000 samples for training and 9000 for testing in CIFAR while STL only has 4500 for training and 7,200 for testing. They were all scaled to 32 32 images and also were applied by [0,1] normalization.
Our experiments aim at the most challenging datasets in small image domain adaptation task. For the convenience of comparison, in each of these cases, we applied inductive unsupervised learning in domain adaptation task like French et al. did French et al. (2017). Only the training sets were used during training, while the test sets only were used for reporting accuracy. By the way, all the results were obtained by the same architecture, as shown in Table 1,2.
MNIST SVHN, This adaptation path is a difficult task for the reason that the model must learn knowledge from grayscale and generalize it to varicolored images. We first used Wasserstein GAN as domain confusion technique to mitigate training unstable problem, achieving 48.65% accuracy. For the sake of solving class misalignment problem, we introduced selfensembling model to guide the target class feature representation as similar as possible to the source. With this modification, we achieved a result that strongly outperforms prior art from 61.1% to 83.65%. To compare with selfensembling based model, we only used classification loss and selfensembling loss, which resulted in 32.80% accuracy only. It is worth noting that the result of our approach DCSE is highly surpass the domain confusion method. It also confirms the feature misalignment problem in unsupervised domain adaptation task, which has a huge effect on the model performance. However, selfensembling loss can partly solve this problem by generating pseudo predictions to correct the feature conditional distribution.
SVHN MNIST In this adaptation direction, we found our algorithm DCSE sometimes may unstable at the beginning of training. We diagnosed the problem by removing the domain confusion loss, observing the model can be trained stably and got 99.23% accuracy. We hypothesized that domain confusion loss may lead the model to a poor model assumption at the initial stage, so the selfensembling pseudo label is highly possible incorrect, which leads the model to degenerate. We overcame this problem by linear ramping up the at the first 30 epochs with interval 0.1. With this modification, we significantly surpassed the domainalignment techniques like DC (ours) and RevGrad Ganin and Lempitsky (2014) and selfensembling techniques with 99.53% accuracy. To the best of our knowledges, this is a stateoftheart result in domain adaptation setting, which is close to the supervised results.
CIFAR STL The gaps in CIFAR STL path are greater than the gaps in MNIST SVHN path since the former looks much more morphological changes than the latter. Furthermore, The source domain CIFAR has 45,000 images to perform supervised training, while unlabeled target domain STL only has 4,500 images. We obtained a strong performance with 78.64% in this experiment which surpassed the prior art. For fair comparison, we removed the WGAN loss and got 77.68% accuracy. This shows that our approach can stably outperforms the results of SEonly approaches. Furthermore, based on the analysis of model assumption, it also proves the fact that domainalignment technique can help the model reach a better model assumption. As a result, selfensembling technique can lead the model to a higher performance.
STL CIFAR In this experiment, we achieved 72.98% accuracy. This result significantly exceeds the accuracy of SE model using selfensembling method only. We consider that WGAN loss can gain much more informations by aligning feature distribution since STL has only 4,500 images be trained supervisely, which leads the model arrive in a better model assumption. Thus the unlabeled images are able to help the model become more accurate by selfensembling.
6 Conclusions
In this paper, we have proposed an algorithm named as DCSE by combining domain confusion and selfensembling, which aims at addressing training unstability problem and feature conditional distribution misalignment problem. Our model works by finding an unify class representation between domains, which presents better results in MNIST SVHN and CIFAR STL cross domain adaptation benchmarks.
Acknowledgements
The authors are grateful to all reviewers for their very insightful comments and suggestions. This work was supported in part by National Natural Science Foundation of China under Grants 61773127 and 61727810, Ten Thousand Talent Program approved in 2018, Scientific Funds approved in 2016 for Higher Level Talents by Guangdong Provincial universities Grant 2015TX01X232 and Project supported by GDHVPS 2014, Guangdong Province Science Foundation for Program of Research Team under Grant 2018A030313306, and Guangzhou Science and Technology Foundation under Grants 201802010037 and 20150810007.
References
 Andrew Y. Ng Adam Coates, Honglak Lee. An analysis of single layer networks in unsupervised feature learning. AISTATS, 2011.
 M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein GAN. ArXiv eprints, January 2017.
 Bo Du, Liangpei Zhang, Dacheng Tao, and Dengyi Zhang. Unsupervised transfer learning for target detection from hyperspectral images. Neurocomputing, 120:72 – 82, 2013. ISSN 09252312. doi: https://doi.org/10.1016/j.neucom.2012.08.056. URL http://www.sciencedirect.com/science/article/pii/S092523121300297X.
 Geoffrey French, Michal Mackiewicz, and Mark Fisher. Selfensembling for visual domain adaptation. ArXiv eprints 1706.05208, 2017.
 Y. Ganin and V. Lempitsky. Unsupervised Domain Adaptation by Backpropagation. ArXiv eprints, September 2014.
 Marzieh Gheisari and Mahdieh Soleymani Baghshah. Unsupervised domain adaptation via representation learning and adaptive classifier learning. Neurocomputing, 165:300 – 311, 2015. ISSN 09252312. doi: https://doi.org/10.1016/j.neucom.2015.03.020. URL http://www.sciencedirect.com/science/article/pii/S0925231215002921.
 Muhammad Ghifary, W. Bastiaan Kleijn, Mengjie Zhang, David Balduzzi, and Wen Li. Deep reconstructionclassification networks for unsupervised domain adaptation. In Proceedings of European Conference on Computer Vision, pages 597–613, 2016.
 Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2672–2680. 2014. URL http://papers.nips.cc/paper/5423generativeadversarialnets.pdf.
 Ishaan Gulrajani, Faruk Ahmed, Martín Arjovsky, Vincent Dumoulin, and Aaron C. Courville. Improved training of Wasserstein GANs. Computing Research Repository, abs/1704.00028, 2017. URL http://arxiv.org/abs/1704.00028.
 Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
 Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
 Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. 2012. URL http://papers.nips.cc/paper/4824imagenetclassificationwithdeepconvolutionalneuralnetworks.pdf.
 Samuli Laine and Timo Aila. Temporal ensembling for semisupervised learning. Computing Research Repository, abs/1610.02242, 2016. URL http://arxiv.org/abs/1610.02242.
 Ilya Loshchilov and Frank Hutter. SGDR: stochastic gradient descent with restarts. Computing Research Repository, abs/1608.03983, 2016. URL http://arxiv.org/abs/1608.03983.
 M.T. Luong, H. Pham, and C. D. Manning. Effective Approaches to Attentionbased Neural Machine Translation. ArXiv eprints, August 2015.
 Yann LÃ©cun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. In Proceedings of the IEEE, volume 86, pages 2278–2324, 1998.
 Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.
 Jianhan Pan, Xuegang Hu, Peipei Li, Huizong Li, Wei He, Yuhong Zhang, and Yaojin Lin. Domain adaptation via multilayer transfer learning. Neurocomputing, 190:10 – 24, 2016. ISSN 09252312. doi: https://doi.org/10.1016/j.neucom.2015.12.097. URL http://www.sciencedirect.com/science/article/pii/S0925231216000096.
 Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10):1345–1359, 2010.
 Weike Pan. A survey of transfer learning for collaborative recommendation with auxiliary data. Neurocomputing, 177:447 – 453, 2016. ISSN 09252312. doi: https://doi.org/10.1016/j.neucom.2015.11.059. URL http://www.sciencedirect.com/science/article/pii/S0925231215018640.
 Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, and Trevor Darrell. Pytorch. https://github.com/pytorch/pytorch, 2017.
 Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, and Michael Bernstein. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2014.
 Paolo Russo, Fabio Maria Carlucci, Tatiana Tommasi, and Barbara Caputo. From source to target and back: symmetric bidirectional adaptive GAN. Computing Research Repository, abs/1705.08824, 2017. URL http://arxiv.org/abs/1705.08824.
 Andrei A. Rusu, Neil C. Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks. Computing Research Repository, abs/1606.04671, 2016. URL http://arxiv.org/abs/1606.04671.
 Mehdi Sajjadi, Mehran Javanmardi, and Tolga Tasdizen. Regularization with stochastic transformations and perturbations for deep semisupervised learning. Computing Research Repository, abs/1606.04586, 2016. URL http://arxiv.org/abs/1606.04586.
 K. Simonyan and A. Zisserman. Very Deep Convolutional Networks for LargeScale Image Recognition. ArXiv eprints, September 2014.
 Xin Sun, Junyu Shi, Lipeng Liu, Junyu Dong, Claudia Plant, Xinhua Wang, and Huiyu Zhou. Transferring deep knowledge for object recognition in lowquality underwater videos. Neurocomputing, 275:897 – 908, 2018. ISSN 09252312. doi: https://doi.org/10.1016/j.neucom.2017.09.044. URL http://www.sciencedirect.com/science/article/pii/S0925231217315631.
 Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. Computing Research Repository, abs/1409.4842, 2014. URL http://arxiv.org/abs/1409.4842.
 Jie Tan, Tingnan Zhang, Erwin Coumans, Atil Iscen, Yunfei Bai, Danijar Hafner, Steven Bohez, and Vincent Vanhoucke. Simtoreal: Learning agile locomotion for quadruped robots. Computing Research Repository, abs/1804.10332, 2018. URL http://arxiv.org/abs/1804.10332.
 Qiaoyu Tan, Yanming Yu, Guoxian Yu, and Jun Wang. Semisupervised multilabel classification using incomplete label information. Neurocomputing, 260:192 – 202, 2017. ISSN 09252312. doi: https://doi.org/10.1016/j.neucom.2017.04.033. URL http://www.sciencedirect.com/science/article/pii/S092523121730704X.
 Xin Tang, Fang Guo, Jianbing Shen, and Tianyuan Du. Facial landmark detection by semisupervised deep learning. Neurocomputing, 2018. ISSN 09252312. doi: https://doi.org/10.1016/j.neucom.2018.01.080. URL http://www.sciencedirect.com/science/article/pii/S0925231218301139.
 Antti Tarvainen and Harri Valpola. Weightaveraged consistency targets improve semisupervised deep learning results. Computing Research Repository, abs/1703.01780, 2017. URL http://arxiv.org/abs/1703.01780.
 Alessio Tonioni, Matteo Poggi, Stefano Mattoccia, and Luigi Di Stefano. Unsupervised adaptation for deep stereo. In Proceedings of the IEEE International Conference on Computer Vision, pages 1614–1622, 2017.
 Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, and Trevor Darrell. Deep domain confusion: Maximizing for domain invariance. Computing Research Repository, abs/1412.3474, 2014. URL http://arxiv.org/abs/1412.3474.
 Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain adaptation. Computing Research Repository, abs/1702.05464, 2017. URL http://arxiv.org/abs/1702.05464.
 Jing Wang, Xin Zhang, Xueqing Li, and Jixiang Du. Semisupervised manifold alignment with few correspondences. Neurocomputing, 230:322 – 331, 2017. ISSN 09252312. doi: https://doi.org/10.1016/j.neucom.2016.12.010. URL http://www.sciencedirect.com/science/article/pii/S0925231216314850.
 Baodi Yuan, Jian Tu, RuiWei Zhao, Yingbin Zheng, and YuGang Jiang. Learning partbased midlevel representation for visual recognition. Neurocomputing, 275:2126 – 2136, 2018. ISSN 09252312. doi: https://doi.org/10.1016/j.neucom.2017.10.062. URL http://www.sciencedirect.com/science/article/pii/S0925231217317137.