Surprisingly Simple Semi-Supervised Domain Adaptation with Pretraining and Consistency

Surprisingly Simple Semi-Supervised Domain Adaptation with Pretraining and Consistency


Visual domain adaptation involves learning to classify images from a target visual domain using labels available in a different source domain. A range of prior work uses adversarial domain alignment to try and learn a domain invariant feature space, where a good source classifier can perform well on target data. This however, can lead to errors where class A features in the target domain get aligned to class B features in source. We show that in the presence of a few target labels, simple techniques like self-supervision (via rotation prediction) and consistency regularization can be effective without any adversarial alignment to learn a good target classifier. Our Pretraining and Consistency (PAC) approach, can achieve state of the art accuracy on this semi-supervised domain adaptation task, surpassing multiple adversarial domain alignment methods, across multiple datasets. Notably, it outperforms all recent approaches by 3-5% on the large and challenging DomainNet benchmark, showing the strength of these simple techniques in fixing errors made by adversarial alignment.

1 Introduction

The problem of visual domain adaptation arises, when a learner must leverage labelled source domain data to classify instances in the target domain, where it has limited access to ground-truth annotated labels. An example of this is the problem of learning to classify real-world images based on hand-sketched depictions. The problem is challenging because discriminative features that are learnt while training to classify source domain instances may not be meaningful or sufficiently discriminative in the target domain. As described in prior works, this situation can be viewed as arising from a “domain-shift”, where the joint distribution of features and labels in the source domain does not follow the same law in the target domain.

Figure 1: Feature space illustration of our approach. Shapes of points represent classes and colors represent different domains. Target domain points with a black border are labelled. Initial feature space might not be discriminative on target. Pretraining for rotation prediction can help fix initial features to some extent (a b). Consistency regularization helps cluster points by changing features to keep the classifier decision boundary away from data-dense regions. Image space perturbations can help correctly cluster points that lie on the wrong side of the boundary (b c).

We consider the problem of semi-supervised domain adaptation (SSDA). Namely, given ground-truth labelled source instances, a few target labelled examples, and unlabeled target-domain data, a learner’s goal is to classify unlabelled examples in the target domain. In this context, a number of prior approaches have proposed to address the domain-shift problem by aligning features of source and target domain. This is so that a classifier learnt on source domain labels also correctly classifies target domain examples. In particular, a substantial set of these works propose methods rooted in adversarial domain alignment [1, 21, 32, 29, 37, 42], deriving from the theory presented by Ben-David et al[2].

In this paper, we use simple label consistency [33] and rotation prediction for pretraining [12] to propose a method, PAC (Pretraining and Consistency), that is competitive with the current state of the art on semi-supervised domain adaptation over multiple datasets. Our method does not use any adversarial domain alignment but is able to outperform them in most cases. Notably, PAC achieves better target accuracy than all comparable state-of-the-art methods on the large and challenging DomainNet [26] benchmark by 3-5 % on the 1-shot and 3-shot (1 and 3 labelled target examples respectively) SSDA scenarios.

Fig 1 (a), shows feature space data distribution of points in a hypothetical binary classification problem where simply aligning features can lead to many target domain features being mapped close to source features of a different class. A classifier learnt with source labels and possibly a few target labels can make errors here. Starting with an initial feature extractor that generates features somewhat meaningful to the image categories, can remedy this situation to some extent.

Most recent domain adaptation approaches use an Imagenet [9] pretrained backbone as the starting point for their feature extractor. While adopted to be universally meaningful, these features are still limited by the kind of images in the Imagenet dataset. We use self-supervision via rotation prediction to enhance this Imagenet pretrained backbone for the particular domain adaptation task. This was first proposed by Gidaris et al[12] for learning features from unlabelled images, and it was recently found in a study by Wallace et al[40] to be more semantically meaningful than an array of other self-supervision objectives.

Also key to our approach is label consistency using image space perturbations. This relates to cluster assumption for an ideal classifier [6], which states that points in the same cluster in input space should be classified similarly. It is equivalent to the low-density separation assumption meaning that a classifier’s decision boundaries should lie in regions of low density of data since that makes the classifier less likely to change its output on nearby/perturbed data [39]. Label consistency or consistency regularization is a way of enforcing this by making a classifier invariant to small input perturbations and thus to neighborhoods that may form clusters. In our approach, we do this using image augmentation methods like RandAugment [8] and color jittering, and the model is trained to produce the same output for both a perturbed and an unperturbed version of the image.

In Fig 1 (b) we try to motivate a scenario where consistency regularization can help fix errors that adversarial domain alignment might make. By way of perturbing data, the classifier is encouraged to not have decision boundaries close to these data points, allowing them to cluster. We note here that data augmentation is a powerful perturbation technique since it indicates a meaningful neighborhood in images. In other words, manipulating the image via small translations, rotations, color manipulations etc. does not change the category of the image, as far as it is done on common objects where some fine detail in the image may not play a big role in recognition. Via these perturbations, consistency regularization can help correctly cluster points that may initially lie on the wrong side of the decision boundary.

Our contributions in this paper are two-fold:

  • We propose a simple semi-supervised domain adaptation method PAC, based on label consistency and rotation prediction for pretraining, which achieves state of the art accuracy on SSDA across multiple datasets.

  • We analyze thoroughly, individual components of our method and how they affect performance, providing an understanding of these components and thus possibly how they can be combined with other techniques.

2 Background and Related Work

Adversarial Domain Alignment. Ben-David et al[2] presented a theory of domain adaptation where they introduced an upper bound on target domain error of a classifier. Given a classifier in the hypothesis space . They showed that,

where and are the source and target domain errors of the classifier , is the error of the ideal joint classifier and

where and represent the source and target domain data distributions. Intuitively, this domain divergence is a measure of the maximum change in a classifier’s outputs on the target domain, when it changes only a little on source. Note that the divergence is equivalently defined using a single classifier from , as opposed to using two from (readers are recommended to refer to Definition 3 of [2] for a more elaborate description).

A range of visual domain adaptation approaches tries to find a feature space such that the distributions and have low divergence. Computing the divergence requires a classifier , which may or may not have any relation to the classifier for categories in the final classification task. Since finding the divergence involves a supremum over the classifier parameters, this results in a minimax objective for reducing domain divergence. This, along with minimizing the error of a classifier on is broadly the approach adopted by a range of unsupervised domain adaptation methods [1, 10, 21, 32, 37, 42].

Adversarial domain alignment makes the assumption that aligning source and target features along with learning a good classifier on source features are sufficient for learning a good target domain classifier. Shu et al[34] showed, using the example of a domain discriminator [10] that this may not necessarily be true, especially when feature extractors have very high capacity—which is the case for commonly used deep convolutional networks. They also showed that cluster assumption [6] for an ideal classifier can have a role to play in domain adaptation.

Semi-supervised learning and Cluster Assumption. Semi-supervised learning, where a classifier learns using a mixture of labelled and unlabelled examples, has a long running literature. Cluster assumption [6] is widely accepted for robust image classification and its enforcement has been shown to lead to good performance in a range of semi-supervised learning methods. Conditional entropy minization is a way of enforcing cluster assumption [13]. Consistency regularization, as discussed in the introduction, is another method of enforcing that a classifier’s labelling be consistent between input examples and their perturbed versions. Making a classifier invariant to these perturbations makes its decision boundaries lie in regions of low data density. Both random [33] and adversarial perturbations [24] have been proposed in this context. More recently, Fixmatch [19], with a simple consistency regularization approach using image based augmentations with RandAugment [8], achieved remarkable performance on semi-supervised classification on CIFAR-10 with very few labelled examples per class. PAC, like Fixmatch, uses RandAugment for consistency regularization. We additionally use color jittering. In our experiments we found these image augmentations to work better than adversarial perturbations, possibly because the latter may not guarantee preserving an image’s semantic meaning.

Semi-supervised Domain Adaptation. As mentioned above, conditional entropy minimization has been used to enforce cluster assumption. Saito et al[29] cleverly built this into a minimax optimization problem to propose a domain alignment method, called minimax entropy (MME) that also plays by the cluster assumption. They evaluated MME on semi-supervised domain adaptation, and found that it performed better than other domain alignment approaches without conditional entropy minimization. Their approach uses entropy maximization with the classifier in an attempt to move the boundary close to and in between target unlabelled examples. Subsequent conditional entropy minimization using the feature extractor clusters the target unlabelled examples largely according to the “split” created by the classifier. Note that with this approach the neighborhood already gets defined by the classifier and if errors are made here, they are harder to fix. So, while PAC can fix errors like the ones in Fig. 1 (b), MME may find it harder to do so. We demonstrate this in Sec 4.3 by comparing the performance of both methods from a randomly initialized backbone, and find that our method performs better.

Saito et al[29] also used a benchmark that has been used by subsequent approaches for evaluation of SSDA methods. Here we describe some of these approaches. APE [17] uses different feature alignment objectives, within and across domains along with a perturbation consistency objective. BiAT[16] uses multiple adversarial perturbation strategies and consistency losses alongside MME. Li et al[20] propose an online meta-learning framework using target domain labelled data for meta-testing. They evaluate this approach with multiple domain alignment methods. We use their meta-MME model on semi-supervised domain adaptation for comparison.

Self Supervision and Domain Adaptation. In the absence of any labelled training data, different self-supervision objectives [5, 7, 14, 12, 23] have been proposed that can learn semantically meaningful image features for tasks like image classification and object detection. In domain adaptation some recent approaches [5, 36, 41] have used self-supervision tasks as an auxiliary objective to regularize their model. Saito et al[30], used a self-supervised feature space clustering objective for universal domain adaptation. PAC differs from these approaches in that we use rotation prediction to pretrain our feature extractor. This helps our initial feature extractor output more relevant and semantically meaningful features for the classification task at hand. In our experiments, we compared this to MoCo [14], a recent contrastive self-supervision approach, and found features learnt using rotation prediction to have better properties for our task. This is also in line with the findings of Wallace et al[40], where rotation prediction was found to be more semantically meaningful compared to other self-supervision objectives, for a range of classification tasks across different datasets.

3 Pretraining and Consistency (PAC)

Figure 2: A diagram of our Pretraining and Consistency (PAC) approach. We first train our backbone for the self-supervised task of predicting rotations (left). This backbone is then used as a warm start for our classification model, which uses labelled data with the cross entropy criterion and consistency regularization for the unlabelled data (right).

Before describing our approach, we introduce notation to describe the problem and our model’s components. Available to the model are two sets of labelled images : , the labelled source images and , the few labelled target images, and additionally the set of unlabelled target images . The goal is to predict labels for these images in . The final classification model consists of two components : the feature extractor and the classifier . generates features for an input image which the classifier operates on to produce output class scores , where is the number of categories that the images in the dataset could belong to. In our experiments, is a convolutional network and produces features with unit -norm, i.e. (following [29]). consists of one or two fully connected layers.

An overview of PAC is shown in Fig 2. Our final model is trained in two stages:

3.1 Rotation Pretraining

We first train our feature extractor with the self-supervised task of predicting image rotations (Fig 2 (left)) on both the source and target datasets, i.e., all images in and . Without using image category labels, we train a 4-way classifier to predict one out of 4 possible angles (, , , ) that an input image has been rotated by. We follow the procedure of Gidaris et al[12] and in each minibatch, we introduce all 4 rotations of a given image to the classifier. This backbone is then used as the initialization for semi-supervised domain adaptation training.

3.2 Consistency Regularization

Consistency regularization promotes the final model to produce the same output for both an input image and a perturbed version . We introduce these perturbations using image level augmentations: RandAugment [8] along with additional color jittering. Given an unlabelled image , we first compute the model’s predicted class distributions

is then confidence thresholded using a threshold and the following is used as the consistency regularization loss.


where is an indicator function and is cross-entropy. Note that has been used to index into the elements of . Intuitively, an unperturbed version of image is used to compute pseudo-targets for the perturbed version , which is only used when the pseudo-target has high confidence (). We also note here that the target is not used in computing gradients for the parameters of the network. For the labelled examples from and , we use the same perturbations but with ground truth labels as targets.

The model is optimized using minibatch-SGD, with minibatches and sampled from and respectively. The final optimization criterion used is

where is the one-hot representation of or .

4 Experiments

4.1 Datasets

We evaluate our method, PAC, on four different datasets: DomainNet [26], VisDA-17 [27], Office-Home [38] and Office [28]. DomainNet [26] is a recent large scale domain adaptation benchmark with 6 different visual domains and 345 classes. We use a subset of four domains (Clipart, Paintings, Real, and Sketch) and 126 classes for our experiments. This subset had close to 36500 images per domain. A total of 7 different scenarios out of the possible 12 were used for evaluation. VisDA-17 is another large scale adaptation benchmark with a single adaptation scenario : the source domain consists of 152,398 synthetic images from 12 categories, and the target domain consists of 55,388 real images. Office-Home [38] is a dataset with 65 categories of objects found in typical office and home environments. It has 4 different visual domains (Art, Clipart, Product, and Real), and we evaluate our methods on all 12 different adaptation scenarios. The 4 domains have close to 3800 images on average. Office [28] is a dataset of objects of 31 different categories commonly found in an office. It has 3 different categories—amazon, webcam and dslr, with approx. 2800, 800 and 500 images respectively. Following [29] we evaluated only on the 2 cases with amazon as the target domain, since the other two domains have a lot fewer images.

For each dataset and adaptation scenario, following [29], we use one-shot and three-shot settings for evaluation, where one and three target labels per class are available to the learner respectively. For each scenario, 3 examples in the target domain are held out for validation, except in VisDA-17 where 20 examples per class were held out because of the fewer number of categories.

4.2 Implementation Details

All our experiments were implemented in PyTorch [25] using W&B [4] for tracking experiments. On the Office and Office-Home datasets, we evaluated PAC using both an Alexnet [18] and a VGG-16 [35] backbone. The experiments on the DomainNet dataset, used an Alexnet and a Resnet-34 [15] backbone, while on VisDA-17 we evaluated our method with a ResNet-34 backbone.

While using an Alexnet or VGG-16 feature extractor, we use 1 fully connected layer as the classifier , and while using the Resnet-34 backbone, we use a 2-layer MLP with 512 intermediate nodes. Our backbones before being trained using the rotation prediction task, are pretrained on the Imagenet [9] dataset, same as other methods used for comparison. Similar to [29], in every minibatch for training, we sampled an equal number of labelled and unlabelled examples. Labelled examples came in equal numbers from both the source and target domains. We used an SGD optimizer with momentum of 0.9 and a learning rate decay schedule similar to that used by [10]. For consistency regularization, the confidence threshold was set to 0.9 across all experiments, having validated on the real to sketch scenario of DomainNet. Complete details of all experiments have been included in Appendix D.

4.3 Results

Net Method R to C R to P P to C C to S S to P R to S P to R MEAN
1-shot 3-shot 1-shot 3-shot 1-shot 3-shot 1-shot 3-shot 1-shot 3-shot 1-shot 3-shot 1-shot 3-shot 1-shot 3-shot
Alexnet S+T 43.3 47.1 42.4 45.0 40.1 44.9 33.6 36.4 35.7 38.4 29.1 33.3 55.8 58.7 40.0 43.4
DANN 43.3 46.1 41.6 43.8 39.1 41.0 35.9 36.5 36.9 38.9 32.5 33.4 53.6 57.3 40.4 42.4
ADR 43.1 46.2 41.4 44.4 39.3 43.6 32.8 36.4 33.1 38.9 29.1 32.4 55.9 57.3 39.2 42.7
CDAN 46.3 46.8 45.7 45.0 38.3 42.3 27.5 29.5 30.2 33.7 28.8 31.3 56.7 58.7 39.1 41.0
MME 48.9 55.6 48.0 49.0 46.7 51.7 36.3 39.4 39.4 43.0 33.3 37.9 56.8 60.7 44.2 48.2
Meta-MME - 56.4 - 50.2 - 51.9 - 39.6 - 43.7 - 38.7 - 60.7 - 48.7
APE 47.7 54.6 49.0 50.5 46.9 52.1 38.5 42.6 38.5 42.2 33.8 38.7 57.5 61.4 44.6 48.9
BiAT 54.2 58.6 49.2 50.6 44.0 52.0 37.7 41.9 39.6 42.1 37.2 42.0 56.9 58.8 45.5 49.4
PAC 55.4 61.7 54.6 56.9 47.0 59.8 46.9 52.9 38.6 43.9 38.7 48.2 56.7 59.7 48.3 54.7

S+T 55.6 60.0 60.6 62.2 56.8 59.4 50.8 55.0 56.0 59.5 46.3 50.1 71.8 73.9 56.8 60.0
DANN 58.2 59.8 61.4 62.8 56.3 59.6 52.8 55.4 57.4 59.9 52.2 54.9 70.3 72.2 58.4 60.7
ADR 57.1 60.7 61.3 61.9 57.0 60.7 51.0 54.4 56.0 59.9 49.0 51.1 72.0 74.2 57.6 60.4
CDAN 65.0 69.0 64.9 67.3 63.7 68.4 53.1 57.8 63.4 65.3 54.5 59.0 73.2 78.5 62.5 66.5
MME 70.0 72.2 67.7 69.7 69.0 71.7 56.3 61.8 64.8 66.8 61.0 61.9 76.1 78.5 66.4 68.9
Meta-MME - 73.5 - 70.3 - 72.8 - 62.8 - 68.0 - 63.8 - 79.2 - 70.1
APE 70.4 76.6 70.8 72.1 72.9 76.7 56.7 63.1 64.5 66.1 63.0 67.8 76.6 79.4 67.8 71.7
BiAT 73.0 74.9 68.0 68.8 71.6 74.6 57.9 61.5 63.9 67.5 58.5 62.1 77.0 78.6 67.1 69.7
PAC 74.9 78.6 73.0 74.3 72.6 76.0 65.8 69.6 67.9 69.4 68.7 70.2 76.7 79.3 71.4 73.9
Table 1: Accuracy on the DomainNet dataset (%) for one-shot and three-shot settings on 4 domains, R: Real, C: Clipart, P: Painting, S: Sketch. PAC, though simple, is strong enough to outperform other state of the art approaches on most scenarios.
Method Overall Accuracy
1-shot 3-shot
S+T 57.7 59.9
MME 69.7 70.7
PAC 75.2 80.4
Table 2: Results on VisDA-17. Our method outperforms MME, a current state of the art method, by a sizeable margin.

Comparison to other approaches. We compare PAC with different recent semi-supervised domain adaptation approaches : MME [29], BiAT [16], Meta-MME [20], APE [17], using results reported by these papers. Besides this, we also include in the tables, baseline approaches using adversarial domain alignment—DANN [11], ADR [31] and CDAN [21], that were evaluated by Saito et al[29]. The baseline “S+T” is a method that simply uses all labelled data available to it to train the network using cross-entropy loss. Note that PAC can be construed as “S+T” along with additional consistency regularization and with a warm start using rotation prediction for pretraining.

In Table 1, we compare the accuracy of PAC with different recent approaches on DomainNet. Remarkably, our simple approach outperforms current state of the art by 3-5% on this benchmark with different backbones. In Table 2 holding the VisDA-17 results, besides our method, we report results of S+T and MME, that we replicated from the the implementation of [29]. We see that PAC shows strong performance, with close to 10% improvement in accuracy over MME in the 3-shot scenario. On the smaller Office-Home dataset, as seen from the average accuracies in Table 3, our method is comparable to state of the art in the 3-shot scenarios, but starts lagging a little in the 1-shot scenario. This is an effect seen in our results across Tables 1, 2 and 3 where our improvements over state of the art are larger in the 3-shot scenario than in the case of 1-shot. We delve deeper into this and report an analysis of our method with different number of labelled target examples in Appendix A. Complete results on different scenarios of Office-Home and results on the Office dataset can be found in Appendix C.

Method Alexnet VGG
1-shot 3-shot 1-shot 3-shot
S+T 44.1 50.0 57.4 62.9
DANN 45.1 50.3 60.0 63.9
ADR 44.5 49.5 57.3 63.1
CDAN 41.2 46.2 55.9 61.8
ENT 38.8 50.9 51.6 64.8
MME 49.2 55.2 62.7 67.6
APE - 55.6 - -
BiAT 49.6 56.4 - -
PAC 47.4 55.2 62.2 67.7
Table 3: Results on the Office-Home dataset (%), using both the Alexnet and VGG backbones. Accuracy reported is averaged over all adaptation scenarios. Performance on each setting is included in supplementary material. Our approach is competitive with state of the art on this benchmark.

Ablation analysis. In Table 4, we see what rotation prediction pretraining and consistency regularization do for final target classification performance separately. The two components provide boosts to the final performance individually, with the combination of both performing best. We see that in most cases consistency regularization helps performance by a lot, especially in the 3-shot scenarios.

Rot CR Target Accuracy
Alexnet Resnet-34
1-shot 3-shot 1-shot 3-shot
29.1 33.3 46.3 50.1
35.1 37.9 54.1 56.1
32.5 45.9 64.3 68.9
38.7 48.2 68.7 70.2
Table 4: Ablation study for pretraining predicting rotations (Rot) and consistency regularization (CR) on the real to sketch scenario of Domainnet using both Alexnet and Resnet-34 backbones.

Feature space analysis. In Fig 3 we plot the 2-D TSNE [22] embeddings for features generated by 5 differently trained Alexnet backbones. The embeddings are plotted for all points from 5 randomly picked classes. The source domain points which are light colored circles, come from real images of Office-Home and the target domain points which are dark colored markers come from clipart images. The labelled target examples are marked with X’s. The two plots on the left compare differently pretrained backbones and the three on the right use backbones at the end of different SSDA training processes. In the plots we can see that pretraining the backbone for rotation prediction starts to align and cluster points according to their classes a little better than what a backbone pretrained just on Imagenet can do. Out of the methods trained on the SSDA task on the right, we see that both PAC and MME create well separated classes in feature space allowing for the classifier to have decision boundaries in low-density regions. MME explicitly minimizes conditional entropy which may draw samples even further apart from the classifier boundaries, as compared to our method which simply tries to ensure that the classifier does not separate an example and its perturbed version.

Figure 3: 2-D TSNE embeddings of features from 5 randomly chosen classes. Lighter colors represent source domain points and darker ones represent corresponding target domain points. The dark colored X’s are labelled target domain points (3 per class). Pretraining with rotation prediction begins to cluster points of the same class a little better over a backbone pretrained just on Imagenet. Our method on the right, separates classes just as well as MME. Refer to Section 4.3 for discussion. (Best viewed with color and under zoom)

In Table 5, we quantitatively analyze the features via three different metrics : The -distance is a distance metric between the two domains in feature space computed using an SVM trained to classify domains as done in [3]. The higher the error of the SVM domain classifier, the lower is the -distance. The other two metrics are accuracies of distance based classifiers in feature space. The first one, “Dist Acc. (Target)” is the accuracy of a classifier that assigns any unlabelled target example, the class label of the target labelled examples closest to it on average in the feature space. “Dist. Acc. (Source)” similarly uses only the source examples, all of which are labelled, to compute the class label for an unlabelled target example. Comparing the pretrained backbones, we see that rotation pretraining improves the feature space both by bringing closer the features across the two domains (as indicated by the low -distance) and aligning them so that features from the same class are closer to one another (indicated by the higher accuracies). When it comes to final feature spaces of the SSDA methods, we see that MME, being a domain alignment method, reduces -distance more than PAC. However, PAC is able to better maintain the class-defined neighborhood of features, as indicated by the higher accuracies. This also indicates that metrics like domain discrepancy may be secondary to the performance of a good classifier that maintains a class-defined feature space neighborhood across both source and target domains.

Backbone -dist
Dist. Acc.
Dist. Acc.

Imagenet pretrained
1.57 26.8 26.0
Rotation pretrained 1.28 36.2 34.6
S+T 1.49 43.0 43.3
MME 1.24 51.2 51.5
PAC 1.45 56.4 56.8
Table 5: Distance based classifier accuracy and -distance for different Alexnet backbones on the real to clipart adaptation scenario of Office-Home. We see that rotation prediction helps improve the initial feature space. Also amongst the SSDA methods, PAC maintains a better class defined neighborhood both within and across domains, even though the two domains are not aligned as closely as in the case of MME.
Figure 4: Performance of our method with different augmentation/perturbation methods on real to clipart adaptation of Office-Home. Adversarial perturbation helps, but not as much as image augmentation approaches do. A combination of color jittering and RandAugment performed the best.

Which perturbation technique is best? We compared three different image augmentation approaches : RandAugment [8] involves a list of 14 different augmentation schemes like translations, rotations, shears, color/brightness enhancements etc., 2 out of which are chosen randomly anytime an image is augmented. We also evaluated color jittering, since common objects in our datasets are largely invariant to small changes in color. Finally we tried a combination of both, and found that this performed best for our method. Fig 4 shows the comparison of the final target accuracies achieved using an Alexnet backbone on the real to clipart adaptation scenario of Office-Home. Besides perturbations based on augmentation, we also evaluated adversarial image perturbation via virtual adversarial training (VAT) [24]. When using VAT, we found improvements over the simple “S+T” method (48.3% using VAT vs 44.6% without), but as seen from Fig 4, we found this was much lower than image augmentation approaches. This is quite likely because image augmentation imposes a more meaningful neighborhood on images where class labels do not change, while adversarial perturbation does not have this guarantee.

Can consistency regularization fix more errors than MME? Short answer : yes. In Sec. 2, we mentioned that consistency regularization, because of the perturbations it makes in image space, can fix errors of the kind that simple conditional entropy minimization, the way it is done in MME, cannot. We validate this hypothesis by training both methods from a randomly initialized feature extractor, where we expect initial features to have a much less meaningful neighborhood in feature space. In Table 6, we see a larger gap in the performance of MME starting from a pretrained vs a randomly initialized backbone, which tells us that consistency regularization can fix a lot more errors in the initial feature space than MME. Note that “Ours (CR)” method here does not include any rotation pretraining for this comparison.

Method Imagenet pt. Random init.
MME 51.2 26.9
Ours (CR) 54.1 40.0
Table 6: Comparison of MME and our consistency regularization approach on Imagenet pretrained backbone and randomly initialized backbone. Consistency regularization can fix more initial feature space errors than MME.

Sensitivity to confidence threshold. Our consistency regularization approach uses soft targets based on outputs of the classifier only in cases where the confidence of labelling is high. In Fig 5, we compare the sensitivity of our method to this threshold. We see that higher confidence thresholds up to 0.9 help final target classification performance.

Figure 5: Sensitivity of our method to different thresholds used for consistency regularization. Accuracies reported are on the real to sketch split of DomainNet using a Resnet-34 backbone.

How does pretraining with rotation prediction compare to a constrastive method? Contrastive pretraining methods [14, 7] have been shown to attain remarkable performance in learning features from unlabelled images that are useful for tasks like image recognition and object detection. We evaluate how momentum contrast (MoCo) [14] performs for pretraining our feature extractor on both source and target images, compared with rotation prediction. Table 7 compares the same metrics as Table 5 with the addition of final method (training with labels and consistency regularization) performance on target classification. We see that, like rotation prediction, MoCo improves the imagenet pretrained features to some extent. It has marginally better class-defined structure across domains, but a poorer structure in the target domain indicated by the accuracies of the distance based classifiers. Finally as seen under “Final Acc.” in the table, when training our method from different initializations, with a MoCo pretrained backbone we see better results than an Imagenet pretrained one, but poorer than one that was pretrained on the rotation prediction task.

Dist. Acc.
Dist. Acc.
Final Acc.

1.57 26.8 26.0 54.1
MoCo 1.31 31.4 34.8 56.3
Rotation 1.28 36.2 34.6 58.8
Table 7: Distance based classifier accuracy, -distance and final method performance for different Alexnet backbones on the real to clipart adaptation scenario of Office-Home. MoCo helps over Imagenet pretraining, but not as much as rotation prediction.

5 Conclusion

We showed that consistency regularization and pretraining using rotation prediction are powerful techniques in semi-supervised domain adaptation. Our method, using simply a combination of these without any adversarial domain alignment, could outperform recent state of the art on this task, most of which use adversarial alignment. We presented a thorough analysis of both of our model components showing why they are better than other options for similar approaches. We hope this can help in their use in combination with other methods on the same or related tasks.

Acknowledgements. This work was supported by the Hariri Institute at Boston University.



Appendix A PAC Performance with Different Shots

In Figure 6, we plot the target accuracy of 4 methods on the real to clipart adaptation scenario of Office-Home, for different number of labelled target examples. The method “CR” represents the consistency regularization part of PAC, meaning it starts with an Imagenet pretrained backbone, same as S+T and MME [29]. We see that being an unsupervised domain adaptation method, MME performs better at 0-shots, but PAC does not lag behind in performance by a lot. With a few labelled target examples, PAC and CR start performing better. When the number of labelled target examples increases further, MME and S+T start closing this gap, possibly just due to higher supervision from labels. It is less evident in Figure 6 but still discernible that the advantage provided by rotation pretraining is greater in the case of fewer shots, than when there are more labels in the target domain.

Figure 6: Performance with different number of labelled target examples.

Appendix B More questions

Can pretraining and consistency help other methods? An indication towards the affirmative is seen when we train MME with pretraining and consistency on the 3-shot real to sketch scenario of DomainNet using a Resnet-34 backbone. The results are shown in Table 8, where we can see that pretraining and consistency both individually help MME’s performance, and their combination helps it the most.

Rot CR Accuracy
Table 8: Pretraining and consistency with MME.

What if pretraining uses rotation prediction only on target? We train the backbone only on target domain data for pretraining with rotation prediction, and then train it like PAC using consistency regularization. On the 3-shot real to clipart SSDA scenario of Office-Home using an Alexnet backbone, this achieves a final target accuracy of % compared to % of PAC. This is indicative of target-only rotation prediction helping the initial feature extractor, but not as much as in the case when source domain data is used along with it.

Rot CR
(with source)
(only target)
56.6 35.5
58.9 36.7
Table 9: Ablating source domain information.

How big is the role of source domain data in final target performance? To see this, we train our method with no access to source domain data. This is similar to the semi-supervised learning problem. Target accuracy with only 3 labelled target examples and access to all other unlabelled examples, on the clipart domain of Office-Home using an Alexnet backbone, are in the last column of Table 9. For reference, the accuracies of our method with source domain data from the real domain (i.e. R2C adaptation scenario) are provided in the 3 column.

Appendix C Results on Office and Office-Home

Table 11 shows the results of PAC on the different scenarios of Office-Home, the average accuracy over all these scenarios was also reported in Table 3. Table 10 shows the accuracy of PAC on two scenarios of Office. We see that PAC performs comparably to state of the art. It lags behind a little in the 1-shot scenarios as compared to 3-shot ones.

Network Method D to A W to A
1-shot 3-shot 1-shot 3-shot
Alexnet S+T 50.0 62.4 50.4 61.2
DANN 54.5 65.2 57.0 64.4
ADR 50.9 61.4 50.2 61.2
CDAN 48.5 61.4 50.2 60.3
ENT 50.0 66.2 50.7 64.0
MME 55.8 67.8 57.2 67.3
APE - 69.0 - 67.6
BiAT 54.6 68.5 57.9 68.2
PAC 54.7 66.3 53.6 65.1
VGG S+T 68.2 73.3 69.2 73.2
DANN 70.4 74.6 69.3 75.4
ADR 69.2 74.1 69.7 73.3
CDAN 64.4 71.4 65.9 74.4
ENT 72.1 75.1 69.1 75.4
MME 73.6 77.6 73.1 76.3
PAC 72.4 75.6 70.2 76.0
Table 10: Results on Office. We evaluate using the two scenarios where the target domain is amazon
Network Method R to C R to P R to A P to R P to C P to A A to P A to C A to R C to R C to A C to P Mean
Alexnet S+T 37.5 63.1 44.8 54.3 31.7 31.5 48.8 31.1 53.3 48.5 33.9 50.8 44.1
DANN 42.5 64.2 45.1 56.4 36.6 32.7 43.5 34.4 51.9 51.0 33.8 49.4 45.1
ADR 37.8 63.5 45.4 53.5 32.5 32.2 49.5 31.8 53.4 49.7 34.2 50.4 44.5
CDAN 36.1 62.3 42.2 52.7 28.0 27.8 48.7 28.0 51.3 41.0 26.8 49.9 41.2
ENT 26.8 65.8 45.8 56.3 23.5 21.9 47.4 22.1 53.4 30.8 18.1 53.6 38.8
MME 42.0 69.6 48.3 58.7 37.8 34.9 52.5 36.4 57.0 54.1 39.5 59.1 49.2
BiAT - - - - - - - - - - - - 49.6
PAC 49.6 69.8 45.9 57.5 42.5 30.4 53.1 35.8 51.9 48.2 26.0 57.6 47.4
VGG S+T 39.5 75.3 61.2 71.6 37.0 52.0 63.6 37.5 69.5 64.5 51.4 65.9 57.4
DANN 52.0 75.7 62.7 72.7 45.9 51.3 64.3 44.4 68.9 64.2 52.3 65.3 60.0
ADR 39.7 76.2 60.2 71.8 37.2 51.4 63.9 39.0 68.7 64.8 50.0 65.2 57.3
CDAN 43.3 75.7 60.9 69.6 37.4 44.5 67.7 39.8 64.8 58.7 41.6 66.2 55.9
ENT 23.7 77.5 64.0 74.6 21.3 44.6 66.0 22.4 70.6 62.1 25.1 67.7 51.6
MME 49.1 78.7 65.1 74.4 46.2 56.0 68.6 45.8 72.2 68.0 57.5 71.3 62.7
PAC 56.4 78.8 64.6 73.1 54.7 55.3 69.8 43.5 69.5 65.3 45.3 69.6 62.2
Alexnet S+T 44.6 66.7 47.7 57.8 44.4 36.1 57.6 38.8 57.0 54.3 37.5 57.9 50.0
DANN 47.2 66.7 46.6 58.1 44.4 36.1 57.2 39.8 56.6 54.3 38.6 57.9 50.3
ADR 45.0 66.2 46.9 57.3 38.9 36.3 57.5 40.0 57.8 53.4 37.3 57.7 49.5
CDAN 41.8 69.9 43.2 53.6 35.8 32.0 56.3 34.5 53.5 49.3 27.9 56.2 46.2
ENT 44.9 70.4 47.1 60.3 41.2 34.6 60.7 37.8 60.5 58.0 31.8 63.4 50.9
MME 51.2 73.0 50.3 61.6 47.2 40.7 63.9 43.8 61.4 59.9 44.7 64.7 55.2
APE 51.9 74.6 51.2 61.6 47.9 42.1 65.5 44.5 60.9 58.1 44.3 64.8 55.6
BiAT - - - - - - - - - - - - 56.4
PAC 58.9 72.4 47.5 61.9 53.2 39.6 63.8 49.9 60.0 54.5 36.3 64.8 55.2
VGG S+T 49.6 78.6 63.6 72.7 47.2 55.9 69.4 47.5 73.4 69.7 56.2 70.4 62.9
DANN 56.1 77.9 63.7 73.6 52.4 56.3 69.5 50.0 72.3 68.7 56.4 69.8 63.9
ADR 49.0 78.1 62.8 73.6 47.8 55.8 69.9 49.3 73.3 69.3 56.3 71.4 63.1
CDAN 50.2 80.9 62.1 70.8 45.1 50.3 74.7 46.0 71.4 65.9 52.9 71.2 61.8
ENT 48.3 81.6 65.5 76.6 46.8 56.9 73.0 44.8 75.3 72.9 59.1 77.0 64.8
MME 56.9 82.9 65.7 76.7 53.6 59.2 75.7 54.9 75.3 72.9 61.1 76.3 67.6
PAC 63.5 82.3 66.8 75.8 58.6 57.1 75.9 56.7 72.2 70.5 57.7 75.3 67.7
Table 11: Results on all adaptation scenarios of Office-Home.

Appendix D Experiment details

As mentioned in Section 4.2, all our experiments were implemented in PyTorch [25] using W&B [4] for managing experiments.

d.1 PAC experiments

We used three different backbones for evaluation in different experiments—Alexnet [18], VGG-16 [35] and Resnet-34 [15]. As mentioned in Section 4.2, while using an Alexnet or VGG-16 feature extractor, we use 1 fully connected layer as the classifier, and while using the Resnet-34 backbone, we use a 2-layer MLP with 512 intermediate nodes. The classifier uses a temperature parameter set to to sharpen the distribution it outputs using a softmax.

Same as [29], we train the models using minibatch-SGD, with source examples, labelled target examples and unlabelled target examples that the learner “sees” at each training step. for the VGG and Resnet backbones, while for Alexnet. The SGD optimizer used a momentum parameter and a weight decay (coefficient of regularizer on parameter norm) of . For all experiments, the parameters of the backbone are updated with a learning rate of , while the parameters of the classifier are updated with a learning rate . Both of these are decayed as training progresses using a decay schedule similar to [10]. Learning rate at step () is set as below:

For experiments on the Office and Office-Home dataset, we trained PAC using both an Alexnet and a VGG-16 backbone, and the models were trained for 10000 steps with the stopping point chosen using best validation accuracy.

For the experiments on DomainNet, we use both Alexnet and Resnet-34 backbones, while for VisDA-17, we use only Resnet-34. All models in these experiments were trained for 50000 steps, using validation accuracy for determining the best stopping point.

d.2 Pretraining

As mentioned in Section 4 of the main paper, we pretrain our models for rotation prediction starting from Imagenet pretrained weights. A comparison of PAC with a backbone trained for rotation prediction starting from imagenet pretraining (final target accuracy = %) vs one that does not use any imagenet pretraining (final target accuracy = %), revealed that there is important feature space information in imagenet pretrained weights that rotation prediction could not capture on its own. This comparison was done using an Alexnet on the real to clipart adaptation scenario of Office-Home.

Following Gidaris et al[12], we trained the model on all 4 rotations of a single image in each minibatch. Each minibatch contained images each from source and target domains, which translates to images considering all rotations. The Alexnet backbones are trained using a learning rate of and . The Resnet-34 and VGG backbones are both trained using and a learning rate of . We found that beyond a certain point early on in training, the number of steps of training for rotation prediction did not make a big difference to the final task accuracy, and finally the chosen number of training steps was 4000 for Alexnet, 2000 for VGG-16 and 5000 for Resnet-34 backbones.

d.3 Other Experiments

MoCo pretraining. Using the Alexnet backbone, we trained momentum contrast [14] for 5000 training steps, where in each step the model saw 32 images each from the real and the clipart domains of Office-Home. The queue length used for MoCo was 4096 and the momentum parameter was .

Virtual Adversarial Training. For adding a VAT criterion to our model, we closely followed the VAT criterion in VADA [34]. We used a radius of for adversarial perturbations and a coefficient of for the VAT criterion, which is the KL divergence between the outputs of the perturbed and the unperturbed input from the target domain.


  1. H. Ajakan, P. Germain, H. Larochelle, F. Laviolette and M. Marchand (2014) Domain-adversarial neural networks. arXiv preprint arXiv:1412.4446. Cited by: §1, §2.
  2. S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira and J. W. Vaughan (2010) A theory of learning from different domains. Machine learning 79 (1-2), pp. 151–175. Cited by: §1, §2, §2.
  3. S. Ben-David, J. Blitzer, K. Crammer and F. Pereira (2006) Analysis of representations for domain adaptation. Advances in neural information processing systems 19, pp. 137–144. Cited by: §4.3.
  4. L. Biewald (2020) Experiment tracking with weights and biases. Note: Software available from External Links: Link Cited by: Appendix D, §4.2.
  5. F. M. Carlucci, A. D’Innocente, S. Bucci, B. Caputo and T. Tommasi (2019) Domain generalization by solving jigsaw puzzles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2229–2238. Cited by: §2.
  6. O. Chapelle and A. Zien (2005) Semi-supervised classification by low density separation.. In AISTATS, Vol. 2005, pp. 57–64. Cited by: §1, §2, §2.
  7. T. Chen, S. Kornblith, M. Norouzi and G. Hinton (2020) A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709. Cited by: §2, §4.3.
  8. E. D. Cubuk, B. Zoph, J. Shlens and Q. V. Le (2020) Randaugment: practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 702–703. Cited by: §1, §2, §3.2, §4.3.
  9. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li and L. Fei-Fei (2009) ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, Cited by: §1, §4.2.
  10. Y. Ganin and V. Lempitsky (2015) Unsupervised domain adaptation by backpropagation. In International conference on machine learning, pp. 1180–1189. Cited by: §D.1, §2, §2, §4.2.
  11. Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand and V. Lempitsky (2016) Domain-adversarial training of neural networks. The Journal of Machine Learning Research 17 (1), pp. 2096–2030. Cited by: §4.3.
  12. S. Gidaris, P. Singh and N. Komodakis (2018) Unsupervised representation learning by predicting image rotations. In International Conference on Learning Representations, Cited by: §D.2, §1, §1, §2, §3.1.
  13. Y. Grandvalet and Y. Bengio (2005) Semi-supervised learning by entropy minimization. In Advances in neural information processing systems, pp. 529–536. Cited by: §2.
  14. K. He, H. Fan, Y. Wu, S. Xie and R. Girshick (2020) Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738. Cited by: §D.3, §2, §4.3.
  15. K. He, X. Zhang, S. Ren and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §D.1, §4.2.
  16. P. Jiang, A. Wu, Y. Han, Y. Shao, M. Qi and B. Li Bidirectional adversarial training for semi-supervised domain adaptation. Cited by: §2, §4.3.
  17. T. Kim and C. Kim (2020) Attract, perturb, and explore: learning a feature alignment network for semi-supervised domain adaptation. arXiv preprint arXiv:2007.09375. Cited by: §2, §4.3.
  18. A. Krizhevsky, I. Sutskever and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §D.1, §4.2.
  19. A. Kurakin, C. Li, C. Raffel, D. Berthelot, E. D. Cubuk, H. Zhang, K. Sohn, N. Carlini and Z. Zhang (2020) FixMatch: simplifying semi-supervised learning with consistency and confidence. In NeurIPS, Cited by: §2.
  20. D. Li and T. Hospedales (2020) Online meta-learning for multi-source and semi-supervised domain adaptation. arXiv preprint arXiv:2004.04398. Cited by: §2, §4.3.
  21. M. Long, Z. Cao, J. Wang and M. I. Jordan (2018) Conditional adversarial domain adaptation. In Advances in Neural Information Processing Systems, pp. 1640–1650. Cited by: §1, §2, §4.3.
  22. L. v. d. Maaten and G. Hinton (2008) Visualizing data using t-sne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: §4.3.
  23. I. Misra and L. v. d. Maaten (2020) Self-supervised learning of pretext-invariant representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6707–6717. Cited by: §2.
  24. T. Miyato, S. Maeda, M. Koyama and S. Ishii (2018) Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence 41 (8), pp. 1979–1993. Cited by: §2, §4.3.
  25. A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai and S. Chintala (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d\textquotesingleAlché-Buc, E. Fox and R. Garnett (Eds.), pp. 8024–8035. External Links: Link Cited by: Appendix D, §4.2.
  26. X. Peng, Q. Bai, X. Xia, Z. Huang, K. Saenko and B. Wang (2019) Moment matching for multi-source domain adaptation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1406–1415. Cited by: §1, §4.1.
  27. X. Peng, B. Usman, N. Kaushik, J. Hoffman, D. Wang and K. Saenko (2017) Visda: the visual domain adaptation challenge. arXiv preprint arXiv:1710.06924. Cited by: §4.1.
  28. K. Saenko, B. Kulis, M. Fritz and T. Darrell (2010) Adapting visual category models to new domains. In European conference on computer vision, pp. 213–226. Cited by: §4.1.
  29. K. Saito, D. Kim, S. Sclaroff, T. Darrell and K. Saenko (2019) Semi-supervised domain adaptation via minimax entropy. In Proceedings of the IEEE International Conference on Computer Vision, pp. 8050–8058. Cited by: Appendix A, §D.1, §1, §2, §2, §3, §4.1, §4.1, §4.2, §4.3, §4.3.
  30. K. Saito, D. Kim, S. Sclaroff and K. Saenko (2020) Universal domain adaptation through self supervision. arXiv preprint arXiv:2002.07953. Cited by: §2.
  31. K. Saito, Y. Ushiku, T. Harada and K. Saenko (2018) Adversarial dropout regularization. In International Conference on Learning Representations, Cited by: §4.3.
  32. K. Saito, K. Watanabe, Y. Ushiku and T. Harada (2018) Maximum classifier discrepancy for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3723–3732. Cited by: §1, §2.
  33. M. Sajjadi, M. Javanmardi and T. Tasdizen (2016) Regularization with stochastic transformations and perturbations for deep semi-supervised learning. In Advances in neural information processing systems, pp. 1163–1171. Cited by: §1, §2.
  34. R. Shu, H. Bui, H. Narui and S. Ermon (2018) A dirt-t approach to unsupervised domain adaptation. In International Conference on Learning Representations, Cited by: §D.3, §2.
  35. K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. International Conference on Learning Representations. Cited by: §D.1, §4.2.
  36. Y. Sun, E. Tzeng, T. Darrell and A. A. Efros (2019) Unsupervised domain adaptation through self-supervision. arXiv preprint arXiv:1909.11825. Cited by: §2.
  37. E. Tzeng, J. Hoffman, K. Saenko and T. Darrell (2017) Adversarial discriminative domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7167–7176. Cited by: §1, §2.
  38. H. Venkateswara, J. Eusebio, S. Chakraborty and S. Panchanathan (2017) Deep hashing network for unsupervised domain adaptation. In (IEEE) Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.1.
  39. V. Verma, A. Lamb, J. Kannala, Y. Bengio and D. Lopez-Paz (2019) Interpolation consistency training for semi-supervised learning. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pp. 3635–3641. Cited by: §1.
  40. B. Wallace and B. Hariharan (2020) Extending and analyzing self-supervised learning across domains. arXiv preprint arXiv:2004.11992. Cited by: §1, §2.
  41. J. Xu, L. Xiao and A. M. López (2019) Self-supervised domain adaptation for computer vision tasks. IEEE Access 7, pp. 156694–156706. Cited by: §2.
  42. Y. Zhang, T. Liu, M. Long and M. I. Jordan (2019) Bridging theory and algorithm for domain adaptation. arXiv preprint arXiv:1904.05801. Cited by: §1, §2.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description