Unsupervised Domain Adaptation
This paper addresses unsupervised domain adaptation, the setting where labeled training data is available on a source domain, but the goal is to have good performance on a target domain with only unlabeled data. Like much of previous work, we seek to align the learned representations of the source and target domains while preserving discriminability. The way we accomplish alignment is by learning to perform auxiliary self-supervised task(s) on both domains simultaneously. Each self-supervised task brings the two domains closer together along the direction relevant to that task. Training this jointly with the main task classifier on the source domain is shown to successfully generalize to the unlabeled target domain. The presented objective is straightforward to implement and easy to optimize. We achieve state-of-the-art results on four out of seven standard benchmarks, and competitive results on segmentation adaptation. We also demonstrate that our method composes well with another popular pixel-level adaptation method.
Visual distribution shifts are fundamental to our constantly evolving world. We humans face them all the time, e.g. when we navigate a foreign city, read text in a new font, or recognize objects in an environment we have never encountered before. These real-world challenges to the human visual perception have direct parallels in computer vision. Formally, a distribution shift happens when a model is trained on data from one distribution (source), but the goal is to make good predictions on some other distribution (target) that shares the label space with the source. Often computational models struggle even for pairs of distributions that humans find intuitively similar.
Our paper studies the setting of unsupervised domain adaptation, with labeled data in the source domain, but only unlabeled data in the target domain. The general philosophy of the field is to induce alignment of the source and target domains through some transformation. In the context of deep learning, a convolutional neural network maps images to learned representations in some feature space, so inducing alignment is done by making the distribution shifts small between the source and target in this shared feature space (csurka2017domain; wang2018deep; gopalan2011domain). If, in addition, such representations preserve discriminability on the source domain, then we can learn a good classifier on the source, which now generalizes to the target under the reasonable assumption that the representations of the two domains have the same ground truth.
Most existing approaches implement this philosophy of alignment by minimizing a measurement of distributional discrepancy in the feature space (details in section 2), often some form of maximum mean discrepancy (MMD) e.g. long2017deep, or a learned discriminator of the source and target as an approximation to the total variation distance e.g. ganin2016domain. Both measurements lead to the formulation of the training objective as a minimax optimization problem a.k.a. adversarial learning, which is known to be very difficult to solve. Unless carefully balanced, the push and pull in opposite directions can often cause wild fluctuations in the discrepancy loss and lead to sudden divergence (details in section 2). Therefore, we propose to avoid minimax optimization altogether through a very different approach.
Our main idea is to achieve alignment between the source and target domains by training a model on the same task in both domains simultaneously. Indeed, if we had labels in both domains, we could simply use our original classification task for this. However, since we lack labels in the target domain, we propose to use a self-supervised auxiliary task, which creates its own labels directly from the data (see section 3). In fact, we can use multiple self-supervised tasks, each one aligning the two domains along a direction of variation relevant to that task. Jointly training all the self-supervised tasks on both domains together with the original task on the source domain produces well-aligned representations as shown in Figure 1.
Like all of deep learning, we can only empirically verify that at least in our experiments, the model does not overfit by internally creating a different decision boundary for each domain along different dimensions, which would then yield bad results. Recent research suggests that stochastic gradient descent is indeed unlikely to find such overfitting solutions with a costly decision boundary of high complexity (implicit regularization), even though the models have enough capacity (zhang2016understanding; neyshabur2017geometry; arora2018stronger).
The key contribution of our work is to draw a connection between unsupervised domain adaptation and self-supervised learning. While we do not propose any fundamentally new self-supervised tasks, we offer insights in section 3 on how to select the right ones for adaptation, and propose in section 4 a novel training algorithm on those tasks, using batches of samples from both domains. Additionally, we demonstrate that domain alignment could be achieved with a simple and stable algorithm, without the need for adversarial learning. In section 5, we report state-of-the-art results on several standard benchmarks.
2 Related work
In this section we provide a brief overview of the two sub-fields that our work is aiming to bridge.
2.1 Unsupervised domain adaptation
Methods for unsupervised domain adaptation in computer vision can be divided into three broad classes. The dominant class, which our work belongs to, aims to induce alignment between the source and the target domains in some feature space. This has been done by optimizing for some measurement of distributional discrepancy. One popular measurement is the maximum mean discrepancy (MMD) – the distance between the mean of the two domains in some reproducing kernel Hilbert space, where the kernel is chosen to maximize the distance (bousmalis2016domain; long2015learning; long2017deep). Another way to obtain a measurement of discrepancy is to train an adversarial discriminator that distinguishes between the two domains (ganin2014unsupervised; ganin2016domain; tzeng2017adversarial). However, both MMD and adversarial training are formulated as minimax optimization problems, which are widely known, both in theory and practice, to be very difficult (fedus2017many; duchi2016local; liang2017well; jin2019minmax). Since the optimization landscape is much more complex than in standard supervised learning, training often does not converge or converges to a bad local minimum (goodfellow2016nips; nagarajan2017gradient; li2017limitations), and requires carefully balancing the two sets of parameters (for minimization and maximization) so one does not dominate the other (salimans2016improved; neyshabur2017stabilizing).
To make minimax optimization easier, researchers have proposed numerous modifications to the loss function, network design, and training procedure (arjovsky2017wasserstein; gulrajani2017improved; karras2017progressive; courty2017optimal; sun2016deep; shu2018dirt; sener2016learning). Over the years, these modifications have yielded practical improvements on many standard benchmarks, but have also made the state-of-the-art algorithms very complicated. Often practitioners are not sure which tricks are necessary for which applications, and implementing these tricks can be bug-prone and frustrating. To make matters worse, since there is no labeled target data available for a validation set, practitioners have no way to perform hyper-parameter tuning or early stopping.
The second class of methods directly transforms the source images to resemble the target images with generative models (taigman2016unsupervised; hoffman2017cycada; bousmalis2017unsupervised). While similar to the first class in the philosophy of alignment, these methods operate on image pixels directly instead of an intermediate representation space, and therefore can benefit from an additional round of adaptation in some representation space. In subsection 5.2 we demonstrate that composing our method with a popular pixel-level method yields stronger performance than either alone.
The third class of methods uses a model trained on the labeled source data to estimate labels on the target data, then trains on some of those estimated pseudo-labels (e.g. the most confident ones), therefore bootstrapping through the unlabeled target data. Sometimes called self-ensembling (french2017self), this technique is borrowed from semi-supervised learning, where it is called co-training (saito2017asymmetric; zou2018unsupervised; chen2018adversarial; chen2011co). In contrast, our method uses joint training (of the main and self-supervised tasks), different from co-training in every aspect except the name.
2.2 Self-supervised Feature Learning
The emerging field of self-supervised learning aims to use the machinery of supervised learning to address problems where external supervision is not available. The idea is to come up with auxiliary (also called “pretext”) tasks, where data itself can be used as supervision, and use them to learn deep feature representations which will hopefully be useful for downstream “real” tasks. Many such auxiliary tasks have been proposed in the literature, including colorization (predicting the chrominance channels of an image given its luminance) (zhang2016colorful; larsson2017colorproxy; zhang2017split), image inpainting pathak2016context, spatial context prediction (doersch2015unsupervised), solving jigsaw puzzles (noroozi2016unsupervised), image rotation prediction (gidaris2018unsupervised), predicting audio from video (owens2016ambient), contrastive predictive coding (oord2018representation), etc. Researchers have also experimented with combining multiple self-supervised tasks together doersch2017multi, although with limited success.
Typically, self-supervision is used as a pre-training step on unlabeled data (e.g. the ImageNet training set without labels) to initialize a deep learning model, followed by fine-tuning on a labeled training set (e.g. PASCAL VOC) and evaluating on the corresponding test set. Instead, in this paper, we train the self-supervised tasks together with the main supervised task, encouraging a consistent representation that both aligns the two domains and does well on the main task111Note that it is possible to apply the standard pre-training followed by fine-tuning regime to unsupervised domain adaptation, doing self-supervised pre-training on the target domain (or both the source and the target) and then fine-tuning on the source. However, this gives almost no benefit over no-adaptation baseline, and is far from being competitive..
Recently, self-supervision has also been used for other problem settings, such as improving robustness (hendrycks2019using), domain generalization (carlucci2019domain) and few-short learning (su2019boosting). The most relevant paper to us is ghifary2016deep, which uses self-supervision for unsupervised domain adaptation, but not through alignment. Their algorithm trains a denoising autoencoder vincent2008extracting only on the target data, together with the main classifier only on the labeled source data. They argue theoretically that this is better than training the autoencoder on both domains together. However, their theory is based on the critical assumption that the domains are already aligned, which is rarely true in practice. Consequently, their empirical results are much weaker than ours, as discussed in section 5. Please see Appendix A for more detailed comparisons with these works.
3 Designing self-supervised tasks for adaptation
|Task||Images and self-supervised labels|
The design of auxiliary self-supervised tasks is an exciting area of research in itself, with many successful examples listed in section 2. However, not all of them are suitable for unsupervised domain adaptation. In order to induce alignment between the source and target, the labels created by self-supervision should not require capturing information on the very factors where the domains are meaninglessly different, i.e. the factors of variation that we are trying to eliminate through adaptation.
A particularly unsuitable tasks are these that try to predict pixels of the original image, as image inpainting (pathak2016context), colorization (zhang2016colorful; larsson2017colorproxy; zhang2017split) or denoising autoencoder(vincent2008extracting). The success of pixel-wise reconstruction depends strongly on brightness information, or other factors of variation in overall appearance (e.g. sunny vs. coudy) that are typically irrelevant to high-level visual concepts. Thus, instead of inducing alignment, learning a pixel reconstruction task would instead serve to further separate the domains. We have experimented with using the colorization task and the denoising autoencoder for our training algorithm, and found their performance little better than the source only baseline, sometimes even worse! 222It is interesting to note that works borrowing from semi-supervised learning, e.g. ghifary2016deep, use denoising autoencoder for their training algorithm and obtain performance better than source only (but still not competitive with ours). As explained in Appendix A, this difference is due to the difference in training algorithms – we use self-supervision on both domains while they only use on the target. This further reflects the difference in philosophy, as we use target data for alignment whereas they use it simply as extra data as in semi-supervised learning.
In general, classification tasks that predict structural labels seems better suited for our purpose than reconstruction tasks that predict pixels. Therefore, we have settled on three classification-based self-supervised tasks that combine simplicity with high performance:
Rotation Prediction: (gidaris2018unsupervised) An input image is rotated in 90-degree increments i.e. , , , and ; the task is to predict the angle of rotation as a four-way classification problem.
Flip Prediction: An input image is randomly flipped vertically; the task is to predict if it is flipped or not 333We do not use horizontal flips, which is a common data augmentation technique, since it is typically desirable for natural scene features to be invariant to horizontal flips..
Patch Location Prediction: Patches are randomly cropped out of an input image; the task is to predict where the patches come from 444For large images (e.g. segmentation), the crop comes from a continuous set of coordinates, and the task is a regression problem in two dimensions, trained with the square loss. For small images (e.g. object recognition), the crop comes from one of the four quadrants, and the task is a four-way classification problem; this distinction exists only for ease of implementation..
Consider a trivial illustrative example where the source and target are exactly the same except that the target pixels are all scaled down by a constant factor (e.g. daylight to dusk transition). All three of the aforementioned forms of self-supervision are suitable for this example, because pixel scaling i.e. brightness is “orthogonal” to the prediction of rotation, flip and location.
Our training algorithm is simple once a set of self-supervised tasks are selected. We already have the loss function on the main prediction task, denoted , that we do not have target labels for. Each self-supervised task corresponds to a loss function for . So altogether, our optimization problem is set up as a combination of these loss functions, as is done for standard multi-task learning (see section 4). Implementation details are included in Appendix B.
Each loss function corresponds to a different “head” for , which produces predictions in the respective label space. All the task-specific heads (including for the actual prediction task) share a common feature extractor . Altogether, the parameters of and are the learned variables i.e. the free parameters of our optimization problem.
In our paper, is a deep convolutional neural network, and every is simply a linear layer i.e. multiplication by a matrix of size output space dimension feature space dimension. If th task is classification in nature, then also has a softmax or sigmoid (for multi-class or binary) following the linear layer. The output space dimension is only four for rotation and location classification, and two for flip and location regression. Depending on the network architecture used, the feature space dimension ranges between 64 and 512. The point is to make every low capacity, so the heads are forced to share high-level features, as is desirable for inducing alignment. A linear map from the highest-level features to the output is the smallest possible head and performs well empirically.
Let contain the labeled source data, and contain the unlabeled target data. for the main prediction task takes in the labeled source data, and produces the following term in our objective:
Each self-supervised task for modifies the input samples with some transformation and creates labels . Denote as the self-supervised samples generated from the source samples (with the original labels discarded), and from the target. Then the loss of each task produces the following term:
Note that for , unlike , take in both the source and target data; as we emphasize for many times throughout the paper, this is critical for inducing alignment. Altogether, our optimization problem can be formalized as in multi-task learning 555We have experimented with trade-off hyper-parameters for the loss terms inside the sum and found them unnecessary. Since a labeled target validation set might not be available, it can be beneficial to reduce the number of hyper-parameters.:
At test-time, we discard the self-supervised heads and use .
4.1 A heuristic for hyper-parameter tuning and early stopping
As previously mentioned, in the unsupervised domain adaptation setting, there is no target label available and therefore no target validation set, so typical strategies for hyper-parameter tuning and early stopping that require a validation set cannot be applied. This problem remains underappreciated; in fact, it is often unclear how previous works select their hyper-parameters or detemine when training is finished, both of which can be important factors that impact performance, especially for complex algorithms using adversarial learning. In this subsection we describe a simple heuristic 666 Appendix C contains some additional explanation of why this heuristic cannot be used for regular training. The short answer is given by the famous Goodhardt’s law in economics and dynamical systems: “When a measurement becomes an objective, it ceases to be a good measurement.”, merely as rule-of-thumb to make things work instead of a technical innovation. Like almost all of deep learning, there is no statistical guarantee on this heuristic, but it is shown to be practically effective in our experiments (see Figure 3). We only hope that it serves as the first guess of a solution towards this underappreciated problem.
The main idea is that, because our method never explicitly optimizes for any measurement of distributional discrepancy, these measurements can instead be used for hyper-parameter tuning and early stopping. Since it would be counterproductive to introduce additional parameters in order to perform hyper-parameter tuning, we simply use the distance between the mean of the source and target samples in the learned representation space, as produced by . Formally, this can be expressed as
where and are unlabeled source and target validation sets 777 above is also known as the discrepancy under the linear kernel from the perspective of kernel MMD. Appendix C contains some additional explanation of why the mean distance is suitable for our heuristic, even though it is a specific form of MMD, which we claim to be difficult to optimize. The short answer is that it does not require minimax optimization..
Our heuristic combines and the main task error on the (labeled) source validation set. Denote and the measurement vectors of those two quantities respectively over epochs. The final measurement vector is i.e. a normalized sum of the two vectors; the epoch at which we perform early stopping is then simply . Intuitively, this heuristic roughly corresponds to our goal of inducing alignment while preserving discriminability.
|VADA (IN) (shu2018dirt)||95.7||73.3||94.5||-||-||78.3||71.4|
|DIRT-T (IN) (shu2018dirt)||98.7||76.5||99.4||-||-||-||73.3|
|Source only VADA & DIRT-T||58.5||27.9||77.0||-||-||76.3||63.6|
|Source only VADA & DIRT-T (IN)||59.9||40.9||82.4||-||-||77.0||62.6|
|Source only our method||44.9||30.5||92.2||94.7||81.4||75.6||56.1|
|R||98.9||61.3||85.8 888This entry uses both rotation and flip.||96.5||90.2||81.2||65.6|
Our code is available at /https://github.com/yueatsprograms/uda_release for object recognition and https://github.com/erictzeng/ssa-segmentation-release for segmentation.
5.1 Seven benchmarks for object recognition
The seven benchmarks are based on the six datasets described in Appendix D, each with a predefined training set / test set split, and labels are available on both splits. Previous works (cited in Table 2) have created those seven benchmarks by picking pairs of datasets with the same label space, treating one as the source and the other as the target, and training on the training set of the source with labels revealed and of the target with labels hidden. Following the standard setup of the field, labels on the target test set should only be used for evaluation, not for hyper-parameter tuning or early stopping; therefore we apply the heuristic described in subsection 4.1.
For the two natural scene benchmarks, we use all three tasks described in section 3: rotation, location and flip prediction. For the five benchmarks on digits we do not use location because it yields trivial solutions that do not encourage the learning of semantic concepts. Given the image of a digit cropped into the four quadrants, location prediction can be trivially solved by looking at the four corners where a white stroke determines the category. Adding flip to the digits does not hurt performance, but does not improve significantly either, so we do not report those results separately.
As shown in Table 2, despite the simplicity of our method, we achieve state-of-the-art accuracy on four out of the seven benchmarks. In addition, we show the source only results from our closest competitor (VADA and DIRT-T), and note that our source only results are in fact lower than theirs on those very benchmarks that we perform the best on; this indicates that our success is indeed due to effectiveness in adaptation instead of the base architecture.
Our method fails on the pair of benchmarks with SVHN, on which rotation yields trivial solutions. Because SVHN digits are cropped from house numbers with multiple digits, majority of the images have parts of the adjacent digits on the side. The main task head needs to look at the center, but the rotation head learns to look at the periphery and cheat. This failure case shows that the success of our method is tied to how well the self-supervised task fits the application. Practioners should use their domain knowledge evaluate how well the task fits, instead of blithely apply it to everything.
Also seen in Table 2 is that our method excels at object recognition in natural scenes, especially with all three tasks together. For STL-10CIFAR-10, our base model is much worse than that of VADA and DIRT-T, but still beats all the baselines. Adding location and flip gives an improvement of 9%, on top of the 9% already over source only. This is not surprising since those tasks were originally developed for ImageNet pre-training i.e. object recognition in natural scenes – our method is very successful when the task fits the application well.
5.2 Benchmark for semantic segmentation
|Ours + CyCADA||86.6||37.8||80.8||29.7||16.4||28.9||30.9||22.2||83.8||37.1||76.9||60.1||7.8||84.1||30.8||32.1||1.2||23.2||13.3||41.2|
To experiment with our method in more diverse applications, we also evaluate on a challenging simulation-to-real benchmark for semantic segmentation – GTA5 Cityscapes. GTA5 (richter2016playing) contains 24,966 video frames taken from the computer game, where dense segmentation labels are automatically given by the game engine. Cityscapes (cordts_cityscapes) contains 5,000 video frames taken from real-world dash-cams. The main task is to classify every pixel in an image as one of the 19 classes shared across both datasets, and accuracy is measured by intersection over union (IoU). The best possible results are given as the oracle, when the labels on Cityscapes are available for training, so typical supervised learning methods are applicable.
Our results are shown in Table 3, and implementation details in Appendix F. Our self-supervised tasks were designed for classification, where the label on an image depends on its global content, while in segmentation the labels tend to be highly local. Nevertheless, with very little modification, we see significant improvements over the source only baseline.
We also experiment with combining our method with another popular unsupervised domain adaptation method – CyCADA (hoffman2017cycada), designed specifically for segmentation. Surprisingly, when operating on top of images produced by this already very strong baseline, our method further improves performance. This demonstrates that pixel-level adaptation methods might still benefit from an additional round of adaptation by inducing alignment through self-supervision. We emphasize that these results are obtained with a very simple instantiation of our method, as a start towards the development of self-supervised tasks more suitable for semantic segmentation.
We hope that this work encourages future researchers in unsupervised domain adaptation to consider the study of self-supervision as an alternative to adversarial learning, and researchers in self-supervision to consider designing tasks and evaluating them in our problem setting. Most self-supervised tasks today were originally designed for pre-training and evaluated in terms of accuracy gains on a downstream recognition, localization or detection tasks. It will be interesting to see if new self-supervised tasks can arise from the motivation of adaptation, for which alignment is the key objective. Moreover, domain experts could perhaps incorporate their dataset specific knowledge into the design of a self-supervised task specifically for their application.
One additional advantage of our method, not considered in this paper, is that it might be particularly amenable to very small target sample size, when those other methods based on adversarial learning cannot accurately estimate the target distribution. We leave this topic for the future work.
Appendix A Additional discussion on specific papers related to ours
In this section we discuss the four papers mentioned in section 2 that are related to ours, but different in training algorithm, philosophy or problem setting.
Deep Reconstruction-Classification Networks (ghifary2016deep). This method works in unsupervised domain adaptation, the same problem setting as ours. It learns a denoising autoencoder on the target, together with the main classifier on the source. However, they use only the target domain for reconstruction, claiming both empirically and theoretically that it is better than using both domains together. In contrast, we use both the source and target for self-supervision. These two algorithmic differences really reflect our fundamental difference in philosophy. They take the target data as analogous to the unlabeled (source) data in semi-supervised learning, so any form of self-supervision suitable for semi-supervised learning is good enough; their theoretical analysis is also directly borrowed from that of semi-supervised learning, which concludes that it is necessary and sufficient to only use the target data for self-supervision. We take data from both domains for self-supervision in order to align their feature distributions; also, both conceptually and empirically, we cannot use reconstruction tasks because they are unsuitable for inducing alignment (see section 3).
Empirical experiments further support our arguments in three ways: 1) If we were to only use the target for our self-supervision tasks, we would perform barely better than the source only baseline, even on MNISTMNIST-M, the easiest benchmark where we would otherwise observe a huge improvement. 2) If we were to use the reconstruction task i.e. a denoising autoencoder, we would again perform barely better than the source only baseline, and sometimes even worse, as described in section 3. 3) As shown in Table 2, our results are much better than theirs on all of the benchmarks by a large margin. This shows that implementing the philosophy of alignment, developed for unsupervised domain adaptation, is much superior in our problem setting to blithely borrowing from semi-supervised learning.
Using Self-Supervised Learning Can Improve Model Robustness and Uncertainty (hendrycks2019using). This paper studies the setting of robustness, where there is no sample provided, even unlabeled, from the target domain. Their method jointly trains for the main task and the self-supervised task only on the source domain, really because there is no target data of any kind. Because their problem setting is very challenging (with so little information provided on the target), their primary evaluation is on a dataset of CIFAR-10 images with the addition of 15 types of common corruptions (e.g. impulse noise, JPEG compression, snow weather, motion blur and pixelation), simulated by an algorithm. No idea on unsupervised domain adaptation is mentioned.
Domain Generalization by Solving Jigsaw Puzzles (carlucci2019domain). This paper studies two setting. The first is the robustness setting, exactly the same as in hendrycks2019using, except that evaluation is done on MNISTMNIST-M and MNISTSVHN. Their baseline is also a method from the robustness community that trains on adversarial examples and uses no target data. Again, because their problem setting is very challenging, the accuracy is low for both their proposed method and the baseline. The second setting is called domain generalization, which is very similar to meta-learning. The goal is to perform well simultaenously on multiple distributions, all labeled. Evaluation is done using the mean accuracy on all the domains. Beside the name, there is little similarity between the setting of unsupervised domain adaptation and domain generalization, which has no unsupervised component.
Boosting Supervision with Self-Supervision for Few-shot Learning (su2019boosting). As evident from the title, this paper studies the setting of few-shot learning, where the goal is to perform well on a very small dataset. Again, there is no unsupervised component and little connection to our setting.
Most of our results have already been produced when carlucci2019domain and su2019boosting were presented at a conference. hendrycks2019using was submitted to NeurIPS 2019 and should be considered concurrent work.
Appendix B Additional algorithmic details
In practice and for our experiments, the source and target datasets are often imbalanced in size. If we were to blithely solve for the objective in Equation 2, the domain with a larger dataset would carry more weight for every because we are summing over all samples in both datasets. We would like to keep the two sums inside roughly balanced such that in terms of features produced by , neither of the two domains would dominate the other. This is easy to achieve in our implementation through balanced batches. When we need to sample a batch for task , we simply sample half the batch from the source, and another half from the target, then put them together. In the end, our implementation requires little change on top of an existing supervised learning codebase. Each self-supervised task is defined as a module, and adding a new one only amounts to defining the structural modifications and the loss function.
We optimize the objective in Equation 2 with stochastic gradient descent (SGD). One can simply take a joint step on Equation 2, which covers all the losses for and . However, the implementation would then have to store the gradients with respect to all losses together. For memory efficiency, our implementation loops over ; for each self-supervised task , it samples a batch of combined source and target data (without their original labels), structurally modifies the images by , creates new labels according to the modification, and obtains a loss for and on this batch. So a gradient step is taken for each , before finally a batch of original source images and labels are sampled for a gradient step on and . In terms of test accuracy, these two implementations make little difference (usually less than 1%), and the choice simply comes down to a time-memory trade-off.
Appendix C Additional discussion on the mean distance
In this section we answer some potential questions about our selection rule, which uses the mean distance, from the perspective of a possibly confused reader.
Q: The motivation of the paper is that methods based on minimax optimization, such as those using kernel-MMD, are difficult to optimize. Since you are using the mean distance (MMD under the linear kernel) for hyper-parameter tuning and early stopping, which is a form of model selection, how are you different from those other methods and why is optimization easy for you?
A: Our method is fundamentally different from those other methods based on MMD, and therefore more amenable to optimization in the following two ways: 1) Even though we use the mean distance, which is a form of MMD, we never pose a minimax optimization problem. The model parameters minimize the loss functions of the tasks, which we hope makes the mean distance small (see Figure 3). Model selection using subsection 4.1 also minimizes the mean distance. In our method, all the loss functions, for model parameters and hyper-parameters, work towards the same goal of inducing alignment, so optimization is easier. 2) Even though hyper-parameter tuning and early stopping are a forms of model selection just like training, there is a qualitative difference in the degrees of freedom involved. For subsection 4.1, when performing early stopping for example, optimization amounts to a grid search over the epochs after training is finished, and there is only one degree of freedom.
Q: Why is the mean distance suitable for model selection when it comes to hyper-parameter tuning and early stopping, but not regular training?
A: Again, we agree that hyper-parameter tuning and early stopping are a forms of model selection. However, unlike model parameters ranging in the hundred of thousands, there are at most a few hyper-parameters and only one parameter for early stopping. Model parameters can easily overfit to the mean distance while those few degrees of freedom we use cannot. This is precisely also the reason why previous works using MMD resort to minimax optimization. Since they use MMD for training, they must also give MMD enough degrees of freedom to work towards the opposite direction of optimization as the model parameters, so that it is not as easy to overfit.
Appendix D Details of the six datasets used for object recognition
1) MNIST (lecun1998gradient): greyscale images of handwritten digits 0 to 9; 60,000 samples in the training set and 10,000 in the test set.
2) MNIST-M (ganin2016domain): constructed by blending MNIST digits with random color patches from the BSDS500 dataset arbelaez2011contour; same training / test set size as MNIST.
3) SVHN (netzer2011reading): colored images of cropped out house numbers from Google Street View; the task is to classify the digit at the center; 73,257 samples in the training set, 26,032 in the test set and 531,131 easier samples for additional training.
4) USPS: greyscale images of handwritten digits only slightly different from MNIST; 7291 samples in the training set and 2007 in the test set.
5) CIFAR-10 (krizhevsky2009learning): colored images of 10 classes of centered natural scene objects; 50,000 samples in the training set and 10,000 in the test set.
6) STL-10 (coates2011analysis): colored images of objects only slightly different from CIFAR-10; 5000 samples in the training set and 8000 in the test set.
Because CIFAR-10 and STL-10 differ in one class category, we follow common practice (shu2018dirt; french2017self; ghifary2016deep) and delete the offending categories, so each of these two datasets actually only has 9 classes.
Appendix E Implementation details on the object recognition benchmarks
We use a 26-layer pre-activation ResNet (he2016identity) as our test-time model , where is the last linear layer that makes the predictions, and is everything before that. For unsupervised domain adaptation, there is no consensus on what base architecture to use among the previous works. Our choice is simply base on the widespread adoption of the ResNet architecture and the ease of implementation. In Table 2 we provide the source only results using our base architecture, and the ones from shu2018dirt, our closest competitor. Our source only results are in fact worse than theirs, indicating that our improvements are indeed made through adaptation.
At training time, the self-supervised heads are simply linear layers connected to the end of as discussed in section 4. There is no other modification on the standard ResNet. For all experiments on the object recognition benchmarks, we optimize our model with SGD using weight decay 5e-4 and momentum 0.9, with a batch size of 128. We use an initial learning rate of 0.1 and a two milestone schedule, where the learning rate drops by a factor of 10 at each milestone. All these optimization hyper-parameters are taken directly from the standard literature (he2016identity; huang2016deep; guo2017calibration) without any modification for our problem setting. We select the total number of epochs and the two milestones based on convergence of the source classifier and unsupervised classifiers . Early stopping is done using the selection heuristic discussed in subsection 4.1. For fair comparison with our baselines, we do not perform data augmentation, following previous works (hoffman2017cycada; sener2016learning; ghifary2016deep).
Appendix F Implementation details on GTA5 Cityscapes
For our experiments, we initialize our model from the DeepLab-v3 architecture (deeplabv3), pre-trained on ImageNet, as commonly done in the field. Each self-supervised head consists of a global average pooling layer on the pre-logit layer, followed by a single linear layer. To take advantage of the large size of the natural scene images, we use the continuous i.e. regression version of location prediction. The self-supervised head is trained on the square loss, to regress the coordinates (in two dimensions) that the patch is cropped from. Natural for the regression version, instead of cropping from the quadrants like for the classification version on the small datasets, we instead crop out 400400 patches taken at random from the segmentation scenes. We optimize our model with SGD using a learning rate of 0.007 for 15,000 iterations, with a batch size of 48. Once again, we use the selection heuristic in subsection 4.1 for early-stopping.