Universal Domain Adaptation through Self Supervision
Unsupervised domain adaptation methods traditionally assume that all source categories are present in the target domain. In practice, little may be known about the category overlap between the two domains. While some methods address target settings with either partial or open-set categories, they assume that the particular setting is known a priori. We propose a more universally applicable domain adaptation approach that can handle arbitrary category shift, called Domain Adaptative Neighborhood Clustering via Entropy optimization (DANCE). DANCE combines two novel ideas: First, as we cannot fully rely on source categories to learn features discriminative for the target, we propose a novel neighborhood clustering technique to learn the structure of the target domain in a self-supervised way. Second, we use entropy-based feature alignment and rejection to align target features with the source, or reject them as unknown categories based on their entropy. We show through extensive experiments that DANCE outperforms baselines across open-set, open-partial and partial domain adaptation settings.
Deep neural networks can learn highly discriminative representations for image recognition tasks Deng et al. (2009); Simonyan and Zisserman (2014); Krizhevsky et al. (2012); Ren et al. (2015); He et al. (2017), but do not generalize well to domains that are not distributed identically to the training data. Domain adaptation (DA) aims to transfer representations of source categories to novel target domains without additional supervision. Recent deep DA methods primarily do this by minimizing the feature distribution shift between the source and target samples Ganin and Lempitsky (2014); Long et al. (2015); Sun et al. (2016). However, these methods make strong assumptions about the degree to which the source categories overlap with the target domain, which limits their applicability to many real-world settings.
In this paper, we address the problem of Universal DA. Suppose and are the label sets in the source and target domain. In Universal DA we want to handle all of the following potential “category shifts”: closed-set (), open-set () Busto and Gall (2017); Saito et al. (2018c), partial () Cao et al. (2018), or a mix of open and partial You et al. (2019), see Fig. 1. Existing DA methods cannot address Universal DA well because they are each designed to handle just one of the above settings. However, since the target domain is unlabeled, we may not know in advance which of these situations will occur. Thus, an unexpected category shift could lead to catastrophic misalignment. For example, using a closed-set method when the target has novel (“unknown”) classes could incorrectly align them to source (“known”) classes. The underlying issue at play is that existing work heavily relies on prior knowledge about the category shift.
The second problem is that the over-reliance on supervision in the source domain also makes it challenging to obtain discriminative features on the target.
Prior methods focus on aligning target features with source, rather than on exploiting structure specific to the target domain. In the universal DA setting, this means that we may fail to learn features useful for discriminating “unknown” categories from the known categories, because such features may not exist in the source.
Self-supervision was proposed in Carlucci et al. (2019) to extract domain-generalizable features, but it is limited in extracting discriminative features on the target.
We propose to overcome these challenging problems by introducing Domain Adaptive Neighborhood Clustering via Entropy optimization (DANCE). An overview is shown in Fig. 2. Rather than relying only on the supervision of source categories to learn a discriminative representation, DANCE harnesses the cluster structure of the target domain using self-supervision. This is done with a “neighborhood clustering” technique that self-supervises feature learning in the target. At the same time, useful source features and class boundaries are preserved and adapted with a partial domain alignment loss that we refer to as “entropy separation loss.” This loss allows the model to either match each target example with the source, or reject it as an “unknown” category.
Our contributions are summarized as follows:
We propose DANCE, a universal domain adaptation method that can be applied out-of-the-box without prior knowledge of specific category shift,
We experimentally observe that DANCE is the only method that outperforms the source-only model in every setting,
We achieve state-of-the-art performance on all open-set and open-partial DA settings, and some partial DA settings, and
We learn discriminative features of “unkown” target samples even without any supervision.
2 Related Work
Closed-set Domain Adaptation (CDA). The main challenge in domain adaptation (DA) is the domain gap in feature distributions between domains, which degrades the source classifier’s performance. The basic approach of DA measures the distance between feature distributions in source and target, then trains a model to minimize this distance. Many DA methods utilize a domain classifier to measure the distance Ganin and Lempitsky (2014); Tzeng et al. (2014); Long et al. (2015, 2018), while others minimize classifier discrepancy Saito et al. (2018b, a); Zhang et al. (2019) to learn more discriminative features, or utilize pseudo-labels assigned to target examples Saito et al. (2017); Zou et al. (2018). Clustering-based methods are proposed by Deng et al. (2019); Sener et al. (2016); Haeusser et al. (2017). These and other mainstream methods assume that all target examples belong to source classes. In this sense, they rely heavily on the relationship between source and target. Partial Domain Adaptation (PDA) handles the case where the target classes are a subset of source classes. This task is solved by performing importance-weighting on source examples that are similar to samples in the target Cao et al. (2018); Zhang et al. (2018); Cao (2019). Open set Domain Adaptation (ODA) deals with target examples whose class is different from any of the source classes Panareda Busto and Gall (2017); Saito et al. (2018c); Liu et al. (2019). The drawback of ODA methods is that they assume we necessarily have unknown examples in the target domain, and can fail in closed or partial domain adaptation. The idea of Universal Domain Adaptation (UniDA) was proposed in You et al. (2019). However, they applied their method to a mixture of PDA and ODA, which we call OPDA, where the target domain contains a subset of the source classes plus some unknown classes. Our goal is to propose a method that works well on CDA, ODA, PDA, and OPDA. We call the task UniDA in our paper.
Self-Supervised Learning. Self-supervised learning obtains features useful for various image recognition tasks by using a large number of unlabeled images Doersch et al. (2015). A model is trained to solve a pretext (surrogate) task such as solve a jigsaw puzzle Noroozi and Favaro (2016) or instance discrimination Wu et al. (2018). Huang et al. (2019); Zhuang et al. (2019) proposed to perform instance discrimination and trained a model to discover neighborhoods for each example. They calculate cross entropy loss on the probabilistic distribution of similarity between examples. Our work is similar in that we aim to perform unsupervised clustering of unlabeled examples, but different in that Huang et al. (2019); Zhuang et al. (2019) require specifying which examples are in the neighborhood for each example. We perform entropy minimization on the similarity distribution among unlabeled target examples and source prototypes. Carlucci et al. (2019) proposed to utilize the jigsaw puzzle pretext task for domain generalization with access to multiple source domains. We perform a comparison between DANCE and this method in our supplementary.
3 DANCE: Domain Adaptive Neighborhood Clustering via Entropy optimization
Our task is universal domain adaptation: given a labeled source domain with “known” categories and an unlabeled target domain which contains all or some “known” categories and possible “unknown” categories. Our goal is to label the target samples with either one of the labels or the “unknown” label. We train the model on and evaluate on . We seek a truly universal method that can handle any possible category shift without prior knowledge of it. The key is not to force complete alignment between the entire source and target distributions, as this may result in catastrophic misalignment. Instead, the challenge is to extract well-clustered target features while performing a relaxed alignment to the source classes and potentially rejecting “unknown” points.
We adopt a prototype-based classifier that maps samples close to their true class centroid (prototype) and far from samples of other classes. We first propose to use self-supervision in the target domain to cluster target samples. We call this technique neighborhood clustering (NC). Each target point is aligned either to a “known” class prototype in the source or to its neighbor in the target. This allows the model to learn a discriminative metric that maps a point to its semantically close match, whether or not its class is “known”. This is achieved by minimizing the entropy of the distribution over point similarity.
Second, we propose an entropy separation loss to either align the target point with a source prototype or reject it as “unknown”. The loss is applied to the entropy of the “known” category classifier’s output to force it to be either low (the sample should belong to a “known” class) or high (the sample should be far from any “known” class). In addition, we utilize domain-specific batch normalization Chang et al. (2019); Li et al. (2016); Saito et al. (2019) to eliminate domain style information as a form of weak domain alignment.
3.1 Network Architecture
We adopt the architecture used in Saito et al. (2019), which has an L2 normalization layer before the last linear layer. We can regard the weight vectors in the last linear layer as prototype features of each class. This architecture is well-suited to our purpose of finding a clustering over both target features and source prototypes. Let be the feature extraction network which takes an input and outputs a feature vector . Let be the classification network which consists of one linear layer without bias. The layer consists of weight vectors where represents the number of classes in the source. takes L2 normalized features and outputs logits. denotes the output of after the softmax function.
3.2 Neighborhood Clustering (NC)
The principle behind our self-supervised clustering objective is to move each target point either to a “known” class prototype in the source or to its neighbor in the target. By making nearby points closer, the model learns well-clustered features. If “unknown” samples have similar characteristics with other “unknown” samples, then this clustering objective will help us extract discriminative features. This intuition is illustrated in Fig. 2. The important point is that we do not rely on strict distribution alignment with the source in order to extract discriminative target features. Instead we propose to minimize the entropy of each target point’s similarity distribution to other target samples and to prototypes. To minimize the entropy, the point will move closer to a nearby point (we assume a neighbor exists) or to a prototype. This approach is illustrated in Fig. 3.
Specifically, we calculate the similarity to all target samples and prototypes for each mini-batch of target features. Let denotes a memory bank which stores all target features and denotes the target features in the memory bank and the prototype weight vectors, where is the feature dimension in the last linear layer:
where the and are L2-normalized. To consider target samples absent in the mini-batch, we employ a memory bank to store and use the features to calculate the similarity as done in Wu et al. (2018). In every iteration, is updated with the mini-batch features. Let denote features in the mini-batch and denote sets of target samples’ indices in the mini-batch. For all , we set
Therefore, the memory bank contains both updated target features from the current mini-batch and the older target features absent in the mini-batch. Unlike Wu et al. (2018), we update the memory so that it simply stores features, without considering the momentum of features in previous epochs. Let denote the -th item in , then the probability that the feature is a neighbor of the feature or prototype is, for and ,
and the temperature parameter controls the distribution concentration degree Hinton et al. (2015). Then, the entropy is calculated as follows,
We minimize the above loss to align each target sample to either a target neighbor or a prototype, whichever is closer.
3.3 Entropy Separation Loss
The neighborhood clustering loss encourages the target samples to become well-clustered, but we still need to align some of them with “known” source categories while keeping the “unknown” target samples far from the source. In addition to the domain-specific batch normalization (see Sec. 3.4), which can work as a form of weak domain alignment, we need an explicit objective to encourage alignment or rejection of target samples. As pointed out in You et al. (2019), “unknown” target samples are likely to have a larger entropy of the source classifier’s output than “known” target samples. This is because “unknown” target samples do not share common features with “known” source classes.
Inspired by this, we propose to draw a boundary between “known” and “unknown” points using the entropy of a classifier’s output. We visually introduce the idea in Fig. 4. The distance between the entropy and threshold boundary, , is defined as , where is the classification output for a target sample. By maximizing the distance, we can make far from . We expect that the entropy of “unknown” target samples will be larger than whereas for the “known” ones it will be smaller. Tuning the parameter based on each adaptation setting requires a validation set. Instead, we define , where is the number of source classes. Since is the maximum value of , we assume depends on it, and confirm that the defined value empirically works well. We perform an analysis of in the supplemental material. The above formulation assumes that “known” and “unknown” target samples can be separated with . However, in many cases, the threshold can be ambiguous and can change due to domain shift. Therefore, we propose to introduce a confidence threshold parameter such that the final form of the loss is
The introduction of the confidence threshold allows us to give the separation loss only to confident samples. When is sufficiently large, the network is confident about a decision of “known” or “unknown”. Thus, we train the network to make the sample far from the value .
Domain specific batch normalization. To enhance alignment between source and target domain, we propose to utilize domain-specific batch normalization Chang et al. (2019); Li et al. (2016); Saito et al. (2019). The batch normalization layer whitens the feature activations, which contributes to a performance gain. As reported in Saito et al. (2019), simply splitting source and target samples into different mini-batches and forwarding them separately helps alignment. This kind of weak alignment matches our goal because strongly aligning feature distributions can harm the performance on non-closed set domain adaptation.
|Method||Closed DA (CDA)||Partial DA (PDA)||Open set DA (ODA)||Open-Partial DA (OPDA)||Avg|
|Source Only||76.5||54.6||46.3||75.9||57.0||49.9||89.1||69.6||43.2||86.4||71.0||38.8||61.7||4.8 1.2|
|DANN Ganin et al. (2016)||85.9||62.7||69.1||42.2||40.9||38.7||88.7||72.8||48.2||88.7||76.7||50.6||65.7||3.5 1.7|
ETN Cao (2019)
|STA Liu et al. (2019)||73.6||44.7||48.1||69.8||47.9||48.2||89.9||69.3||48.8||89.8||72.6||47.4||61.2||4.5 1.3|
|UAN You et al. (2019)||84.4||58.8||66.4||52.9||34.2||39.7||91.0||74.6||50.0||84.1||75.0||47.3||62.0||4.1 1.3|
|DANCE (ours)||85.5||66.4||70.2||84.7||68.1||73.7||94.1||78.1||65.3||93.9||80.4||69.2||76.8||1.2 0.4|
Final Objective. The final objective is
where denotes the cross-entropy loss on source samples. The loss on source and target is calculated in a different mini-batch to achieve domain-specific batch normalization. To reduce the number of hyper-parameters, we used the same weighting hyper-parameter for and .
The goal of the experiments is to compare DANCE with the baselines across all sub-cases of Universal DA (i.e., CDA, PDA, ODA, and OPDA) under the four object classification datasets and four settings for each dataset. We follow the settings of Long et al. (2018) for closed (CDA), Cao et al. (2018) for partial (PDA), Liu et al. (2019) for open-set (ODA), and You et al. (2019) for open-partial domain adaptation (OPDA) in our experiments.
Datasets. As the most prevalent benchmark dataset, we use Office Saenko et al. (2010), which has three domains (Amazon (A), DSLR (D), Webcam (W)) and 31 classes. The second benchmark dataset OfficeHome (OH) Venkateswara et al. (2017) contains four domains and 65 classes. The third dataset VisDA (VD) Peng et al. (2017) contains 12 classes from two domains: synthetic and real images. We provide an analysis of varying the number of classes using Caltech Griffin et al. (2007) and ImageNet Deng et al. (2009) because these datasets contain a large number of classes. Let denotes a set of classes present in the source, denotes a set of classes present in the target. Table 2 summarizes the number of classes in each setting. See supplementary material for details about each split, which follows the experimental settings of Long et al. (2018); Cao et al. (2018); Liu et al. (2019); You et al. (2019).
|Dataset||Class Split ()|
|Office||31 / 0 / 0||10 / 21 / 0||10 / 0 / 11||10 / 10 / 11|
|OH||65 / 0 / 0||25 / 40 / 0||15 / 0 / 50||10 / 5 / 50|
|VisDA||12 / 0 / 0||6 / 6 / 0||6 / 0 / 6||6 / 3 / 3|
Evaluation. We use the same evaluation metrics. In CDA and PDA, we simply calculate the accuracy over all target samples. In ODA and OPDA, we average the accuracy over classes including “unknown”. For example, an average over 11 classes is reported in the Office ODA setting. We run each experiment three times and report the average result.
Implementation Details. All experiments are implemented in Pytorch Paszke et al. (2017). We employ ResNet50 He et al. (2016) pre-trained on ImageNet Deng et al. (2009) as the feature extractor in all experiments. We remove the last linear layer of the network and add a new weight matrix to construct . For baselines, we use their implementation. Hyper-parameters for each method are tuned on the “Amazon to DSLR” OPDA setting. We set in Eq. 9 as 0.05 and in Eq. 7 as 0.5 for our method. For all comparisons, we use the same hyper-parameters, batch-size, learning rate, and checkpoint. The analysis of sensitivity to hyper-parameters is discussed in the supplementary. Comparisons. We show two kinds of comparisons to provide better empirical insights. The first comparison is the universal comparison to the 5 baselines including state-of-the-art methods on CDA, PDA, ODA, and OPDA. As we assume that we do not have prior knowledge of the category shift in the target domain, all methods use fixed hyper-parameters, which are tuned on the “Amazon to DSLR” OPDA setting. The second comparison is the non-universal comparison. In addition to the 5 baselines, we report published state-of-the-art results on each setting and the results of DANCE tuned for each setting. Please note that the universal results should not be directly compared with the non-universal results, as the non-universal baselines are optimized for each setting with prior knowledge and do not have âunknownâ example rejection in CDA and PDA. However, we can observe this gap in performance and analyze the performance of DANCE when optimized for each setting. See supplemental material for details of the optimization for each setting.
Universal comparison baselines: Source-only (SO). The model is trained with source examples without using target samples. By comparing to this baseline, we can see how much gain we can obtain by performing adaptation. Closed-set DA (CDA). Since this is the most popular setting of domain adaptation, we employ DANN Ganin and Lempitsky (2014), a standard approach of feature distribution matching between domains. Partial DA (PDA). ENT Cao (2019) is the state-of-the-art method in PDA. This method utilizes the importance weighting on source samples with adversarial learning. Open-set DA (ODA). STA Liu et al. (2019) tries to align target “known” examples as well as rejecting “unknown” samples. This method assumes that there is a particular number of “unknown” samples and rejects them as “unknown”.
Open-Partial DA (OPDA). UAN You et al. (2019) tries to incorporate the value of entropy to reject “unknown” examples.
Overview (Table 1). As seen in Table 1, which summarizes the universal comparison, DANCE is the only method which improves the performance compared to SO, a model trained without adaptation, in all settings. In addition, DANCE performs the best on open set and universal domain adaptation in all settings and the partial domain adaptation setting for VisDA. Our average performance is much better than other baselines with respect to both accuracy and rank.
|Method||D to A||W to A||R to P||R to C||VisDA|
CDA (Table 3). DANCE significantly improves performance compared to the source-only model (SO), and shows comparable performance to some of the baseline methods. In the non-universal comparison, some baselines show much better performance. However, such methods designed for CDA fail in adaptation when there are “unknown” examples. PDA (Table 4). DANCE significantly improves accuracy compared to SO and achieves a comparable performance to ETN, which is one of the state-of-the-art methods in PDA. Although ETN in the universal comparison shows better performance than DANCE, it does not perform well on ODA and OPDA. In the case of VisDA, DANCE outperforms all baselines. We found that if we do not utilize memory and do not use “unknown” example rejection (DANCE*), we achieve the best performance on Office-Home in the non-universal comparison. See our supplemental material for more detail. ODA (Table 5). DANCE outperforms all the other baselines including the non-universal comparison. STA and UAN are designed for the ODA and OPDA achieve decent performance on these settings but show poor performance on some settings in CDA and PDA. One reason is that their method assumes that there there is a particular number of “unknown” examples in the target domain and reject them as “unknown”. OPDA (Table 6). The trend is similar to that of ODA. From the results of ODA and OPDA, we can see the importance of utilizing self-supervision in the target domain when there are “unknown” categories.
Feature Visualization. Fig. 5 shows the target feature visualization with t-SNE Maaten and Hinton (2008). We use the ODA setting of “DSLR to Amazon” on Office. The target “known” features (black plots) are well clustered with DANCE. In addition, most of the “unknown” features (the other colors) are kept far from “known” features and “unknown” examples in the same class are clustered together. Although we do not give supervision on the “unknown” classes, similar examples are clustered together. The visualization supports the results of the clustering performance (see below).
Evaluation on clustering of “unknown” examples. Here, we evaluate how well the learned features can cluster samples from “unknown” classes. First, we train a new linear classifier on top of the fixed features of “unknown” class samples. We use one labeled example per “unknown” category for training. Then, we evaluate the classification accuracy on the “unknown” samples. Since the learned feature is fixed, we can evaluate its own ability to cluster the samples. In this experiment, we employ the ODA setting. We use D to A and W to A of Office (11 “unknown” classes), R to P, R to C of OfficeHome (50 “unknown” classes) and VisDA (6 “unknown” classes). As we can see in Table 7, the features obtained by DANCE perform better than other methods. This result and the feature visualization indicate that the features learned by DANCE are better for clustering samples from “unknown” classes.
Analysis of Neighborhood Clustering (NC) and Entropy Separation (ES). Table 8 shows the ablation study of NC and ES, Eqs. 6 and 8, respectively. The experiments are done on CDA on Office. Using both NC and ES significantly improves the performance. These two are complementary and necessary for successful adaptation. Additionally, as shown in Fig. 5(a), the accuracy improves with the decrease of two losses as we expect.
Varying the number of “unknown” classes. We analyze the behavior of DANCE under the different the number of “unknown” classes. In this analysis, we use open set adaptation from Amazon in Office to Caltech, where there are 10 shared classes and many unshared classes.
Openness is defined as . corresponds to the shared 10 categories. We increased the number of “unknown” categories, i.e. . Fig. 5(b) shows the accuracy of all classes whereas Fig. 5(c) shows area under the receiver operating characteristic curve on “unknown” classes. As we add more “unknown” classes, the performance of all methods decreases. However, DANCE consistently performs better than other methods and is robust to the number of “unknown” classes.
Varying the number of source private classes. We analyze the behavior under the different the number of source private classes in the OPDA setting. We varied the number of classes present only in the source (i.e., ). To conduct an extensive analysis, we use ImageNet-1K Deng et al. (2009) as the source domain and Caltech-256 as the target domain. They have 84 shared classes. We use all of the unshared classes of Caltech as “unknown” target while we increase the number of the classes of ImageNet (i.e., ). The result is shown in Fig. 5(d). As we have more unshared source classes, the performance degrades as seen in Fig. 5(d). However, DANCE consistently shows better performance. Since STA just tries to classify almost all target examples as “unknown,” the performance is significantly worse.
Ablation of memory. We evaluate how the memory features contribute to the performance in Table D. In this study, “w/o memory” indicates that we calculated similarity only within each mini-batch. Our observation is that memory helps given many target examples, but does not help much when there is a small number of target examples. When we have many classes and many unlabeled target examples (CDA OH have 65 classes), it is clearly better to use memory. On the other hand, the performance of the model without memory performs better in OH PDA with 25 target classes. As shown in VisDA PDA (6 target classes), a smaller number of target classes does not degrade the performance with memory features. The main difference in OH and VisDA is the number of target examples (approx. 1,000 vs 30,000). As reported in He (2019), memory features does not necessarily improve the performance in self-supervised learning. We believe that improving the usage of a memory bank can further improve the performance.
In this paper, we introduce Domain Adaptative Neighborhood Clustering via Entropy optimization (DANCE) which performs well on universal domain adaptation. We propose two novel self-supervision based components: neighborhood clustering and entropy separation which can handle arbitrary category shift. DANCE is the only model which outperforms the source-only model in all settings and the state-of-the-art baselines in many settings. In addition, we show that DANCE extracts discriminative feature representations for “unknown” class examples without any supervision on the target domain.
This work was supported by Honda, DARPA and NSF Award No. 1535797.
|Open Set||Open Partial|
|Method||A to W||D to W||W to D||A to D||D to A||W to A||A to W||D to W||W to D||A to D||D to A||W to A|
|w/o memory, w/o rej||56.2||84.6||87.1||74.0||74.8||82.4||77.5||56.5||86.7||80.4||60.2||84.1||75.4|
A Dataset Detail
In PDA, 10 classes in Caltech-256 are used as shared classes . The other 21 classes are used as source private classes . Since DSLR and Webcam do not have many examples, we conduct experiments on D to A, W to A, A to C (Caltech), D to C, and W to C shifts. In OSDA, the same 10 classes are used as shared classes and the selected 11 classes are used as unknown classes . The setting is the same as Saito et al. (2018c). In OPDA, the same 10 class are used as shared classes and then, in alphabetical order, the next 10 classes are used as source private classes , and the remaining 11 classes are used as unknown classes . The second benchmark dataset is OfficeHome (OH) Venkateswara et al. (2017), which contains four domains and 65 classes. In PDA, in alphabetical order, the first 25 classes are selected as shared classes and the rest classes are source private classes . In OSDA, the first 15 classes are used as shared classes and the rest classes are used as unknown classes . In OPDA, the first 10 classes are used as shared classes , the next 5 classes are source private classes and the rest are unknown classes . The third dataset is VisDA Peng et al. (2017), which contains 12 classes from the two domains, synthetic and real images. The synthetic domain consists of 152,397 synthetic 2D renderings of 3D objects and the real domain consists of 55,388 real images. In PDA, the first 6 classes are used as shared classes and the rest are source private classes . In OSDA, we follow Saito et al. (2018c) and use the 6 classes as shared classes and the rest as unknown classes . In OPDA, the first 6 classes are shared classes , the next 3 are source private classes and the other 3 classes are unknown classes . We mainly perform experiments on these three datasets with four settings because it enables direct comparison with many state-of-the-art results. We provide an analysis of varying the number of classes using Caltech Griffin et al. (2007) and ImageNet Deng et al. (2009) because these datasets contain a large number of classes.
B Implementation Detail
We list the implementation details which are excluded from the main paper due to a limit of space.
DANCE (universal comparison). The batch-size is set as 36. The temperature parameter in Eq. 5 is set as 0.05 by following Saito et al. (2019). We train a model for 10,000 iterations with nestrov momentum SGD and report the performance at the end of the iterations. The initial learning rate is set as , which is decayed with the factor of , where denotes the number of iterations and we set and . The learning rate of pre-trained layers is multiplied by . We follow Saito et al. (2019) for this scheduling method.
Baselines (universal comparison). We use the following released codes for ETN Cao (2019)(https://github.com/thuml/ETN), UAN You et al. (2019)(https://github.com/thuml/Universal-Domain-Adaptation), and STA Liu et al. (2019)(https://github.com/thuml/Separate_to_Adapt). We tune the hyper-parameter of these methods by validating the performance on OPDA, Amazon to DSLR, Office. Since we could not see improvements by changing the hyper-parameters from their codes, we employed the hyper-parameters provided in their codes. For ETN, we use the hyper-parameters for Office-Home. For UAN and STA, we use the hyper-parameters for Office. We implement DANN by ourselves and tuned the hyper-parameters by the performance on OPDA, Amazon to DSLR, Office. For all of these methods, we report the performance at the end of training for comparison. We observe that there is a gap in the performance between the best checkpoint and the final checkpoint. This can explain the gap between the reported performance in their paper and the performance in our universal comparisons.
Ours (non-universal comparison). In the non-universal comparison, we have prior knowledge about the class distribution in the target domain. For example, we know that there are no “unknown” samples in PDA and CDA. Therefore, we do not have to consider “unknown” sample rejection both in a training and testing phase. We will include the parameters used in this comparison when we publish a code.
Baselines (non-universal comparison). We run experiments for STA (OfficeHome ODA) and ETN (A2C, W2C, D2C, PDA) since the results are not available in their papers. For STA, we tune the hyper-parameters of a trade-off of the entropy. We could see an improvement on average. For ETN, we report the performance which employs the same hyper-parameters as the universal comparison but does not use “unknown” sample rejection. For other non-universal comparisons, we show the performance of the results reported in their papers. “NA” indicates the results are not available in their paper. We observe the performance gap in our universal comparison and the reported performance in each paper. For example, the performance of UAN in OPDA has a big gap between the universal comparison and the non-universal comparison (reported accuracy) although we use the same hyper-parameters. We could obtain similar performance to the reported number if we pick up the best checkpoint for each setting. But, we report the performance of fixed iterations’ checkpoints for a fair comparison, which can explain the gap.
C Supplemental Results
Detailed results of ODA and OPDA. Table A shows the detailed results of ODA and OPDA. OS* shows the averaged accuracy over known classes while OS shows the averaged accuracy including unknown class. DANCE shows good performance on both metrics. ETN shows better results on OS* than DANCE in several scenarios. In ETN results, OS* shows much better results on OS, which means that ETN is not good at recognizing unknown samples as unknown. This is clearly shown in Figure 6 (c) in our main paper.
Comparison with Jigsaw Carlucci et al. (2019). Table E shows the comparison with jigsaw puzzle based self-supervised learning. To consider the self-supervised learning part of DANCE, we replaced neighborhood clustering loss with the jigsaw puzzle loss on the target domain. The jigsaw puzzle loss is calculated on target samples. We can see that DANCE performed better in almost all settings and confirm the effectiveness of clustering based self-supervision for this task.
Results with standard deviations. Table B and show results of DANCE with standard deviations. We show only the averaged accuracy over three runs in the main paper due to a limit of space. We show the standard deviation. We can observe that DANCE shows decent standard deviations.
Detailed results of non-universal comparison. Table D shows detailed results of the non-universal comparison in PDA OfficeHome and Table 9 in our main paper. As we explain in our main paper, by ablating the memory features, we could see much change in the performance in some settings. Besides, since the two settings do not include unknown target samples, we can see improvement by ablating unknown sample rejection in the test phase (74.5 to 75.4 on average in Partial).
Sensitivity to hyper-parameters. In Fig. A, we show the sensitivity to hyper-parameters on OPDA setting of Amazon to DSLR, which we used to tune the hyper-parameters. Although in Eq. 5 is decided based on the number of source classes, we show the behavior of our method when changing it in Fig. 0(c). When we increase the value, more examples will be decided as known, then the performance on unknown examples decreases.
- Open set domain adaptation.. In ICCV, Cited by: §1.
- Partial adversarial domain adaptation. In ECCV, Cited by: §1, §2, §4.1, §4.1, Table 4.
- Learning to transfer examples for partial domain adaptation. In CVPR, Cited by: §2, §B, Table 1, §4.1.
- Domain generalization by solving jigsaw puzzles. In CVPR, Cited by: Table E, §1, §2, §C.
- Domain-specific batch normalization for unsupervised domain adaptation. In CVPR, Cited by: §3.4, §3.
- Imagenet: a large-scale hierarchical image database. In CVPR, Cited by: §1, §A, §4.1, §4.1, §4.3.
- Cluster alignment with a teacher for unsupervised domain adaptation. In ICCV, Cited by: §2.
- Unsupervised visual representation learning by context prediction. In ICCV, pp. 1422–1430. Cited by: §2.
- Unsupervised domain adaptation by backpropagation. In ICML, Cited by: §1, §2, §4.1.
- Domain-adversarial training of neural networks. JMLR 17 (59), pp. 1–35. Cited by: Table 1.
- Caltech-256 object category dataset. Cited by: §A, §4.1.
- Associative domain adaptation. In ICCV, Cited by: §2.
- Mask r-cnn. In ICCV, Cited by: §1.
- Deep residual learning for image recognition. In CVPR, Cited by: §4.1.
- Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722. Cited by: §4.3.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §3.2.
- Unsupervised deep learning by neighbourhood discovery. In ICML, Cited by: §2.
- Imagenet classification with deep convolutional neural networks. In NIPS, Cited by: §1.
- Revisiting batch normalization for practical domain adaptation. arXiv preprint arXiv:1603.04779. Cited by: §3.4, §3.
- Separate to adapt: open set domain adaptation via progressive separation. In CVPR, Cited by: §2, §B, Table 1, §4.1, §4.1, §4.1, Table 5.
- Learning transferable features with deep adaptation networks. In ICML, Cited by: §1, §2.
- Conditional adversarial domain adaptation. In NIPS, Cited by: §2, §4.1, §4.1, Table 3.
- Visualizing data using t-sne. JMLR 9 (11), pp. 2579–2605. Cited by: Figure 5, §4.3.
- Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, Cited by: Table E, §2.
- Open set domain adaptation. In ICCV, Cited by: §2.
- Automatic differentiation in pytorch. Cited by: §4.1.
- Visda: the visual domain adaptation challenge. arXiv preprint arXiv:1710.06924. Cited by: §A, §4.1.
- Faster r-cnn: towards real-time object detection with region proposal networks. In NIPS, Cited by: §1.
- Adapting visual category models to new domains. In ECCV, Cited by: §4.1.
- Semi-supervised domain adaptation via minimax entropy. In ICCV, Cited by: §B, §3.1, §3.4, §3.
- Adversarial dropout regularization. In ICLR, Cited by: §2.
- Asymmetric tri-training for unsupervised domain adaptation. In ICML, Cited by: §2.
- Maximum classifier discrepancy for unsupervised domain adaptation. In CVPR, Cited by: §2.
- Open set domain adaptation by backpropagation. In ECCV, Cited by: §1, §A, §2.
- Learning transferrable representations for unsupervised domain adaptation. In NIPS, Cited by: §2.
- Very deep convolutional networks for large-scale image recognition. arXiv. Cited by: §1.
- Return of frustratingly easy domain adaptation. In AAAI, Cited by: §1.
- Deep domain confusion: maximizing for domain invariance. arXiv. Cited by: §2.
- Deep hashing network for unsupervised domain adaptation. In CVPR, Cited by: §A, §4.1.
- Unsupervised feature learning via non-parametric instance discrimination. In CVPR, Cited by: §2, §3.2.
- Larger norm more transferable: an adaptive feature norm approach for unsupervised domain adaptation. In ICCV, Cited by: Table 3, Table 4.
- Universal domain adaptation. In CVPR, Cited by: §1, §2, §B, §3.3, Table 1, §4.1, §4.1, §4.1, Table 6.
- Importance weighted adversarial nets for partial domain adaptation. In CVPR, Cited by: §2.
- Bridging theory and algorithm for domain adaptation. In ICML, Cited by: §2, Table 3.
- Local aggregation for unsupervised learning of visual embeddings. In ICCV, Cited by: §2.
- Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In ECCV, Cited by: §2.