Discriminative Adversarial Domain Adaptation
Given labeled instances on a source domain and unlabeled ones on a target domain, unsupervised domain adaptation aims to learn a task classifier that can well classify target instances. Recent advances rely on domain-adversarial training of deep networks to learn domain-invariant features. However, due to an issue of mode collapse induced by the separate design of task and domain classifiers, these methods are limited in aligning the joint distributions of feature and category across domains. To overcome it, we propose a novel adversarial learning method termed Discriminative Adversarial Domain Adaptation (DADA). Based on an integrated category and domain classifier, DADA has a novel adversarial objective that encourages a mutually inhibitory relation between category and domain predictions for any input instance. We show that under practical conditions, it defines a minimax game that can promote the joint distribution alignment. Except for the traditional closed set domain adaptation, we also extend DADA for extremely challenging problem settings of partial and open set domain adaptation. Experiments show the efficacy of our proposed methods and we achieve the new state of the art for all the three settings on benchmark datasets.
Many machine learning tasks are advanced by large-scale learning of deep models, with image classification  as one of the prominent examples. A key factor to achieve such advancements is the availability of massive labeled data on the domains of the tasks of interest. For many other tasks, however, training instances on the corresponding domains are either difficult to collect, or their labeling costs prohibitively. To address the scarcity of labeled data for these target tasks/domains, a general strategy is to leverage the massively available labeled data on related source ones via domain adaptation . Even though the source and target tasks share the same label space (i.e. closed set domain adaptation), domain adaptation still suffers from the shift in data distributions. The main objective of domain adaptation is thus to learn domain-invariant features, so that task classifiers learned from the source data can be readily applied to the target domain. In this work, we focus on the unsupervised setting where training instances on the target domain are completely unlabeled.
Recent domain adaptation methods are largely built on modern deep architectures. They rely on great model capacities of these networks to learn hierarchical features that are empirically shown to be more transferable across domains [65, 69]. Among them, those based on domain-adversarial training [13, 62] achieve the current state of the art. Based on the seminal work of DANN , they typically augment a classification network with an additional domain classifier. The domain classifier takes features from the feature extractor of the classification network as inputs, which is trained to differentiate between instances from the two domains. By playing a minimax game , adversarial training aims to learn domain-invariant features.
Such domain-adversarial networks can largely reduce the domain discrepancy. However, the separate design of task and domain classifiers has the following shortcomings. Firstly, feature distributions can only be aligned to a certain level, since model capacity of the feature extractor could be large enough to compensate for the less aligned feature distributions. More importantly, given practical difficulties of aligning the source and target distributions with high granularity to the category level (especially for complex distributions with multi-mode structures), the task classifier obtained by minimizing the empirical source risk cannot well generalize to the target data due to an issue of mode collapse [27, 56], i.e., the joint distributions of feature and category are not well aligned across the source and target domains.
Recent methods [27, 56] take the first step to address the above shortcomings by jointly parameterizing the task and domain classifiers into an integrated one. To further push this line, based on such a classifier, we propose a novel adversarial learning method termed Discriminative Adversarial Domain Adaptation (DADA), which encourages a mutually inhibitory relation between its domain prediction and category prediction for any input instance, as illustrated in Figure 1. This discriminative interaction between category and domain predictions underlies the ability of DADA to reduce domain discrepancy at both the feature and category levels. Intuitively, the adversarial training of DADA mainly conducts competition between the domain neuron (output) and the true category neuron (output). Different from the work  whose mechanism to align the joint distributions is rather implicit, DADA enables explicit alignment between the joint distributions, thus improving the classification of target data. Except for closed set domain adaptation, we also extend DADA for partial domain adaptation , i.e. the target label space is subsumed by the source one, and open set domain adaptation , i.e. the source label space is subsumed by the target one. Our main contributions can be summarized as follows.
We propose in this work a novel adversarial learning method, termed DADA, for closed set domain adaptation. Based on an integrated category and domain classifier, DADA has a novel adversarial objective that encourages a mutually inhibitory relation between category and domain predictions for any input instance, which can promote the joint distribution alignment across domains.
For more realistic partial domain adaptation, we extend DADA by a reliable category-level weighting mechanism, termed DADA-P, which can significantly reduce the negative influence of outlier source instances.
For more challenging open set domain adaptation, we extend DADA by balancing the joint distribution alignment in the shared label space with the classification of outlier target instances, termed DADA-O.
Experiments show the efficacy of our proposed methods and we achieve the new state of the art for all the three adaptation settings on benchmark datasets.
Closed Set Domain Adaptation After the seminal work of DANN , ADDA  proposes an untied weight sharing strategy to align the target feature distribution to a fixed source one. SimNet  replaces the standard FC-based cross-entropy classifier by a similarity-based one. MADA  and CDAN  integrate the discriminative category information into domain-adversarial training. VADA  reduces the cluster assumption violation to constrain domain-adversarial training. Some methods [62, 63] focus on transferable regions to learn domain-invariant features and task classifier. TAT  enhances the discriminability of features to guarantee the adaptability. Some methods [51, 50, 29] utilize category predictions from two task classifiers to measure the domain discrepancy. The most related works [27, 56] to us propose joint parameterization of the task and domain classifiers, which implicitly align the joint distributions. Differently, our proposed DADA makes the joint distribution alignment more explicit, thus promoting classification on the target domain.
Partial Domain Adaptation The work  weights each source instance by its importance to the target domain based on one domain classifier, and then trains another domain classifier on target and weighted source instances. The works [5, 6] reduce the contribution of outlier source instances to the task or domain classifiers by utilizing category predictions. Differently, DADA-P weights the proposed source discriminative adversarial loss by a reliable category confidence.
Open Set Domain Adaptation Previous research  proposes to reject an instance as the unknown category by threshold filtering. The work  proposes to utilize adversarial training for both domain adaptation and unknown outlier detection. Differently, DADA-O balances the joint distribution alignment in the shared label space with the outlier rejection.
Given of labeled instances sampled from the source domain , and of unlabeled instances sampled from the target domain , the objective of unsupervised domain adaptation is to learn a feature extractor and a task classifier such that the expected target risk is low for a certain classification loss function . The domains and are assumed to have different distributions. To achieve a low target risk, a typical strategy is to learn and by minimizing the sum of the source risk and some notion of distance between the source and target domain distributions, inspired by domain adaptation theories [2, 1]. This strategy is based on a simple rational that the source risk would become a good indicator of the target risk when the distance between the two distributions is getting closer. While most of existing methods use distance measures based on the marginal distributions, it is arguably better to use those based on the joint distributions.
The above strategy is generally implemented by domain-adversarial learning [13, 62], where separate task classifier and domain classifier are typically stacked on top of the feature extractor . As discussed before, this type of design has the following shortcomings: (1) model capacity of could be large enough to make and hardly differentiable for any instance, even though the marginal feature distributions are not well aligned; (2) more importantly, it is difficult to align the source and target distributions with high granularity to the category level (especially for complex distributions with multi-mode structures), and thus obtained by minimizing the empirical source risk cannot perfectly generalize to the target data due to an issue of mode collapse, i.e. the joint distributions are not well aligned.
To alleviate the above shortcomings, inspired by semi-supervised learning methods based on GANs [53, 12], the recent work  proposes joint parameterization of and into an integrated one . Suppose the classification task of interest has categories, is formed simply by augmenting the last FC layer of with one additional neuron.
Denote as the output vector of class probabilities of for an instance , and , , as its element. The element of the conditional probability vector is written as follows
For ease of subsequent notations, we also write and . Then, such a network is trained by the classification-aware adversarial learning objective
where balances category classification and domain adversarial losses. The mechanism of this objective to align the joint distributions across domains is rather implicit.
To make it more explicit, based on the integrated classifier , we propose a novel adversarial learning method termed Discriminative Adversarial Domain Adaptation (DADA), which explicitly enables a discriminative interplay of predictions among the domain and categories for any input instance, as illustrated in Figure 1. This discriminative interaction underlies the ability of DADA to promote the joint distribution alignment, as explained shortly.
Discriminative Adversarial Learning
To establish a direct interaction between category and domain predictions, we propose a novel source discriminative adversarial loss that is tailored to the design of the integrated classifier . The proposed loss is inspired by the principle of binary cross-entropy loss. It is written as
Intuitively, the proposed loss (3) establishes a mutually inhibitory relation between of the prediction on the true category of , and of the prediction on the domain of . We first discuss how the proposed loss (3) works during adversarial training, and we show that under practical conditions, minimizing (3) over the classifier has the effects of discriminating among task categories while distinguishing the source domain from the target one, and maximizing (3) over the feature extractor can discriminatively align the source domain to the target one.
Discussion We first write the gradient formulas of on any source instance w.r.t. and as
Since both and are among the output probabilities of the classifier , we always have and , suggesting . When the loss (3) is minimized over via stochastic gradient descent (SGD), we have the update where is the learning rate, and since , increases; when it is maximized over via stochastic gradient ascent (SGA), we have the update , and since , decreases. Then, we discuss the change of in two cases: (1) in case of that guarantees , when minimizing the loss (3) over by SGD update , we have decreased , and when maximizing it over by SGA update , we have increased ; (2) in case of that guarantees , when minimizing the loss (3) over by SGD update, we have increased , and when maximizing it over by SGA update, we have decreased , as shown in Figure 2.
For discriminative adversarial domain adaptation, we expect that (1) when minimizing the proposed loss (3) over , task categories of the source domain is discriminative and the source domain is distinctive from the target one, which can be achieved when increases and decreases; (2) when maximizing it over , the source domain is aligned to the target one while retains discriminability, which can be achieved when decreases and increases in the case of . To meet the expectations, the condition of for all source instances should be always satisfied. This is practically achieved by pre-training DADA on the labeled source data using a -way cross-entropy loss, and maintaining in the adversarial training of DADA the same supervision signal. We present in the supplemental material empirical evidence on benchmark datasets that shows the efficacy of our used scheme.
To achieve the joint distribution alignment, the explicit interplay between category and domain predictions for any target instance should also be created. Motivated by recent works [44, 34] which alleviate the issue of mode collapse by aligning each instance to several most related categories, we propose a target discriminative adversarial loss based on the design of the integrated classifier , by using the conditional category probabilities to weight the domain predictions. It is written as
where the element of the domain prediction vector for the category is written as follows
An intuitive explanation for our proposed (4) is provided in the supplemental material.
Established knowledge from cluster analysis  indicates that we can estimate clusters with a low probability of error only if the conditional entropy is small. To this end, we adopt the entropy minimization principle , which is written as
where is a hyper-parameter that trade-offs the adversarial domain adaptation objective with the entropy minimization one in the unified optimization problem. Note that in the minimization problem of (7), serves as a regularizer for learning to avoid the trivial solution (i.e. all instances are assigned to the same category), and in the maximization problem of (7), it helps learn more target-discriminative features, which can alleviate the negative effect of adversarial feature adaptation on the adaptability .
By optimizing (7), the joint distribution alignment can be enhanced. This ability comes from the better use of discriminative information from both the source and target domains. Concretely, DADA constrains the domain classifier so that it clearly/explicitly knows the classification boundary, thus reducing false alignment between different categories. By deceiving such a strong domain classifier, DADA can learn a feature extractor that better aligns the two domains. We also theoretically prove in the supplemental material that DADA can better bound the expected target error.
Extension for Partial Domain Adaptation
Partial domain adaptation is a more realistic setting, where the target label space is subsumed by the source one. The false alignment between the outlier source categories and the target domain is unavoidable. To address it, existing methods [5, 66, 6] utilize the category or domain predictions, to decrease the contribution of source outliers to the training of task or domain classifiers. Inspired by these ideas, we extend DADA for partial domain adaptation by using a reliable category-level weighting mechanism, which is termed DADA-P.
Concretely, we average the conditional probability vectors over all target data and then normalize the averaged vector by dividing its largest element. The category weight vector with as its element is derived by a convex combination of the normalized vector and an all-ones vector , as follows
where is to suppress the detection noise of outlier source categories in the early stage of training. Then, we apply the category weight vector to the proposed discriminative adversarial loss for any source instance, leading to
Since predicted probabilities on the outlier source categories are more likely to increase when minimizing over , which incurs negative transfer. To avoid it, we minimize over and the objective of DADA-P is
By optimizing it, DADA-P can simultaneously alleviate negative transfer and promote the joint distribution alignment across domains in the shared label space.
Extension for Open Set Domain Adaptation
Open set domain adaptation is a very challenging setting, where the source label space is subsumed by the target one. We denominate the shared category and all unshared categories between the two domains as the “known category” and “unknown category” respectively. The goal of open set domain adaptation is to correctly classify any target instance as the known or unknown category. The false alignment between the known and unknown categories is inevitable. To this end, the work  proposes to make a pseudo decision boundary for the unknown category, which enables the feature extractor to reject some target instances as outliers. Inspired by this work, we extend DADA for open set domain adaptation by training the classifier to classify all target instances as the unknown category with a small probability , which is termed DADA-O. Assuming the predicted probability on the unknown category as the element of , i.e., , the modified target adversarial loss when minimized over the integrated classifier is
where . When maximized over the feature extractor , we still use the discriminative loss in (4). Replacing in (7) with (11) gives the overall adversarial objective of DADA-O, which can achieve a balance between domain adaptation and outlier rejection.
We utilize all target instances to obtain the concept of “unknown”, which is very helpful for the classification of unknown target instances as the unknown category but can cause the misclassification of known target instances as the unknown category. This issue can be alleviated by selecting an appropriate . If is too small, the unknown target instances cannot be correctly classified; if is too large, the known target instances can be misclassified. By choosing an appropriate , the feature extractor can separate the unknown target instances from the known ones while aligning the joint distributions in the shared label space.
Datasets and Implementation Details
Office-31  is a popular benchmark domain adaptation dataset consisting of images of categories collected from three domains: Amazon (A), Webcam (W), and DSLR (D). We evaluate on six settings.
Syn2Real  is the largest benchmark. Syn2Real-C has over images of shared categories in the combined training, validation, and testing domains. The images on the training domain are synthetic ones by rendering 3D models. The validation and test domains comprise real images, and the validation one has images. We use the training domain as the source domain and validation one as the target domain. For partial domain adaptation, we choose images of the first categories (in alphabetical order) in the validation domain as the target domain and form the setting: Synthetic 12 Real 6. For open set domain adaptation, we evaluate on Syn2Real-O, which includes two domains. The training/synthetic domain uses synthetic images from the categories of Syn2Real-C as “known”. The validation/real domain uses images of the categories from the validation domain of Syn2Real-C as “known”, and k images from other categories as “unknown”. We use the training and validation domains of Syn2Real-O as the source and target domains respectively.
Implementation Details We follow standard evaluation protocols for unsupervised domain adaptation [13, 62]: we use all labeled source and all unlabeled target instances as the training data. For all tasks of Office-31 and Synthetic 12 Real 6, based on ResNet-50 , we report the classification result on the target domain of mean(standard deviation) over three random trials. For other tasks of Syn2Real, we evaluate the accuracy of each category based on ResNet-101 and ResNet-152 (for closed and open set domain adaptation respectively). For each base network, we use all its layers up to the second last one as the feature extractor , and set the neuron number of its last FC layer as to have the integrated classifier . Exceptionally, we follow the work  and replace the last FC layer of ResNet-152 with three FC layers of 512 neurons. All base networks are pre-trained on ImageNet . We firstly pre-train them on the labeled source data, and then fine-tune them on both the labeled source data and unlabeled target data via adversarial training, where we maintain the same supervision signal as the pre-training.
We follow DANN  to use the SGD training schedule: the learning rate is adjusted by , where denotes the process of training iterations that is normalized to be in , and we set , , and ; the hyper-parameter is initialized at and is gradually increased to by , where we set . We empirically set . We implement all our methods by PyTorch. The code will be available at https://github.com/huitangtang/DADA-AAAI2020.
|Methods||A W||D W||W D||A D||D A||W A||Avg|
|DADA (w/o em + w/o td)||91.00.2||98.70.1||100.00.0||90.80.2||70.90.3||70.20.3||86.9|
DADA (w/o em)
|Methods||A W||D W||W D||A D||D A||W A||Avg|
|No Adaptation ||79.90.3||96.80.4||99.50.1||84.10.4||64.50.3||66.40.4||81.9|
Ablation Study We conduct ablation studies on Office-31 to investigate the effects of key components of our proposed DADA based on ResNet-50. Our ablation studies start with the very baseline termed “No Adaptation” that simply fine-tunes a ResNet-50 on the source data. To validate the mutually inhibitory relation enabled by DADA, we use DANN  and DANN-CA  respectively as the second and third baselines. To investigate how the entropy minimization principle helps learn more target-discriminative features, we remove the entropy minimization loss (6) from our main minimax problem (7), denoted as “DADA (w/o em)”. To know effects of the proposed source and target discriminative adversarial losses (3) and (4), we remove both (6) and (4) from (7), denoted as “DADA (w/o em + w/o td)”.
Results in Table 1 show that although DANN improves over “No Adaptation”, its result is much worse than DANN-CA, verifying the efficacy of the design of the integrated classifier . “DADA (w/o em + w/o td)” improves over DANN-CA and “DADA (w/o em)” improves over “DADA (w/o em + w/o td)”, showing the efficacy of our proposed discriminative adversarial learning. DADA significantly outperforms DANN and DANN-CA, confirming the efficacy of the proposed mutually inhibitory relation between the category and domain predictions in aligning the joint distributions of feature and category across domains. Table 1 also confirms that entropy minimization is helpful to learn more target-discriminative features.
No Adaptation 
No Adaptation 
Quantitative Comparison To compare the efficacy of different methods in reducing domain discrepancy at the category level, we visualize the average probability on the true category over all target instances by task classifiers of No Adaptation, DANN, DANN-CA, and DADA on A W in Figure 3. Note that here we use labels of the target data for the quantization of category-level domain discrepancy. Figure 3 shows that our proposed DADA gives the predicted probability on the true category of any target instance a better chance to approach , meaning that target instances are more likely to be correctly classified by DADA, i.e., a better category-level domain alignment.
Closed Set Domain Adaptation We compare in Tables 2 and 3 our proposed method with existing ones on Office-31 and Syn2Real-C based on ResNet-50 and ResNet-101 respectively. Whenever available, results of existing methods are quoted from their respective papers or the recent works [44, 34, 31, 51]. Our proposed DADA outperforms existing methods, testifying the efficacy of DADA in aligning the joint distributions of feature and category across domains.
Partial Domain Adaptation We compare in Table 5 our proposed method to existing ones on Syn2Real-C based on ResNet-50. Results of existing methods are quoted from the work . Our proposed DADA-P substantially outperforms all comparative methods by , showing the effectiveness of DADA-P on reducing the negative influence of source outliers while promoting the joint distribution alignment in the shared label space.
Open Set Domain Adaptation We compare in Table 4 our proposed method with existing ones on Syn2Real-O based on ResNet-152. Results of existing methods are quoted from the recent work . Our proposed DADA-O outperforms all comparative methods in both evaluation metrics of Known and Mean, showing the efficacy of DADA-O in both aligning joint distributions of the known instances and identifying the unknown target instances. It is noteworthy that DADA-O improves over the state-of-the-art method AODA by a large margin when the known-to-unknown ratio in the target domain is much smaller than , i.e. the false alignment between the known source and unknown target instances will be much more serious. This observation confirms the efficacy of DADA-O.
We provide more results and analysis for the three problem settings in the supplemental material.
We propose a novel adversarial learning method termed Discriminative Adversarial Domain Adaptation (DADA) to overcome the limitation in aligning the joint distributions of feature and category across domains, which is due to an issue of mode collapse induced by the separate design of task and domain classifiers. Based on an integrated task and domain classifier, DADA has a novel adversarial objective that encourages a mutually inhibitory relation between the category and domain predictions, which can promote the joint distribution alignment. Unlike previous methods, DADA explicitly enables a discriminative interaction between category and domain predictions. Except for closed set domain adaptation, we also extend DADA for more challenging problem settings of partial and open set domain adaptation. Experiments on benchmark datasets testify the efficacy of our proposed methods for all the three settings.
This work is supported in part by the National Natural Science Foundation of China (Grant No.: 61771201), the Program for Guangdong Introducing Innovative and Enterpreneurial Teams (Grant No.: 2017ZT07X183), and the Guangdong R&D key project of China (Grant No.: 2019B010155001).
We provide an intuitive explanation for our proposed loss (4) in Section A. We theoretically prove that our proposed method can better bound the expected target error than existing ones in Section B. We provide more results and analysis on benchmark datasets of Digits, Office-31, Office-Home, and ImageNet-Caltech for closed set, partial, and open set domain adaptation in Section C. We present empirical evidence on benchmark datasets of digits that shows the efficacy of our used training scheme in Section D. We will release the code soon.
Appendix A Intuitive Explanation for Our Proposed Loss (4)
We denote the output vector of class scores of before the final softmax operation for an instance as , and its element as , . We denote the output vector of class probabilities of after the final softmax operation for an instance as , and its element as , . We write , as
We always have for any instance . When maximized over the feature extractor , the adversarial loss on an unlabeled target instance (cf. objective (2) in Section Discriminative Adversarial Domain Adaptation in the paper) is written as
We write the gradient formulas of w.r.t. , as
where , differ in the term of , meaning that they are proportional to the class scores of . In other words, the higher the class score is (i.e., the higher the class probability is), the stronger gradient the corresponding category neuron back-propagates, suggesting that the target instance is aligned to several most confident/related categories on the source domain. Such a mechanism to align the joint distributions of feature and category across domains is rather implicit. To make it more explicit, our proposed target discriminative adversarial loss (cf. loss (4) in Section Discriminative Adversarial Learning in the paper) uses the conditional probabilities to weight the category-wise domain predictions. By such a design, the discriminative adversarial training on the target data explicitly conducts the competition between the domain neuron (output) and the most confident category neuron (output) as the discriminative adversarial training on the source data does, thus promoting the category-level domain alignment. This is what we mean by the mutually inhibitory relation between the category and domain predictions for any input instance.
This intuitive explanation manifests that the adversarial training of DADA clearly and explicitly utilizes the discriminative information of the target domain, thus improving the alignment of joint distributions of feature and category across domains.
Appendix B Generalization Error Analysis for Our Proposed DADA
We prove that our proposed DADA can better bound the expected target error than existing domain adaptation methods [13, 58, 44, 46, 67, 55, 34, 62, 63, 56], taking the similar formalism of theoretical results of domain adaptation [2, 1].
For all hypothesis spaces introduced below, we assume them of finite effective size, i.e., finite VC dimension, so that the following distance measures defined over these spaces can be estimated from finite instances . We consider a fixed representation function from the instance set to the feature space , i.e., , and a hypothesis space for the -category task classifier from the feature space to the label space , i.e., . Note that is the -dimensional one-hot vector for any label . Denote the marginal feature distribution and the joint distribution of feature and category by and for the source domain , and similarly and for the target domain , respectively. Let be the expected source error of a hypothesis w.r.t. the joint distribution , where is the indicator function which is if predicate is true, and otherwise. Similarly, denotes the expected target error of w.r.t. the joint distribution . Let be the ideal joint hypothesis that explicitly embodies the notion of adaptability . Let and be the disagreement between hypotheses and w.r.t. the joint distributions and respectively. Specified by the two works [2, 1], the probabilistic bound of the expected target error of the hypothesis is given by the sum of the expected source error , the combined error of the ideal joint hypothesis , and the distribution discrepancy across data domains, as the follow
For domain adaptation to be possible, a natural assumption is that there exists the ideal joint hypothesis so that the combined error is small. The ideal joint hypothesis may not be unique, since in practice we always have the same error obtained by two different machine learning models. Denote a set of ideal joint hypotheses by , which is a subset of , i.e., . Based on this assumption, domain adaptation aims to reduce the domain discrepancy . Let be the proxy of the label vector of , for every pair of . Denote the thus obtained proxies of the joint distributions and by and , respectively . Then, by definition, , and similarly . Based on the two joint distribution proxies, we have the domain discrepancy
Let be a (loss) difference hypothesis space over the joint variable of , where computes the empirical 0-1 classification loss of the task classifier for any input pair of . Then, the -distance between two distributions and , is defined as
Let be a (loss) difference hypothesis space, which contains a class of functions over the joint variable of . Then, the -distance between two distributions and , is defined as
Let be a (loss) difference hypothesis space over the joint variable of , where computes the empirical 0-1 classification loss of the task classifier for any input pair of . Then, the -distance between two distributions and , is defined as
Let be a (loss) difference hypothesis space, which contains a class of functions over . Then, the -distance between two distributions and , is defined as
We are now ready to give an upper bound on the domain discrepancy in terms of the distance measures we have defined.
The distribution discrepncy between the source and target domains can be upper bounded by the -distance, the -distance, the -distance, and the -distance as follows