Domain Agnostic Learning with Disentangled Representations

Domain Agnostic Learning with Disentangled Representations

Supplementary Materials


Unsupervised model transfer has the potential to greatly improve the generalizability of deep models to novel domains. Yet the current literature assumes that the separation of target data into distinct domains is known as a priori. In this paper, we propose the task of Domain-Agnostic Learning (DAL): How to transfer knowledge from a labeled source domain to unlabeled data from arbitrary target domains? To tackle this problem, we devise a novel Deep Adversarial Disentangled Autoencoder (DADA) capable of disentangling domain-specific features from class identity. We demonstrate experimentally that when the target domain labels are unknown, DADA leads to state-of-the-art performance on several image classification datasets.


Xingchao Pengbu \icmlauthorZijun Huangcol \icmlauthorXimeng Sunbu \icmlauthorKate Saenkobu


buComputer Science Department, Boston University; 111 Cummington Mall, Boston, MA 02215, USA; \icmlaffiliationcolColumbia Unversity and MADO AI Research; 116th St and Broadway, New York, NY 10027, USA; \icmlcorrespondingauthorKate


domain-agnostic learning, domain adaptation, feature disentanglement


1 Introduction

Supervised machine learning assumes that training and testing data are sampled i.i.d from the same distribution, while in practice, the training and testing data are typically collected from related domains but under different distributions, a phenomenon known as domain shift Quionero-Candela et al. (2009). To avoid the cost of annotating each new test domain, Unsupervised Domain Adaptation (UDA) tackles domain shift by aligning the feature distribution of the source domain with that of the target domain, resulting in domain-invariant features. However, current methods assume that target samples have domain labels and therefore can be isolated into separate homogeneous domains. For many practical applications, this is an overly strong assumption. For example, a hand-written character recognition system could encounter characters written by different people, on different materials, and under different lighting conditions; an image recognition system applied to images scraped from the web must handle mixed-domain data (e.g. paintings, sketches, clipart) without their domain labels.

In this paper, we consider Domain-Agnostic Learning (DAL), a more difficult but practical problem of knowledge transfer from one labeled source domain to multiple unlabeled target domains. The main challenges of domain-agnostic learning are that: (1) the target data has mixed domains, which hampers the effectiveness of mainstream feature alignment methods Long et al. (2015); Sun & Saenko (2016); Saito et al. (2018), and (2) class-irrelevant information leads to negative transfer Pan & Yang (2010), especially when the target domain is highly heterogeneous.

Mainstream UDA methods align the source domain to the target domain by minimizing the Maximum Mean Discrepancy Long et al. (2015); Tzeng et al. (2014), aligning high-order moments Sun & Saenko (2016); Zellinger et al. (2017), or adversarial training Ganin & Lempitsky (2015); Tzeng et al. (2017). However, these methods are designed for one-to-one domain alignment and do not account for multiple latent domains in the target. Multi-source domain adaptation Peng et al. (2018); Xu et al. (2018); Mansour et al. (2009) considers adaptation between multiple sources and a single target domain and assumes domain labels on the source data. Continuous domain adaptation Hoffman et al. (2014) aims to transfer knowledge to a continuously changing domain (e.g. cars in different decades), but in their scenario the target data are temporally related. Recently, domain generalization approaches Li et al. (2018a); Carlucci et al. (2018b); Li et al. (2018b) have been introduced to adapt from multiple labeled source domains to an unseen target domain. All of the above models make a strong assumption that the target data are homogeneously sampled from the same distribution, unlike the scenario we consider here.

We postulate that a solution to domain-agnostic learning should not only learn invariance between source and target, but should also actively disentangle the class-specific features from the remaining information in the image. Deep neural networks are known to extract features in which multiple hidden factors are highly entangled Bengio et al. (2013). Recent work attempts to disentangle features in the latent space of autoencoders with adversarial training Cao et al. (2018); Liu et al. (2018b); Odena et al. (2017); Lee et al. (2018). However, the above models have limited capacity in transferring features learned from one domain to heterogeneous target domains.  Liu et al. (2018a) proposes a framework that takes samples from multiple domains as input, and derives a domain-invariant latent feature space via adversarial training. This model is limited by two factors when applied to the DAL task. First, it only disentangles the embeddings into domain-invariant features and domain-specific features such as weather conditions, and discards the latter, but does not explicitly try to separate class-relevant features from class-irrelevant features like background. Second, there is no guarantee that the domain-invariant features are fully disentangled from the domain-specific features.

To address the issues mentioned above, we propose a novel Deep Adversarial Disentangled Autoencoder (DADA), aiming to tackle domain-agnostic learning by disentangling the domain-invariant features from both domain-specific and class-irrelevant features simultaneously. First, in addition to domain disentanglement Liu et al. (2018a); Cao et al. (2018); Lee et al. (2018), we employ class disentanglement to remove class-irrelevant features, as shown in Figure 1. The class disentanglement is trained in an adversarial fashion: a class identifier is trained on the labeled source domain and the disentangler generates features to fool the class identifier. To the best of our knowledge, we are the first to show that class disentanglement boosts domain adaptation performance. Second, to enhance the disentanglement, we propose to minimize the mutual information between the disentangled features. We implement a neural network to estimate the mutual information between the disentangled feature distributions, inspired by a recently published theoretical work Belghazi et al. (2018). Comprehensive experiments on standard image recognition datasets demonstrate that our derived disentangled representation achieves significant improvements over the state-of-the-art methods on the task of domain-agnostic learning.

The main contributions of this paper are highlighted as follows: (1) we propose a novel learning paradigm of domain-agnostic learning; 2) we develop an end-to-end Deep Adversarial Disentangled Autoencoder (DADA) which learns a better disentangled feature representation to tackle the task; and (3) We propose class disentanglement to remove class-irrelevant features, and minimize the mutual information to enhance the disentanglement.

2 Related Work

Figure 1: Our DADA architecture learns to extract domain-invariant features of visual categories. In addition to domain disentanglement (blue lines), we employ class disentanglement (red lines) to remove class-irrelevant features, both trained adversarially. We further apply a mutual information minimizer to strengthen the disentanglement.

Domain Adaptation Unsupervised domain adaptation (UDA) aims to transfer the knowledge learned from one or more labeled source domains to an unlabeled target domain. Various methods have been proposed, including discrepancy-based UDA approaches Long et al. (2017); Tzeng et al. (2014); Ghifary et al. (2014); Peng & Saenko (2018), adversary-based approaches Liu & Tuzel (2016); Tzeng et al. (2017); Liu et al. (2018a), and reconstruction-based approaches Yi et al. (2017); Zhu et al. (2017); Hoffman et al. (2018); Kim et al. (2017). These models are typically designed to tackle single source to single target adaptation. Compared with single source adaptation, multi-source domain adaptation (MSDA) assumes that training data are collected from multiple sources. Originating from the theoretical analysis in Ben-David et al. (2010); Mansour et al. (2009); Crammer et al. (2008), MSDA has been applied to many practical applications Xu et al. (2018); Duan et al. (2012); Peng et al. (2018). Specifically,  Ben-David et al. (2010) introduce an -divergence between the weighted combination of source domains and a target domain. We propose a new and more practical learning paradigm, not yet considered in the UDA literature, where labeled data come from a single source domain but the testing data contain multiple unknown domains.

Representation Disentanglement The goal of learning disentangled representations is to model the factors of data variation. Recent works Mathieu et al. (2016); Makhzani et al. (2016); Liu et al. (2018a); Odena et al. (2017) aim at learning an interpretable representation using generative adversarial networks (GANs) Goodfellow et al. (2014); Kingma et al. (2014) and variational autoencoders (VAEs) Rezende et al. (2014); Kingma & Welling (2013). In a fully supervised setting, Lee et al. (2018) proposes to disentangle the feature representation into a domain-invariant content space and a domain-specific attribute space, producing diverse outputs without paired training images. Another work Odena et al. (2017) proposes an auxiliary classifier GAN (AC-GAN) to achieve representation disentanglement. Despite promising performance, these methods focus on disentangling representation in a single domain.  Liu et al. (2018a) introduces a unified feature disentangler to learn a domain-invariant representation from data across multiple domains. However, their model assumes that multiple source domains are available during training, which limits its practical application. In contrast, our model disentangles the representation based on one source domain and multiple unknown target domains, and proposes an improved approach to disentanglement that considers the class label and mutual information between features.

Agnostic Learning There are several prior studies of agnostic learning that are related to our work. Model-Agnostic Meta-Learning (MAML) Finn et al. (2017) aims to train a model on a variety of learning tasks and solve a new task using only a few training examples. Different from MAML, our method mainly focuses on transferring knowledge to heterogeneous domains. Carlucci et al. (2018a) proposes a learning framework to seamlessly extend the knowledge from multiple source domain to an unseen target domain by pixel-adaptation in an incremental architecture. Romijnders et al. (2018) introduces a domain agnostic normalization layer for adversarial UDA and improves the performance of deep models on an unseen domain. Though the results are promising, we argue that only normalizing the feature representation is not enough for domain-agnostic learning, and that extracting disentangled domain-invariant and domain-specific features is also important.

3 DADA: Deep Adversarial Disentangled Autoencoder

We define the domain-agnostic learning task as follows: Given a source domain with labeled examples, the goal is to minimize risk on target domains = { without domain labels. We denote the target domains as with unlabeled examples. Empirically, we want to minimize the target risk , where is the classifier.

We propose to solve the task by learning domain-invariant features that are discriminative of the class. Figure 1 shows the proposed model. The feature generator maps the input image to a feature vector , which has many highly entangled factors. The disentangler is responsible for disentangling the features () into domain-invariant features (), domain-specific features (), and class-irrelevant features (). The feature reconstructor aims to recover from either (, ) or (, ). D and R are implemented as the encoder and decoder in a Variational Autoencoder. A mutual information minimizer is applied between and , as well as between and , to enhance the disentanglement. Adversarial training via a domain identifier aligns the source domain and the heterogeneous target domain in the space. A class identifier is trained on the labeled source domain to predict the class distribution and to adversarially extract class-irrelevant features . We next describe each component in detail.

Variational Autoencoders VAEs Kingma & Welling (2013) are a class of deep generative models that simultaneously train both a probabilistic encoder and decoder. The encoder is trained to generate latent vectors that roughly follow a Gaussian distribution. In our case, we learn each part of our disentangled representations by applying a VAE architecture with the following objective function:


where the first term aims at recovering the original features extracted by , and the second term calculates Kullback-Leibler divergence which penalizes the deviation of latent features from the prior distribution (as ). However, this property cannot guarantee that domain-invariant features are well disentangled from the domain-specific features or from class-irrelevant features, as the loss function in Equation 1 only aligns the latent features to a normal distribution.

Class Disentanglement To address the above problem, we employ class disentanglement to remove class-irrelevant features, such as background, in an adversarial way. First, we train the disentangler and the -way class identifier to correctly predict the labels, supervised by the cross-entropy loss:


where .

In the second step, we fix the class identifier and train the disentangler to fool the class identifier by generating class-irrelevant features . This can be achieved by minimizing the negative entropy of the predicted class distribution:


where the first term and the second term indicate minimizing the entropy on the source domain and on heterogeneous target, respectively. The above adversarial training process forces the corresponding disentangler to extract class-irrelevant features.

Domain Disentanglement To tackle the domain agnostic learning task, disentangling class-irrelevant features is not enough, as it fails to align the source domain with the target. To achieve better alignment, we further propose to disentangle the learned features into domain-specific and domain-invariant and to thus align the source with the target domain in the domain-invariant latent space. This is achieved by exploiting adversarial domain classification in the resulting latent space. Specifically, we leverage a domain identifier , which takes the disentangled feature ( or ) as input and outputs the domain label (source or target). The objective function of the domain identifier is as follows:


Then the disentangler is trained to fool the domain identifier to extract domain-invariant features.

Figure 2: We demonstrate the effectiveness of DADA on three dataset: Digit-Five, Office-Caltech10 Gong et al. (2012) and DomainNet Peng et al. (2018) dataset. The Digit-Five dataset includes: MNIST (mt), MNIST-M (mm), SVHN (sv), Synthetic (syn), and USPS (up). The Office-Caltech10 dataset contains: Amazon (A), Caltech (C), DSLR (D), and Webcam (W). The DomainNet dataset includes: clipart (clp), infograph (inf), painting (pnt), quickdraw (qdr), real (rel), and sktech (skt).

Mutual Information Minimization To better disentangle the features, we minimize the mutual information between domain-invariant and domain-specific features (, ), as well as domain-invariant and class-irrelevant features (, ):


where , is the joint probability distribution of (, ), and and are the marginals. Despite being a pivotal measure across different domains, the mutual information is only tractable for discrete variables, or for a limited family of problems where the probability distributions are unknown Belghazi et al. (2018). The computation incurs a complexity of , which is undesirable for deep CNNs. Is this paper, we adopt the Mutual Information Neural Estimator (MINE) Belghazi et al. (2018)


which provides unbiased estimation of mutual information on i.i.d samples by leveraging a neural network .

Input: source labeled datasets ; heterogeneous target dataset ; feature extractor ; disentangler ; category identifier , domain identifier , mutual information estimator , and reconstructor .
Output: well-trained feature extractor , well-trained disentangler , and class identifier .

1:while not converged do
2:     Sample mini-batch from and ;
3:     Class Disentanglement:
4:     for 1:iter do
5:          Update , , by Eq.2;
6:          Update by Eq.3;
7:     end for
8:     Domain Disentanglement:
9:     Update and by Eq.4;
10:     Mutual Information Minimization:
11:     Calculate mutual information between the disentangled feature pair (, ), as well as (,) with ;
12:     Update , by Eq.7;
13:     Reconstruction:
14:     Reconstruct by (,) and (, ) with ;
15:     Update , by Eq.1
16:end while
17:return .
Algorithm 1 Learning algorithm for DADA

Practically, MINE (6) can be computed as - . Additionally, to avoid computing the integrals, we leverage Monte-Carlo integration:


where are sampled from the joint distribution and is sampled from the marginal distribution. We implement a neural network to perform the Monte-Carlo integration defined in Equation 7.

Ring-style Normalization Conventional batch normalization Ioffe & Szegedy (2015) diminishes internal covariate shift by subtracting the batch mean and dividing by the batch standard deviation. Despite promising results on domain adaptation, batch normalization alone is not enough to guarantee that the embedded features are well normalized in the scenario of heterogeneous domains. The target data are sampled from multiple domains and their embedded features are scattered irregularly in the latent space. Zheng et al. (2018) proposes a ring-style norm constraint to maintain a balance between the angular classification margins of multiple classes. Its objective is as follows:


where is the learned norm value. However, ring loss is not robust and may cause mode collapse if the learned is small. Instead, we incorporate the ring loss into a Geman-McClure model and minimize the following loss function:


where is the scale factor of the Geman-McClure model.

Optimization Our model is trained in an end-to-end fashion. We train the class and domain disentanglement component, MINE and the reconstruction component iteratively with Stochasitc Gradient Descent Kiefer et al. (1952) or Adam Kingma & Ba (2014) optimizer. We employ the popular neural networks (e.g. LeNet, AlexNet, or ResNet) as our feature generator . The detailed training procedure is presented in Algorithm 1.

4 Experiments


Models mtmm,sv,sy,up mmmt,sv,sy,up svmt,mm,sy,up symt,mm,sv,up upmt,mm,sv,sy Avg
Source Only 20.51.2 53.50.9 62.90.3 77.90.4 22.60.4 47.5
DAN Long et al. (2015) 21.71.0 55.30.7 63.20.5 79.30.2 40.20.4 51.9
DANN Ganin & Lempitsky (2015) 22.81.1 45.20.6 61.80.2 79.30.3 38.70.6 49.6
ADDA Tzeng et al. (2017) 23.41.3 54.80.8 63.50.4 79.60.3 43.50.5 52.9
UFDN Liu et al. (2018a) 20.21.5 41.60.7 64.50.4 60.70.3 44.60.2 46.3
MCD Saito et al. (2018) 28.71.3 43.80.8 75.10.3 78.90.3 55.30.4 56.4


DADA+class (I) 28.91.2 50.10.9 65.40.2 79.80.1 50.40.3 54.9
DADA+domain (II) 34.11.7 57.10.4 71.30.4 82.50.3 45.40.4 57.5
DADA+ring (III) 35.31.5 57.50.6 80.10.3 82.90.2 46.20.3 60.4
DADA+rec (IV) 39.41.4 61.10.7 80.10.4 83.70.2 47.20.4 62.3


Table 1: Accuracy on “Digit-Five” dataset with domain agnostic learning protocol. DADA achieves 62.3% accuracy, significantly outperforming other baselines. We incrementally add each component to our model, aiming to study their effectiveness on the final results. (model I: with class disentanglement; model II: I + domain disentanglement; model III: II + ring loss; model IV: III + reconstruction loss. mt, up, sv, sy, mm are abbreviations for MNIST, USPS, SVHN, Synthetic Digits, MNIST-M.)
(a) Source Features (b) UFDN Features (c) MCD Features (d) DADA Features
Figure 3: Feature visualization: t-SNE plot of source features, UFDN Liu et al. (2018a) features, MCD Saito et al. (2018) features and DADA features on agnostic target domain in sv mm,mt,up,sy setting. We use different markers and different colors to denote different categories. (Best viewed in color.)

We compare the DADA model to state-of-the-art domain adaptation algorithms on the following tasks: digit classification (MNIST, SVHN, USPS, MNIST-M, Synthetic Digits) and image recognition (Office-Caltech10 Gong et al. (2012), DomainNet Peng et al. (2018)). Sample images of these datasets can be seen in Figure 2. Table 6 (suppementary material) shows the detailed number of images we use in our experiments. In the main paper, we only report major results; more implementation details are provided in the supplementary material. All of our experiments are implemented in the PyTorch111 platform.

4.1 Experiments on Digit Recognition

Digit-Five This dataset is a collection of five benchmarks for digit recognition, namely MNIST LeCun et al. (1998), Synthetic Digits Ganin & Lempitsky (2015), MNIST-M Ganin & Lempitsky (2015), SVHN, and USPS. In our experiments, we take turns setting one domain as the source domain and the rest as the mixed target domain (discarding both the class and the domain labels), leading to five transfer tasks. To explore the effectiveness of each component in our model, we propose four different ablations, i.e. model I: with class disentanglement; model II: I + domain disentanglement; model III: II + ring loss; model IV: III + reconstruction loss. The detailed architecture of our model can be seen in Table 5 (supplementary material).


Method A C,D,W C A,D,W D A,C,W W A,C,D Average
AlexNet Krizhevsky et al. (2012) 83.10.2 88.90.4 86.70.4 82.20.3 85.2
DAN Long et al. (2015) 82.50.3 86.20.4 75.70.5 80.40.2 81.2
RTN Long et al. (2016) 85.20.4 89.80.3 81.70.3 83.70.4 85.1
JAN Long et al. (2017) 83.50.3 88.50.2 80.10.3 85.90.4 84.5
DANN Ganin & Lempitsky (2015) 85.90.4 90.50.3 88.6 90.40.2 88.9
DADA (Ours) 86.30.3 91.70.4 89.90.3 91.30.3 89.8


ResNet He et al. (2016) 90.50.3 94.30.2 88.70.4 82.50.3 89.0
SE French et al. (2018) 90.30.4 94.70.4 88.50.3 85.30.4 89.7
MCD Saito et al. (2018) 91.70.4 95.30.3 89.50.2 84.30.2 90.2
DANN Ganin & Lempitsky (2015) 91.50.4 94.30.4 90.50.3 86.30.3 90.6
DADA (Ours) 92.0 95.1 91.30.4 93.10.3 92.9


Table 2: Accuracy on Office-Caltech10 dataset with DAL protocal. The methods in the above table are based on “AlexNet” backbone and the methods below are based on the “ResNet” backbone. For both backbones, our model outperforms other baselines.
(a) -Distance (b) Training loss for CA,D,W (c) MCD confusion matrix (d) DADA confusion matrix
Figure 4: Empirical analysis: (a)-Distance of ResNet, MCD and DADA features on two different tasks; (b) training errors and accuracy on CA,D,W task. (c)-(d) confusion matrices of MCD, and DADA models on WA,C,D task.

We compare our model to state-of-the-art baselines: Deep Adaptation Network (DANLong et al. (2015), Domain Adversarial Neural Network (DANNGanin & Lempitsky (2015), Adversarial Discriminative Domain Adaptation (ADDATzeng et al. (2017), Maximum Classifier Discrepancy (MCDSaito et al. (2018), and Unified Feature Disentangler Network (UFDNLiu et al. (2018a). Specifically, DAN applies MMD loss Gretton et al. (2007) to align the source domain with the target domain in reproducing kernel Hilbert space. DANN and ADDA align the source domain with target domain by adversarial loss. MCD is a domain adaptation framework which incorporates two classifiers. UFDN employs a variational autoencoder Kingma & Welling (2013) to disentangle domain-invariant representations. When conducting the baseline experiments, we utilize the code provided by the authors and keep the original experimental settings.

Results and Analysis The experimental results on the “Digit-Five” dataset are shown in Table 1. From these, we can make the following observations. (1) Model IV achieves 62.3% average accuracy, significantly outperforming other baselines on most of the domain-agnostic tasks. (2) The results of model I and II demonstrate the effectiveness of class disentanglement and domain disentanglement. Without minimizing the mutual information between disentangled features, UFDN performs poorly on this task. (3) In model III, the ring loss boost the performance by three percent, demonstrating that feature normalization is essential in domain-agnostic learning.

To dive deeper into the disentangled features, we plot in Figure 3(a)-3(d) the t-SNE embeddings of the feature representations learned on the svmm,mt,up,sy task with source-only features, UFDN features, MCD features, and DADA features, respectively. We observe that the features derived by our model are more separated between classes than UFDN and MCD features.


Models clpinf,pnt


AlexNet Krizhevsky et al. (2012) 22.50.4 15.30.2 21.20.3 6.00.2 17.20.3 21.80.3 17.3
DAN Long et al. (2015) 23.70.3 14.90.4 22.70.2 7.60.3 19.40.4 23.40.5 18.6
RTN Long et al. (2016) 21.40.3 14.20.3 21.00.4 7.70.2 17.80.3 20.80.4 17.2
JAN Long et al. (2017) 21.10.4 16.50.2 21.60.3 9.90.1 15.40.2 22.50.3 17.8
DANN Ganin & Lempitsky (2015) 24.10.2 15.20.4 24.50.3 8.20.4 18.00.3 24.10.4 19.1
DADA (Ours) 23.90.4 17.90.4 25.40.5 9.40.2 20.50.3 25.20.4 20.4


ResNet101 He et al. (2016) 25.60.2 16.80.3 25.80.4 9.20.2 20.60.5 22.30.1 20.1
SE French et al. (2018) 21.30.2 8.50.1 14.50.2 13.80.4 16.00.4 19.70.2 15.6
MCD  Saito et al. (2018) 25.10.3 19.10.4 27.00.3 10.40.3 20.20.2 22.50.4 20.7
DADA (Ours) 26.10.4 20.00.3 26.50.4 12.90.4 20.70.4 22.80.2 21.5


Table 3: Accuracy on the DomainNet dataset Peng et al. (2018) dataset with DAL protocol. The table below shows the results based on AlexNet Krizhevsky et al. (2012) backbone and the below are the results of ResNet He et al. (2016) backbone. For both setting, our model outperforms other baselines.


Source clp inf pnt qdr rel skt Avg


DAN (o-o) 25.2 14.9 24.1 7.8 20.4 25.2 19.6
DAN (o-m) 23.7 14.9 22.7 7.6 19.4 23.4 18.6


JAN (o-o) 24.2 18.1 23.2 7.8 15.8 23.8 18.8
JAN (o-m) 21.1 16.5 21.6 9.9 15.4 22.5 17.8


Table 4: One-to-one (o-o) vs. one-to-many alignment (o-m). We only show the source domain in the table, the remaining five domains set as the target domain.

4.2 Experiments on Office-Caltech10

Office-Caltech10 Gong et al. (2012) This dataset includes 10 common categories shared by Office-31 Saenko et al. (2010) and Caltech-256 datasets Griffin et al. (2007). It contains four domains: Caltech (C), which are sampled from Caltech-256 dataset, Amazon (A), which contains images collected from, Webcam (W) and DSLR (D), which are images taken by web camera and DSLR camera under office environment. In our experiments, we take turns to set one domain as the source domain and the rest as the heterogeneous target domain, leading to four DAL tasks.

In our experiments, we leverage two popular networks, AlexNet Krizhevsky et al. (2012) and ResNet He et al. (2016), as the backbone of the feature generator . Both the networks are pre-trained on ImageNet Deng et al. (2009). Other components are randomly initialized with normal distribution. In the optimization procedure, we set the learning rate of randomly initialized parameters ten times of the pre-trained parameters. The architecture of other components can be seen in Table 7 (supplementary material).

In addition to the baselines mentioned in Section 4.1, we add three baselines: Residual Transfer Network (RTNLong et al. (2016), Joint Adaptation Network (JANLong et al. (2017) and Self Ensembling French et al. (2018). Specifically, RTN employs residual layer He et al. (2016) for better knowledge transfer, based on DAN Long et al. (2015). JAN leverages a joint MMD-loss layer to align the features in two consecutive layers. SE applies self-ensembling learning based on a teacher-student model and was the winner of the Visual Domain Adaptation Challenge 222 We do not apply these methods in digit recognition because the LeNet-based model LeCun et al. (1989) is too simple to add a residual or joint training layer. We also omit ADDA and UFDN baselines as these models fail to converge while training on Office-Caltech10 under the domain-agnostic setup.

Results The experimental results on Office-Caltech10 dataset are shown in Table 2. For fair comparison, we utilize the same backbone as the baselines and separately show the results. From these results, we make the following observations. (1) Our model achieves 89.8% accuracy with an AlexNet backbone Krizhevsky et al. (2012) and 92.9% accuracy with a ResNet backbone, outperforming the corresponding baselines on most shifts. (2) The adversarial method (DANN) works better than the feature alignment methods (DAN, RTN, JAN). More interestingly, negative transfer Pan & Yang (2010) occurs for feature alignment methods. This is somewhat expected, as these models align the entangled features directly, including the class-irrelevant features. (3) From the ResNet results, we observe limited improvements for the baselines from the source-only model, especially for boosting-based SE method. This phenomenon suggests that the boosting procedure works poorly when the target domain is heterogeneously distributed.

To better analyze the error modes, we plot the confusion matrices for MCD (84.3% accuracy) and DADA (93.1% accuracy) on WA,C,D task in Figure 4(c)-4(d). The figures illustrate MCD mainly confuses “calculator” vs. “keyboard”, “backpack” vs. “headphones”, and “monitor” vs. “projector”, while DADA is able to distinguish them with disentangled features.

-Distance Ben-David et al. \yrciteben2010theory suggests -distance as a measure of domain discrepancy. Following Long et al.  \yrcitelong2015, we calculate the approximate -distance for WA,C,D and DA,C,W tasks, where is the generalization error of a two-sample classifier (kernel SVM) trained on the binary problem to distinguish input samples between the source and target domains. Figure 4(a) displays for the two tasks with raw ResNet features, MCD features, and DADA features, respectively. We observe that the for both MCD features and DADA features are smaller than ResNet features, and the on DADA features is smaller than on MCD features, which is in consistent with the quantitative results, demonstrating the effectiveness of our disentangled features.

Convergence Analysis As DADA involves multiple losses and a complex learning procedure including adversarial learning and disentanglement, we analyze the convergence performance for the CA,D,W task, as showed in Figure 4(b) (lines are smoothed for easier analysis). We plot the cross-entropy loss on the source domain, ring loss defined by Equation 9, mutual information defined by Equation 7, and the accuracy in the figure. Figure 4(b) illustrates that the training losses gradually converge and the accuracy become steady after about 20 epochs of training.

4.3 Experiments on the DomainNet dataset

DomainNet333 Peng et al. (2018) This dataset contains approximately 0.6 million images distributed among 345 categories. It contains six distinct domains: Clipart (clp), a collection of clipart images; Infograph (inf), infographic images with specific object; Painting (pnt), artistic depictions of object in the form of paintings; Quickdraw (qdr), drawings from the worldwide players of game “Quick Draw!”444; Real (rel, photos and real world images; and Sketch (skt), sketches of specific objects. It is very large-scale and includes rich informative vision cues across different domains, providing a good testbed for DAL. Sample images can be seen from Figure 2. Following Section 4.2, we take turns to set one domain as the source domain and the rest as the heterogeneous target domain, leading to six DAL tasks.

Results The experimental results on DomainNet Peng et al. (2018) are shown in Table 3. The results shows our model achieves 21.5% accuracy with a ResNet backbone. Note that this dataset contains about 0.6 million images, and so a one percent accuracy improvement is not a trivial achievement. Our model gets comparable results with the best-performing baseline when the source domain is pnt, or qdr and outperforms other baselines for the rest of the tasks. From the experimental results, we make two interesting observations. (1) In DAL, the SE model French et al. (2018) performs poorly when the number of categories is large, which is in consistent with results in Peng et al. (2018). (2) The adversarial alignment method (DANN) performs better than feature alignment methods in DAL, a similar trend to that in Section 4.2.

One-to-one vs. one-to-many alignment In the DAL task, the UDA models are performing one-to-many alignment as the target data have no domain labels. However, traditional feature alignment methods such as DAN and JAN are designed for one-to-one alignment. To investigate the effectiveness of domain labels, we design a controlled experiment for DAN and JAN. First, we provide the domain labels and perform one-to-one unsupervised domain adaptation. Then we take away the domain labels and perform one-to-many domain-agnostic learning. The results are shown in Table 4. We observe the one-to-one alignment does indeed outperform one-to-many alignment, even though the models in one-to-many alignment have seen more data. These results further demonstrate that DAL is a more challenging task and that traditional feature alignment methods need to be re-thought for this problem.

5 Conclusion

In this paper, we first propose a novel domain agnostic learning (DAL) schema and demonstrate the importance of DAL in practical scenarios. Towards tackling DAL task, we have proposed a novel Deep Adversarial Disentangled Autoencoders (DADA) to disentangle domain-invariant features in the latent space. We have proposed to leveraging class disentanglement and mutual information minimizer to enhance the feature disentanglement. Empirically, we demonstrate that the ring-loss-style normalization boosts the performance of DADA in DAL task. An extensive empirical evaluation on DAL benchmarks demonstrate the efficacy of the proposed model against several state-of-the-art domain adaptation algorithms.

6 Acknowledgement

We thank Saito Kuniaki, Ben Usman, Ping Hu for their useful discussions and suggestions. We thank anonymous reviewers and area chairs for their useful insight to improve this work. This work was partially supported by NSF and Honda Research Institute.


  • Belghazi et al. (2018) Belghazi, M. I., Baratin, A., Rajeshwar, S., Ozair, S., Bengio, Y., Courville, A., and Hjelm, D. Mutual information neural estimation. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 531–540, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR. URL
  • Ben-David et al. (2010) Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., and Vaughan, J. W. A theory of learning from different domains. Machine learning, 79(1-2):151–175, 2010.
  • Bengio et al. (2013) Bengio, Y., Courville, A., and Vincent, P. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.
  • Cao et al. (2018) Cao, J., Katzir, O., Jiang, P., Lischinski, D., Cohen-Or, D., Tu, C., and Li, Y. Dida: Disentangled synthesis for domain adaptation. arXiv preprint arXiv:1805.08019, 2018.
  • Carlucci et al. (2018a) Carlucci, F. M., Russo, P., Tommasi, T., and Caputo, B. Agnostic domain generalization. CoRR, abs/1808.01102, 2018a. URL
  • Carlucci et al. (2018b) Carlucci, F. M., Russo, P., Tommasi, T., and Caputo, B. Agnostic domain generalization. CoRR, abs/1808.01102, 2018b. URL
  • Crammer et al. (2008) Crammer, K., Kearns, M., and Wortman, J. Learning from multiple sources. Journal of Machine Learning Research, 9(Aug):1757–1774, 2008.
  • Deng et al. (2009) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 248–255. IEEE, 2009.
  • Duan et al. (2012) Duan, L., Xu, D., and Chang, S.-F. Exploiting web images for event recognition in consumer videos: A multiple source domain adaptation approach. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pp. 1338–1345. IEEE, 2012.
  • Finn et al. (2017) Finn, C., Abbeel, P., and Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 1126–1135, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR. URL
  • French et al. (2018) French, G., Mackiewicz, M., and Fisher, M. Self-ensembling for visual domain adaptation. In International Conference on Learning Representations, 2018. URL
  • Ganin & Lempitsky (2015) Ganin, Y. and Lempitsky, V. Unsupervised domain adaptation by backpropagation. In Bach, F. and Blei, D. (eds.), Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pp. 1180–1189, Lille, France, 07–09 Jul 2015. PMLR. URL
  • Ghifary et al. (2014) Ghifary, M., Kleijn, W. B., and Zhang, M. Domain adaptive neural networks for object recognition. In Pacific Rim international conference on artificial intelligence, pp. 898–904. Springer, 2014.
  • Gong et al. (2012) Gong, B., Shi, Y., Sha, F., and Grauman, K. Geodesic flow kernel for unsupervised domain adaptation. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pp. 2066–2073. IEEE, 2012.
  • Goodfellow et al. (2014) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.
  • Gretton et al. (2007) Gretton, A., Borgwardt, K. M., Rasch, M., Schölkopf, B., and Smola, A. J. A kernel method for the two-sample-problem. In Advances in neural information processing systems, pp. 513–520, 2007.
  • Griffin et al. (2007) Griffin, G., Holub, A., and Perona, P. Caltech-256 object category dataset. 2007.
  • He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
  • Hoffman et al. (2014) Hoffman, J., Darrell, T., and Saenko, K. Continuous manifold based adaptation for evolving visual domains. In Computer Vision and Pattern Recognition (CVPR), 2014.
  • Hoffman et al. (2018) Hoffman, J., Tzeng, E., Park, T., Zhu, J.-Y., Isola, P., Saenko, K., Efros, A., and Darrell, T. CyCADA: Cycle-consistent adversarial domain adaptation. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 1989–1998, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR. URL
  • Ioffe & Szegedy (2015) Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pp. 448–456, 2015.
  • Kiefer et al. (1952) Kiefer, J., Wolfowitz, J., et al. Stochastic estimation of the maximum of a regression function. The Annals of Mathematical Statistics, 23(3):462–466, 1952.
  • Kim et al. (2017) Kim, T., Cha, M., Kim, H., Lee, J. K., and Kim, J. Learning to discover cross-domain relations with generative adversarial networks. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 1857–1865, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR. URL
  • Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Kingma & Welling (2013) Kingma, D. P. and Welling, M. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  • Kingma et al. (2014) Kingma, D. P., Mohamed, S., Rezende, D. J., and Welling, M. Semi-supervised learning with deep generative models. In Advances in neural information processing systems, pp. 3581–3589, 2014.
  • Krizhevsky et al. (2012) Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105, 2012.
  • LeCun et al. (1989) LeCun, Y., Boser, B., Denker, J., Henderson, D., Howard, R., Hubbard, W., and Jackel, L. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1989.
  • LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • Lee et al. (2018) Lee, H.-Y., Tseng, H.-Y., Huang, J.-B., Singh, M., and Yang, M.-H. Diverse image-to-image translation via disentangled representations. In Ferrari, V., Hebert, M., Sminchisescu, C., and Weiss, Y. (eds.), Computer Vision – ECCV 2018, pp. 36–52, Cham, 2018. Springer International Publishing. ISBN 978-3-030-01246-5.
  • Li et al. (2018a) Li, H., Jialin Pan, S., Wang, S., and Kot, A. C. Domain generalization with adversarial feature learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5400–5409, 2018a.
  • Li et al. (2018b) Li, Y., Gong, M., Tian, X., Liu, T., and Tao, D. Domain generalization via conditional invariant representation, 2018b.
  • Liu et al. (2018a) Liu, A. H., Liu, Y., Yeh, Y., and Wang, Y. F. A unified feature disentangler for multi-domain image translation and manipulation. CoRR, abs/1809.01361, 2018a. URL
  • Liu & Tuzel (2016) Liu, M.-Y. and Tuzel, O. Coupled generative adversarial networks. In Advances in neural information processing systems, pp. 469–477, 2016.
  • Liu et al. (2018b) Liu, Y.-C., Yeh, Y.-Y., Fu, T.-C., Wang, S.-D., Chiu, W.-C., and Wang, Y.-C. F. Detach and adapt: Learning cross-domain disentangled deep representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018b.
  • Long et al. (2015) Long, M., Cao, Y., Wang, J., and Jordan, M. Learning transferable features with deep adaptation networks. In Bach, F. and Blei, D. (eds.), Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pp. 97–105, Lille, France, 07–09 Jul 2015. PMLR. URL
  • Long et al. (2016) Long, M., Zhu, H., Wang, J., and Jordan, M. I. Unsupervised domain adaptation with residual transfer networks. In Advances in Neural Information Processing Systems, pp. 136–144, 2016.
  • Long et al. (2017) Long, M., Zhu, H., Wang, J., and Jordan, M. I. Deep transfer learning with joint adaptation networks. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pp. 2208–2217, 2017. URL
  • Makhzani et al. (2016) Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I., and Frey, B. Adversarial autoencoders. ICLR workshop, 2016.
  • Mansour et al. (2009) Mansour, Y., Mohri, M., Rostamizadeh, A., and R, A. Domain adaptation with multiple sources. In Koller, D., Schuurmans, D., Bengio, Y., and Bottou, L. (eds.), Advances in Neural Information Processing Systems 21, pp. 1041–1048. Curran Associates, Inc., 2009.
  • Mathieu et al. (2016) Mathieu, M. F., Zhao, J. J., Zhao, J., Ramesh, A., Sprechmann, P., and LeCun, Y. Disentangling factors of variation in deep representation using adversarial training. In Advances in Neural Information Processing Systems, pp. 5040–5048, 2016.
  • Odena et al. (2017) Odena, A., Olah, C., and Shlens, J. Conditional image synthesis with auxiliary classifier GANs. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 2642–2651, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR. URL
  • Pan & Yang (2010) Pan, S. J. and Yang, Q. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2010.
  • Peng & Saenko (2018) Peng, X. and Saenko, K. Synthetic to real adaptation with generative correlation alignment networks. In 2018 IEEE Winter Conference on Applications of Computer Vision, WACV 2018, Lake Tahoe, NV, USA, March 12-15, 2018, pp. 1982–1991, 2018. doi: 10.1109/WACV.2018.00219. URL
  • Peng et al. (2018) Peng, X., Bai, Q., Xia, X., Huang, Z., Saenko, K., and Wang, B. Moment matching for multi-source domain adaptation. arXiv preprint arXiv:1812.01754, 2018.
  • Quionero-Candela et al. (2009) Quionero-Candela, J., Sugiyama, M., Schwaighofer, A., and Lawrence, N. D. Dataset Shift in Machine Learning. The MIT Press, 2009. ISBN 0262170051, 9780262170055.
  • Rezende et al. (2014) Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. In Xing, E. P. and Jebara, T. (eds.), Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pp. 1278–1286, Bejing, China, 22–24 Jun 2014. PMLR. URL
  • Romijnders et al. (2018) Romijnders, R., Meletis, P., and Dubbelman, G. A domain agnostic normalization layer for unsupervised adversarial domain adaptation. CoRR, abs/1809.05298, 2018. URL
  • Saenko et al. (2010) Saenko, K., Kulis, B., Fritz, M., and Darrell, T. Adapting visual category models to new domains. In European conference on computer vision, pp. 213–226. Springer, 2010.
  • Saito et al. (2018) Saito, K., Watanabe, K., Ushiku, Y., and Harada, T. Maximum classifier discrepancy for unsupervised domain adaptation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • Sun & Saenko (2016) Sun, B. and Saenko, K. Deep CORAL: correlation alignment for deep domain adaptation. CoRR, abs/1607.01719, 2016. URL
  • Tzeng et al. (2014) Tzeng, E., Hoffman, J., Zhang, N., Saenko, K., and Darrell, T. Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474, 2014.
  • Tzeng et al. (2017) Tzeng, E., Hoffman, J., Saenko, K., and Darrell, T. Adversarial discriminative domain adaptation. In Computer Vision and Pattern Recognition (CVPR), volume 1, pp.  4, 2017.
  • Xu et al. (2018) Xu, R., Chen, Z., Zuo, W., Yan, J., and Lin, L. Deep cocktail network: Multi-source unsupervised domain adaptation with category shift. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3964–3973, 2018.
  • Yi et al. (2017) Yi, Z., Zhang, H. R., Tan, P., and Gong, M. Dualgan: Unsupervised dual learning for image-to-image translation. In ICCV, pp. 2868–2876, 2017.
  • Zellinger et al. (2017) Zellinger, W., Grubinger, T., Lughofer, E., Natschläger, T., and Saminger-Platz, S. Central moment discrepancy (CMD) for domain-invariant representation learning. CoRR, abs/1702.08811, 2017. URL
  • Zheng et al. (2018) Zheng, Y., Pal, D. K., and Savvides, M. Ring loss: Convex feature normalization for face recognition. arXiv preprint arXiv:1803.00130, 2018.
  • Zhu et al. (2017) Zhu, J.-Y., Park, T., Isola, P., and Efros, A. A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Computer Vision (ICCV), 2017 IEEE International Conference on, 2017.

Appendix A Model Architecture

We provide the detailed model architecture (Table 5 and Table 7) for each component in our model: Generator, Disentangler, Domain Classifier, Classifier and MINE.


layer configuration


Feature Generator


1 Conv2D (3, 64, 5, 1, 2), BN, ReLU, MaxPool
2 Conv2D (64, 64, 5, 1, 2), BN, ReLU, MaxPool
3 Conv2D (64, 128, 5, 1, 2), BN, ReLU




1 FC (8192, 3072), BN, ReLU
2 DropOut (0.5), FC (3072, 2048), BN, ReLU


Domain Identifier


1 FC (2048, 256), LeakyReLU
2 FC (256, 2), LeakyReLU


Class Identifier


1 FC (2048, 10), BN, Softmax




1 FC (4096, 8192)


Mutual Information Estimator


fc1_x FC (2048, 512)
fc1_y FC (2048, 512), LeakyReLU
2 FC (512,1)


Table 5: Model Architecture for ‘Digit-Five‘. For each convolution layer, we list the input dimension, output dimension, kernel size, stride, and padding. For the fully-connected layer, we provide the input and output dimensions. For drop-out layers, we provide the probability of an element to be zeroed.

Appendix B Details of datasets

We provide the detailed information of datasets (Table 6). For Digit-Five and the DomainNet dataset, we provide the train/test split for each domain and for Office-Caltech10, we provide the number of images in each domain.




Splits mnist mnist_m svhn syn usps Total
Train 55,000 55,000 25,000 25,000 7,348 167,348
Test 10,000 10,000 14,549 9,000 1,860 37,309




Splits amazon caltech dslr webcam Total
Total 958 1,123 157 295 2,533




Splits clp inf pnt qdr rel skt Total
Train 34,019 37,087 52,867 120,750 122,563 49,115 416,401
Test 14,818 16,114 22,892 51,750 52,764 21,271 179,609
Table 6: Detailed information for datasets


layer configuration


Feature Generator: ResNet101 or AlexNet




1 Dropout(0.5), FC (2048, 2048), BN, ReLU
2 Dropout(0.5), FC (2048, 2048), BN, ReLU


Domain Identifier


1 FC (2048, 256), LeakyReLU
2 FC (256, 2), LeakyReLU


Class Identifier


1 FC (2048, 10), BN, Softmax




1 FC (4096, 2048)


Mutual Information Estimator


fc1_x FC (2048, 512)
fc1_y FC (2048, 512), LeakyReLU
2 FC (512,1)


Table 7: Model Architecture for ‘Office-Caltech10‘ and ‘DomainNet‘. For each convolution layer, we list the input dimension, output dimension, kernel size, stride, and padding. For the fully-connected layer, we provide the input and output dimensions. For drop-out layers, we provide the probability of an element to be zeroed.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description