Adversarial Variational Domain Adaptation

Adversarial Variational Domain Adaptation

Manuel Pérez-Carrasco
Dept. of Computer Science
University of Concepción
maperezc@udec.cl
&Guillermo Cabrera-Vives
Dept. of Computer Science
University of Concepción
guillecabrera@inf.udec.cl
&Pavlos Protopapas
IACS
Harvard University
pavlos@seas.harvard.edu
&Nicolás Astorga
Dept. of Electrical Engineering
University of Chile
nicolas.astorga.r@ing.uchile.cl
&Marouan Belhaj
IACS
Harvard University
blhmarouan@gmail.com
Abstract

In this work we address the problem of transferring knowledge obtained from a vast annotated source domain to a low labeled or unlabeled target domain. We propose Adversarial Variational Domain Adaptation (AVDA), a semi-supervised domain adaptation method based on deep variational embedded representations. We use approximate inference and adversarial methods to map samples from source and target domains into an aligned semantic embedding. We show that on a semi-supervised few-shot scenario, our approach can be used to obtain a significant speed-up in performance when using an increasing number of labels on the target domain.

1 Introduction

Deep neural networks have become the state of the art for a lot of machine learning problems. However, these methods usually imply the need for a large amount of labeled data in order to avoid overfitting and be able to generalize. Furthermore, it is assumed that train and test data come from the same distribution and feature space. This becomes a huge problem in cases when labeling is costly and/or time-consuming. One way to address this challenge is to use a source domain which contains a vast amount of annotated data and reduce the domain shift between this domain and a different, but similar, target domain in which we have few or even no annotations.

Domain adaptation (DA) methods aim at reducing the domain shift between datasets pan_2010 , allowing to generalize a model trained on source to perform similarly on the target domain by finding a common shared space between them. Deep DA uses deep neural networks to achieve this task. Previous works in deep DA have addressed the problem of domain shift by using statistical measures long_2017 ; long_2015 ; tzeng_2014 ; yan_2017 ; Zhuang_2015 ; sun_2016 ; peng_2017 or introducing class-based loss functions tzeng_2015 ; gebru_2017 ; motiian_2017a ; motiian_2017b in order to diminish the distance between domain distributions. Since the appearance of Generative Adversarial Networks goodfellow_2014 new approaches have been developed focused on using adversarial domain adaptation (ADA) techniques ganin_2015 ; ganin_2014 . The goal of adversarial domain adaptation tzeng_2017 is to learn from the source data distribution a model to predict on the target distribution by finding a common representation for the data by using an adversarial objective with respect to a domain discriminator. This way, a domain-invariant feature space can be used to solve a classification task on both the source and the target.

Despite ADA methods being good at aligning distributions even in an unsupervised domain adaptation (UDA) scenario (i.e. with no labels from the target), they have problems when facing some domain adaptation challenges. First, since most of these methods were made to tackle UDA problems, they usually fail when there is a significant covariate shift between domains zou_2019 . Second, these methods are not able to take advantage of the semi-supervised scenario in order to produce more accurate models when a few amount of labels are available from the target, generating poor decision boundaries near annotated target data saito_2019 . This behavior has been studied in different works, which tried to adapt domain-invariant features from different classes independently saito_2018 ; kang_2019 .

In this work, we propose Adversarial Variational Domain Adaptation (AVDA), a domain adaptation model which works on unsupervised and semi-supervised scenarios by exploiting target labels when they are available by using variational deep embedding (VaDE jiang_2016 ) and adversarial methods goodfellow_2014 . The idea behind AVDA is to correct the domain shift of each class independently by using an embedded space composed by a mixture of Gaussians, in which each class correspond to a Gaussian mixture component.

The performance of AVDA was validated on benchmark digit recognition tasks using MNIST lecunn_1998 , USPS usps_1988 , and SVHN svhn_2011 datasets and on a real case consisting in galaxy images using the Cosmic Assembly Near-infrared Deep Extragalactic Legacy Survey (CANDELS, grogin011 ) as source and the Cluster Lensing and Supernova Survey with Hubble (CLASH, Postman_2012 ) as target. We demonstrate competitive results among other methods in the state-of-the-art for the digits task and then show the potential of our model obtaining speed-up in performance when few target labels are available even in a high domain shift scenario.

2 Related Work

Due the capability of deep networks to learn transferable bengio_2012 ; yosinski_2014 ; oquab_2014 and invariant goodfellow_2009 representations of the data, the idea of transferring knowledge acquired from a vast labeled source to increase the performance on a target domain has become a wide area of research Tan_2018 . Domain adaptation methods deal with this challenge by reducing the domain shift between source and target domains pan_2010 , aligning a common intern representation for them.

Some statistical metrics have been proposed in order to align source and target distributions, such as maximum mean discrepancy (MMD) long_2017 ; long_2015 ; tzeng_2014 ; yan_2017 , Kullback Leibler (KL) divergence Zhuang_2015 or correlation alignment (CORAL) sun_2016 ; peng_2017 . Since the appearance of Generative Adversarial Networks goodfellow_2014 significant work has been developed around adversarial domain adaptation (ADA) techniques ganin_2015 . The idea of ADA methods is to use a domain classifier which discriminates if a sample belongs to the source or target domain, while a generator learns how to create indistinguishable representations of data in order to fool the domain classifier. By doing this, a domain-invariant representation of the data distribution is produced in a latent space.

Despite ADA models achieving good results either by matching distributions in a feature representation (i.e. feature-level) ganin_2014 ; ming_2017 ; long_2018 ; shu_2018 ; russo_2017 or generating target images that look as if they were part of the source dataset (i.e. pixel-level) isola_2016 ; zhu_2017 ; hoffman_2017 ; hu_2018 ; hosseini_2019 , when they are used in a UDA scenario, they have difficulties dealing with big covariate shifts between domains zou_2019 . Furthermore, when a few number of annotated target samples are included, these models often do not improve performance relative to just train with labeled target samples saito_2019 . In order to deal with few labels, few-shot domain adaptation methods have been created motiian_2017a ; motiian_2017b , which are not meant to work with unlabeled data, often producing overfitted representations and having problems to generalize on the target domain wang_2018 .

Semi-supervised domain adaptation (SSDA) deal with these challenges using both labeled and unlabeled samples during training gong_2012 ; gopalan_2011 ; glorot_2011 ; santos_2017 ; belhaj_2018 ; saito_2018 . Usually for SSDA we are interested on finding a space in which labeled and unlabeled target samples belonging to the same class have a similar internal representation donahue_2013 ; yao_2015 ; zou_2019 ; saito_2019 . A promising approach to deal with labeled and unlabeled samples during training are semi-supervised variational autoencoders kingma_2014 ; rasmus_2015 ; maloe_2016 . These models seek to learn a latent space which depends on the labeled data. As the latent space is shared between labeled and unlabeled data, points from the same class will be closer in the latent space. This latent space can be extended to an embedding in which each class is represented by an embedding component. Our proposed AVDA framework uses a variational deep embedding jiang_2016 representation, in which both source and target samples that belong to the same class are mapped into an embedding component, allowing the model to obtain a significant speed-up in performance as more labels are used from the target domain.

3 Adversarial Variational Domain Adaptation

In this work we propose Adversarial Variational Domain Adaptation (AVDA), a model based on semi-supervised variational deep embedding and adversarial methods. We use a Gaussian mixture model as a prior for the embedded space and align samples from source and target domains that belong to the same class into the same Gaussian component.

3.1 Problem Definition

In a semi-supervised domain adaptation scenario, we are given a source domain with number of labeled samples and a target domain with number of labeled samples. Also, for the target domain we have a subset of unlabeled samples. For both domains we have the same classes, i.e. , . Source and target data are drawn from unknown joint distributions and respectively, where .

The goal of this work is to build a model that provides an embedding space in which source and target data have the same representation for each of the classes. We propose the use of a Semi-supervised Variational Deep Embedding [20]. This model is composed by the inference models and that encodes source and target data into this latent representation, which we set to be a mixture of Gaussian distribution depending on the labels and they are parametrized by and , for source and target respectively. Also, the generative model describes the data as if they were generated from a latent variable and is parametrized by . A discriminative process is included to enforce the separability between the Gaussian mixture components. The overall model is displayed in Figure 1.

Figure 1: Overall architecture for Adversarial Variational Domain Adaptation. The model works in three steps. First, source data is encoded into an embedded space using the inference model with parameters and decoded using the generative model with parameters . Each data point is mapped to an embedded component associated to its class. Second, a classifier with parameters is trained to discriminate if a sample comes from the source domain and its class, or comes from the target domain. Third, the target inference model is trained in an adversarial fashion using target data and , generating an aligned embedding representation for source and target domains.

3.2 Adversarial Variational Domain Adaptation Model

For the source domain, we define a generative process as follows:

(1)

where is a multinomial distribution parametrized by , the prior probability for class , , . At the same time, and are the mean and variance of the embedded normal distribution corresponding to class labels . is a likelihood function whose parameters are formed by non-linear transformations of the variable using a neural network with parameters . In this work, we use deep neural networks as a non-linear function approximation.

For source and target domains, we define two inference models. We use variational inference to find an approximation for the true posterior distribution using the approximated posterior distributions and which are parametrized by a deep neural networks with parameters for the source domain and for the target domain. We assume the approximate posterior can be factorized as , and model it by using normal and categorical distributions as follows:

(2)
(3)

where and are the outputs of the source and target deep neural networks with parameters and respectively, and are then used to sample from a Gaussian mixture distribution by using the reparametrization trick defined in kingma_2013 . and represent the source and target processes modeled through independent neural networks. These networks take the latent variables and return the parameters and to sample a categorical variable by using a Gumbel-Softmax distribution jiang_2016 . With this estimator, we can generate labels and backpropagate through this sampled categorical variable by using the continuous relaxation defined by Jiang et al. 2016 jiang_2016 , avoiding the marginalization over all the categorical labels introduced in kingma_2014 ; jiang_2016 , significantly reducing computational costs.

3.3 Variational Objectives

The supervised variational lower bound on the marginal likelihood for a single source data point can be derived similarly to a deep generative model kingma_2013 ; kingma_2014 as follows:

(4)

We can notice that the observed label never appears in the equation. However, the supervision comes in when computing the Kullback-Leibler divergence between and . In fact, we force to be equal to the Gaussian belonging to the with distribution . At the same time, a predictive function is included in order to enforce the separability between the embedding components. The lower bound for source domain can be optimized by minimizing the following objective:

(5)

where is the hyper-parameter that controls the relative importance between the generative and discriminative processes of the model.

On the other hand, for the target domain, we would like that the inference model will be able to map the samples into the same embedding obtained by the source generative model. Taking this into account, the objective can be decomposed in two parts. One for labeled data and other for unlabeled data. For a single target labeled data point we can obtain the supervised objective as follows:

(6)

where is the optimized prior distribution for the source. is the hyper-parameter that controls the relative importance of the discriminative process in the model. For a single target unlabeled data point we can derive the unsupervised objective as follows:

(7)

The minimization of this term helps generating the mapping of each unlabeled target sample component into the correspondent embedding space using a predicted categorical variable.

3.4 Adversarial Objective

We would like the agglomerative distribution over the approximated posterior distribution to be the same for source and target domains (i.e ). These distributions are embedded spaces which depend on the labels, hence we use the semi-supervised generative adversarial model proposed by Odena A. 2016 odena_2016 in order to encourage the alignment between these distributions. In particular, we use a discriminator with parameters which takes the form of a classifier distinguishing between classes, where the first classes correspond to the source classes and the class correspond to a class representing the data was generated by the target inference model. By doing this, we encourage the discriminator to learn the underlying distribution of each class independently. This discriminator, can be optimized using the following objective:

(8)

where is the cross entropy loss. On the the other hand, we try to confuse via an adversarial loss which forces the inference model of the target to learn a mapping from the samples to their correspondent embedding component. We can optimize the parameters of the target inference model using the following objective:

(9)

where are the predicted categorical variables sampled from the Gumbel-Softmax distribution for unlabeled target samples and the real class for labeled target samples.

3.5 Overall Objectives and Optimization

In this section we describe the overall objective of the model and how this objective can be optimized. The training process is done in three steps: train the source model, train the discriminator, and train the target model.

Source Optimization: The first step consists in optimizing the source model. By doing this, we can obtain an embedding space which will be used later to map target samples into the same embedding components. The overall objective for the source domain is defined in Equation 5 and it is composed of three terms. The first term can be computed analytically by following the proof of jiang_2016 as follows:

(10)

where and are the approximated mean and variance given by the neural network. and are the mean and variance of each embedding component. is the dimensionality of and . The second term can be optimized by computing the expectation of gradients using the reparametrization trick defined in kingma_2013 ; rezende_2014 as follows:

(11)

where denotes simple gradients over the parameters and . is the element-wise product. The third discriminative term can be trivially optimized by minimizing the cross entropy loss between real labels and predicted labels as estimated by the predictive function.

Discriminative Step: The discriminative step is done by minimizing Equation 8 with respect to the parameters of the discriminator . The goal of this step is to encourage the discriminator to learn the embedding representation generated by using the source domain. Then, we minimize the target inference model in order to fool the discriminator mapping samples into the same embedding. For this purpose, the discriminative step and later introduced target step are performed alternately.

Target Step: The overall objective for the target domain can be written as follows:

(12)

In this Equation, can be optimized using Equation 10. can be decomposed into two terms as in Equation 7. Following the derivation of belhaj_2018 , we can compute the derivatives of the second term using expectation of gradients and the reparametrization trick defined in jang_2016 to derive a Monte Carlo estimator as follows:

(13)

where is a predicted categorical variable sampled using the Gumbel-softmax relaxation defined as:

(14)

where are parameters outputed by the neural network , is a random variable sampled from and is a hyperparameter that regulates the entropy of the sampling. This reparametrization trick allows us to discretize during the forward pass, while we can use a continuous approximation in the backward pass. The KL divergence of Equation 13 is similar to the one introduced for source optimization, hence we can use the analytical solution introduced in Equation 10. The derivatives of the second term of Equation 12 can be computed as:

(15)

The third term of Equation 12 can be optimized by minimizing Equation 9 with respect to the parameters .

4 Experiments

We evaluate our framework on the digits dataset composed of MNIST lecunn_1998 , USPS usps_1988 , and Street View House Numbers (SVHN) svhn_2011 . We then apply it to a real case scenario using galaxy images from the Cosmic Assembly Near-infrared Deep Extragalactic Legacy Survey (CANDELS, grogin011 ) as source and the Cluster Lensing and Supernova Survey with Hubble (CLASH, Postman_2012 ) as target.

4.1 Datasets

Digits: We use three benchmark digits datasets: MNIST (M), USPS (U), and Street View House Numbers (SVHN) (S). Theses datasets contain images of digits from 0 to 9 in different environments: M and U contain handwritten digits while, S contain natural scene images. Each dataset is composed of its own train and test set: train samples and test samples for MNIST; train samples and test samples for USPS; and train samples and test samples samples for SVHN. For adaptation purposes we used, the evaluation protocol proposed in CyCADA hoffman_2017 , i.e. three domain adaptation task are performed: from MNIST to USPS (M U), from USPS to MNIST (U M), and from SVHN to MNIST (S M).

Galaxies: We use galaxy images from CANDELS and CLASH and address the problem of classifying them according to their morphologies: smooth, features, irregular, point source, and unclassifiable 111Data is public available at:
https://drive.google.com/open?id=1BSc42VfAb2Mw0zlQShTFUnbCQaf11q4q
. This is a challenging problem due to the high domain shift between source and target, which is given by changes in the spectroscopic filters in which the images were captured. Specifically, the CANDELS dataset contains images from the GOOD-S giavalisco_2003 galaxy cluster in the near-infrared F160W spectroscopic filter, and labels created by expert astronomers karteltepe . The CLASH dataset contains images from 25 galaxy clusters in 16 different spectroscopic filters ranged from ultraviolet to near-infrared. Labels for CLASH were also created by experts as described in perez2018 .

4.2 Implementation Details

For both tasks, images were rescaled to and resized to x. The SVHN dataset contains RGB images, so for S M, the MNIST images were repeated in each of the three filters in order to use the same input tensor size to the network. The hyperparameters were empirically selected by measuring the performance on 100 randomly selected samples from the training set for digits and 3 for the galaxies, performing 5 cross-validated experiments. Specifically, for all scenarios was set as and as . was set as and was set as . Our embedding is created on a 20-dimensional space, where the means and standard deviations of each Gaussian component are learnt via backpropagation. The means are initially set along different axes so that and (all-ones vector). Our training was performed using the Adam optimizer kingma_2015 with parameters and a learning rate of using mini-batches of 128 samples. For fair comparison, we used similar network architectures as the ones proposed in ming_2017 ; hu_2018 .

4.3 Results

1) Digits: We compare our results with state-of-the-art methods in UDA and SSDA scenarios. In the UDA scenario, we use a fully labeled source and a fully unlabeled target to perform the adaptation. In Table 1, we show our results and compare them to previous approaches as obtained from their papers. We report the mean accuracy and its standard deviation across ten random experiments and the accuracy obtained by our best run. Even though our method is not designed for an unsupervised scenario, it is competitive on methods on M U and U M. Our method performs poorly on S M, which presents a higher domain shift, and ACAL performs the best by matching features at a pixel-level adding a relaxation on the cycle consistency, replacing it by a task-specific loss.

For the SSDA scenario, we perform experiments using one label per class on the target (denoted as 1-shot) and five labels per class on the target (denoted as 5-shot). The rest of the training samples are used in an unsupervised fashion. We compare our results against other methods that utilize the same number of labels per class. Notice that CCSA motiian_2017b and FADA motiian_2017a do not utilize unlabeled samples during training while we do. The results are computed using the same procedure as before and are displayed in Table 2. Here, we outperform all previous approaches. Our approach has a higher speed of adaptation in the sense that by using a small number of labels in the target domain we are able to obtain competitive results.

Unsupervised
Method M U U M S M
UNIT ming_2017
SBADA-GAN russo_2017
CyCADA hoffman_2017
CDAN long_2018
DupGAN hu_2018
ACAL hosseini_2019
CADA zou_2019
AVDA (ours) best
AVDA (ours) random
Table 1: Results on Digits dataset for unsupervised task. All the results are reported using accuracy by performing 10 random experiments.
1-shot 5-shot
Method M U U M S M M U U M S M
CCSA motiian_2017b - -
FADA motiian_2017a
F-CADA zou_2019
AVDA (ours) best
AVDA (ours) random
Table 2: Results on Digits dataset for semi-supervised task. All the results are reported using accuracy by performing 10 random experiments.

Ablation Study: We examined the performance of AVDA adopting three different training strategies, in which we change critical components of our framework. First, we investigate the use of fixed priors during training (i.e they are not updated via backpropagation). We denoted this experiment as AVDAFP. Second, we investigate the model in a classical adversarial domain adaptation scenario (i.e the discriminator tries to distinguish between samples generated by source or target). We denote this experiment as AVDAADA. Third, we investigate the model when a target generative model is included. We denote this experiment as AVDAGT. The experiments were performed in the most difficult digit scenario S M using five labels per class. For the first experiment, AVDAFP obtained an accuracy of , hence lowering the accuracy of our model. For the second experiment, AVDAADA obtained an accuracy of , increasing the variance of the model performance. For the third experiments, AVDAGT obtained an accuracy of , decreasing the discriminative capability of our model. In consequence, AVDA components obtain the best performance as compared to these three slightly variant frameworks.

Visualization: In order to visualize the alignment between source and target domains, we visualize the embedding space by using t-distributed stochastic neighbor embedding (t-SNE, maaten_2008 ) for the task S M considering a 5-shot scenario. Figure 2 shows this visualization. On the left, we show each class in a different color, demonstrating the classifying capability of the model. On the right, we show the source and target in different colors, demonstrating the ability of the model to generate good alignments between the data labels and the embedded components for both domains.

2) Galaxies: For the morphology classification task, we trained 6 different models using 0, 1, 5, 10, 25 and 50 labeled target samples per class. As in a classical semi-supervised setting, all unlabeled target images were used for training and evaluation. We show the results in Figure 3. We can notice that just a few number of labeled samples are enough to make important corrections in the domain shift, observing a significant speed-up in performance when labels are included.

Figure 2: Visualization of the embedding space using t-SNE for the S M task considering a 5-shot scenario. Colors on the right panel represent the data labels, showing the capability of the model to generate good discriminative boundaries. Colors on the right panel represent source and target data showing the capability of the model to align source and target domains into the same embedding components.
Figure 3: Performance in terms of accuracy in morphology recognition task using 0, 1, 5, 10, 25 and 50 number of labeled target samples per class.

5 Conclusion

In this paper we present Adversarial Variational Domain Adaptation (AVDA), a semi-supervised approach for domain adaptations problems were a vast annotated source domain is available but none or few labels from a target domain exist. Unlike previous methods which align source and target domains into a single common feature space, we use a variational embedding and align samples that belong to the same class into the same embedding component using adversarial methods. Experiments on digits and galaxy morphology classification problems are used to validate the proposed approach. Our model presents a significant speed-up in terms of the increase in accuracy as more labeled examples are used from the target domain, increasing the accuracy more than with only one label per class for the morphology classification task. Though our framework does not show competitive on an unsupervised scenario, we demonstrate the capability of the model to align embedding spaces even in high domain shift scenarios by outperforming state-of-the-art methods from to in a semi-supervised scenario.

References

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
391942
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description