Generalizing to Unseen Domains via Adversarial Data Augmentation

Generalizing to Unseen Domains
via Adversarial Data Augmentation

Riccardo Volpi1,
Istituto Italiano di Tecnologia &Hongseok Namkoong1
Stanford University &Ozan Sener
Intel Labs \ANDJohn Duchi
Stanford University &Vittorio Murino
Istituto Italiano di Tecnologia
Università di Verona &Silvio Savarese
Stanford University
Equal contribution.Work done while author was a Visiting Student Researcher at Stanford University.
1footnotemark: 1
Abstract

We are concerned with learning models that generalize well to different unseen domains. We consider a worst-case formulation over data distributions that are near the source domain in the feature space. Only using training data from the source domain, we propose an iterative procedure that augments the dataset with examples from a fictitious target domain that is "hard" under the current model. We show that our iterative scheme is an adaptive data augmentation method where we append adversarial examples at each iteration. For softmax losses, we show that our method is a data-dependent regularization scheme that behaves differently from classical regularizers (\eg, ridge or lasso) that regularize towards zero. On digit recognition and semantic segmentation tasks, we empirically observe that our method learns models that improve performance across a priori unknown data distributions.

1 Introduction

In many modern applications of machine learning, we wish to learn a system that can perform uniformly well across multiple populations. Due to high costs of data acquisition, however, it is often the case that datasets consist of a limited number of population sources. While performance evaluated on the validation dataset—usually from the same population as the training dataset—is a standard metric on which many systems are optimized, it has been observed that performance on populations different from that of the training data can be much worse Daume2006 ; Blitzer2006 ; BenDavid2006 ; Saenko2010 ; NameTheDataset . In this paper, we are concerned with generalizing to populations different from the training distribution, in settings where we have no access to any data from the unknown target distributions.

A number of authors have proposed domain adaptation methods (for example, see  Ganin ; ADDA ; DeepCORAL ; morerio2018 ; DIFA ) in settings where a fully labeled source dataset and an unlabeled (or partially labeled) set of examples from fixed target distributions are available. Although such algorithms can successfully learn models that perform well on known target distributions, the assumption of a priori fixed target distributions can be restrictive in practical scenarios. For example, consider a semantic segmentation algorithm used by a robot: every task, robot, environment and camera configuration will result in a different target distribution, and these diverse scenarios can be identified only after the model is trained and deployed, making it difficult to collect samples from them.

In this work, we develop methods that can learn to better generalize to new unknown domains. We consider the restrictive setting where training data only comes from a single source domain. For example, consider a module for self-driving cars that needs to generalize well across weather conditions unexplored during training. Inspired by recent developments in distributionally robust optimization and adversarial training Certifiable ; LeeRa17 ; Heinze-DemlMe17 , we consider the following worst-case problem around the (training) source distribution

(1)

Here, is the model, is a source data point with its labeling, is the loss function, and is a distance metric on the space of probability distributions.

The solution to worst-case problem (1) guarantees good performance against data distributions that are distance away from the source domain . To allow data distributions that have different support to that of the source , we use Wasserstein distances as our metric . Our distance will be defined on the semantic space 111By semantic space we mean learned representations since recent works perceptual1 ; perceptual2 suggest that distances in the space of learned representations of high capacity models typically correspond to semantic distances in visual space., so that target populations satisfying will be realistic covariate shifts that preserve the same semantic representation of the source (\eg, adding color to a greyscale image). In this regard, we expect the solution to the worst-case problem (1)—the model that we wish to learn—to have favorable performance across covariate shifts in the semantic space.

We propose an iterative procedure that aims to solve the problem (1) for a small value of at a time, and does stochastic gradient updates to the model with respect to these fictitious worst-case target distributions (Section 2). Each iteration of our method uses small values of , and we provide a number of theoretical interpretations of our method. First, we show that our iterative algorithm is an adaptive data augmentation method where we add adversarially perturbed samples—at the current model—to the dataset (Section 3). More precisely, our adversarially generated samples roughly correspond to Tikhonov regularized Newton-steps Levenberg44 ; Marquardt63 on the loss in the semantic space. Further, we show that for softmax losses, each iteration of our method can be thought of as a data-dependent regularization scheme where we regularize towards the parameter vector corresponding to the true label, instead of regularizing towards zero like classical regularizers such as ridge or lasso.

From a practical viewpoint, a key difficulty in applying the worst-case formulation (1) is that the magnitude of the covariate shift is a priori unknown. We propose to learn an ensemble of models that correspond to different distances . In other words, our iterative method generates a collection of datasets, each corresponding to a different inter-dataset distance level , and we learn a model for each of them. At test time, we use a heuristic method to choose an appropriate model from the ensemble.

We test our approaches on a simple digit recognition task, and a more realistic semantic segmentation task across different seasons and weather conditions. In both settings, we observe that our method allows to learn models that improve performance across a priori unknown target distributions that have varying distance from the original source domain.

Related work

The literature on adversarial training FastGradientMethod ; Certifiable ; LeeRa17 ; Heinze-DemlMe17 is closely related to our work, since the main goal is to devise training procedures that learn models robust to fluctuations in the input. Departing from imperceptible attacks considered in adversarial training, we aim to learn models that are resistant to larger perturbations, namely out-of-distribution samples. Sinha et al. Certifiable proposes a principled adversarial training procedure, where new images that maximize some risk are generated and the model parameters are optimized with respect to those adversarial images. Being devised for defense against imperceptible adversarial attacks, the new images are learned with a loss that penalizes differences between the original and the new ones. In this work, we rely on a minimax game similar to the one proposed by Sinha et al. Certifiable , but we impose the constraint in the semantic space, in order to allow our adversarial samples from a fictitious distribution to be different at the pixel level, while sharing the same semantics.

There is a substantial body of work on domain adaptation Daume2006 ; Blitzer2006 ; Saenko2010 ; Ganin ; ADDA ; DeepCORAL ; morerio2018 ; DIFA , which aims to better generalize to a priori fixed target domains whose labels are unknown at training time. This setup is different from ours in that these algorithms require access to samples from the target distribution during training. Domain generalization methods DG0 ; DG1 ; DG2 ; DG3 ; Mancini2018 that propose different ways to better generalize to unknown domains are also related to our work. These algorithms require the training samples to be drawn from different domains (while having access to the domain labels during training), not a single source, a limitation that our method does not have. In this sense, one could interpret our problem setting as unsupervised domain generalization. Tobin et al. DomainRandomization proposes domain randomization, which applies to simulated data and creates a variety of random renderings with the simulator, hoping that the real world will be interpreted as one of them. Our goal is the same, since we aim at obtaining data distributions more similar to the real world ones, but we accomplish it by actually learning new data points, and thus making our approach applicable to any data source and without the need of a simulator.

Hendrycks and Gimpel SoftmaxICLR2016 suggest that a good empirical way to detect whether a test sample is out-of-distribution for a given model is to evaluate the statistics of the softmax outputs. We adapt this idea in our setting, learning ensemble of models trained with our method and choosing at test time the model with the greatest maximum softmax value.

2 Method

The worst-case formulation (1) over domains around the source hinges on the notion of distance , that characterizes the set of unknown populations we wish to generalize to. Conventional notions of Wasserstein distance used for adversarial training Certifiable are defined with respect to the original input space , which for images corresponds to raw pixels. Since our goal is to consider fictitious target distributions corresponding to realistic covariate shifts, we define our distance on the semantic space. Before properly defining our setup, we first give a few notations. Letting the dimension of output of the last hidden layer, we denote where is the set of weights of the final layer, and is the rest of the weights of the network. We denote by the output of the embedding layer of our neural network. For example, in the classification setting, is the number of classes and we consider the softmax loss

(2)

where is the -th column of the classification layer weights .

Wasserstein distance on the semantic space

On the space , consider the following transportation cost —cost of moving mass from to

The transportation cost takes value for data points with different labels, since we are only interested in perturbation to the marginal distribution of . We now define our notion of distance on the semantic space. For inputs coming from the original space , we consider the transportation cost defined with respect to the output of the last hidden layer

so that measures distance with respect to the feature mapping . For probability measures and both supported on , let denote their couplings, meaning measures with and . Then, we define our notion of distance by

(3)

Armed with this notion of distance on the semantic space, we now consider a variant of the worst-case problem (1) where we replace the distance with  (3), our adaptive notion of distance defined on the semantic space

Computationally, the above supremum over probability distributions is intractable. Hence, we consider the following Lagrangian relaxation with penalty parameter

(4)

Taking the dual reformulation of the penalty relaxation (4), we can obtain an efficient solution procedure. The following result is a minor adaptation of (BlanchetMu16, , Theorem 1); to ease notation, let us define the robust surrogate loss

(5)
{lemma}

Let be continuous. For any distribution and any , we have

(6)

In order to solve the penalty problem (4), we can now perform stochastic gradient descent procedures on the robust surrogate loss . Under suitable conditions BoydVa04 , we have

(7)

where is an adversarial perturbation of at the current model . Hence, computing gradients of the robust surrogate requires solving the maximization problem (5). Below, we consider an (heuristic) iterative procedure that iteratively performs stochastic gradient steps on the robust surrogate .

Iterative Procedure

We propose an iterative training procedure where two phases are alternated: a maximization phase where new data points are learned by computing the inner maximization problem (5) and a minimization phase, where the model parameters are updated according to stochastic gradients of the loss evaluated on the adversarial examples generated from the maximization phase. The latter step is equivalent to stochastic gradient steps on the robust surrogate loss , which motivates its name. The main idea here is to iteratively learn "hard" data points from fictitious target distributions, while preserving the semantic features of the original data points.

Concretely, in the -th maximization phase, we compute adversarially perturbed samples at the current model

(8)

where are the original samples from the source distribution . The minimization phase then performs repeated stochastic gradient steps on the augmented dataset . The maximization phase (8) can be efficiently computed for smooth losses if is strongly convex (Certifiable, , Theorem 2); for example, this is provably true for any linear network. In practice, we use gradient ascent steps to solve for worst-case examples (8); see Algorithm 1 for the full description of our algorithm.

1:Input: , Original dataset and initialized weights
2:Initialize: ,
3:for  do Run the minimax procedure times
4:     Sample from the dataset
5:     for  do
6:          for  do Run the maximizer for iterations
7:                          
8:          Append to dataset      
9:     for  do Run the minimizer for iterations
10:          Sample uniformly from dataset
11:                
12:for  do Run the minimizer for iterations
13:     Sample uniformly from dataset
14:     
Algorithm 1 Adversarial Data Augmentation

Ensembles for classification

The hyperparameter —which is inversely proportional to , the distance between the fictitious target distribution and the source—controls the ability to generalize outside the source domain. Since target domains are unknown, it is difficult to choose an appropriate level of a priori. We propose a heuristic ensemble approach where we train models . Each model is associated with a different value of , and thus to fictitious target distributions with varying distances from the source . To select the best model at test time—inspired by Hendrycks and Gimpel SoftmaxICLR2016 —given a sample , we select the model with the greatest softmax score

(9)

3 Theoretical Motivation

In our iterative algorithm (Algorithm 1), the maximization phase (8) was a key step that augmented the dataset with adversarially perturbed data points, which was followed by standard stochastic gradient updates to the model parameters. In this section, we provide some theoretical understanding of the augmentation step (8). First, we show that the augmented data points (8) can be interpreted as Tikhonov regularized Newton-steps Levenberg44 ; Marquardt63 on the loss in the semantic space (under the current model). Roughly speaking, this quantifies the sense in which Algorithm 1 is an adaptive data augmentation algorithm that adds data points from fictitious "hard" target distributions. Secondly, recall that the robust surrogate (5) is the loss whose stochastic gradients were used to update the model parameters in the minimization step (Eq (7)). In the classification setting, we show that the robust surrogate (5) roughly corresponds to a novel data-dependent regularization scheme on the softmax loss . Instead of penalizing towards zero like classical regularizers (\eg, ridge or lasso), our data-dependent regularization term penalizes deviations from the parameter vector corresponding to that of the true label.

3.1 Adaptive Data Augmentation

We now give an interpretation for the augmented data points in the maximization phase (8). Concretely, we fix , , , and consider an -maximizer

We let , and abuse notation by using . In what follows, we show that the feature mapping satisfies

(10)

Intuitively, this implies that the adversarially perturbed sample is drawn from a fictitious target distribution where probability mass on was transported to . We note that the transported point in the semantic space corresponds to a Tikhonov regularized Newton-step Levenberg44 ; Marquardt63 on the loss at the current model . Noting that computing involves backsolves on a large dense matrix, we can interpret our gradient ascent updates in the maximization phase (8) as an iterative scheme for approximating this quantity.

We assume sufficient smoothness, where we use to denote the -operator norm of a matrix . {assumption} There exists such that, for all , we have and . {assumption} There exists such that, for all , we have . Then, we have the following bound (10) whose proof we defer to Appendix A.1. {theorem} Let Assumptions 3.13.1 hold. If and , then

3.2 Data-Dependent Regularization

In this section, we argue that under suitable conditions on the loss,

the robust surrogate loss (5) corresponds to a particular data-dependent regularization scheme. Let be the -class softmax loss (2) given by

where is the -th row of the classification layer weight . Then, the robust surrogate is an approximate regularizer on the classification layer weights

(11)

The expansion (11) shows that the robust surrogate (5) is roughly equivalent to data-dependent regularization where we minimize the distance between , our “average estimated linear classifier”, to , the linear classifier corresponding to the true label . Letting for a fixed , we have the following result whose proof we defer to Appendix A.3. {theorem} If and , the softmax loss (2) satisfies

4 Experiments

Figure 1: Samples from MNIST MNIST , SVHN SVHN , USPS USPS , SYN and MNIST-M Ganin ; Ganin2 (top), and from Dawn, Fog, Spring, Night and Winter sequences from SYNTHIA SYNTHIA dataset (bottom).

We evaluate our method for both classification and semantic segmentation settings, following the evaluation scenarios of domain adaptation techniques Ganin ; ADDA ; FCNInTheWild , though in our case the target domains are unknown at training time. We summarize our experimental setup including implementation details, evaluation metrics and datasets for each task. We compare our method against the Empirical Risk Minimization (ERM) baseline in all of our results.

Digit classification

We train on MNIST MNIST dataset and test on MNIST-M Ganin , SVHN SVHN , SYN Ganin and USPS USPS (see Figure 1 (top)). We use digit samples for training and evaluate our models on the respective test sets of the different target domains, using accuracy as a metric. In order to work with comparable datasets, we resized all the images to , and treated images from MNIST and USPS as RGB. We use a ConvNet ConvNet with architecture conv-pool-conv-pool-fc-fc-softmax and set the hyperparameters , , and . In the minimization phase, we use Adam Adam with batch size equal to .

Semantic scene segmentation

We use the SYTHIASYNTHIA dataset for semantic segmentation. The dataset contains images from different locations (we use Highway, New York-like City and Old European Town), and different weather/time/date conditions (we use Dawn, Fog, Night, Spring and Winter, see Figure 1 (bottom)). We train models on a source domain and test on other domains, using the standard mean Intersection Over Union (mIoU) metric to evaluate our performance VOC2008 . We arbitrarily chose images from the left front camera throughout our experiments. For each one, we sample random images (resized to pixels) from the training set. We use a Fully Convolutional Network (FCN) FCN , with a ResNet-50 ResNet body and set the hyperparameters , , and . For the minimization phase, we use Adam Adam with batch size equal to .

4.1 Results on Digit Classification

Figure 2: Results associated with models trained with MNIST samples and tested on SVHN, MNIST-M, SYN and USPS (, , and columns, respectively). The red line/bar in each panel is associated with models trained with ERM, the blue bars with models trained with our method. Top row: results obtained by fixing and varying . Middle row: results obtained by fixing and varying . Bottom row: results obtained with the ensemble method by varying . Black bars indicate the range of accuracies spanned. Results obtained by averaging over different runs.

In this section, we present and discuss the results on the digit classification experiment. Figure 2 (middle) shows performances related to models trained with different numbers of iterations . Comparing models trained with Algorithm 1 (blue) and models trained with ERM (red), it can be observed that our method leads to statistically significant, improved performances when the models are tested on data distributions very different from the training one (SVHN, MNIST-M and SYN). On USPS, our method causes accuracies to drop since MNIST and USPS are very similar datasets (see Figure 1), thus the image domain that USPS belongs to is not explored by our algorithm during the training procedure, which optimizes for worst case performance. Figure 2 (top) shows performances of models trained with different values of the hyperparameter (with ). These plots show that on out-of-sample domains (SVHN, MNIST-M and SYN), our method outperforms ERM for any value of , a desired achievement, since this hyperparameter cannot be properly cross-validated. We further report in Appendix B a comparison between our method and an unsupervised domain adaptation algorithm (ADDA ADDA ), and results associated with different values of the hyperparameters and .

Finally, we report the results obtained by learning an ensemble of models. Since the hyperparameter is nontrivial to set a priori, we use the softmax confidences (9) to choose which model to use at test time. We learn ensemble of models, each of which is trained by running Algorithm 1 with different values of the as , with . Figure 2 (bottom) shows the comparison between our method with different numbers of iterations (blue) and ERM (red). In order to separate the role of ensemble learning, we learn an ensemble of baseline models each corresponding to a different initialization. We fix the number of models in the ensemble to be the same for both the baseline (ERM) and our method. As it can be observed, by comparing Figure 2 (middle) and (bottom), our ensemble approach yields higher accuracy in most of the testing scenarios. It is worth noting also the reduced performance gap detectable in the USPS experiment.

4.2 Results on Semantic Scene Segmentation

Figure 3: Results obtained with semantic segmentation models trained with ERM (red) and our method with and (blue). Leftmost panels are associated with models trained on Highway, rightmost panels are associated with models trained on New York-like City. Test datasets are Highway, New York-like City and Old European Town.

We report a comparison between models trained with ERM and models trained with our method (Algorithm 1 with ). We set in every experiment, but let us stress that this is an arbitrary value, we did not observe a strong correlation between the different values and the general behavior of the models in this case. Its role would be much more meaningful in an ensemble setting, where each model would be associated with a different level of robustness, as discussed in Section 2. In this setting, we do not apply the ensemble approach, but only evaluate the performances of the single models. The main reason for this choice is the fact that the heuristics developed to choose the correct model at test time in effect cannot be applied in a straightforward fashion to a semantic segmentation problem. One could apply it to the single pixels and combine the outputs of the different models according to the softmax distributions. Some preliminary results showed that this approach could be applicable, but we reserve a detailed analysis of it for future works.

Figure 3 reports numerical results obtained. Specifically, leftmost plots report results associated with models trained on sequences from the Highway split and tested on the New York-like City and the Old European Town splits (top-left and bottom-left, respectively); rightmost plots report results associated with models trained on sequences from the New York-like City split and tested on the Highway and the Old European Town splits (top-right and bottom-right, respectively). The training sequences (Dawn, Fog, Night, Spring and Winter) are indicate on the x-axis. Red and blue bars indicate average mIoUs achieved by models trained with ERM and by models trained with our method, respectively. These results were calculated by averaging over the mIoUs obtained with each model on the different conditions of the test set. As can be observed, models trained with our method mostly better generalize to unknown data distributions. In particular, our method always outperforms the baseline by a statistically significant margin when the training images are from Night scenarios. This is since the baseline model trained on images from Night is strongly biased towards dark scenery, while, as a consequence of training over worst-case distributions, our models can overcome this strong bias and better generalize across different unseen domains.

5 Conclusions and Future Work

We study a new adversarial data augmentation procedure that learns to better generalize across unseen data distributions, and define an ensemble method to exploit this technique in a classification framework. This is in contrast to domain adaptation algorithms, which require a sufficient number of samples from a known, a priori fixed target distribution. Our experimental results show that our iterative procedure provides broad generalization behavior on digit recognition and cross-season and cross-weather semantic segmentation tasks.

For future work, we hope to extend the ensemble methods by defining novel decision rules. The proposed heuristics (9) only apply to classification settings, and extending them to a broad realm of tasks including semantic segmentation is an important direction. Many theoretical questions still remain. For instance, quantifying the behavior of data-dependent regularization schemes presented in Section 3 would help us better understand adversarial training methods in general.

References

Appendix A Proofs

a.1 Proof of Theorem 3.1

Recall that we consider a fixed , , , and . We begin by noting that since , we have

(12)

Similarly as , let be an -optimizer to the problem (12)

To further ease notation, let us denote

the first- and second-order approximation of around respectively.

First, we note that by hypothesis and hence, attains the maximum in the problem (12)

(13)

Now, note that is - strongly concave since

by Assumption 3.1, where denotes the maximum eigenvalue. Recalling the definition of given in Eq (12), we then have

(14)

where we used the definition of in the last inequality.

Next, we note that and are close by Taylor expansion. {lemma}[(Nesterov06, , Lemma 1)] Let have a -Lipschitz Hessian so that for all , . Then, for all ,

Applying Lemma A.1, we have that

Using this inequality in the bound (14), we arrive at

(15)

From definition (13) of , we have

(16)

Next, to bound in the bound (15), we show that and are at most -away. We defer the proof of the following lemma to Appendix A.2 {lemma} Let Assumption 3.1 hold and . Then,

Applying Lemma A.1 to bound on the right hand side of inequality (15), and using the bound (16) for , we obtain

This gives the final result.

a.2 Proof of Lemma a.1

We use the following key lemma which says that for functions that satisfy a growth condition, its minimum is stable to perturbations to the function. {lemma}[(BonnansSh13, , Proposition 4.32)] Suppose that satisfies the second-order growth condition: there exists a such that if we denote by the minimizer of so that , we have for all

If there is a function such that is -Lipschitz on a neighborhood of , then , any -approximate minimizer of in , satisfies

Letting and , note first that is strongly convex. Further, is -Lipschitz by Assumption 3.1. Applying Lemma A.2, we obtain the result.

a.3 Proof of Theorem 11

Again, we abuse notation by writing for , and similarly and . We begin by noting that since , we have

The following claim will be crucial. {claim} If is -Lipschitz with respect to the -norm, then

{proof-of-claim}

From Taylor’s theorem, we have

Using this approximation in the definition of , we get

Similarly, we can compute the lower bound

Combining the two bounds, the claim follows.

From the claim, it suffices to show that is -Lipschitz. From , we have

Now, since

we conclude that

Appendix B Additional Experimental Results

Table 1 reports results associated with the digit experiment (Section 4.1, Figure 2). In particular, it reports numerical results (averaged over different runs) obtained with models trained with Algorithm 1 by varying the hyperparameters and . Training set is constituted by MNIST samples, models were tested on SVHN, MNIST-M, SYN and USPS (see Figure 1 (top)). The baselines (accuracies achieved by models trained with ERM) are:

  • SVHN:

  • MNIST-M:

  • SYN:

  • USPS:

Table 2 reports results associated with the semantic segmentation experiment (Section 4.2, Figure 3). To summarize, it reports results obtained by training models on Highway and testing them on New York-like City and Old European Town, and by training models on New York-like City and testing them on Highway and Old European Town (see Figure 1 (bottom) to observe the different weather/time/date conditions). The comparison is between models trained with ERM (ERM rows) and our method (Ours rows), \egAlgorithm 1 with and .

Finally, Figure 4 reports a comparison between our method (blue) and the unsupervised domain adaptation algorithm ADDA ADDA (yellow), by varying the number of target images fed to the latter during training. Note that, since unsupervised domain adaptation algorithms make use of target data during training while our method does not, the comparison is not fair. However, we are interested in evaluating to which extent our method can compete with a well performing unsupervised domain adaptation algorithm ADDA . While on MNIST USPS split ADDA clearly outperforms our method, on MNIST MNIST-M the accuracies reached by our method are just slightly lower than the ones reached by ADDA, and on MNIST SYN our method outperforms it, even if the domain adaptation algorithm has access to a large number of samples from the target domain. Finally, note that MNIST SVHN results are not provided because ADDA would not converge on this split (in effect, these results are neither reported in the original work ADDA ). Instead, models trained on MNIST samples using our method better generalize to SVHN, as shown in Section 4.1.

K=1 K=2 K=3 K=4
SVHN
MNIST-M