Generalizing to Unseen Domains
via Adversarial Data Augmentation
Abstract
We are concerned with learning models that generalize well to different unseen domains. We consider a worstcase formulation over data distributions that are near the source domain in the feature space. Only using training data from the source domain, we propose an iterative procedure that augments the dataset with examples from a fictitious target domain that is "hard" under the current model. We show that our iterative scheme is an adaptive data augmentation method where we append adversarial examples at each iteration. For softmax losses, we show that our method is a datadependent regularization scheme that behaves differently from classical regularizers (\eg, ridge or lasso) that regularize towards zero. On digit recognition and semantic segmentation tasks, we empirically observe that our method learns models that improve performance across a priori unknown data distributions.
1 Introduction
In many modern applications of machine learning, we wish to learn a system that can perform uniformly well across multiple populations. Due to high costs of data acquisition, however, it is often the case that datasets consist of a limited number of population sources. While performance evaluated on the validation dataset—usually from the same population as the training dataset—is a standard metric on which many systems are optimized, it has been observed that performance on populations different from that of the training data can be much worse Daume2006 ; Blitzer2006 ; BenDavid2006 ; Saenko2010 ; NameTheDataset . In this paper, we are concerned with generalizing to populations different from the training distribution, in settings where we have no access to any data from the unknown target distributions.
A number of authors have proposed domain adaptation methods (for example, see Ganin ; ADDA ; DeepCORAL ; morerio2018 ; DIFA ) in settings where a fully labeled source dataset and an unlabeled (or partially labeled) set of examples from fixed target distributions are available. Although such algorithms can successfully learn models that perform well on known target distributions, the assumption of a priori fixed target distributions can be restrictive in practical scenarios. For example, consider a semantic segmentation algorithm used by a robot: every task, robot, environment and camera configuration will result in a different target distribution, and these diverse scenarios can be identified only after the model is trained and deployed, making it difficult to collect samples from them.
In this work, we develop methods that can learn to better generalize to new unknown domains. We consider the restrictive setting where training data only comes from a single source domain. For example, consider a module for selfdriving cars that needs to generalize well across weather conditions unexplored during training. Inspired by recent developments in distributionally robust optimization and adversarial training Certifiable ; LeeRa17 ; HeinzeDemlMe17 , we consider the following worstcase problem around the (training) source distribution
(1) 
Here, is the model, is a source data point with its labeling, is the loss function, and is a distance metric on the space of probability distributions.
The solution to worstcase problem (1) guarantees good performance against data distributions that are distance away from the source domain . To allow data distributions that have different support to that of the source , we use Wasserstein distances as our metric . Our distance will be defined on the semantic space ^{1}^{1}1By semantic space we mean learned representations since recent works perceptual1 ; perceptual2 suggest that distances in the space of learned representations of high capacity models typically correspond to semantic distances in visual space., so that target populations satisfying will be realistic covariate shifts that preserve the same semantic representation of the source (\eg, adding color to a greyscale image). In this regard, we expect the solution to the worstcase problem (1)—the model that we wish to learn—to have favorable performance across covariate shifts in the semantic space.
We propose an iterative procedure that aims to solve the problem (1) for a small value of at a time, and does stochastic gradient updates to the model with respect to these fictitious worstcase target distributions (Section 2). Each iteration of our method uses small values of , and we provide a number of theoretical interpretations of our method. First, we show that our iterative algorithm is an adaptive data augmentation method where we add adversarially perturbed samples—at the current model—to the dataset (Section 3). More precisely, our adversarially generated samples roughly correspond to Tikhonov regularized Newtonsteps Levenberg44 ; Marquardt63 on the loss in the semantic space. Further, we show that for softmax losses, each iteration of our method can be thought of as a datadependent regularization scheme where we regularize towards the parameter vector corresponding to the true label, instead of regularizing towards zero like classical regularizers such as ridge or lasso.
From a practical viewpoint, a key difficulty in applying the worstcase formulation (1) is that the magnitude of the covariate shift is a priori unknown. We propose to learn an ensemble of models that correspond to different distances . In other words, our iterative method generates a collection of datasets, each corresponding to a different interdataset distance level , and we learn a model for each of them. At test time, we use a heuristic method to choose an appropriate model from the ensemble.
We test our approaches on a simple digit recognition task, and a more realistic semantic segmentation task across different seasons and weather conditions. In both settings, we observe that our method allows to learn models that improve performance across a priori unknown target distributions that have varying distance from the original source domain.
Related work
The literature on adversarial training FastGradientMethod ; Certifiable ; LeeRa17 ; HeinzeDemlMe17 is closely related to our work, since the main goal is to devise training procedures that learn models robust to fluctuations in the input. Departing from imperceptible attacks considered in adversarial training, we aim to learn models that are resistant to larger perturbations, namely outofdistribution samples. Sinha et al. Certifiable proposes a principled adversarial training procedure, where new images that maximize some risk are generated and the model parameters are optimized with respect to those adversarial images. Being devised for defense against imperceptible adversarial attacks, the new images are learned with a loss that penalizes differences between the original and the new ones. In this work, we rely on a minimax game similar to the one proposed by Sinha et al. Certifiable , but we impose the constraint in the semantic space, in order to allow our adversarial samples from a fictitious distribution to be different at the pixel level, while sharing the same semantics.
There is a substantial body of work on domain adaptation Daume2006 ; Blitzer2006 ; Saenko2010 ; Ganin ; ADDA ; DeepCORAL ; morerio2018 ; DIFA , which aims to better generalize to a priori fixed target domains whose labels are unknown at training time. This setup is different from ours in that these algorithms require access to samples from the target distribution during training. Domain generalization methods DG0 ; DG1 ; DG2 ; DG3 ; Mancini2018 that propose different ways to better generalize to unknown domains are also related to our work. These algorithms require the training samples to be drawn from different domains (while having access to the domain labels during training), not a single source, a limitation that our method does not have. In this sense, one could interpret our problem setting as unsupervised domain generalization. Tobin et al. DomainRandomization proposes domain randomization, which applies to simulated data and creates a variety of random renderings with the simulator, hoping that the real world will be interpreted as one of them. Our goal is the same, since we aim at obtaining data distributions more similar to the real world ones, but we accomplish it by actually learning new data points, and thus making our approach applicable to any data source and without the need of a simulator.
Hendrycks and Gimpel SoftmaxICLR2016 suggest that a good empirical way to detect whether a test sample is outofdistribution for a given model is to evaluate the statistics of the softmax outputs. We adapt this idea in our setting, learning ensemble of models trained with our method and choosing at test time the model with the greatest maximum softmax value.
2 Method
The worstcase formulation (1) over domains around the source hinges on the notion of distance , that characterizes the set of unknown populations we wish to generalize to. Conventional notions of Wasserstein distance used for adversarial training Certifiable are defined with respect to the original input space , which for images corresponds to raw pixels. Since our goal is to consider fictitious target distributions corresponding to realistic covariate shifts, we define our distance on the semantic space. Before properly defining our setup, we first give a few notations. Letting the dimension of output of the last hidden layer, we denote where is the set of weights of the final layer, and is the rest of the weights of the network. We denote by the output of the embedding layer of our neural network. For example, in the classification setting, is the number of classes and we consider the softmax loss
(2) 
where is the th column of the classification layer weights .
Wasserstein distance on the semantic space
On the space , consider the following transportation cost —cost of moving mass from to
The transportation cost takes value for data points with different labels, since we are only interested in perturbation to the marginal distribution of . We now define our notion of distance on the semantic space. For inputs coming from the original space , we consider the transportation cost defined with respect to the output of the last hidden layer
so that measures distance with respect to the feature mapping . For probability measures and both supported on , let denote their couplings, meaning measures with and . Then, we define our notion of distance by
(3) 
Armed with this notion of distance on the semantic space, we now consider a variant of the worstcase problem (1) where we replace the distance with (3), our adaptive notion of distance defined on the semantic space
Computationally, the above supremum over probability distributions is intractable. Hence, we consider the following Lagrangian relaxation with penalty parameter
(4) 
Taking the dual reformulation of the penalty relaxation (4), we can obtain an efficient solution procedure. The following result is a minor adaptation of (BlanchetMu16, , Theorem 1); to ease notation, let us define the robust surrogate loss
(5) 
Let be continuous. For any distribution and any , we have
(6) 
In order to solve the penalty problem (4), we can now perform stochastic gradient descent procedures on the robust surrogate loss . Under suitable conditions BoydVa04 , we have
(7) 
where is an adversarial perturbation of at the current model . Hence, computing gradients of the robust surrogate requires solving the maximization problem (5). Below, we consider an (heuristic) iterative procedure that iteratively performs stochastic gradient steps on the robust surrogate .
Iterative Procedure
We propose an iterative training procedure where two phases are alternated: a maximization phase where new data points are learned by computing the inner maximization problem (5) and a minimization phase, where the model parameters are updated according to stochastic gradients of the loss evaluated on the adversarial examples generated from the maximization phase. The latter step is equivalent to stochastic gradient steps on the robust surrogate loss , which motivates its name. The main idea here is to iteratively learn "hard" data points from fictitious target distributions, while preserving the semantic features of the original data points.
Concretely, in the th maximization phase, we compute adversarially perturbed samples at the current model
(8) 
where are the original samples from the source distribution . The minimization phase then performs repeated stochastic gradient steps on the augmented dataset . The maximization phase (8) can be efficiently computed for smooth losses if is strongly convex (Certifiable, , Theorem 2); for example, this is provably true for any linear network. In practice, we use gradient ascent steps to solve for worstcase examples (8); see Algorithm 1 for the full description of our algorithm.
Ensembles for classification
The hyperparameter —which is inversely proportional to , the distance between the fictitious target distribution and the source—controls the ability to generalize outside the source domain. Since target domains are unknown, it is difficult to choose an appropriate level of a priori. We propose a heuristic ensemble approach where we train models . Each model is associated with a different value of , and thus to fictitious target distributions with varying distances from the source . To select the best model at test time—inspired by Hendrycks and Gimpel SoftmaxICLR2016 —given a sample , we select the model with the greatest softmax score
(9) 
3 Theoretical Motivation
In our iterative algorithm (Algorithm 1), the maximization phase (8) was a key step that augmented the dataset with adversarially perturbed data points, which was followed by standard stochastic gradient updates to the model parameters. In this section, we provide some theoretical understanding of the augmentation step (8). First, we show that the augmented data points (8) can be interpreted as Tikhonov regularized Newtonsteps Levenberg44 ; Marquardt63 on the loss in the semantic space (under the current model). Roughly speaking, this quantifies the sense in which Algorithm 1 is an adaptive data augmentation algorithm that adds data points from fictitious "hard" target distributions. Secondly, recall that the robust surrogate (5) is the loss whose stochastic gradients were used to update the model parameters in the minimization step (Eq (7)). In the classification setting, we show that the robust surrogate (5) roughly corresponds to a novel datadependent regularization scheme on the softmax loss . Instead of penalizing towards zero like classical regularizers (\eg, ridge or lasso), our datadependent regularization term penalizes deviations from the parameter vector corresponding to that of the true label.
3.1 Adaptive Data Augmentation
We now give an interpretation for the augmented data points in the maximization phase (8). Concretely, we fix , , , and consider an maximizer
We let , and abuse notation by using . In what follows, we show that the feature mapping satisfies
(10) 
Intuitively, this implies that the adversarially perturbed sample is drawn from a fictitious target distribution where probability mass on was transported to . We note that the transported point in the semantic space corresponds to a Tikhonov regularized Newtonstep Levenberg44 ; Marquardt63 on the loss at the current model . Noting that computing involves backsolves on a large dense matrix, we can interpret our gradient ascent updates in the maximization phase (8) as an iterative scheme for approximating this quantity.
We assume sufficient smoothness, where we use to denote the operator norm of a matrix . {assumption} There exists such that, for all , we have and . {assumption} There exists such that, for all , we have . Then, we have the following bound (10) whose proof we defer to Appendix A.1. {theorem} Let Assumptions 3.1, 3.1 hold. If and , then
3.2 DataDependent Regularization
In this section, we argue that under suitable conditions on the loss,
the robust surrogate loss (5) corresponds to a particular datadependent regularization scheme. Let be the class softmax loss (2) given by
where is the th row of the classification layer weight . Then, the robust surrogate is an approximate regularizer on the classification layer weights
(11) 
The expansion (11) shows that the robust surrogate (5) is roughly equivalent to datadependent regularization where we minimize the distance between , our “average estimated linear classifier”, to , the linear classifier corresponding to the true label . Letting for a fixed , we have the following result whose proof we defer to Appendix A.3. {theorem} If and , the softmax loss (2) satisfies
4 Experiments
We evaluate our method for both classification and semantic segmentation settings, following the evaluation scenarios of domain adaptation techniques Ganin ; ADDA ; FCNInTheWild , though in our case the target domains are unknown at training time. We summarize our experimental setup including implementation details, evaluation metrics and datasets for each task. We compare our method against the Empirical Risk Minimization (ERM) baseline in all of our results.
Digit classification
We train on MNIST MNIST dataset and test on MNISTM Ganin , SVHN SVHN , SYN Ganin and USPS USPS (see Figure 1 (top)). We use digit samples for training and evaluate our models on the respective test sets of the different target domains, using accuracy as a metric. In order to work with comparable datasets, we resized all the images to , and treated images from MNIST and USPS as RGB. We use a ConvNet ConvNet with architecture convpoolconvpoolfcfcsoftmax and set the hyperparameters , , and . In the minimization phase, we use Adam Adam with batch size equal to .
Semantic scene segmentation
We use the SYTHIASYNTHIA dataset for semantic segmentation. The dataset contains images from different locations (we use Highway, New Yorklike City and Old European Town), and different weather/time/date conditions (we use Dawn, Fog, Night, Spring and Winter, see Figure 1 (bottom)). We train models on a source domain and test on other domains, using the standard mean Intersection Over Union (mIoU) metric to evaluate our performance VOC2008 . We arbitrarily chose images from the left front camera throughout our experiments. For each one, we sample random images (resized to pixels) from the training set. We use a Fully Convolutional Network (FCN) FCN , with a ResNet50 ResNet body and set the hyperparameters , , and . For the minimization phase, we use Adam Adam with batch size equal to .
4.1 Results on Digit Classification
In this section, we present and discuss the results on the digit classification experiment. Figure 2 (middle) shows performances related to models trained with different numbers of iterations . Comparing models trained with Algorithm 1 (blue) and models trained with ERM (red), it can be observed that our method leads to statistically significant, improved performances when the models are tested on data distributions very different from the training one (SVHN, MNISTM and SYN). On USPS, our method causes accuracies to drop since MNIST and USPS are very similar datasets (see Figure 1), thus the image domain that USPS belongs to is not explored by our algorithm during the training procedure, which optimizes for worst case performance. Figure 2 (top) shows performances of models trained with different values of the hyperparameter (with ). These plots show that on outofsample domains (SVHN, MNISTM and SYN), our method outperforms ERM for any value of , a desired achievement, since this hyperparameter cannot be properly crossvalidated. We further report in Appendix B a comparison between our method and an unsupervised domain adaptation algorithm (ADDA ADDA ), and results associated with different values of the hyperparameters and .
Finally, we report the results obtained by learning an ensemble of models. Since the hyperparameter is nontrivial to set a priori, we use the softmax confidences (9) to choose which model to use at test time. We learn ensemble of models, each of which is trained by running Algorithm 1 with different values of the as , with . Figure 2 (bottom) shows the comparison between our method with different numbers of iterations (blue) and ERM (red). In order to separate the role of ensemble learning, we learn an ensemble of baseline models each corresponding to a different initialization. We fix the number of models in the ensemble to be the same for both the baseline (ERM) and our method. As it can be observed, by comparing Figure 2 (middle) and (bottom), our ensemble approach yields higher accuracy in most of the testing scenarios. It is worth noting also the reduced performance gap detectable in the USPS experiment.
4.2 Results on Semantic Scene Segmentation
We report a comparison between models trained with ERM and models trained with our method (Algorithm 1 with ). We set in every experiment, but let us stress that this is an arbitrary value, we did not observe a strong correlation between the different values and the general behavior of the models in this case. Its role would be much more meaningful in an ensemble setting, where each model would be associated with a different level of robustness, as discussed in Section 2. In this setting, we do not apply the ensemble approach, but only evaluate the performances of the single models. The main reason for this choice is the fact that the heuristics developed to choose the correct model at test time in effect cannot be applied in a straightforward fashion to a semantic segmentation problem. One could apply it to the single pixels and combine the outputs of the different models according to the softmax distributions. Some preliminary results showed that this approach could be applicable, but we reserve a detailed analysis of it for future works.
Figure 3 reports numerical results obtained. Specifically, leftmost plots report results associated with models trained on sequences from the Highway split and tested on the New Yorklike City and the Old European Town splits (topleft and bottomleft, respectively); rightmost plots report results associated with models trained on sequences from the New Yorklike City split and tested on the Highway and the Old European Town splits (topright and bottomright, respectively). The training sequences (Dawn, Fog, Night, Spring and Winter) are indicate on the xaxis. Red and blue bars indicate average mIoUs achieved by models trained with ERM and by models trained with our method, respectively. These results were calculated by averaging over the mIoUs obtained with each model on the different conditions of the test set. As can be observed, models trained with our method mostly better generalize to unknown data distributions. In particular, our method always outperforms the baseline by a statistically significant margin when the training images are from Night scenarios. This is since the baseline model trained on images from Night is strongly biased towards dark scenery, while, as a consequence of training over worstcase distributions, our models can overcome this strong bias and better generalize across different unseen domains.
5 Conclusions and Future Work
We study a new adversarial data augmentation procedure that learns to better generalize across unseen data distributions, and define an ensemble method to exploit this technique in a classification framework. This is in contrast to domain adaptation algorithms, which require a sufficient number of samples from a known, a priori fixed target distribution. Our experimental results show that our iterative procedure provides broad generalization behavior on digit recognition and crossseason and crossweather semantic segmentation tasks.
For future work, we hope to extend the ensemble methods by defining novel decision rules. The proposed heuristics (9) only apply to classification settings, and extending them to a broad realm of tasks including semantic segmentation is an important direction. Many theoretical questions still remain. For instance, quantifying the behavior of datadependent regularization schemes presented in Section 3 would help us better understand adversarial training methods in general.
References
 [1] Shai BenDavid, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of representations for domain adaptation. In B. Schölkopf, J. C. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 137–144. MIT Press, 2007.
 [2] Jose Blanchet and Karthyek Murthy. Quantifying distributional model risk via optimal transport. arXiv:1604.01446 [math.PR], 2016.
 [3] John Blitzer, Ryan McDonald, and Fernando Pereira. Domain adaptation with structural correspondence learning. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, EMNLP ’06, pages 120–128, Stroudsburg, PA, USA, 2006. Association for Computational Linguistics.
 [4] J Frédéric Bonnans and Alexander Shapiro. Perturbation analysis of optimization problems. Springer Science & Business Media, 2013.
 [5] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
 [6] J. S. Denker, W. R. Gardner, H. P. Graf, D. Henderson, R. E. Howard, W. Hubbard, L. D. Jackel, H. S. Baird, and I. Guyon. Advances in neural information processing systems 1. chapter Neural Network Recognizer for Handwritten Zip Code Digits, pages 323–331. 1989.
 [7] Alexey Dosovitskiy and Thomas Brox. Generating images with perceptual similarity metrics based on deep networks. In Advances in Neural Information Processing Systems, pages 658–666, 2016.
 [8] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2008 (VOC2008) Results. http://www.pascalnetwork.org/challenges/VOC/voc2008/workshop/index.html.
 [9] Yaroslav Ganin and Victor S. Lempitsky. Unsupervised domain adaptation by backpropagation. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 611 July 2015, pages 1180–1189, 2015.
 [10] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domainadversarial training of neural networks. J. Mach. Learn. Res., 17(1), January 2016.
 [11] Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 2015.
 [12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015.
 [13] Christina HeinzeDeml and Nicolai Meinshausen. Conditional variance penalties and domain shift robustness. arXiv preprint arXiv:1710.11469, 2017.
 [14] Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and outofdistribution examples in neural networks. CoRR, abs/1610.02136, 2016.
 [15] Judy Hoffman, Dequan Wang, Fisher Yu, and Trevor Darrell. Fcns in the wild: Pixellevel adversarial and constraintbased adaptation. CoRR, abs/1612.02649, 2016.
 [16] Hal Daumé III and Daniel Marcu. Domain adaptation for statistical classifiers. CoRR, abs/1109.6341, 2011.
 [17] Justin Johnson, Alexandre Alahi, and Li FeiFei. Perceptual losses for realtime style transfer and superresolution. In European Conference on Computer Vision, pages 694–711. Springer, 2016.
 [18] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
 [19] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Comput., 1(4):541–551, December 1989.
 [20] Yann Lecun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. In Proceedings of the IEEE, pages 2278–2324, 1998.
 [21] Jaeho Lee and Maxim Raginsky. Minimax statistical learning and domain adaptation with wasserstein distances. arXiv preprint arXiv:1705.07815, 2017.
 [22] K Levenberg. A method for the solution of certain problems in least squares. quart. appl. math. 2. 1944.
 [23] Da Li, Yongxin Yang, YiZhe Song, and Timothy M. Hospedales. Deeper, broader and artier domain generalization. CoRR, abs/1710.03077, 2017.
 [24] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. CoRR, abs/1411.4038, 2014.
 [25] M. Mancini, S. R. Bulò, B. Caputo, and E. Ricci. Robust place categorization with deep domain generalization. IEEE Robotics and Automation Letters, 3(3):2093–2100, July 2018.
 [26] Donald W Marquardt. An algorithm for leastsquares estimation of nonlinear parameters. Journal of the society for Industrial and Applied Mathematics, 11(2):431–441, 1963.
 [27] Pietro Morerio, Jacopo Cavazza, and Vittorio Murino. Minimalentropy correlation alignment for unsupervised deep domain adaptation. International Conference on Learning Representations, 2018.
 [28] Saeid Motiian, Marco Piccirilli, Donald A. Adjeroh, and Gianfranco Doretto. Unified deep supervised domain adaptation and generalization. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
 [29] Krikamol Muandet, David Balduzzi, and Bernhard Schölkopf. Domain generalization via invariant feature representation. In Sanjoy Dasgupta and David McAllester, editors, Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 10–18, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR.
 [30] Yurii Nesterov and Boris T Polyak. Cubic regularization of newton method and its global performance. Mathematical Programming, 108(1):177–205, 2006.
 [31] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011, 2011.
 [32] German Ros, Laura Sellart, Joanna Materzynska, David Vazquez, and Antonio M. Lopez. The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
 [33] Kate Saenko, Brian Kulis, Mario Fritz, and Trevor Darrell. Adapting visual category models to new domains. In Proceedings of the 11th European Conference on Computer Vision: Part IV, ECCV’10, pages 213–226, Berlin, Heidelberg, 2010. SpringerVerlag.
 [34] Shiv Shankar, Vihari Piratla, Soumen Chakrabarti, Siddhartha Chaudhuri, Preethi Jyothi, and Sunita Sarawagi. Generalizing across domains via crossgradient training. In International Conference on Learning Representations, 2018.
 [35] Aman Sinha, Hongseok Namkoong, and John Duchi. Certifiable distributional robustness with principled adversarial training. In International Conference on Learning Representations, 2018.
 [36] Baochen Sun and Kate Saenko. Deep CORAL: correlation alignment for deep domain adaptation. In ECCV Workshops, 2016.
 [37] Joshua Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. CoRR, abs/1703.06907, 2017.
 [38] A. Torralba and A. A. Efros. Unbiased look at dataset bias. In Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition, CVPR ’11, pages 1521–1528, Washington, DC, USA, 2011. IEEE Computer Society.
 [39] Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain adaptation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
 [40] Riccardo Volpi, Pietro Morerio, Silvio Savarese, and Vittorio Murino. Adversarial feature augmentation for unsupervised domain adaptation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
Appendix A Proofs
a.1 Proof of Theorem 3.1
Recall that we consider a fixed , , , and . We begin by noting that since , we have
(12) 
Similarly as , let be an optimizer to the problem (12)
To further ease notation, let us denote
the first and secondorder approximation of around respectively.
First, we note that by hypothesis and hence, attains the maximum in the problem (12)
(13)  
Now, note that is  strongly concave since
by Assumption 3.1, where denotes the maximum eigenvalue. Recalling the definition of given in Eq (12), we then have
(14) 
where we used the definition of in the last inequality.
Next, we note that and are close by Taylor expansion. {lemma}[(Nesterov06, , Lemma 1)] Let have a Lipschitz Hessian so that for all , . Then, for all ,
Applying Lemma A.1, we have that
Using this inequality in the bound (14), we arrive at
(15) 
From definition (13) of , we have
(16) 
Next, to bound in the bound (15), we show that and are at most away. We defer the proof of the following lemma to Appendix A.2 {lemma} Let Assumption 3.1 hold and . Then,
Applying Lemma A.1 to bound on the right hand side of inequality (15), and using the bound (16) for , we obtain
This gives the final result.
a.2 Proof of Lemma a.1
We use the following key lemma which says that for functions that satisfy a growth condition, its minimum is stable to perturbations to the function. {lemma}[(BonnansSh13, , Proposition 4.32)] Suppose that satisfies the secondorder growth condition: there exists a such that if we denote by the minimizer of so that , we have for all
If there is a function such that is Lipschitz on a neighborhood of , then , any approximate minimizer of in , satisfies
a.3 Proof of Theorem 11
Again, we abuse notation by writing for , and similarly and . We begin by noting that since , we have
The following claim will be crucial. {claim} If is Lipschitz with respect to the norm, then
From Taylor’s theorem, we have
Using this approximation in the definition of , we get
Similarly, we can compute the lower bound
Combining the two bounds, the claim follows.
From the claim, it suffices to show that is Lipschitz. From , we have
Now, since
we conclude that
Appendix B Additional Experimental Results
Table 1 reports results associated with the digit experiment (Section 4.1, Figure 2). In particular, it reports numerical results (averaged over different runs) obtained with models trained with Algorithm 1 by varying the hyperparameters and . Training set is constituted by MNIST samples, models were tested on SVHN, MNISTM, SYN and USPS (see Figure 1 (top)). The baselines (accuracies achieved by models trained with ERM) are:

SVHN:

MNISTM:

SYN:

USPS:
Table 2 reports results associated with the semantic segmentation experiment (Section 4.2, Figure 3). To summarize, it reports results obtained by training models on Highway and testing them on New Yorklike City and Old European Town, and by training models on New Yorklike City and testing them on Highway and Old European Town (see Figure 1 (bottom) to observe the different weather/time/date conditions). The comparison is between models trained with ERM (ERM rows) and our method (Ours rows), \egAlgorithm 1 with and .
Finally, Figure 4 reports a comparison between our method (blue) and the unsupervised domain adaptation algorithm ADDA ADDA (yellow), by varying the number of target images fed to the latter during training. Note that, since unsupervised domain adaptation algorithms make use of target data during training while our method does not, the comparison is not fair. However, we are interested in evaluating to which extent our method can compete with a well performing unsupervised domain adaptation algorithm ADDA . While on MNIST USPS split ADDA clearly outperforms our method, on MNIST MNISTM the accuracies reached by our method are just slightly lower than the ones reached by ADDA, and on MNIST SYN our method outperforms it, even if the domain adaptation algorithm has access to a large number of samples from the target domain. Finally, note that MNIST SVHN results are not provided because ADDA would not converge on this split (in effect, these results are neither reported in the original work ADDA ). Instead, models trained on MNIST samples using our method better generalize to SVHN, as shown in Section 4.1.
K=1  K=2  K=3  K=4  

SVHN  
MNISTM  