Learning to Optimize Domain Specific Normalization
with Domain Augmentation for Domain Generalization
Abstract
We propose a simple but effective multisource domain generalization technique based on deep neural networks by incorporating optimized normalization layers specific to individual domains. Our approach employs multiple normalization methods while learning a separate affine parameter per domain. For each domain, the activations are normalized by a weighted average of multiple normalization statistics. The normalization statistics are kept track of separately for each normalization type if necessary. Specifically, we employ batch and instance normalizations in our implementation and attempt to identify the best combination of two normalization methods in each domain and normalization layer. In addition, we augment new domains through the combinations of multiple existing domains to increase the diversity of source domains available during training. The optimized normalization layers and the domain augmentation are effective to enhance the generalizability of the learned model. We demonstrate the stateoftheart accuracy of our algorithm in the standard benchmark datasets.
1 Introduction
Domain generalization aims to learn generic feature representations agnostic to domains and make trained models perform well in completely new domains. To achieve this challenging goal, one needs to train models that can capture useful information observed commonly in multiple domains and recognize semantically related but visually inconsistent examples effectively. Many realworld problems have similar objectives so this task can be widely used in various practical applications. Domain generalization is closely related to unsupervised domain adaptation but there is a critical difference regarding the availability of target domain data; contrary to unsupervised domain adaptation, domain generalization cannot access examples in target domain during training but is still required to capture transferable information across domains. Due to the constraint, the domain generalization problem is typically considered to be more difficult and involve multiple source domains to make the problem more feasible.
Domain generalization techniques are classified into several groups depending on their approaches. Some algorithms define novel loss functions to learn domainagnostic representations [muandet2013domain, motiian2017unified, li2018domain] while others are more interested in designing deep neural network architectures to realize similar goals [pacs, dinnocente2018domain, mancini2018best]. The algorithms based on metalearning have been proposed under the assumption that there exists a heldout validation set [balaji2018metareg, li2019episodic, li2018learning].
Our algorithm belongs to the second category, i.e.network architecture design methods. In particular, we are interested in exploiting normalization layers in deep neural networks to solve the domain generalization problem. The most naïve approach would be to train a single deep neural network with batch normalization using all training examples regardless of their domain memberships. This method works fairly well partly because batch normalization regularizes feature representations from heterogeneous domains and a trained model is often capable of adapting to unseen domains. However, the benefit of batch normalization is limited when a domain shift is significant, and we are often required to remove domainspecific styles for better generalization. Instance normalization [in] turns out to be an effective scheme for the goal and incorporating both batch and instance normalization techniques further improves accuracy by a datadriven balancing of two normalization methods [bin]. Our approach also employs the two normalizations but proposes a more sophisticated algorithm designed for domain generalization.
We explore domainspecific normalizations to learn domainagnostic discriminative representations by discarding domainspecific ones. The goal of our algorithm is to optimize the combination of normalization techniques in each domain while different domains learn separate parameters for the mixture of normalizations. The intuition behind this approach is that we can learn domaininvariant representations by controlling types of normalization and parameters in normalization layers. Note that all other parameters including the ones in convolutional layers are shared across domains. Although our approach is similar to [bin, sn] in the sense that they attempt to learn the optimal mixing weights between normalization types, our domainspecific optimized normalization technique is unique since it learns normalization parameters and maintains normalization statistics in each domain separately. Figure 1 illustrates the main idea of our approach.
In addition, we propose a simple domain augmentation technique to increase the number of source domains used for training. Specifically, we generate a new source domain by unifying multiple existing ones, by which the resulting domains have different statistics compared to the original ones. Albeit simple, an ensemble of the models trained using multiple combinations of the original and generated domains consistently improves the accuracy compared to the comparable ensemble classifiers based only on the original domains.
Our contributions are as follows:

We propose a domain generalization technique using multiple heterogeneous normalization methods specific to individual domains, which facilitates to extract domainagnostic feature representations of input examples by removing domainspecific information effectively.

Our approach optimizes the weights of individual normalization types jointly with the rest of networks, and the resulting model is capable of balancing between normalization methods to capture information useful for classification in unseen domains.

We diversify source domains by domain augmentation to improve accuracy, which constructs new domains by the simple combinations of multiple existing ones.

The proposed algorithm achieves the stateoftheart accuracy in multiple standard benchmark datasets and presents consistent benefits of individual components.
2 Related Work
This section discusses existing domain generalization approaches and reviews two related problems, multisource domain adaptation and normalization techniques in deep neural networks.
2.1 Domain Generalization
Domain generalization algorithms learn domaininvariant representations given input examples regardless of their domain memberships. Since target domain information is not available at training time, they typically rely on multiple source domains to extract knowledge applicable to any unseen domain. The existing domain generalization approaches can be roughly categorized into three classes. The first group of methods proposes novel loss functions that encourage learned representations to generalize well to new domains. Muandet et al.[muandet2013domain] propose domaininvariant component analysis, which performs dimensionality reduction to make feature representation invariant to domains. A few recent works [motiian2017unified, li2018domain] also attempt to learn a shared embedding space appropriate for semantic matching across domains. Another kind of approaches tackles domain generalization problem by manipulating deep neural network architectures. Domainspecific information is handled by designated modules within deep neural networks [pacs, dinnocente2018domain] while [mancini2018best] proposes a soft model selection technique to obtain generalized representations. Recently, metalearning based techniques start to be used to solve domain generalization problems. MLDG [li2018learning] extends MAML [finn2017model] to domain generalization task. Balaji et al.[balaji2018metareg] points out the limitation of [li2018learning] and proposes a regularizer to address domain generalization in a metalearning framework directly. Also, [li2019episodic] presents an episodic training technique appropriate for domain generalization. Note that, to our knowledge, none of the existing methods exploit normalization types and their parameters for domain generalization.
2.2 MultiSource Domain Adaptation
Multisource domain adaptation can be considered as the middleground between domain adaptation and generalization, where data from multiple source domains are used for training in addition to examples in an unlabeled target domain. Although unsupervised domain adaptation is a very popular problem, its multisource version is relatively less investigated. Zhao et al.[zhao2018adversarial] propose to learn features that are invariant to multiple domain shifts through adversarial training, and Guo et al.[guo2018multi] use a mixtureofexperts approach by modeling the interdomain relationships between source and target domains. A recent work using domainspecific batch normalization [dsbn] has shown competitive performance in multisource domain adaptation by aligning the representations in heterogeneous domains to a single common feature space.
2.3 Normalization in Neural Networks
Normalization techniques in deep neural networks are originally designed for regularizing trained models and improving their generalization performance. Various normalization techniques [bn, in, ln, spectralnorm, wn, gn, bin, sn, ssn, dsbn] have been studied actively in recent years. The most popular technique is batch normalization (BN) [bn], which normalizes activations over individual channels using data in a minibatch while instance normalization (IN) [in] performs the same operation per instance instead of minibatch. In general, IN is effective to remove instancespecific characteristics (e.g.style in an image) and adding IN makes a trained model focus on instanceinvariant information and increases generalization capability of the model to an unseen domain. Other normalizations such as layer normalization (LN) [ln] and group normalization (GN) [gn] have the same concept while weight normalization [wn] and spectral normalization [spectralnorm] normalize weights over parameter space.
Recently, batchinstance normalization (BIN) [bin], switchable normalization (SN) [sn], and sparse switchable normalization (SSN) [ssn] employ the combinations of multiple normalization types to maximize the benefit. Note that BIN considers batch and instance normalizations while SN uses LN additionally. On the other hand, DSBN [dsbn] adopts separate batch normalization layers for each domain to deal with domain shift and generate domaininvariant representations.
3 DomainSpecific Optimized Normalization for Domain Generalization
This section describes our main algorithm called domainspecific optimized normalization (DSON) in details and also presents how the proposed method is employed to solve domain generalization problems.
3.1 Overview
Domain generalization aims to learn a domainagnostic model typically using multiple domains to be applied to an unseen domain. Consider a set of training examples with its corresponding label set in a source domain . Our goal is to train a classifier using the data in multiple () source domains to correctly classify an image in a target domain , which are unavailable during training.
The main objective of the domain generalization problem is to learn a joint embedding space across all source domains, which is expected to be valid in target domains as well. To this end, we train domaininvariant classifiers from each of the source domains and ensemble their predictions. To embed each example onto a domaininvariant feature space, we employ domainspecific normalization, which is to be described in following sections.
Our classification network consists of a set of feature extractors and a single fully connected layer . Specifically, the feature extractors share all parameters across domains except for the ones in the normalization layers.
For each source domain , loss function is defined as
(1) 
where is the crossentropy loss. All network parameters are jointly optimized to minimize the sum of classification losses for source domains:
(2) 
Our domainspecific deep neural network model is obtained by backpropagating the total loss . To facilitate generalization, in the validation phase, we follow the leaveonedomainout validation strategy proposed in [dinnocente2018domain]; the label of a validation example from domain is predicted by averaging predictions from all domainspecific classifiers, except for the one with domain .
3.2 Batch and Instance Normalization
Normalization techniques [bn, in, ln, gn] are widely applied in recent network architectures for better optimization and regularization. Among them, we combine batch normalization (BN) and instance normalization (IN) to construct a domainagnostic feature extractor.
Our intuition about the two normalization methods is as follows. For each channel, BN whitens activations over all the spatial locations of instances within a minibatch, whereas IN does over locations in a single instance. Compared to BN, IN is effective to reduce the crosscategory variance by normalizing each instance independently. Figure 2 visualizes the change of feature distribution after applying batch and instance normalization. When IN is integrated into a feature extractor, the features become less discriminative with respect to object categories compared to the ones with BN layers only. At the same time, the features become less discriminative with respect to domains, thereby making the learned representations less overfit to a particular domain. A mixture of IN with BN serves as regularization and the resulting classifier tends to focus on highlevel semantic information and is more likely to become invariant to domain shifts.
Based on this intuition, we now employ IN in addition to BN in all normalization layers of our network, where means and variances of two normalization statistics are linearly interpolated to make the feature extractor domainagnostic.
3.3 Optimization for DomainSpecific Normalization
Based on the intuitions above, we propose a domainspecific optimized normalization (DSON) for domain generalization. Given an example from domain , the proposed domainspecific normalization layer transforms channelwise whitened activations using affine parameters and . Note that whitening is also performed for each domain. At each channel, the activations are transformed as
(3) 
where the whitening is performed using the domainspecific mean and variance, and ,
(4) 
We combine batch normalization (BN) and instance normalization (IN) in a similar manner to [sn] as
(5)  
(6) 
where both are calculated separately in each domain as
(7)  
(8) 
and
(9)  
(10) 
The optimal mixture weights, , between BN and IN are trained to minimize the loss in Eq (2).
3.4 Inference
A test example in a target domain is unknown during training. Hence, for inference, we feed the example to the feature extractors of all domains. The final label prediction is given by computing the logits using the fully connected layer , averaging the logits, i.e., and finally applying a softmax function.
One potential issue in the inference step is whether target domains can rely on the model trained only on source domains. This is the main challenge in domain generalization, which assumes that reasonably good representations of target domains can be obtained from the information in source domains only. In our algorithm, we believe instance normalization in each domain has the capability to remove domainspecific styles and standardize the representation. Since each domain has different characteristics, we learn the relative weights of instance normalization in each domain separately. Thus, predictions in each domain should be accurate enough even for the data in target domains. Additionally, the accuracy given by aggregating the predictions of multiple networks trained on different source domains should further improve accuracy.
4 Domain Diversification via Augmentation
As domain generalization aims at working robustly over arbitrary domains, having diverse source domains in the training set is crucial for generalization ability. Based on this intuition, we propose to diversify source domains by domain augmentation. Our key idea is that a mixed set of samples from multiple domains can be interpreted as a new domain with deviated statistics from the original ones.
We employ a simple generation strategy; a new domain is constructed by making a union of multiple existing domains. For example, suppose that there exist three original domains denoted by , , and and each domain contains the instances in the corresponding domain. We construct a new domain by a union of all the elements in two or more original domains, i.e., . Although these are trivial augmentations, we believe that this strategy is useful since an interpolated domain can be considered indirectly. There are various ways to use these new domains. We first construct three datasets, , , and , where is same with the original dataset. Then, we run DSON on each of the datasets and make an ensemble of the learned models to obtain the final results.
5 Experiments
To depict the effectiveness of domainspecific optimized normalization (DSON), we implement it on domain generalization benchmarks and provide an extensive ablation study of the algorithm.
5.1 Experimental Settings
Datasets
We evaluate the proposed method on two domain generalization benchmarks. The PACS dataset [pacs] is commonly used in domain generalization and is favored due to its large interdomain shift across four domains: Photo, Art Painting, Cartoon, and Sketch. It contains a total of 9,991 images in 7 categories, with an image resolution of 227 227. We follow the experimental protocol in [pacs], where the model is trained on any three of the four domains (source domains), and then tested on the remaining domain (target domain). OfficeHome [officehome] is a popular domain adaptation dataset, which consists of four distinct domains: Artistic Images, Clip Art, Product, and Realworld Images. Each domain contains 65 categories, with around 15,500 images in total. While the dataset is mostly used in the domain adaptation context, it can easily be repurposed for domain generalization by following the same protocol used in the PACS dataset. We employ five datasets—MNIST, MNISTM, USPS, SVHN and Synthetic Digits— for digit recognition and split training and testing subsets following [xu2018deep].
Implementation details
For a fair comparison with prior arts [balaji2018metareg, dinnocente2018domain, li2017domain, li2019episodic], we employ a ResNet18 model as the backbone network in all experiments. The convolutional and BN layers are initialized with ImageNet pretrained weights while the fully connected layer is replaced and randomly initialized to match the number of classes for the object recognition task. We use a batch size of 32 images per source domain, and optimize the network parameters over 10K iterations using SGDM with a momentum 0.9 and an initial learning rate . As suggested in [zhao2018adversarial], the learning rate is annealed by , where , , and increases linearly from 0 to 1 as training progresses. We follow the domain generalization convention by training with the “train” split from each of the source domains, then testing on the combined “train” and “validation” splits of the target domain. We make the mixture weights shared among all channels and layers in our network to facilitate optimization, especially in the lower layers according to our observation. This strategy improves accuracy substantially and consistently in all settings.
PACS  OfficeHome  
Art painting  Cartoon  Sketch  Photo  Avg.  Art  Clipart  Product  RealWorld  Avg.  
Baseline (BN)  78.27  71.54  73.96  93.65  79.36  58.71  44.20  71.75  73.19  61.96 
DSBN [dsbn]  78.61  66.17  70.15  95.51  77.61  59.04  45.02  72.67  71.98  62.18 
SN [sn]  82.50  76.80  80.77  93.47  83.38  54.10  44.97  64.54  71.40  58.75 
\hdashlineDSON  82.91  75.70  81.52  94.31  83.61  58.72  44.61  72.04  73.65  62.25 
DSON  85.06  78.58  78.44  95.03  84.27  59.95  45.25  72.29  74.62  63.02 
DSON  84.57  76.82  81.83  95.51  84.68  59.29  45.74  71.60  73.51  62.53 
BNEnsemble  78.91  71.42  94.49  74.14  79.73  59.70  43.85  72.47  73.67  62.42 
\hdashlineDSONEnsemble  85.40  77.82  82.01  95.33  85.13  60.28  46.19  72.94  75.28  63.67 
Digits  
MNIST  MNISTM  USPS  SVHN  Synthetic  Avg.  
Baseline (BN)  86.15  74.44  90.07  81.29  94.46  85.28 
DSBN [dsbn]  87.01  71.20  91.18  78.23  94.30  84.38 
SN [sn]  89.28  78.40  88.54  79.12  95.66  86.20 
\hdashlineDSON  89.68  75.86  90.38  82.19  95.48  86.89 
DSON  90.53  80.48  90.33  80.38  95.41  87.43 
DSON  91.62  79.73  89.69  80.16  95.52  87.34 
DSON  89.62  79.00  91.63  81.02  95.34  87.32 
BNEnsemble  87.77  75.30  89.89  82.28  94.93  86.03 
\hdashlineDSONEnsemble  92.02  80.11  90.93  82.74  95.71  88.30 
5.2 Ablation Study
PACS and OfficeHome Dataset
Before comparing with the stateoftheart domain generalization methods, we conduct an ablation study to assess the contribution of individual components within our full algorithm on the PACS and OfficeHome datasets. Table 1 presents the results, where our complete method is denoted by DSON. It also presents accuracies of other methods, which include external methods as well as internal variations of our algorithm. We first present results from the baseline method, where the model is trained naïvely with BN layers that are not specific to any single domain. Then, to examine the effects of domainspecific normalization layers, the BN layers are made specific to each of the source domains, which is denoted by DSBN [dsbn]. We also examine the suitability of SN [sn] by replacing BN layers with adaptive mixtures of BN, IN and LN. We show three variations of the proposed network, DSON, DSON, and , DSON, which are trained using three different training datasets, , , and , , respectively. Here, we follow the same notation in the previous section for . Finally, ensembling three models (DSONEnsemble) further improves the accuracy, and shows much stronger performance as compared to an ensemble of 3 independently trained BN models, which is denoted by BNEnsemble. We do not include batchinstance normalization (BIN) [bin] in our experiment because it is hard to optimize and the results are unstable. The ablation study clearly illustrates the benefits of individual components in our algorithm: integration of multiple normalization methods, domainspecific normalization, and mixing weight sharing across layers.
Digits Dataset
The results on five digits datasets are shown in Table 2. We show four variations of the proposed network with four different training datasets: , , , and . Our model achieves 87.43% of average accuracy, outperforming all other baselines by large margins. Ensemble further improves the accuracy comparing to an ensemble of 4 independently trained BN models. Note that DSON is more effective in hard domain (MNISTM).
Art painting  Cartoon  Sketch  Photo  Avg.  

Baseline (BN)  80.76  71.59  74.52  94.19  80.27 
JiGen [li2017domain]  79.42  75.25  71.35  96.03  80.51 
DSAM [dinnocente2018domain]  77.33  72.43  77.83  95.30  80.72 
EpiFCR [li2019episodic]  82.10  77.00  73.00  93.90  81.50 
MetaReg* [balaji2018metareg]  83.70  77.20  70.30  95.50  81.70 
DSON  84.57  76.82  81.83  95.51  84.68 
Art  Clipart  Product  RealWorld  Avg.  

Baseline (BN)  58.71  44.20  71.75  73.19  61.96 
JiGen [li2017domain]  53.04  47.51  71.47  72.79  61.20 
DSAM [dinnocente2018domain]  58.03  44.37  69.22  71.45  60.77 
DSON  59.29  45.74  71.60  73.51  62.53 
Noise level  Method  Art painting  Cartoon  Sketch  Photo  Avg.  Avg. 

0.2  Baseline (BN)  75.16  70.41  68.17  92.13  76.47  2.89 
DSBN [dsbn]  77.10  66.00  59.43  94.85  74.35  3.27  
SN [sn]  78.56  75.21  77.42  91.08  80.57  2.57  
\cdashline28  DSON  83.11  79.07  80.15  94.79  84.03  0.65 
0.5  Baseline (BN)  73.49  64.68  58.95  89.22  71.59  7.77 
DSBN [dsbn]  75.05  57.98  59.99  93.35  71.59  6.02  
SN [sn]  78.27  74.06  72.23  88.86  78.36  4.78  
\cdashline28  DSON  80.22  73.85  77.37  94.91  81.59  3.09 
5.3 Comparison with Other Methods
In this section, we compare our method with other domain generalization methods. For fair comparison, we use DSON as our method.
Pacs
Table 3 portrays the crossdomain object recognition accuracy on the PACS dataset. The proposed algorithm is compared with several existing methods, which include JiGen [li2017domain], DSAM [dinnocente2018domain], EpiFCR [li2019episodic] and MetaReg [balaji2018metareg], and it outperforms both the baseline and the stateoftheart technique by significant margins (4.41% and 3.18%, respectively). DSON achieves the highest accuracy on two of the four target domains and reaches a close second in the remaining two domains. The result shows that DSON is particularly useful for hard domains, achieving 7.31% and 5.23% accuracy gain from the baseline in the Sketch and Cartoon, respectively.
OfficeHome
We also evaluate DSON on the OfficeHome dataset, and the results are presented in Table 4. As in PACS, DSON outperforms the recently proposed DSAM [dinnocente2018domain] as well as our baseline. We find that DSON achieves the best score on all target domains except the “Product” domain, where we notice that DSON is slightly worse than the best model. Again, DSON is more advantageous in hard domains; it shows 1.31% improvement from the baseline in the Clipart domain.
5.4 Additional Experiments
Domain Generalization with Label Noise
The performance of the proposed algorithm on the PACS dataset is tested in the presence of label noise, and the results are investigated against other approaches. Two different noise levels are tested (0.2 and 0.5), and the results are presented in Table 5. Although all algorithms undergo performance degradation, the amount of accuracy drops is marginal in general and DSON turns out to be more reliable with respect to label noise compared to other models.
Art painting  Cartoon  Sketch  Photo  Avg.  

Baseline (BN)  78.85  76.79  78.14  99.42  83.30 
DSBN [dsbn]  88.94  83.54  77.39  99.42  87.32 
SN [sn]  79.00  77.66  79.11  97.78  83.39 
\hdashlineDSON  86.54  88.61  86.93  99.42  90.38 
MultiSource Domain Adaptation
DSON can be extended to the multisource domain adaptation task, where we gain access to unlabeled data from the target domain. To compare the effect of different normalization, we adopt Shu et al. [shu2019transferable] as the baseline method and vary the normalization method only. The results are shown in Table 6, where we compare DSON with SN, and DSBN [dsbn]. All compared methods illustrate a large improvement over the baseline. In direct contrast to the results from the ablation analysis in Table 1, DSBN is clearly superior to SN. This is unsurprising, given that DSBN is focused specifically on the domain adaptation task. We find that DSON outperforms not only the baseline but also DSBN, which demonstrates how effectively DSON can be extended to the domain adaptation task. It shows that domainspecific models consistently outperform their domainagnostic counterparts.
6 Conclusion
We presented a simple but effective domain generalization algorithm based on domainspecific optimized normalization layers. The proposed algorithm uses multiple normalization methods while learning a separate affine parameter per domain. The mixing weights are employed to compute the weighted average of multiple normalization statistics. This strategy turns out to be helpful for learning domaininvariant representations since instance normalization removes domainspecific style effectively and makes the trained model focus on semantic category information. In addition, we propose a domain augmentation method to diversify the source domains. The proposed algorithm achieves the stateoftheart accuracy consistently on multiple standard benchmark datasets even with substantial label noise. We also showed that the domainspecific optimization of normalization types is wellsuited for unsupervised domain adaptation.