Improve Model Generalization and Robustness to Dataset Bias with Bias-regularized Learning and Domain-guided Augmentation
Deep Learning has thrived on the emergence of biomedical big data. However, medical datasets acquired at different institutions have inherent bias caused by various confounding factors such as operation policies, machine protocols, treatment preference and etc. As the result, models trained on one dataset, regardless of volume, cannot be confidently utilized for the others. In this study, we investigated model robustness to dataset bias using three large-scale Chest X-ray datasets: first, we assessed the dataset bias using vanilla training baseline; second, we proposed a novel multi-source domain generalization model by (a) designing a new bias-regularized loss function; and (b) synthesizing new data for domain augmentation. We showed that our model significantly outperformed the baseline and other approaches on data from unseen domain in terms of accuracy and various bias measures, without retraining or finetuning. Our method is generally applicable to other biomedical data, providing new algorithms for training models robust to bias for big data analysis and applications. Demo training code is publicly available111https://github.com/ydzhang12345/Domain-Generalization-by-Domain-guided-Multilayer-Cross-gradient-Training.
Dataset bias is inherent to big biomedical data, due to its heterogeneity, incidental endogeneity and dynamic nature [1, 2, 3]. Such bias mars the central assumption of machine learning models: both train and test data should be independent and identically distributed (IID). As the result, dataset bias leads to difference between the estimated and the true value of desired model parameters, which is the bias commonly used in statistics to describe the performance of an estimator [4, 5]. Fig. 1a & b describes the potential problems and sources of dataset bias in medical datasets, where and are the distribution of hospital-specific process, patients and doctors (labelers) respectively. Due to different hardware conditions, diagnosis policy and many more hospital-specific factors, is varying by hospitals. As a result, model trained on internal hospitals cannot generalize to external sites if the internal datasets do not span the distribution of . This can be dangerous because clinicians might wrongly trust the system based on the internal observation.
I-a Case Study - Chest X-rays
Recently, Zech et al. [6, 7] collected in total 158,323 CXRs from three institutions (NIH, MSU and IU) and studied the dataset bias problem carefully. By manually reviewing the CXRs, they observed that: (1) More than 80% of CXRs in MSU were tagged as portable inpatient scans, as those patients were too sick to move; in contrast, all CXRs of IU were outpatient and had no such tags on the images; as a result, model trained on MSU was leveraging the detection of portable tag to calibrate its prediction, leading to poor generalization in IU dataset; (2) NIH CXRs have chest tubes frequently for samples labelled with Pneumothoraxl; consequently, model from NIH heavily relied on the detection of obvious chest tubes instead of the subtler pneumothorax itself; this feature was not helpful for prediction because only after patients being treated would they have the chest tube. They further claimed that those hospital system-specific biases were complicated and hard to fully assess what factors were contributing the predictions exactly. But clearly all models were suffering from performance degradation on external set.
Following this line of research, we also collected more than 550,000 CXRs from three publicly available dataset: NIH, CheXpert and MIMIC [8, 9, 10]. Instead of manually studying the images, we use statistical tool called Classifier Test  to unveil dataset biases. The basic idea is to train a dataset classifier to tell which dataset the input is from. If there is no domain shift or hospital-specific bias, the classifier should act as random guessing. Surprisingly, we found that despite data is of the same modality and body-part, a simple convolution neural network (CNN) achieved near-perfect classification accuracy (see Appendix A for more details), meaning that severe dataset biases were induced during the creation of the image.
I-B Problem Statement - Learning under Unknown Bias
Developing useful machine learning models under intractable dataset bias has been well studied for the past few years [12, 13]. Without knowledge about the underlying mechanisms, most of the methods are trying to utilize datasets from multiple source to learn domain-invariant models [14, 15]. This is a common setting called multi-source domain generalization (MSDG). However, in practice we usually have rich samples of (data points) but very few (datasets). Thus, how to effectively utilize limited available datasets from different sources are the core of MSDG.
We here define our problem setting. Our goal is to train a model that perform well for in-domain datasets and generalize well to unseen domain. Formally, denote , where is the dataset or sub-domain in a shared domain , are the internal sets that are available during training and are the external sets which are completely hidden unless on test time. Here we focus on the classification task and assume all the datasets share the common labels. Then we aim to
where is the model prediction of sample parametrized by and is our evaluation metric. The first double summation of (1) is the internal set performance and the second part is that of external set. Notice that we have no access to the latter part of (1) during training and hence our optimization can only focus on the former part.
To learn domain-invariant predictors more efficiently, we: 1. designed a new loss function that regularizes the bias learning by modelling each training domains explicitly; 2. proposed a new data augmentation methods to improve model generalization by generating domain-guided perturbed hidden activations (see Fig. 1). We verified our methods through extensive experiments and obtained superior performance compared to prior approaches.
For the following contents, we first introduced existing works on dataset bias and domain generalization in section II; then in section III, we demonstrated the intuition and methodology of our proposed methods; as a case study, section IV presented our experiments and provided quantitative and qualitative results on Chest X-ray datasets; we discussed our major findings and concluded the paper in V.
Ii Related Work
Various metrics have been proposed to quantify the biases of dataset. The pioneering work in  suggested use cross-dataset generalization: measure the relative performance drop between the original test set and the new dataset, as long as they come from the same domain.  proposed to replace the relative measure by direct difference followed by a sigmoid function, in order to better preserve the information about internal test set. Cross-dataset generalization is an intuitive and interpretable measure, hence it is widely used in the machine learning community [14, 16]. On the other hand, one can also perform Classifier Two-sample Test (C2ST) to verify whether two datasets are drawn from identical distribution . The idea is that if two datasets are in the same domain, a binary classifier trained on their joints should predict with chance-level on which dataset the sample is drawn. .
Besides the quantitative measures, several visualization techniques can be applied to qualitatively understand the source of biases. In , the trained weights of a linear-SVM classifier were overlaid on original images to discover the pattern of their spatial distribution;  generated the class activation heatmaps (CAM) to visualize the most contributed regions in input image for a trained CNN;  proposed the guided backpropogation gradient activation heatmap (guided grad-CAM) to provide pixel-level attention of CNNs. In other fields such as natural language processing, the attention mechanism  is an effective way to visualize the focus of the model; One can also use Local Interpretable Model-Agnostic Explanations method (LIME)  to understand the model attentions. By comparing the model ”attention” with human sense, we can verify whether the learning algorithm is learning the correct representation features and infer the source of data biases. If human attention heatmap is given, one can also use Spearman’s rank correlation coefficients  or earth mover’s distance  to provide quantitative measures [23, 24].
Several framework has been developed to address dataset bias or domain generalization, where the goal is to train a model that generalizes to unseen datasets or domains. One of the earlier work  was based on max-margin learning (SVM), which modeled biases as per-dataset bias vectors in classifier space. During training, SVM maximized the objective of each dataset by constructing a classifier using addition of dataset-specific bias vector and bias-free vector; then for inference, the bias-free vector was used alone as bias-removal classifier. Built on top of it,  conducted more extensive experiments by using DECAF features  as input to the model. The author found that the bias removal technique in  worked better when using classical BOWSift features  while for DECAF features the opposite held.  further extended this shallow bias modelling structure to end-to-end training low-rank parametrized deep model and observed better performance.
Another series of work on domain generalization focus on feature level and aim to learn domain-invariant feature representation. In , a kernel-based method was used to project the data into common feature space where domain dissimilarity was minimized while the functional relationship of label was preserved. In , domain-robust feature was learnt by a multi-task data reconstruction autoencoders. Domain adversarial training technique could also be used for learning domain-independent feature by fooling a domain classifier  or aligning distributions among different domains .
There are also efforts on addressing domain generalization through modifying the input data.  shuffled the original image patches and added an auxillary recompose task to the model to improve generalization.  used generative adversarial network (GAN) to generate domain-independent images.  developed a new dataset resample paradigm (REPAIR) that improved the model generalization by training on a re-weighted dataset. The one most similar to ours is cross gradient training , which generated inter-domain data by domain-guided perturbations of the inputs.
A similar work that attempted to address dataset bias of Chest X-ray (CXR) data is , where the authors collected ten CXR datasets internationally. However, they did not provide an effective method for handling dataset bias apart from directly trained and tested them in leave-one-out scheme. Also, their task (predict normal or abnormal) is simpler than ours.
Iii Proposed Methods
Iii-a Bias-regularized Learning
We start by revisiting the undoing-bias framework in . Formally, let be the extracted feature of sample , we aim to solve the following soft-constrained max-margin (SVM) optimization problem:
where is our visual world (debiased) classifier, is the biased classifier for dataset and , is the balancing hyper-parameters. Intuitively, here the learning of bias is controlled by regularizing the norm of classifiers. is encouraged to only capture domain-invariant features while leverage dataset-specific features.
The above SVM framework have several drawbacks: 1. the feature representation is not learnt end-to-end, where the original authors use BOWSift ; 2. the hinge loss is not optimizer-friendly since it is not differentiable everywhere; 3. we cannot have a probability interpretation of the prediction, which is crucial in assisting medical diagnosis; 4. it models the bias weights with only additive relationl. Thus, we firstly propose to train the feature extractor end-to-end using deep model. That is, where is a neural net feature extractor parametrized by . During training, is updated by back-propagating the gradient through . Secondly, we propose to train the network using cross-entropy loss to accommodate 1 and 2. Lastly, for 3 we introduce for each dataset as an additional trainable parameters, such that , where represents the element-wise product. Here models the multiplicative relation between the model bias and visual world. This enables the model to capture both the feature shifts and scaling of the bias datasets. With those changes, our proposed cross-entropy training objective is
where and is the negative log-likelihood between the last linear layer and ground-truth label.
We highlight the importance of the regularization term of visual world classifier in our proposed cross-entropy loss, because can easily overfit on a solution that takes advantage of all the bias features, leading to poor generalization performance in external set. To see this, consider a learned feature embedding , where is the common feature, and are the bias feature presented in and . Ideally, we want such that it can generalize to some unseen dataset or domain . However, without proper regularization, can still be a valid solution if for we have and for we have . By penalizing the norm of more than and , we push the bias learning to those bias vectors instead of visual world classifier.
Iii-B Domain-guided Augmentation
The above framework model the bias in higher level classifier space. However, since we have limited training domains, our feature extractor can still overfit on the in-domain data. Synthesizing new domain data and augment them to training set could be a potential solution, as the model could have higher chance to span target domain distribution. One simple strategy of augmenting unseen domain data is to use Mix-up [36, 37] technique, where synthetic data are generated by linearly mixing samples from different datasets. However, as we will show in the following section, when the dataset-bias are severe, this method suffer from the convergence problem. Also, the diagnosis of medical imaging are usually relying on fine-grained features of the image. In this case, naively mixing samples will destroy the crucial details in the data.
Another idea is to augment the training data by Cross-gradient Training method . Formally, consider a dataset classifier and data point , we can generate a new sample , where is the step size and is the dataset classification loss. Then we can train the model with this synthetic data . To ensure the gradient change have minimum effect on the label , we also augment the training of with , where . This makes the dataset classifier unsensitive to the labels and hence won’t change .
Our proposed method is built upon this idea. However, differ to , we do not train a separate network for . Instead, is just a linear layer which directly takes the feature embedding as input and output the dataset classification logits. During training, the gradient will not propagate through and is only used to update . The reasons for this design lie in two folds: training a separate feature extractor for or propagating the gradient through will lead to gradient vanishing , because the dataset classifier is significantly easier to train; more importantly, observing that now is differentiable with respect to , we can augment the intermediate features in addition to the input , leading to a new multi-layer augmentation training paradigm. Specifically, given a pre-determined set of layer output, e.g. , where and are the output of the last convolution and fully-connected layer, respectively. For each training step we can sample one of the layer output, say , compute , generate a new augmentation feature point and feed to the following layers. We shall also follow Cross-gradient Training to generate and augment the training of . Because the augment data point does not belong to any specific datasets, is only fed to the visual world classifier for training and . In this way, we improve the robustness of our feature extractor to unseen data. We name this novel domain-guided augmentation method as Multi-layer Cross-gradient Training (MCT). Together with the Bias-regularized Learning, we summarize the overall model pipeline in Fig. 2 and the training pseudo-code in Algorithm LABEL:Algorithm.
We use three large-scale Chest X-ray datasets as a case study, which are all open-sourced. They are NIH ChestX-ray14 from NIH Clinical Center , Stanford CheXpert from Stanford Hospital  and Mimic-CXR from Beth Israel Deaconess Medical Center [10, 38]. Since the above datasets have different label categories, we select 5 common diseases (Atelectasis, Cardiomegaly, Consolidation, Edema and Effusion) that they share with each other. We also discard all the lateral scans in CheXpert and Mimic-CXR as NIH only have frontal view images. Table I summarizes the basic information of each processed dataset. We use a roghly 7:1:2 split for train, val and test set of each dataset, except for NIH which has an official split that has the same ratio. We also ensure X-ray scans of the same patient belong to the same split set, preventing information leakage.
We also include three popular datasets of domain generalization to specifically verify our proposed Multi-layer Cross-gradient Training algorithm. Dataset details and experiments can be found in Appendix B.
|Datasets||# Patients||# Scans||# Atelectasis||# Cardiomegaly||# Consolidation||# Edema||# Effusion|
Iv-B Bias Measurements Metrics
In this section we introduce two quantitative metrics for measuring biases of trained models.
Iv-B1 Generalization-Based Metrics
be the internal test set performance and
be the average external test set performance. The cross-dataset performance drop for a particular model can be defined as:
Intuitively, measures the change of cross dataset performance, normalized by the internal set score. indicates that biases are present, which becomes more severe when it gets closer to 1. If , it means internal performance is sub-optimal and no informative conclusion can be drawn by cross-dataset evaluation.
Iv-B2 Classifier Two-sample Test
The Classifier Two-sample Test (C2ST) aims to determine whether two datasets are drawn from the same distribution by training a binary classifier to differ from each other. Formally, given two datasets and , we can construct a new dataset 
and a binary classifier to be the conditional probability estimation of . Then we can obtain the classification accuracy according to
Intuitively, if the two datasets are from the same distribution, the test set should be close to 0.5, i.e. no better than random guessing. Otherwise, there must be some distinct features in one of the dataset which are exploited by the classifier.
Iv-C Results on Large-scale CXRs
In this section, we evaluate our proposed methods on Large-scale CXR datasets. For training, we resize all the images to 256256, followed by a random crop of . Unlike , we do not use random horizontal flip since some diseases (e.g. Cardiomegaly) rely on spatial information. AlexNet  pretrained on Imagenet  is used as feature extraction backbone for all the models which are compared in the following sections unless specified. As each patient can have multiple diseases at the same time, our task is essentially a multi-label classification. We use binary cross-entropy loss for each diseases. Hyperparameters are determined by validation set and the selected of MCT here is the input and the last dense layer of feature extraction network. We implement all our experiments with Pytorch .
Iv-C1 Classification on Seen Datasets
Remind that we not only want our model generalize well on unseen sets, but also on the available internal sets. Thus, we first evaluate how our proposed methods perform on seen datasets. We use leave-one-out scheme for splitting the domains and run experiments on all possible dataset combinations. Table II shows the internal performance of various models.
For the last four models where we have bias classifier and visual world classifier , we show AUC score of both and the one in bracket is the bias result
We choose vanilla Alexnet as our baseline and also compare our methods to several advanced models on domain adaptation and generalization. Specifically, in the baseline we combine and train all the datasets together (train-them-all-together) without other special strategy. In domain adversarial training (DANN)  we want to learn a domain-invariant feature representation by fooling a dataset discriminator. In REAPIR , we fix a trained feature extractor and assign a trainable weight for each sample to minimize the dataset representation bias; then we resample the dataset according to the weights and retrain the network. In Mixup , inputs and labels are modified to be weighted sum of data from different domains. In CrossGrad , adversarial inputs are synthesized guided by domain perturbations. We found that DANN fails to converge because the gradient of domain discriminator dominates the feature learning; REPAIR does not help for the internal performance; Mixup is worse than vanilla baseline; Crossgrad suffers from gradient vanishing problem. On the contrary, our proposed undoing bias framework with cross-entropy loss (E2E-CE) surpass all the alternatives, suggesting that there is performance gain by modelling dataset bias carefully in multi-source data training. The performance is further increased by using our proposed MCT augmentation. Notice that the bias weight vector in our proposed model performs better than the visual world ones in the internal set, indicating that our model effectively encode dataset-specific information in the bias model.
Iv-C2 Classification on unseen dataset
Table III demonstrates the AUC score of each model tested on external set. Unlike what is found in  where the undoing bias framework  perform worse with DECARF feature, we show that by training the model end-to-end we can in fact get better performance on external generalization. Moreover, we observe similar results as in internal set performance. Our proposed method surpass all the comparing methods in every domain split, closing the performance gap between internal and external domain. We also find that popular domain adaptation methods such as DANN  and data augmentation methods such as Mixup  do not work well for CXR data.
Iv-C3 Measuring The Bias
In this section, We further present the bias measurements of each model. We extract the hidden representation of trained models at the last layer of feature extractor and train a dataset classifier on top of that to perform the classifier Test. Table IV summarizes the results. Several observations can be drawn: 1. although CXR data seems to be very similar to each other regardless of its origin, training a deep model naively can significantly induce dataset bias, leading to large generalization gap when testing on other source of data; 2. our proposed methods obtain the best average performance with much smaller performance drop between internal set and external set; 3. our proposed methods have smaller dataset bias.
|Models||Perf. Drop (PD)||Classifier Test||Rank Cor.|
We plot the feature representation of baseline model and our proposed methods by using t-SNE visualization  for better demonstration. By looking at Fig. 3, we can observe that the dataset bias is clearly present in the vanilla baseline model. On the other hand, the t-SNE embeddings of different datasets are mixed together in our proposed methods, indicating the effectiveness of bias mitigation of our model.
To understand how the bias affect our disease prediction, we plot the gradient-guided activation heatmap (Grad-CAM)  for the proposed models. Figure 4 visualizes two sets of results randomly chosen from the external set. We can see that the vanilla Alexnet may take advantage of the unrelated tags and is subject to noise, while our proposed model is being more discriminative and robust.
We also provide quantitative measure for our generated heatmaps by comparing it to the ground-truth annotation using Spearman’s rank correlation coefficients , which measures the rank order between two sets. The results are shown in Table IV. It can be seen that our proposed method has better correlation with human annotations.
|Original Image||Grad-CAM of Alexnet||Grad-CAM of Our Model|
V Discussion and Conclusion
Due to heterogeneous and confounding data creation process, dataset bias is a common problem in medical datasets. In this study, we showed that (1) naively training and using deep models on medical data is subject to dataset bias, leading to poor generalization ability; (2) our proposed MCT with Bias-regularized Learning framework effectively utilizes multi-source training data, mitigating the damage of dataset bias and closing the performance gap between internal domains and unseen domains without retraining or domain-adaptation. Although we performed case study mainly on medical images, it is easy to adapt it to other forms of biomedical data such as electronic health records, since our proposed methods mainly work on the hidden space of the deep models. Our future work includes investigating the choice of augmentation layer set for different neural network architectures and testing our model on other tasks.
Fig. 5 shows the classification results on a random subset with a simple CNN. The network architecture is simply stacks of 5 convolution blocks followed by a fully-connected layer, with each block to be conv-relu-maxpool. We found that deep CNN could easily differentiate which hospital the data came from, indicating the presence of dataset bias.
We explicitly test our novel MCT domain-guided augmentation methods on three popular domain-generalization datasets. They are
Google Fonts : the task is to classify 36 characters collected from 109 fonts;
Rotated-MNIST : this dataset is created by rotating the original MNIST dataset with 6 different degress: 0, 15, 30, 45, 60 and 75. Each image now has a digit label and rotation angle as its domain;
Office-Caltech : there are in total ten different common object categories of four domains (Amazon, Caltech, DSLR and Webcam). Domains tend to have very different viewing angles, object background and etc.
The experiment details are as follows: for Google Fonts and Rotated-MNIST (R-MNIST), We use the same configuration as in ; for Office-Caltech, we use the same setting as Google Fonts. The selected layers for the three experiments are all the layers including the input but without the first convolution and the last dense layer. Results are shown on Table V, where the baseline method of Google Fonts and Office-Caltech is LeNet  without special training, while that of R-MNIST is CCSA . Our proposed MCT augmentation method surpasses other comparing model by a large margin.
|Baseline||68.511footnotemark: 1||95.611footnotemark: 1||41.6|
|DANN ||68.911footnotemark: 1||98.011footnotemark: 1||43.8|
|CrossGrad ||72.611footnotemark: 1||98.611footnotemark: 1||44.3|
Results are taken directly from 
-  J. Fan, F. Han, and H. Liu, “Challenges of big data analysis,” National science review, vol. 1, no. 2, pp. 293–314, 2014.
-  K. Ferryman and M. Pitcan, “Fairness in precision medicine,” Data & Society, 2018.
-  C. H. Lee and H.-J. Yoon, “Medical big data: promise and challenges,” Kidney research and clinical practice, vol. 36, no. 1, p. 3, 2017.
-  B. A. Walther and J. L. Moore, “The concepts of bias, precision and accuracy, and their use in testing the performance of species richness estimators, with a literature review of estimator performance,” Ecography, vol. 28, no. 6, pp. 815–829, 2005.
-  M. J. West, “Stereological methods for estimating the total number of neurons and synapses: issues of precision and bias,” Trends in neurosciences, vol. 22, no. 2, pp. 51–61, 1999.
-  J. R. Zech, M. A. Badgeley, M. Liu, A. B. Costa, J. J. Titano, and E. K. Oermann, “Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study,” PLoS medicine, vol. 15, no. 11, p. e1002683, 2018.
-  J. Zech, “Medium,” Jul 2018. [Online]. Available: https://medium.com/@jrzech
-  X. Wang, Y. Peng, L. Lu, Z. Lu, M. Bagheri, and R. M. Summers, “Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2097–2106.
-  J. Irvin, P. Rajpurkar, M. Ko, Y. Yu, S. Ciurea-Ilcus, C. Chute, H. Marklund, B. Haghgoo, R. Ball, K. Shpanskaya et al., “Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison,” arXiv preprint arXiv:1901.07031, 2019.
-  A. E. Johnson, T. J. Pollard, S. Berkowitz, N. R. Greenbaum, M. P. Lungren, C.-y. Deng, R. G. Mark, and S. Horng, “Mimic-cxr: A large publicly available database of labeled chest radiographs,” arXiv preprint arXiv:1901.07042, 2019.
-  D. Lopez-Paz and M. Oquab, “Revisiting classifier two-sample tests,” arXiv preprint arXiv:1610.06545, 2016.
-  A. Torralba, A. A. Efros et al., “Unbiased look at dataset bias.” in CVPR, vol. 1, no. 2. Citeseer, 2011, p. 7.
-  T. Tommasi, N. Patricia, B. Caputo, and T. Tuytelaars, “A deeper look at dataset bias,” in Domain Adaptation in Computer Vision Applications. Springer, 2017, pp. 37–55.
-  A. Khosla, T. Zhou, T. Malisiewicz, A. A. Efros, and A. Torralba, “Undoing the damage of dataset bias,” in European Conference on Computer Vision. Springer, 2012, pp. 158–171.
-  D. Li, Y. Yang, Y.-Z. Song, and T. M. Hospedales, “Deeper, broader and artier domain generalization,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 5542–5550.
-  N. McLaughlin, J. M. Del Rincon, and P. Miller, “Data-augmentation for reducing dataset bias in person re-identification,” in 2015 12th IEEE International conference on advanced video and signal based surveillance (AVSS). IEEE, 2015, pp. 1–6.
-  B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning deep features for discriminative localization,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2921–2929.
-  R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 618–626.
-  S. Wang and J. Jiang, “Machine comprehension using match-lstm and answer pointer,” arXiv preprint arXiv:1608.07905, 2016.
-  G. J. Katuwal and R. Chen, “Machine learning model interpretability for precision medicine,” arXiv preprint arXiv:1610.09045, 2016.
-  J. L. Myers, A. D. Well, and R. F. Lorch Jr, Research design and statistical analysis. Routledge, 2013.
-  E. Levina and P. Bickel, “The earth mover’s distance is the mallows distance: Some insights from statistics,” in Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, vol. 2. IEEE, 2001, pp. 251–256.
-  Y. Zhang, J. C. Niebles, and A. Soto, “Interpretable visual question answering by visual grounding from attention supervision mining,” in 2019 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2019, pp. 349–357.
-  D. Huk Park, L. Anne Hendricks, Z. Akata, A. Rohrbach, B. Schiele, T. Darrell, and M. Rohrbach, “Multimodal explanations: Justifying decisions and pointing to the evidence,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8779–8788.
-  J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell, “Decaf: A deep convolutional activation feature for generic visual recognition,” in International conference on machine learning, 2014, pp. 647–655.
-  D. G. Lowe, “Object recognition from local scale-invariant features,” in ICCV, 1999, pp. 1150–1157. [Online]. Available: https://doi.org/10.1109/ICCV.1999.790410
-  K. Muandet, D. Balduzzi, and B. Schölkopf, “Domain generalization via invariant feature representation,” in International Conference on Machine Learning, 2013, pp. 10–18.
-  M. Ghifary, W. Bastiaan Kleijn, M. Zhang, and D. Balduzzi, “Domain generalization for object recognition with multi-task autoencoders,” in The IEEE International Conference on Computer Vision (ICCV), December 2015.
-  Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky, “Domain-adversarial training of neural networks,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 2096–2030, 2016.
-  H. Li, S. Jialin Pan, S. Wang, and A. C. Kot, “Domain generalization with adversarial feature learning,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
-  F. M. Carlucci, A. D’Innocente, S. Bucci, B. Caputo, and T. Tommasi, “Domain generalization by solving jigsaw puzzles,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 2229–2238.
-  F. M. Carlucci, P. Russo, T. Tommasi, and B. Caputo, “Hallucinating agnostic images to generalize across domains,” 2018.
-  Y. Li and N. Vasconcelos, “Repair: Removing representation bias by dataset resampling,” arXiv preprint arXiv:1904.07911, 2019.
-  S. Shankar, V. Piratla, S. Chakrabarti, S. Chaudhuri, P. Jyothi, and S. Sarawagi, “Generalizing across domains via cross-gradient training,” arXiv preprint arXiv:1804.10745, 2018.
-  L. Yao, J. Prosky, B. Covington, and K. Lyman, “A strong baseline for domain adaptation and generalization in medical imaging,” arXiv preprint arXiv:1904.01638, 2019.
-  H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyond empirical risk minimization,” arXiv preprint arXiv:1710.09412, 2017.
-  V. Verma, A. Lamb, C. Beckham, A. Najafi, A. Courville, I. Mitliagkas, and Y. Bengio, “Manifold mixup: Learning better representations by interpolating hidden states,” 2018.
-  A. L. Goldberger, L. A. Amaral, L. Glass, J. M. Hausdorff, P. C. Ivanov, R. G. Mark, J. E. Mietus, G. B. Moody, C.-K. Peng, and H. E. Stanley, “Physiobank, physiotoolkit, and physionet: components of a new research resource for complex physiologic signals,” Circulation, vol. 101, no. 23, pp. e215–e220, 2000.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248–255.
-  A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” 2017.
-  L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,” Journal of machine learning research, vol. 9, no. Nov, pp. 2579–2605, 2008.
-  U. Ozbulak, “Pytorch cnn visualizations,” https://github.com/utkuozbulak/pytorch-cnn-visualizations, 2019.
-  B. Gong, Y. Shi, F. Sha, and K. Grauman, “Geodesic flow kernel for unsupervised domain adaptation,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2012, pp. 2066–2073.
-  Y. LeCun, L. Bottou, Y. Bengio, P. Haffner et al., “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
-  S. Motiian, M. Piccirilli, D. A. Adjeroh, and G. Doretto, “Unified deep supervised domain adaptation and generalization,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 5715–5725.